jpekkila
|
f1138b04ac
|
Cleaned up the MPI implementation, removed all older implementations (removed also MPI window implementation which might be handy in the future when CUDA-aware support is introduced). If the removed stuff is needed later, here are some keywords to help find this commit: MPI_window, sendrecv, bidirectional, unidirectional transfer, real-time pinning, a0s, b0s.
|
2020-05-28 16:42:50 +03:00 |
|
jpekkila
|
0d62f56e27
|
Tried an alternative approach to comm (was worse than the current solution) and rewrote the current best solution for (now easier to read)
|
2020-05-28 15:31:43 +03:00 |
|
jpekkila
|
f97005a75d
|
Added WIP version of the new bidirectional comm scheme
|
2020-05-27 19:09:32 +03:00 |
|
jpekkila
|
afe5b973ca
|
Added multiplication operator for int3
|
2020-05-27 19:08:39 +03:00 |
|
jpekkila
|
7e59ea0eff
|
MPI: corners are no longer communicated. Slight performance impact (14 ms vs 15 ms). Tests still pass with 8 GPUs.
|
2020-05-26 19:00:14 +03:00 |
|
jpekkila
|
ec59cdb973
|
Some formatting and unimportant changes to samples
|
2020-05-26 18:57:46 +03:00 |
|
jpekkila
|
c93b3265e6
|
Made comm streams high prio
|
2020-04-22 17:03:53 +03:00 |
|
jpekkila
|
22e01b7f1d
|
Rewrote partitioning code
|
2020-04-19 23:23:23 +03:00 |
|
jpekkila
|
4dd825f574
|
Proper decomposition when using Morton order to partition the computational domain
|
2020-04-19 22:50:26 +03:00 |
|
jpekkila
|
ffb274e16f
|
Linking dynamic CUDA library instead of static (less prone to breaking since Astaroth does not have to be rebuilt when CUDA is updated)
|
2020-04-19 22:33:01 +03:00 |
|
jpekkila
|
8c210b3292
|
3D decomposition is now done using Morton order instead of linear indexing
|
2020-04-19 22:31:57 +03:00 |
|
jpekkila
|
9cd5909f5a
|
BWtest calculates now aggregate bandwidths per process instead of assuming that all neighbor communication can be done in parallel (Within a node one can have parallel P2P connections to all neighbors and we have an insane total bandwidth, but this is not the case with network, we seem to have only one bidirectional socket)
|
2020-04-09 20:28:04 +03:00 |
|
jpekkila
|
d4a84fb887
|
Added a PCIe bandwidth test
|
2020-04-09 20:04:54 +03:00 |
|
jpekkila
|
d6e74ee270
|
Added missing files
|
2020-04-09 19:24:55 +03:00 |
|
jpekkila
|
ed8a0bf7e6
|
Added bwtest and benchmarkscript to CMakeLists
|
2020-04-07 18:35:12 +03:00 |
|
jpekkila
|
fb41741d74
|
Improvements to samples
|
2020-04-07 17:58:47 +03:00 |
|
jpekkila
|
427a3ac5d8
|
Rewrote the previous implementation, now fully works (verified) and gives the speedup we want. Communication latency is now completely hidden on at least two nodes (8 GPUs). Scaling looks very promising.
|
2020-04-06 17:28:02 +03:00 |
|
jpekkila
|
37f1c841a3
|
Added functions for pinning memory that is sent over the network. TODO pack to and from pinned memory selectively (currently P2P results are overwritten with data in pinned memory)
|
2020-04-06 14:09:12 +03:00 |
|
jpekkila
|
cc9d3f1b9c
|
Found a workaround that gives good inter and intra-node performance. HPC-X MPI implementation does not know how to do p2p comm with pinned arrays (should be 80 GiB/s, measured 10 GiB/s) and internode comm is super slow without pinned arrays (should be 40 GiB/s, measured < 1 GiB/s). Made a proof of concept communicator that pins arrays that are send or received from another node.
|
2020-04-05 20:15:32 +03:00 |
|
jpekkila
|
88e53dfa21
|
Added a little program for testing the bandwidths of different MPI comm styles on n nodes and processes
|
2020-04-05 17:09:57 +03:00 |
|
jpekkila
|
fe14ae4665
|
Added an alternative MPI implementation which uses one-sided communication
|
2020-04-02 17:59:53 +03:00 |
|
Johannes Pekkila
|
9b6d927cf1
|
It might be better to benchmark MPI codes without synchronization because of overhead of timing individual steps
|
2020-03-31 12:37:54 +02:00 |
|
Johannes Pekkila
|
742dcc2697
|
Optimized MPI synchronization a bit
|
2020-03-31 12:36:25 +02:00 |
|
jpekkila
|
24e65ab02d
|
Set decompositions for some nprocs by hand
|
2020-03-30 18:13:50 +03:00 |
|
jpekkila
|
9065381b2a
|
Added the configuration used for benchmarking (not to be merged to master)
|
2020-03-30 18:01:35 +03:00 |
|
jpekkila
|
850b37e8c8
|
Added a switch for generating strong and weak scaling results
|
2020-03-30 17:56:12 +03:00 |
|
jpekkila
|
d4eb3e0d35
|
Benchmarks are now written into a csv-file
|
2020-03-30 17:41:42 +03:00 |
|
jpekkila
|
9c5011d275
|
Renamed t to terr to avoid naming conflicts
|
2020-03-30 17:41:09 +03:00 |
|
jpekkila
|
864699360f
|
Better-looking autoformat
|
2020-03-30 17:40:38 +03:00 |
|
jpekkila
|
af531c1f96
|
Added a sample for benchmarking
|
2020-03-30 17:22:41 +03:00 |
|
jpekkila
|
cc64968b9e
|
GPUDirect was off, re-enabled
|
2020-03-26 18:24:42 +02:00 |
|
jpekkila
|
28792770f2
|
Better overlap with computation and comm. when inner integration is launched first
|
2020-03-26 18:00:01 +02:00 |
|
jpekkila
|
4c82e3c563
|
Removed old debug error check
|
2020-03-26 17:59:29 +02:00 |
|
jpekkila
|
5a898b8e95
|
mpitest now gives a warning instead of a compilation failure if MPI is not enabled
|
2020-03-26 15:31:29 +02:00 |
|
jpekkila
|
08f567619a
|
Removed old unused functions for MPi integration and comm
|
2020-03-26 15:04:57 +02:00 |
|
jpekkila
|
329a71d299
|
Added an example how to run the code with MPI
|
2020-03-26 15:02:55 +02:00 |
|
jpekkila
|
ed7cf3f540
|
Added a production-ready interface for doing multi-node runs with Astaroth with MPI
|
2020-03-26 15:02:37 +02:00 |
|
jpekkila
|
dad84b361f
|
Renamed Grid structure to GridDims structure to avoid confusion with MPI Grids used in device.cc
|
2020-03-26 15:01:33 +02:00 |
|
jpekkila
|
db120c129e
|
Modelsolver computes now any built-in parameters automatically instead of relying on the user to supply them (inv_dsx etc)
|
2020-03-26 14:59:07 +02:00 |
|
jpekkila
|
fbd4b9a385
|
Made the MPI flag global instead of just core
|
2020-03-26 14:57:22 +02:00 |
|
jpekkila
|
e1bec4459b
|
Removed an unused variable
|
2020-03-25 13:54:43 +02:00 |
|
jpekkila
|
ce81df00e3
|
Merge branch 'master' of https://bitbucket.org/jpekkila/astaroth
|
2020-03-25 13:51:07 +02:00 |
|
jpekkila
|
e36ee7e2d6
|
AC_multigpu_offset tested to work on at least 2 nodes and 8 GPUs. Forcing should now work with MPI
|
2020-03-25 13:51:00 +02:00 |
|
jpekkila
|
0254628016
|
Updated API specification. The DSL syntax allows only C++-style casting.
|
2020-03-25 11:28:30 +00:00 |
|
jpekkila
|
672137f7f1
|
WIP further MPI optimizations
|
2020-03-24 19:02:58 +02:00 |
|
jpekkila
|
ef63813679
|
Explicit check that critical parameters like inv_dsx are properly initialized before calling integration
|
2020-03-24 17:01:24 +02:00 |
|
jpekkila
|
8c362b44f0
|
Added more warning in case some of the model solver parameters are not initialized
|
2020-03-24 16:56:30 +02:00 |
|
jpekkila
|
d520835c42
|
Added integration to MPI comm, now completes a full integration step. Works at least on 2 nodes
|
2020-03-24 16:55:38 +02:00 |
|
jpekkila
|
37d6ad18d3
|
Fixed formatting in the API specification file
|
2020-03-04 15:09:23 +02:00 |
|
jpekkila
|
13b9b39c0d
|
Renamed sink_particle.md to .txt to avoid it showing up in the documentation
|
2020-02-28 14:44:51 +02:00 |
|