jpekkila
|
0d1c5b3911
|
Autoformatted
|
2020-06-24 15:56:30 +03:00 |
|
jpekkila
|
3c3b2a1885
|
Reverted the default settings to what they were before merge. Note: LFORCING (1) is potentially not tested properly, TODO recheck.
|
2020-06-24 15:35:19 +03:00 |
|
jpekkila
|
88f99c12e4
|
Fixed #fi -> #endif
|
2020-06-24 15:20:43 +03:00 |
|
jpekkila
|
f04e347c45
|
Cleanup before merging to the master merge candidate branch
|
2020-06-24 15:13:15 +03:00 |
|
jpekkila
|
0e4b39d6d7
|
Added a toggle for using pinned memory
|
2020-06-11 11:28:52 +03:00 |
|
jpekkila
|
1cdb9e2ce7
|
Added missing synchronization to the end of the new integration function
|
2020-06-10 12:32:56 +03:00 |
|
jpekkila
|
fa422cf457
|
Added a better-pipelined version of the acGridIntegrate and a switch for toggling the transfer of corners
|
2020-06-10 02:16:23 +03:00 |
|
jpekkila
|
9840b817d0
|
Added the (hopefully final) basic test case used for the benchmarks
|
2020-06-07 21:59:33 +03:00 |
|
jpekkila
|
17a4f31451
|
Added the latest setup used for benchmarks
|
2020-06-04 20:47:03 +03:00 |
|
jpekkila
|
0d80834619
|
Disabled forcing and upwinding for performance tests. Set default grid size to 512^3. Set default cmake params s.t. benchmarks can be reproduced out-of-the-box.
|
2020-06-02 14:09:00 +03:00 |
|
jpekkila
|
a753ca92f2
|
Made cmake handle MPI linking. Potentially a bad idea (usually better to use mpicc and mpicxx wrappers)
|
2020-05-30 22:02:39 +03:00 |
|
jpekkila
|
f97ed9e513
|
For reason X git decided to remove integration from the most critical part of the program when merging. Luckily we have autotests.
|
2020-05-30 20:59:39 +03:00 |
|
jpekkila
|
9cafe88d13
|
Merge branch 'mpi-to-master-merge-candidate-2020-06-01' of https://bitbucket.org/jpekkila/astaroth into mpi-to-master-merge-candidate-2020-06-01
|
2020-05-30 20:25:48 +03:00 |
|
jpekkila
|
176ceae313
|
Fixed various compilation warnings
|
2020-05-30 20:23:53 +03:00 |
|
jpekkila
|
2ddeef22ac
|
bitbucket-pipelines.yml edited online with Bitbucket
|
2020-05-30 16:58:45 +00:00 |
|
jpekkila
|
f929b21ac0
|
bitbucket-pipelines.yml edited online with Bitbucket
|
2020-05-30 16:52:26 +00:00 |
|
jpekkila
|
95275df3f2
|
bitbucket-pipelines.yml edited online with Bitbucket
|
2020-05-30 16:48:39 +00:00 |
|
jpekkila
|
c24996fdb3
|
Added a the official Kitware PPA for pulling the latest CMake when doing automated builds.
|
2020-05-30 16:45:08 +00:00 |
|
jpekkila
|
b719306266
|
Upped the required CMake version. This may be an issue on older machines. Instead of making the user to compile CMake themselves in this case, we could maybe add CMake as a submodule. In any case supporting older CMake versions is not really an option because CUDA support with those is so bad and requires adding dirty hacks to the clean cmakefiles we have now.
|
2020-05-30 19:36:32 +03:00 |
|
jpekkila
|
e05338c128
|
Merged the newest MPI changes
|
2020-05-30 19:18:46 +03:00 |
|
jpekkila
|
555bf8b252
|
Reverted the default settings to same as on master for easier merge
|
2020-05-30 19:06:21 +03:00 |
|
jpekkila
|
4748e48c7d
|
Spelling fixes
|
2020-05-28 17:10:17 +03:00 |
|
jpekkila
|
01ad141d90
|
Added comments and a short overview of the MPI implementation
|
2020-05-28 17:05:12 +03:00 |
|
jpekkila
|
f1138b04ac
|
Cleaned up the MPI implementation, removed all older implementations (removed also MPI window implementation which might be handy in the future when CUDA-aware support is introduced). If the removed stuff is needed later, here are some keywords to help find this commit: MPI_window, sendrecv, bidirectional, unidirectional transfer, real-time pinning, a0s, b0s.
|
2020-05-28 16:42:50 +03:00 |
|
jpekkila
|
0d62f56e27
|
Tried an alternative approach to comm (was worse than the current solution) and rewrote the current best solution for (now easier to read)
|
2020-05-28 15:31:43 +03:00 |
|
jpekkila
|
f97005a75d
|
Added WIP version of the new bidirectional comm scheme
|
2020-05-27 19:09:32 +03:00 |
|
jpekkila
|
afe5b973ca
|
Added multiplication operator for int3
|
2020-05-27 19:08:39 +03:00 |
|
jpekkila
|
7e59ea0eff
|
MPI: corners are no longer communicated. Slight performance impact (14 ms vs 15 ms). Tests still pass with 8 GPUs.
|
2020-05-26 19:00:14 +03:00 |
|
jpekkila
|
ec59cdb973
|
Some formatting and unimportant changes to samples
|
2020-05-26 18:57:46 +03:00 |
|
jpekkila
|
c93b3265e6
|
Made comm streams high prio
|
2020-04-22 17:03:53 +03:00 |
|
jpekkila
|
22e01b7f1d
|
Rewrote partitioning code
|
2020-04-19 23:23:23 +03:00 |
|
jpekkila
|
4dd825f574
|
Proper decomposition when using Morton order to partition the computational domain
|
2020-04-19 22:50:26 +03:00 |
|
jpekkila
|
ffb274e16f
|
Linking dynamic CUDA library instead of static (less prone to breaking since Astaroth does not have to be rebuilt when CUDA is updated)
|
2020-04-19 22:33:01 +03:00 |
|
jpekkila
|
8c210b3292
|
3D decomposition is now done using Morton order instead of linear indexing
|
2020-04-19 22:31:57 +03:00 |
|
jpekkila
|
9cd5909f5a
|
BWtest calculates now aggregate bandwidths per process instead of assuming that all neighbor communication can be done in parallel (Within a node one can have parallel P2P connections to all neighbors and we have an insane total bandwidth, but this is not the case with network, we seem to have only one bidirectional socket)
|
2020-04-09 20:28:04 +03:00 |
|
jpekkila
|
d4a84fb887
|
Added a PCIe bandwidth test
|
2020-04-09 20:04:54 +03:00 |
|
jpekkila
|
d6e74ee270
|
Added missing files
|
2020-04-09 19:24:55 +03:00 |
|
jpekkila
|
ed8a0bf7e6
|
Added bwtest and benchmarkscript to CMakeLists
|
2020-04-07 18:35:12 +03:00 |
|
jpekkila
|
fb41741d74
|
Improvements to samples
|
2020-04-07 17:58:47 +03:00 |
|
jpekkila
|
427a3ac5d8
|
Rewrote the previous implementation, now fully works (verified) and gives the speedup we want. Communication latency is now completely hidden on at least two nodes (8 GPUs). Scaling looks very promising.
|
2020-04-06 17:28:02 +03:00 |
|
jpekkila
|
37f1c841a3
|
Added functions for pinning memory that is sent over the network. TODO pack to and from pinned memory selectively (currently P2P results are overwritten with data in pinned memory)
|
2020-04-06 14:09:12 +03:00 |
|
jpekkila
|
cc9d3f1b9c
|
Found a workaround that gives good inter and intra-node performance. HPC-X MPI implementation does not know how to do p2p comm with pinned arrays (should be 80 GiB/s, measured 10 GiB/s) and internode comm is super slow without pinned arrays (should be 40 GiB/s, measured < 1 GiB/s). Made a proof of concept communicator that pins arrays that are send or received from another node.
|
2020-04-05 20:15:32 +03:00 |
|
jpekkila
|
88e53dfa21
|
Added a little program for testing the bandwidths of different MPI comm styles on n nodes and processes
|
2020-04-05 17:09:57 +03:00 |
|
jpekkila
|
fe14ae4665
|
Added an alternative MPI implementation which uses one-sided communication
|
2020-04-02 17:59:53 +03:00 |
|
jpekkila
|
d6d5920553
|
Pulled improvements to device.cc from the benchmark branch to master
|
2020-03-31 14:23:36 +03:00 |
|
Johannes Pekkila
|
9b6d927cf1
|
It might be better to benchmark MPI codes without synchronization because of overhead of timing individual steps
|
2020-03-31 12:37:54 +02:00 |
|
Johannes Pekkila
|
742dcc2697
|
Optimized MPI synchronization a bit
|
2020-03-31 12:36:25 +02:00 |
|
jpekkila
|
24e65ab02d
|
Set decompositions for some nprocs by hand
|
2020-03-30 18:13:50 +03:00 |
|
jpekkila
|
9065381b2a
|
Added the configuration used for benchmarking (not to be merged to master)
|
2020-03-30 18:01:35 +03:00 |
|
jpekkila
|
850b37e8c8
|
Added a switch for generating strong and weak scaling results
|
2020-03-30 17:56:12 +03:00 |
|