Commit Graph

198 Commits

Author SHA1 Message Date
jpekkila
316d44b843 Fixed an out-of-bounds error with auto-optimization (introduced in the last few commits) 2019-12-03 16:04:44 +02:00
jpekkila
7e4212ddd9 Enabled the generation of API hooks for calling DSL functions (was messing up with compilation earlier) 2019-12-03 15:17:27 +02:00
jpekkila
5a6a3110df Reformatted 2019-12-03 15:14:26 +02:00
jpekkila
f14e35620c Now nvcc is used to compile kernels only. All host code, incl. device.cc, MPI communication and others are now compiled with the host C++ compiler. This should work around an nvcc/MPI bug on Puhti. 2019-12-03 15:12:17 +02:00
jpekkila
8bffb2a1d0 Fixed ambiguous logic in acNodeStoreVertexBufferWithOffset, now halos of arbitrary GPUs do not overwrite valid data from the computational domain of a neighboring GPU. Also disabled p2p transfers temporarily until I figure out a clean way to avoid cudaErrorPeerAccessAlreadyEnabled errors 2019-12-02 12:58:09 +02:00
jpekkila
0178d4788c The core library now links to the CXX MPI library instead of the C one 2019-11-27 14:51:49 +02:00
jpekkila
ab539a98d6 Replaced old deprecated instances of DCONST_INT with DCONST 2019-11-27 13:48:42 +02:00
jpekkila
1270332f48 Fixed a small mistake in the last merge 2019-11-27 11:58:14 +02:00
Johannes Pekkila
3eabf94f92 Merge branch 'master' of https://bitbucket.org/jpekkila/astaroth 2019-11-27 08:55:23 +01:00
jpekkila
5e3caf086e Device id is now properly set when using MPI and there are multiple visible GPUs per node 2019-11-26 16:54:56 +02:00
jpekkila
0b0ccd697a Added some explicit casts in get_neighbor (MPI) to fix warnings raised when compiling with older gcc 2019-11-20 10:18:10 +02:00
Johannes Pekkila
981331e7d7 Benchmark results now written out to a file 2019-10-24 15:53:08 +02:00
Johannes Pekkila
4ffde83215 Set default values for benchmarking 2019-10-24 15:22:47 +02:00
Johannes Pekkila
8894b7c7d6 Added a function for getting pid of a neighboring process when decomposing in 3D 2019-10-23 19:26:35 +02:00
Johannes Pekkila
474bdf185d Cleaned up the MPI solution for 3D decomp test 2019-10-23 12:33:46 +02:00
Johannes Pekkila
1d81333ff7 More concurrent kernels and MPI comm 2019-10-23 12:07:23 +02:00
Johannes Pekkila
04867334e7 Full integration step with MPI comms 2019-10-22 19:59:15 +02:00
Johannes Pekkila
870cd91b5f Added the final MPI solution for the benchmark tests: RDMA is now used and I don't think we can go much faster with the current decomposition scheme. To get better scaling, we probably would have to change 3D decomposition instead of using the current simple 1D decomp 2019-10-22 19:28:35 +02:00
jpekkila
3d7ad7c8f2 Code cleanup 2019-10-22 15:38:34 +03:00
jpekkila
64221c218d Made some warnings go away 2019-10-22 15:03:55 +03:00
Johannes Pekkila
e4a7cdcf1d Added functions for packing and unpacking data on the device 2019-10-22 13:48:47 +02:00
Johannes Pekkila
915e1c7c14 Trying to overlap MPI communication with computation of boundary conditions. However, NVIDIA seemed to forget one important detail in the documentation for CUDA-aware MPI: it looks like CUDA streams are not supported with CUDA-aware MPI communication. So in the end the fastest solution might be to use old-school gpu->cpu->cpu->gpu MPI communication after all 2019-10-21 15:50:53 +02:00
jpekkila
f120343110 Bugfix: peer access was not disabled when Node was destroyed, leading to cudaErrorPeerAccessAlreadyEnabled error when creating new Nodes 2019-10-21 16:23:24 +03:00
Johannes Pekkila
7b475b6dee Better MPI synchronization 2019-10-18 11:50:22 +02:00
Johannes Pekkila
155d369888 MPI communication now 10x faster 2019-10-17 22:39:57 +02:00
jpekkila
26bbfa089d Better multi-node communication: fire and forget. 2019-10-17 18:17:37 +03:00
jpekkila
3d852e5082 Added timing to the MPI benchmark 2019-10-17 17:43:54 +03:00
jpekkila
588a94c772 Added more MPI stuff. Now multi-node GPU-GPU communication with GPUDirect RDMA should work. Also device memory is now allocated in unified memory by default as this makes MPI communication simpler if RDMA is not supported. This does not affect Astaroth any other way since different devices use different portions of the memory space and we continue managing memory transfers manually. 2019-10-17 16:09:05 +03:00
jpekkila
0e88d6c339 Marked some internal functions static 2019-10-17 14:41:44 +03:00
jpekkila
f1e988ba6a Added stuff for the device layer for testing GPU-GPU MPI. This is a quick and dirty solution which is primarily meant for benchmarking/verification. Figuring out what the MPI interface should look like is more challenging and is not the priority right now 2019-10-17 14:40:53 +03:00
jpekkila
65a2d47ef7 Made grid.cu (multi-node) to compile without errors. Not used though. 2019-10-17 13:03:42 +03:00
jpekkila
0865f0499b Various improvements to the MPI-GPU implementation, but linking MPI libraries with both the host C-project and the core library seems to be a major pain. Currently the communication is done via gpu->cpu->cpu->gpu. 2019-10-15 19:32:16 +03:00
jpekkila
113be456d6 Undeprecated the wrong function in commit b693c8a 2019-10-15 18:11:07 +03:00
jpekkila
1ca089c163 New cmake option: MPI_ENABLED. Enables MPI functions on the device layer 2019-10-15 17:57:53 +03:00
jpekkila
b693c8adb4 Undeprecated acDeviceLoadMesh and acDeviceStoreMesh, these are actually very nice to have 2019-10-15 16:12:31 +03:00
jpekkila
08188f3f5b is_valid is now consistently overloaded (parameter passed as a reference). Older CUDA compilers complained about this. 2019-10-14 21:18:21 +03:00
jpekkila
08f155cbec Finetuning some error checks 2019-10-07 20:40:32 +03:00
jpekkila
5d4f47c3d2 Added overloads for vector in-place addition and subtraction 2019-10-07 19:40:54 +03:00
jpekkila
ba49e7e400 Replaced deprecated DCONST_INT calls with overloaded DCONST() 2019-10-07 19:40:27 +03:00
jpekkila
66cfcefb34 More error checks 2019-10-07 17:00:23 +03:00
jpekkila
0e1d1b9fb4 Some optimizations for DSL compilation. Also a new feature: Inplace addition and subtraction += and -= are now allowed 2019-10-07 16:33:24 +03:00
jpekkila
f7c079be2a Removed everything unnecessary from integration.cuh. Now all derivatives etc are available in a standard library header (acc/stdlib/stdderiv.h) 2019-10-07 15:47:33 +03:00
jpekkila
9a16c79ce6 Renamed all references to uniforms to f.ex. loadScalarConstant -> loadScalarUniform (for consistency with the DSL) 2019-10-01 17:12:20 +03:00
jpekkila
2c8c49ee24 Removed or updated some old .gitignore files 2019-09-24 17:50:41 +03:00
jpekkila
e4eea7db83 Added support for Volta GPUs 2019-09-24 17:19:45 +03:00
jpekkila
3bb6ca1712 The Astaroth Code Compiler (acc) is now built with cmake. Additionally, make is now used to generate the CUDA headers from DSL sources. The headers are also properly regenerated whenever a DSL file has been changed. With this commit, the DSL is now seamlessly integrated to the library and we no longer need complicated scripts to figure out the correct files. The current workflow for using custom DSL sources is to pass the DSL module directory to cmake, f.ex. cmake -DDSL_MODULE_DIR=/acc/mhd_solver. Note that the path must be absolute or then given relative to the CMakeLists.txt directory. f.ex cd build && cmake -DDSL_MODULE_DIR=../acc/mhd_solver does not work. CMake then takes all DSL files in that directory and handles the rest. 2019-09-18 17:28:29 +03:00
jpekkila
bce3e4de03 Made warnings about unused device functions go away 2019-09-18 16:58:04 +03:00
jpekkila
021e5f3774 Renamed NUM_STREAM_TYPES -> NUM_STREAMS 2019-09-12 15:48:38 +03:00
jpekkila
53230c9b61 Added errorchecking and more flexibility the the new acDeviceLoadScalarArray function 2019-09-05 19:56:04 +03:00
jpekkila
263a1d23a3 Added a function for loading ScalarArrays to the GPU 2019-09-05 16:35:08 +03:00