Commit Graph

78 Commits

Author SHA1 Message Date
jpekkila
5e3caf086e Device id is now properly set when using MPI and there are multiple visible GPUs per node 2019-11-26 16:54:56 +02:00
jpekkila
0b0ccd697a Added some explicit casts in get_neighbor (MPI) to fix warnings raised when compiling with older gcc 2019-11-20 10:18:10 +02:00
Johannes Pekkila
8894b7c7d6 Added a function for getting pid of a neighboring process when decomposing in 3D 2019-10-23 19:26:35 +02:00
Johannes Pekkila
474bdf185d Cleaned up the MPI solution for 3D decomp test 2019-10-23 12:33:46 +02:00
Johannes Pekkila
1d81333ff7 More concurrent kernels and MPI comm 2019-10-23 12:07:23 +02:00
Johannes Pekkila
04867334e7 Full integration step with MPI comms 2019-10-22 19:59:15 +02:00
Johannes Pekkila
870cd91b5f Added the final MPI solution for the benchmark tests: RDMA is now used and I don't think we can go much faster with the current decomposition scheme. To get better scaling, we probably would have to change 3D decomposition instead of using the current simple 1D decomp 2019-10-22 19:28:35 +02:00
jpekkila
3d7ad7c8f2 Code cleanup 2019-10-22 15:38:34 +03:00
jpekkila
64221c218d Made some warnings go away 2019-10-22 15:03:55 +03:00
Johannes Pekkila
e4a7cdcf1d Added functions for packing and unpacking data on the device 2019-10-22 13:48:47 +02:00
Johannes Pekkila
915e1c7c14 Trying to overlap MPI communication with computation of boundary conditions. However, NVIDIA seemed to forget one important detail in the documentation for CUDA-aware MPI: it looks like CUDA streams are not supported with CUDA-aware MPI communication. So in the end the fastest solution might be to use old-school gpu->cpu->cpu->gpu MPI communication after all 2019-10-21 15:50:53 +02:00
Johannes Pekkila
7b475b6dee Better MPI synchronization 2019-10-18 11:50:22 +02:00
Johannes Pekkila
155d369888 MPI communication now 10x faster 2019-10-17 22:39:57 +02:00
jpekkila
26bbfa089d Better multi-node communication: fire and forget. 2019-10-17 18:17:37 +03:00
jpekkila
3d852e5082 Added timing to the MPI benchmark 2019-10-17 17:43:54 +03:00
jpekkila
588a94c772 Added more MPI stuff. Now multi-node GPU-GPU communication with GPUDirect RDMA should work. Also device memory is now allocated in unified memory by default as this makes MPI communication simpler if RDMA is not supported. This does not affect Astaroth any other way since different devices use different portions of the memory space and we continue managing memory transfers manually. 2019-10-17 16:09:05 +03:00
jpekkila
f1e988ba6a Added stuff for the device layer for testing GPU-GPU MPI. This is a quick and dirty solution which is primarily meant for benchmarking/verification. Figuring out what the MPI interface should look like is more challenging and is not the priority right now 2019-10-17 14:40:53 +03:00
jpekkila
0865f0499b Various improvements to the MPI-GPU implementation, but linking MPI libraries with both the host C-project and the core library seems to be a major pain. Currently the communication is done via gpu->cpu->cpu->gpu. 2019-10-15 19:32:16 +03:00
jpekkila
113be456d6 Undeprecated the wrong function in commit b693c8a 2019-10-15 18:11:07 +03:00
jpekkila
1ca089c163 New cmake option: MPI_ENABLED. Enables MPI functions on the device layer 2019-10-15 17:57:53 +03:00
jpekkila
b693c8adb4 Undeprecated acDeviceLoadMesh and acDeviceStoreMesh, these are actually very nice to have 2019-10-15 16:12:31 +03:00
jpekkila
08f155cbec Finetuning some error checks 2019-10-07 20:40:32 +03:00
jpekkila
ba49e7e400 Replaced deprecated DCONST_INT calls with overloaded DCONST() 2019-10-07 19:40:27 +03:00
jpekkila
66cfcefb34 More error checks 2019-10-07 17:00:23 +03:00
jpekkila
9a16c79ce6 Renamed all references to uniforms to f.ex. loadScalarConstant -> loadScalarUniform (for consistency with the DSL) 2019-10-01 17:12:20 +03:00
jpekkila
bce3e4de03 Made warnings about unused device functions go away 2019-09-18 16:58:04 +03:00
jpekkila
021e5f3774 Renamed NUM_STREAM_TYPES -> NUM_STREAMS 2019-09-12 15:48:38 +03:00
jpekkila
53230c9b61 Added errorchecking and more flexibility the the new acDeviceLoadScalarArray function 2019-09-05 19:56:04 +03:00
jpekkila
263a1d23a3 Added a function for loading ScalarArrays to the GPU 2019-09-05 16:35:08 +03:00
jpekkila
9e57aba9b7 New feature: ScalarArray. ScalarArrays are read-only 1D arrays containing max(mx, max(my, mz)) elements. ScalarArray is a new type of uniform and can be used for storing f.ex. forcing profiles. The DSL now also supports complex numbers and some basic arithmetic (exp, multiplication) 2019-09-02 21:26:57 +03:00
jpekkila
6ea02fa28e DSL now 'feature complete' with respect to what I had in mind before the summer. Users can now create multiple kernels and the library functions are generated automatically for them. The generated library functions are of the form acDeviceKernel_<name> and acNodeKernel_<name>. More features are needed though. The next features to be added at some point are 1D and 2D device constant arrays in order to support profiles for f.ex. forcing. 2019-08-27 18:19:20 +03:00
jpekkila
20138263f4 The previous attempt (dsl_feature_completeness_2019-08-23) to enable arbitrary kernel functions was a failure: we get significant performance loss (25-100%) if step_number is not passed as a template parameter to the integration kernel. Apparently the CUDA compiler cannot perform some optimizations if there is a if/else construct in a performance-critical part which cannot be evaluated at compile time. This branch keeps step_number as a template parameter but takes rest of the user parameters as uniforms (dt is no longer passed as a function parameter but as an uniform with the DSL instead). 2019-08-27 17:36:33 +03:00
jpekkila
39dcda4a04 Made warnings about unused functions go away (this is intended functionality and not all programs will use all types of device constants, thus unnecessary warning) 2019-08-21 14:28:46 +03:00
jpekkila
d801ebdd41 Now parameters and vertexbuffers (fields) can be declared with the DSL only. TODO: translation from the DSL header to C 2019-08-19 17:35:03 +03:00
jpekkila
787363226b Added functions for loading int, int3, scalar and vector constants to the device layer (acDeviceLoad...Constant) 2019-08-19 15:28:16 +03:00
jpekkila
41805dcb68 Added some error checking for the case where user supplies an incomplete meshinfo to acDeviceLoadMeshInfo 2019-08-19 15:17:51 +03:00
jpekkila
598799d7c3 Added a new function to the device interface: acDeviceLoadMeshInfo 2019-08-19 15:14:00 +03:00
jpekkila
6d4d53342e Removed old comments 2019-08-15 11:14:52 +03:00
jpekkila
d5b2e5bb42 Added placeholders for new built-in variables in the DSL. Also overloads to DCONST_INT etc. Naming still pending and old DCONST_REAL etc calls still work. 2019-08-12 14:05:35 +03:00
jpekkila
8bbb2cd5df Now prints device info before trying to run the dummy kernel 2019-08-12 09:46:37 +03:00
jpekkila
daee456660 Merge branch 'cmakelist_rewrite_and_C_API_conformity_07-26' into node_device_interface_revision_07-23 2019-08-06 17:57:30 +03:00
jpekkila
abf4815174 Merge branch 'master' into cmakelist_rewrite_and_C_API_conformity_07-26 2019-08-06 17:53:53 +03:00
jpekkila
5870081645 Split kernels.cuh into bounconds.cuh, integration.cuh and reductions.cuh 2019-08-06 17:50:41 +03:00
jpekkila
3726847683 Made globalGridN and d_multigpu_offsets built-in parameters. Note the renaming from globalGrid.n to globalGridN. 2019-08-06 16:39:15 +03:00
jpekkila
fa6e1116cb The interface revision now actually works. The issue was incorrect order of src and dst indices when storing the mesh. 2019-08-05 17:26:05 +03:00
jpekkila
5f2378e91b Now compiles (does not work though) 2019-08-02 15:15:18 +03:00
jpekkila
92376588ba Merge branch 'master' into cmakelist_rewrite_and_C_API_conformity_07-26 2019-07-31 20:12:22 +03:00
jpekkila
fb0610c1ba Intermediate changes to the revised node interface 2019-07-31 20:04:39 +03:00
jpekkila
0a5d025172 Formatting 2019-07-31 19:08:16 +03:00
jpekkila
9b7f4277fc Fixed errors in device.cu 2019-07-31 19:07:26 +03:00