astaroth

Author	SHA1	Message	Date
jpekkila	5e3caf086e	Device id is now properly set when using MPI and there are multiple visible GPUs per node	2019-11-26 16:54:56 +02:00
jpekkila	0b0ccd697a	Added some explicit casts in get_neighbor (MPI) to fix warnings raised when compiling with older gcc	2019-11-20 10:18:10 +02:00
Johannes Pekkila	8894b7c7d6	Added a function for getting pid of a neighboring process when decomposing in 3D	2019-10-23 19:26:35 +02:00
Johannes Pekkila	474bdf185d	Cleaned up the MPI solution for 3D decomp test	2019-10-23 12:33:46 +02:00
Johannes Pekkila	1d81333ff7	More concurrent kernels and MPI comm	2019-10-23 12:07:23 +02:00
Johannes Pekkila	04867334e7	Full integration step with MPI comms	2019-10-22 19:59:15 +02:00
Johannes Pekkila	870cd91b5f	Added the final MPI solution for the benchmark tests: RDMA is now used and I don't think we can go much faster with the current decomposition scheme. To get better scaling, we probably would have to change 3D decomposition instead of using the current simple 1D decomp	2019-10-22 19:28:35 +02:00
jpekkila	3d7ad7c8f2	Code cleanup	2019-10-22 15:38:34 +03:00
jpekkila	64221c218d	Made some warnings go away	2019-10-22 15:03:55 +03:00
Johannes Pekkila	e4a7cdcf1d	Added functions for packing and unpacking data on the device	2019-10-22 13:48:47 +02:00
Johannes Pekkila	915e1c7c14	Trying to overlap MPI communication with computation of boundary conditions. However, NVIDIA seemed to forget one important detail in the documentation for CUDA-aware MPI: it looks like CUDA streams are not supported with CUDA-aware MPI communication. So in the end the fastest solution might be to use old-school gpu->cpu->cpu->gpu MPI communication after all	2019-10-21 15:50:53 +02:00
Johannes Pekkila	7b475b6dee	Better MPI synchronization	2019-10-18 11:50:22 +02:00
Johannes Pekkila	155d369888	MPI communication now 10x faster	2019-10-17 22:39:57 +02:00
jpekkila	26bbfa089d	Better multi-node communication: fire and forget.	2019-10-17 18:17:37 +03:00
jpekkila	3d852e5082	Added timing to the MPI benchmark	2019-10-17 17:43:54 +03:00
jpekkila	588a94c772	Added more MPI stuff. Now multi-node GPU-GPU communication with GPUDirect RDMA should work. Also device memory is now allocated in unified memory by default as this makes MPI communication simpler if RDMA is not supported. This does not affect Astaroth any other way since different devices use different portions of the memory space and we continue managing memory transfers manually.	2019-10-17 16:09:05 +03:00
jpekkila	f1e988ba6a	Added stuff for the device layer for testing GPU-GPU MPI. This is a quick and dirty solution which is primarily meant for benchmarking/verification. Figuring out what the MPI interface should look like is more challenging and is not the priority right now	2019-10-17 14:40:53 +03:00
jpekkila	0865f0499b	Various improvements to the MPI-GPU implementation, but linking MPI libraries with both the host C-project and the core library seems to be a major pain. Currently the communication is done via gpu->cpu->cpu->gpu.	2019-10-15 19:32:16 +03:00
jpekkila	113be456d6	Undeprecated the wrong function in commit `b693c8a`	2019-10-15 18:11:07 +03:00
jpekkila	1ca089c163	New cmake option: MPI_ENABLED. Enables MPI functions on the device layer	2019-10-15 17:57:53 +03:00
jpekkila	b693c8adb4	Undeprecated acDeviceLoadMesh and acDeviceStoreMesh, these are actually very nice to have	2019-10-15 16:12:31 +03:00
jpekkila	08f155cbec	Finetuning some error checks	2019-10-07 20:40:32 +03:00
jpekkila	ba49e7e400	Replaced deprecated DCONST_INT calls with overloaded DCONST()	2019-10-07 19:40:27 +03:00
jpekkila	66cfcefb34	More error checks	2019-10-07 17:00:23 +03:00
jpekkila	9a16c79ce6	Renamed all references to uniforms to f.ex. loadScalarConstant -> loadScalarUniform (for consistency with the DSL)	2019-10-01 17:12:20 +03:00
jpekkila	bce3e4de03	Made warnings about unused device functions go away	2019-09-18 16:58:04 +03:00
jpekkila	021e5f3774	Renamed NUM_STREAM_TYPES -> NUM_STREAMS	2019-09-12 15:48:38 +03:00
jpekkila	53230c9b61	Added errorchecking and more flexibility the the new acDeviceLoadScalarArray function	2019-09-05 19:56:04 +03:00
jpekkila	263a1d23a3	Added a function for loading ScalarArrays to the GPU	2019-09-05 16:35:08 +03:00
jpekkila	9e57aba9b7	New feature: ScalarArray. ScalarArrays are read-only 1D arrays containing max(mx, max(my, mz)) elements. ScalarArray is a new type of uniform and can be used for storing f.ex. forcing profiles. The DSL now also supports complex numbers and some basic arithmetic (exp, multiplication)	2019-09-02 21:26:57 +03:00
jpekkila	6ea02fa28e	DSL now 'feature complete' with respect to what I had in mind before the summer. Users can now create multiple kernels and the library functions are generated automatically for them. The generated library functions are of the form acDeviceKernel_<name> and acNodeKernel_<name>. More features are needed though. The next features to be added at some point are 1D and 2D device constant arrays in order to support profiles for f.ex. forcing.	2019-08-27 18:19:20 +03:00
jpekkila	20138263f4	The previous attempt (dsl_feature_completeness_2019-08-23) to enable arbitrary kernel functions was a failure: we get significant performance loss (25-100%) if step_number is not passed as a template parameter to the integration kernel. Apparently the CUDA compiler cannot perform some optimizations if there is a if/else construct in a performance-critical part which cannot be evaluated at compile time. This branch keeps step_number as a template parameter but takes rest of the user parameters as uniforms (dt is no longer passed as a function parameter but as an uniform with the DSL instead).	2019-08-27 17:36:33 +03:00
jpekkila	39dcda4a04	Made warnings about unused functions go away (this is intended functionality and not all programs will use all types of device constants, thus unnecessary warning)	2019-08-21 14:28:46 +03:00
jpekkila	d801ebdd41	Now parameters and vertexbuffers (fields) can be declared with the DSL only. TODO: translation from the DSL header to C	2019-08-19 17:35:03 +03:00
jpekkila	787363226b	Added functions for loading int, int3, scalar and vector constants to the device layer (acDeviceLoad...Constant)	2019-08-19 15:28:16 +03:00
jpekkila	41805dcb68	Added some error checking for the case where user supplies an incomplete meshinfo to acDeviceLoadMeshInfo	2019-08-19 15:17:51 +03:00
jpekkila	598799d7c3	Added a new function to the device interface: acDeviceLoadMeshInfo	2019-08-19 15:14:00 +03:00
jpekkila	6d4d53342e	Removed old comments	2019-08-15 11:14:52 +03:00
jpekkila	d5b2e5bb42	Added placeholders for new built-in variables in the DSL. Also overloads to DCONST_INT etc. Naming still pending and old DCONST_REAL etc calls still work.	2019-08-12 14:05:35 +03:00
jpekkila	8bbb2cd5df	Now prints device info before trying to run the dummy kernel	2019-08-12 09:46:37 +03:00
jpekkila	daee456660	Merge branch 'cmakelist_rewrite_and_C_API_conformity_07-26' into node_device_interface_revision_07-23	2019-08-06 17:57:30 +03:00
jpekkila	abf4815174	Merge branch 'master' into cmakelist_rewrite_and_C_API_conformity_07-26	2019-08-06 17:53:53 +03:00
jpekkila	5870081645	Split kernels.cuh into bounconds.cuh, integration.cuh and reductions.cuh	2019-08-06 17:50:41 +03:00
jpekkila	3726847683	Made globalGridN and d_multigpu_offsets built-in parameters. Note the renaming from globalGrid.n to globalGridN.	2019-08-06 16:39:15 +03:00
jpekkila	fa6e1116cb	The interface revision now actually works. The issue was incorrect order of src and dst indices when storing the mesh.	2019-08-05 17:26:05 +03:00
jpekkila	5f2378e91b	Now compiles (does not work though)	2019-08-02 15:15:18 +03:00
jpekkila	92376588ba	Merge branch 'master' into cmakelist_rewrite_and_C_API_conformity_07-26	2019-07-31 20:12:22 +03:00
jpekkila	fb0610c1ba	Intermediate changes to the revised node interface	2019-07-31 20:04:39 +03:00
jpekkila	0a5d025172	Formatting	2019-07-31 19:08:16 +03:00
jpekkila	9b7f4277fc	Fixed errors in device.cu	2019-07-31 19:07:26 +03:00

1 2

78 Commits