Moved explanations and comments to the beginning of astaroth.cu. No code changes.

2019-07-05 15:39:52 +03:00
parent d87eb36f5a
commit ce8fe53f91
1 changed files with 123 additions and 113 deletions
--- a/src/core/astaroth.cu
+++ b/src/core/astaroth.cu
@@ -21,7 +21,106 @@
 * @file
 * \brief Multi-GPU implementation.
 *
- * Detailed info.
+ %JP: The old way for computing boundary conditions conflicts with the
+ way we have to do things with multiple GPUs.
+
+ The older approach relied on unified memory, which represented the whole
+ memory area as one huge mesh instead of several smaller ones. However, unified memory
+ in its current state is more meant for quick prototyping when performance is not an issue.
+ Getting the CUDA driver to migrate data intelligently across GPUs is much more difficult
+ than when managing the memory explicitly.
+
+ In this new approach, I have simplified the multi- and single-GPU layers significantly.
+ Quick rundown:
+         New struct: Grid. There are two global variables, "grid" and "subgrid", which
+         contain the extents of the whole simulation domain and the decomposed grids,
+ respectively. To simplify thing, we require that each GPU is assigned the same amount of
+ work, therefore each GPU in the node is assigned and "subgrid.m" -sized block of data to
+ work with.
+
+         The whole simulation domain is decomposed with respect to the z dimension.
+         For example, if the grid contains (nx, ny, nz) vertices, then the subgrids
+         contain (nx, ny, nz / num_devices) vertices.
+
+         An local index (i, j, k) in some subgrid can be mapped to the global grid with
+                 global idx = (i, j, k + device_id * subgrid.n.z)
+
+ Terminology:
+         - Single-GPU function: a function defined on the single-GPU layer (device.cu)
+
+ Changes required to this commented code block:
+         - The thread block dimensions (tpb) are no longer passed to the kernel here but in
+ device.cu instead. Same holds for any complex index calculations. Instead, the local
+ coordinates should be passed as an int3 type without having to consider how the data is
+ actually laid out in device memory
+         - The unified memory buffer no longer exists (d_buffer). Instead, we have an opaque
+ handle of type "Device" which should be passed to single-GPU functions. In this file, all
+ devices are stored in a global array "devices[num_devices]".
+         - Every single-GPU function is executed asynchronously by default such that we
+           can optimize Astaroth by executing memory transactions concurrently with
+ computation. Therefore a StreamType should be passed as a parameter to single-GPU functions.
+           Refresher: CUDA function calls are non-blocking when a stream is explicitly passed
+           as a parameter and commands executing in different streams can be processed
+           in parallel/concurrently.
+
+
+ Note on periodic boundaries (might be helpful when implementing other boundary conditions):
+
+         With multiple GPUs, periodic boundary conditions applied on indices ranging from
+
+                 (0, 0, STENCIL_ORDER/2) to (subgrid.m.x, subgrid.m.y, subgrid.m.z -
+ STENCIL_ORDER/2)
+
+         on a single device are "local", in the sense that they can be computed without
+ having to exchange data with neighboring GPUs. Special care is needed only for transferring
+         the data to the fron and back plates outside this range. In the solution we use
+ here, we solve the local boundaries first, and then just exchange the front and back plates
+         in a "ring", like so
+                                 device_id
+                     (n) <-> 0 <-> 1 <-> ... <-> n <-> (0)
+
+### Throughout this file we use the following notation and names for various index offsets
+
+    Global coordinates: coordinates with respect to the global grid (static Grid grid)
+    Local coordinates: coordinates with respect to the local subgrid (static Subgrid subgrid)
+
+    s0, s1: source indices in global coordinates
+    d0, d1: destination indices in global coordinates
+    da = max(s0, d0);
+    db = min(s1, d1);
+
+    These are used in at least
+    acLoad()
+    acStore()
+    acSynchronizeHalos()
+
+     Here we decompose the host mesh and distribute it among the GPUs in
+     the node.
+
+     The host mesh is a huge contiguous block of data. Its dimensions are given by
+     the global variable named "grid". A "grid" is decomposed into "subgrids",
+     one for each GPU. Here we check which parts of the range s0...s1 maps
+     to the memory space stored by some GPU, ranging d0...d1, and transfer
+     the data if needed.
+
+     The index mapping is inherently quite involved, but here's a picture which
+     hopefully helps make sense out of all this.
+
+
+     Grid
+                                      |----num_vertices---|
+     xxx|....................................................|xxx
+              ^                   ^   ^                   ^
+             d0                  d1  s0 (src)            s1
+
+     Subgrid
+
+              xxx|.............|xxx
+              ^                   ^
+             d0                  d1
+
+                                  ^   ^
+                                 db  da
 *
 */
 #include "astaroth.h"
@@ -151,36 +250,7 @@ acQuit(void)
 AcResult
 acLoadWithOffset(const AcMesh& host_mesh, const int3& src, const int num_vertices)
 {
-    /*
-    Here we decompose the host mesh and distribute it among the GPUs in
-    the node.
-
-    The host mesh is a huge contiguous block of data. Its dimensions are given by
-    the global variable named "grid". A "grid" is decomposed into "subgrids",
-    one for each GPU. Here we check which parts of the range s0...s1 maps
-    to the memory space stored by some GPU, ranging d0...d1, and transfer
-    the data if needed.
-
-    The index mapping is inherently quite involved, but here's a picture which
-    hopefully helps make sense out of all this.
-
-
-    Grid
-                                     |----num_vertices---|
-    xxx|....................................................|xxx
-             ^                   ^   ^                   ^
-            d0                  d1  s0 (src)            s1
-
-    Subgrid
-
-             xxx|.............|xxx
-             ^                   ^
-            d0                  d1
-
-                                 ^   ^
-                                db  da
-
-    */
+    // See the beginning of the file for an explanation of the index mapping
    for (int i = 0; i < num_devices; ++i) {
        const int3 d0 = (int3){0, 0, i * subgrid.n.z}; // DECOMPOSITION OFFSET HERE
        const int3 d1 = (int3){subgrid.m.x, subgrid.m.y, d0.z + subgrid.m.z};
@@ -216,7 +286,7 @@ acLoadWithOffset(const AcMesh& host_mesh, const int3& src, const int num_vertice
 AcResult
 acStoreWithOffset(const int3& src, const int num_vertices, AcMesh* host_mesh)
 {
-    // See acLoadWithOffset() for an explanation of the index mapping
+    // See the beginning of the file for an explanation of the index mapping
    for (int i = 0; i < num_devices; ++i) {
        const int3 d0 = (int3){0, 0, i * subgrid.n.z}; // DECOMPOSITION OFFSET HERE
        const int3 d1 = (int3){subgrid.m.x, subgrid.m.y, d0.z + subgrid.m.z};
@@ -270,6 +340,9 @@ acSynchronizeHalos(void)

    // We loop only to num_devices - 1 since the front and back plate of the grid is not
    // transferred because their contents depend on the boundary conditions.
+
+    // IMPORTANT NOTE: the boundary conditions must be applied before calling this function!
+    // I.e. the halos of subgrids must contain up-to-date data!
    for (int i = 0; i < num_devices - 1; ++i) {
        const int num_vertices = subgrid.m.x * subgrid.m.y * NGHOST;
        // ...|ooooxxx|... -> xxx|ooooooo|...
@@ -324,70 +397,24 @@ acBoundcondStep(void)
            const int3 d1 = (int3){subgrid.m.x, subgrid.m.y, d0.z + subgrid.n.z};
            boundcondStep(devices[i], STREAM_PRIMARY, d0, d1);
        }
-
+        // With periodic boundary conditions we exchange the front and back plates of the
+        // grid. The exchange is done between the first and last device (0 and num_devices - 1).
+        const int num_vertices = subgrid.m.x * subgrid.m.y * NGHOST;
+        // ...|ooooxxx|... -> xxx|ooooooo|...
+        {
+            const int3 src = (int3){0, 0, subgrid.n.z};
+            const int3 dst = (int3){0, 0, 0};
+            copyMeshDeviceToDevice(devices[num_devices - 1], STREAM_PRIMARY, src, devices[0], dst,
+                                   num_vertices);
+        }
+        // ...|ooooooo|xxx <- ...|xxxoooo|...
+        {
+            const int3 src = (int3){0, 0, NGHOST};
+            const int3 dst = (int3){0, 0, NGHOST + subgrid.n.z};
+            copyMeshDeviceToDevice(devices[0], STREAM_PRIMARY, src, devices[num_devices - 1], dst,
+                                   num_vertices);
+        }
        /*
-        // ===MIIKKANOTE START==========================================
-        %JP: The old way for computing boundary conditions conflicts with the
-        way we have to do things with multiple GPUs.
-
-        The older approach relied on unified memory, which represented the whole
-        memory area as one huge mesh instead of several smaller ones. However, unified memory
-        in its current state is more meant for quick prototyping when performance is not an issue.
-        Getting the CUDA driver to migrate data intelligently across GPUs is much more difficult
-        than when managing the memory explicitly.
-
-        In this new approach, I have simplified the multi- and single-GPU layers significantly.
-        Quick rundown:
-                New struct: Grid. There are two global variables, "grid" and "subgrid", which
-                contain the extents of the whole simulation domain and the decomposed grids,
-        respectively. To simplify thing, we require that each GPU is assigned the same amount of
-        work, therefore each GPU in the node is assigned and "subgrid.m" -sized block of data to
-        work with.
-
-                The whole simulation domain is decomposed with respect to the z dimension.
-                For example, if the grid contains (nx, ny, nz) vertices, then the subgrids
-                contain (nx, ny, nz / num_devices) vertices.
-
-                An local index (i, j, k) in some subgrid can be mapped to the global grid with
-                        global idx = (i, j, k + device_id * subgrid.n.z)
-
-        Terminology:
-                - Single-GPU function: a function defined on the single-GPU layer (device.cu)
-
-        Changes required to this commented code block:
-                - The thread block dimensions (tpb) are no longer passed to the kernel here but in
-        device.cu instead. Same holds for any complex index calculations. Instead, the local
-        coordinates should be passed as an int3 type without having to consider how the data is
-        actually laid out in device memory
-                - The unified memory buffer no longer exists (d_buffer). Instead, we have an opaque
-        handle of type "Device" which should be passed to single-GPU functions. In this file, all
-        devices are stored in a global array "devices[num_devices]".
-                - Every single-GPU function is executed asynchronously by default such that we
-                  can optimize Astaroth by executing memory transactions concurrently with
-        computation. Therefore a StreamType should be passed as a parameter to single-GPU functions.
-                  Refresher: CUDA function calls are non-blocking when a stream is explicitly passed
-                  as a parameter and commands executing in different streams can be processed
-                  in parallel/concurrently.
-
-
-        Note on periodic boundaries (might be helpful when implementing other boundary conditions):
-
-                With multiple GPUs, periodic boundary conditions applied on indices ranging from
-
-                        (0, 0, STENCIL_ORDER/2) to (subgrid.m.x, subgrid.m.y, subgrid.m.z -
-        STENCIL_ORDER/2)
-
-                on a single device are "local", in the sense that they can be computed without
-        having to exchange data with neighboring GPUs. Special care is needed only for transferring
-                the data to the fron and back plates outside this range. In the solution we use
-        here, we solve the local boundaries first, and then just exchange the front and back plates
-                in a "ring", like so
-                                        device_id
-                            (n) <-> 0 <-> 1 <-> ... <-> n <-> (0)
-
-
-        // ======MIIKKANOTE END==========================================
-
        <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< MIIKKANOTE: This code block was essentially
                                                                  moved into device.cu, function
        boundCondStep() In astaroth.cu, we use acBoundcondStep() just to distribute the work and
@@ -417,23 +444,6 @@ acBoundcondStep(void)
                        periodic_boundconds(0, tpb, start, end, d_buffer.in[i]);
        <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        */
-        // With periodic boundary conditions we exchange the front and back plates of the
-        // grid. The exchange is done between the first and last device (0 and num_devices - 1).
-        const int num_vertices = subgrid.m.x * subgrid.m.y * NGHOST;
-        // ...|ooooxxx|... -> xxx|ooooooo|...
-        {
-            const int3 src = (int3){0, 0, subgrid.n.z};
-            const int3 dst = (int3){0, 0, 0};
-            copyMeshDeviceToDevice(devices[num_devices - 1], STREAM_PRIMARY, src, devices[0], dst,
-                                   num_vertices);
-        }
-        // ...|ooooooo|xxx <- ...|xxxoooo|...
-        {
-            const int3 src = (int3){0, 0, NGHOST};
-            const int3 dst = (int3){0, 0, NGHOST + subgrid.n.z};
-            copyMeshDeviceToDevice(devices[0], STREAM_PRIMARY, src, devices[num_devices - 1], dst,
-                                   num_vertices);
-        }
    }
    acSynchronize();
    return AC_SUCCESS;
@@ -442,9 +452,9 @@ acBoundcondStep(void)
 AcResult
 acIntegrateStepWithOffset(const int& isubstep, const AcReal& dt, const int3& start, const int3& end)
 {
+    // See the beginning of the file for an explanation of the index mapping
    for (int i = 0; i < num_devices; ++i) {
        // DECOMPOSITION OFFSET HERE
-        // Same naming here (d0, d1, da, db) as in acLoadWithOffset
        const int3 d0 = (int3){NGHOST, NGHOST, NGHOST + i * subgrid.n.z};
        const int3 d1 = d0 + (int3){subgrid.n.x, subgrid.n.y, subgrid.n.z};