Merge branch 'master' of https://bitbucket.org/jpekkila/astaroth

2020-01-22 15:21:10 +02:00
parent f77ab8a809 ba8960cd08
commit 868bf3ed5e
9 changed files with 1241 additions and 650 deletions
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 # Astaroth - A Multi-GPU Library for Generic Stencil Computations {#mainpage}

-[Specification](doc/Astaroth_API_specification_and_user_manual/API_specification_and_user_manual.md) | [Contributing](CONTRIBUTING.md) | [Licence](LICENCE.md) | [Issue Tracker](https://bitbucket.org/jpekkila/astaroth/issues?status=new&status=open) | [Wiki](https://bitbucket.org/jpekkila/astaroth/wiki/Home)
+[Specification](doc/Astaroth_API_specification_and_user_manual/API_specification_and_user_manual.md) | [Contributing](CONTRIBUTING.md) | [Licence](LICENCE.md) | [Repository](https://bitbucket.org/jpekkila/astaroth) | [Issue Tracker](https://bitbucket.org/jpekkila/astaroth/issues?status=new&status=open) | [Wiki](https://bitbucket.org/jpekkila/astaroth/wiki/Home)

 Astaroth is a multi-GPU library for three-dimensional stencil computations. It is designed especially for performing high-order stencil
 computations in structured grids, where several coupled fields are updated each time step. Astaroth consists of a multi-GPU and single-GPU
@@ -29,12 +29,9 @@ In the base directory, run
 3. `cmake ..`
 4. `make -j`

-> **Optional:** Documentation can be generated by running `doxygen` in the base directory. The
-generated documentation can be found in `doc/doxygen`.
+> **Optional:** Documentation can be generated by running `doxygen` in the base directory. Generated documentation can be found in `doc/doxygen`.

-> **Tip:**  The library is configured by passing [options](#markdown-header-cmake-options) to CMake with `-D[option]=[ON|OFF]`.
-For example, double precision can be enabled by calling `cmake -DBUILD_DOUBLE_PRECISION=ON ..`.
-See [CMakeLists.txt](https://bitbucket.org/jpekkila/astaroth/src/master/CMakeLists.txt) for an up-to-date list of options.
+> **Tip:**  The library is configured by passing [options](#markdown-header-cmake-options) to CMake with `-D[option]=[ON|OFF]`. For example, double precision can be enabled by calling `cmake -DBUILD_DOUBLE_PRECISION=ON ..`. See [CMakeLists.txt](https://bitbucket.org/jpekkila/astaroth/src/master/CMakeLists.txt) for an up-to-date list of options.

 > **Note:** CMake will inform you if there are missing dependencies.

--- a/doc/Astaroth_API_specification_and_user_manual/API_specification_and_user_manual.md
+++ b/doc/Astaroth_API_specification_and_user_manual/API_specification_and_user_manual.md
@@ -308,7 +308,7 @@ AcResult acNodeSynchronizeVertexBuffer(const Node node, const Stream stream,

 ```

-> **NOTE**: Local halos must be up to date before synchronizing the data. Local halos are the grid points outside the computational domain which are used only by a single device. The mesh is distributed to multiple devices by blocking along the z axis. If there are *n* devices and the z-dimension of the computational domain is *nz*, then each device is assigned *nz / n* two-dimensional planes. For example with two devices, the data block that has to be up to date ranges from *(0, 0, nz)* to *(mx, my, nz + 2 * NGHOST)*.
+> **Note:** Local halos must be up to date before synchronizing the data. Local halos are the grid points outside the computational domain which are used only by a single device. The mesh is distributed to multiple devices by blocking along the z axis. If there are *n* devices and the z-dimension of the computational domain is *nz*, then each device is assigned *nz / n* two-dimensional planes. For example with two devices, the data block that has to be up to date ranges from *(0, 0, nz)* to *(mx, my, nz + 2 * NGHOST)*.

 ## Input and Output Buffers

@@ -325,7 +325,7 @@ is done via the API calls
 AcResult acDeviceSwapBuffers(const Device device);
 AcResult acNodeSwapBuffers(const Node node);
 ```
-> **NOTE**: All functions provided with the API operate on input buffers and ensure that the complete result is available in the input buffer when the function has completed. User-specified kernels are exceptions and write the result to output buffers. Therefore buffers have to be swapped only after calling user-specified kernels.
+> **Note:** All functions provided with the API operate on input buffers and ensure that the complete result is available in the input buffer when the function has completed. User-specified kernels are exceptions and write the result to output buffers. Therefore buffers have to be swapped only after calling user-specified kernels.

 ## Devices

@@ -429,7 +429,7 @@ Let *i* be the device id. The portion of the halos shared by neighboring devices
 `acNodeSynchronizeVertexBuffer` and `acNodeSynchronizeMesh` communicate these shared areas among
 the devices in the node.

-> **NOTE:** The decomposition scheme is subject to change.
+> **Note:** The decomposition scheme is subject to change.

 # Astaroth Domain-Specific Language

@@ -563,11 +563,11 @@ which use those uniforms.

 `Uniform`s can be of type `Scalar`, `Vector`, `int`, `int3`, `ScalarField` and `ScalarArray`.

-> Note: As of 2019-10-01, the types `ScalarField` (DSL) and `VertexBuffer` (CUDA) are aliases of the same type. This naming may be changed in the future.
+> **Note:** As of 2019-10-01, the types `ScalarField` (DSL) and `VertexBuffer` (CUDA) are aliases of the same type. This naming may be changed in the future.

-> Note: As of 2019-10-01, `VectorField`s cannot be declared as uniforms. Instead, one should declare each component as a `ScalarField` and use them to construct a `VectorField` during the stencil processing stage. For example, `in VectorField(A, B, C);`, where `A`, `B` and `C` are `uniform ScalarField`s.
+> **Note:** As of 2019-10-01, `VectorField`s cannot be declared as uniforms. Instead, one should declare each component as a `ScalarField` and use them to construct a `VectorField` during the stencil processing stage. For example, `in VectorField(A, B, C);`, where `A`, `B` and `C` are `uniform ScalarField`s.

-> Note: As of 2019-10-01, `uniform`s cannot be assigned values in the stencil definition headers. Instead, one should load the appropriate values during runtime using the `acLoadScalarUniform` and related functions.
+> **Note:** As of 2019-10-01, `uniform`s cannot be assigned values in the stencil definition headers. Instead, one should load the appropriate values during runtime using the `acLoadScalarUniform` and related functions.


 ## Standard Libraries
--- a/doc/manual/manual.txt
+++ b/doc/manual/manual.txt
--- a/2
+++ b/2
@@ -811,7 +811,7 @@ RECURSIVE              = YES
 # Note that relative paths are relative to the directory from which doxygen is
 # run.

-EXCLUDE                =
+EXCLUDE                = analysis 3rdparty samples scripts src/standalone src/core/kernels/deprecated

 # The EXCLUDE_SYMLINKS tag can be used to select whether or not files or
 # directories that are symbolic links (a Unix file system feature) are excluded
--- a/src/core/CMakeLists.txt
+++ b/src/core/CMakeLists.txt
@@ -14,7 +14,7 @@ find_package(CUDA REQUIRED)
 set(CMAKE_CUDA_FLAGS "-gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 --restrict")

 # Compile kernels
-add_library(astaroth_kernels STATIC kernels/boundconds.cu kernels/integration.cu kernels/reductions.cu)
+add_library(astaroth_kernels STATIC kernels/boundconds.cu kernels/integration.cu kernels/reductions.cu kernels/packing.cu)
 target_include_directories(astaroth_kernels PRIVATE .)
 target_compile_features(astaroth_kernels PRIVATE cxx_std_11)
 set_target_properties(astaroth_kernels PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
@@ -23,7 +23,7 @@ add_dependencies(astaroth_kernels dsl_headers)
 # Compile core
 add_library(astaroth_core STATIC astaroth.cc node.cc device.cc)
 target_include_directories(astaroth_core PRIVATE . "${CUDA_INCLUDE_DIRS}")
-target_link_libraries(astaroth_core m astaroth_kernels cudart)
+target_link_libraries(astaroth_core m astaroth_kernels ${CUDA_LIBRARIES})
 target_compile_definitions(astaroth_core PRIVATE AC_USE_CUDA_RUNTIME_API)
 set_target_properties(astaroth_core PROPERTIES POSITION_INDEPENDENT_CODE ON)
 add_dependencies(astaroth_kernels dsl_headers)
--- a/src/core/device.cc
+++ b/src/core/device.cc
--- a/src/core/kernels/packing.cu
+++ b/src/core/kernels/packing.cu
@@ -0,0 +1,116 @@
+/*
+    Copyright (C) 2014-2020, Johannes Pekkilae, Miikka Vaeisalae.
+
+    This file is part of Astaroth.
+
+    Astaroth is free software: you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+
+    Astaroth is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with Astaroth.  If not, see <http://www.gnu.org/licenses/>.
+*/
+
+/**
+ * @file
+ * \brief Brief info.
+ *
+ * Detailed info.
+ *
+ */
+#include "packing.cuh"
+
+#include "src/core/errchk.h"
+
+__global__ void
+kernel_pack_data(const VertexBufferArray vba, const int3 vba_start, PackedData packed)
+{
+    const int i_packed = threadIdx.x + blockIdx.x * blockDim.x;
+    const int j_packed = threadIdx.y + blockIdx.y * blockDim.y;
+    const int k_packed = threadIdx.z + blockIdx.z * blockDim.z;
+
+    // If within the start-end range (this allows threadblock dims that are not
+    // divisible by end - start)
+    if (i_packed >= packed.dims.x || //
+        j_packed >= packed.dims.y || //
+        k_packed >= packed.dims.z) {
+        return;
+    }
+
+    const int i_unpacked = i_packed + vba_start.x;
+    const int j_unpacked = j_packed + vba_start.y;
+    const int k_unpacked = k_packed + vba_start.z;
+
+    const int unpacked_idx = DEVICE_VTXBUF_IDX(i_unpacked, j_unpacked, k_unpacked);
+    const int packed_idx   = i_packed +               //
+                           j_packed * packed.dims.x + //
+                           k_packed * packed.dims.x * packed.dims.y;
+
+    const size_t vtxbuf_offset = packed.dims.x * packed.dims.y * packed.dims.z;
+    for (int i = 0; i < NUM_VTXBUF_HANDLES; ++i)
+        packed.data[packed_idx + i * vtxbuf_offset] = vba.in[i][unpacked_idx];
+}
+
+__global__ void
+kernel_unpack_data(const PackedData packed, const int3 vba_start, VertexBufferArray vba)
+{
+    const int i_packed = threadIdx.x + blockIdx.x * blockDim.x;
+    const int j_packed = threadIdx.y + blockIdx.y * blockDim.y;
+    const int k_packed = threadIdx.z + blockIdx.z * blockDim.z;
+
+    // If within the start-end range (this allows threadblock dims that are not
+    // divisible by end - start)
+    if (i_packed >= packed.dims.x || //
+        j_packed >= packed.dims.y || //
+        k_packed >= packed.dims.z) {
+        return;
+    }
+
+    const int i_unpacked = i_packed + vba_start.x;
+    const int j_unpacked = j_packed + vba_start.y;
+    const int k_unpacked = k_packed + vba_start.z;
+
+    const int unpacked_idx = DEVICE_VTXBUF_IDX(i_unpacked, j_unpacked, k_unpacked);
+    const int packed_idx   = i_packed +               //
+                           j_packed * packed.dims.x + //
+                           k_packed * packed.dims.x * packed.dims.y;
+
+    const size_t vtxbuf_offset = packed.dims.x * packed.dims.y * packed.dims.z;
+    for (int i = 0; i < NUM_VTXBUF_HANDLES; ++i)
+        vba.in[i][unpacked_idx] = packed.data[packed_idx + i * vtxbuf_offset];
+}
+
+AcResult
+acKernelPackData(const cudaStream_t stream, const VertexBufferArray vba, const int3 vba_start,
+                 PackedData packed)
+{
+    const dim3 tpb(32, 8, 1);
+    const dim3 bpg((unsigned int)ceil(packed.dims.x / (float)tpb.x),
+                   (unsigned int)ceil(packed.dims.y / (float)tpb.y),
+                   (unsigned int)ceil(packed.dims.z / (float)tpb.z));
+
+    kernel_pack_data<<<bpg, tpb, 0, stream>>>(vba, vba_start, packed);
+    ERRCHK_CUDA_KERNEL();
+
+    return AC_SUCCESS;
+}
+
+AcResult
+acKernelUnpackData(const cudaStream_t stream, const PackedData packed, const int3 vba_start,
+                   VertexBufferArray vba)
+{
+    const dim3 tpb(32, 8, 1);
+    const dim3 bpg((unsigned int)ceil(packed.dims.x / (float)tpb.x),
+                   (unsigned int)ceil(packed.dims.y / (float)tpb.y),
+                   (unsigned int)ceil(packed.dims.z / (float)tpb.z));
+
+    kernel_unpack_data<<<bpg, tpb, 0, stream>>>(packed, vba_start, vba);
+    ERRCHK_CUDA_KERNEL();
+    return AC_SUCCESS;
+}
--- a/src/core/kernels/packing.cuh
+++ b/src/core/kernels/packing.cuh
@@ -0,0 +1,40 @@
+/*
+    Copyright (C) 2014-2020, Johannes Pekkilae, Miikka Vaeisalae.
+
+    This file is part of Astaroth.
+
+    Astaroth is free software: you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+
+    Astaroth is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with Astaroth.  If not, see <http://www.gnu.org/licenses/>.
+*/
+
+/**
+ * @file
+ * \brief Brief info.
+ *
+ * Detailed info.
+ *
+ */
+#pragma once
+#include "astaroth.h"
+#include "common.cuh"
+
+typedef struct {
+    int3 dims;
+    AcReal* data;
+} PackedData;
+
+AcResult acKernelPackData(const cudaStream_t stream, const VertexBufferArray vba,
+                          const int3 vba_start, PackedData packed);
+
+AcResult acKernelUnpackData(const cudaStream_t stream, const PackedData packed,
+                            const int3 vba_start, VertexBufferArray vba);
--- a/src/core/math_utils.h
+++ b/src/core/math_utils.h
@@ -101,6 +101,11 @@ operator+(const int3& a, const int3& b)
    return (int3){a.x + b.x, a.y + b.y, a.z + b.z};
 }

+static HOST_DEVICE_INLINE int3 operator*(const int3& a, const int3& b)
+{
+    return (int3){a.x * b.x, a.y * b.y, a.z * b.z};
+}
+
 static HOST_DEVICE_INLINE void
 operator+=(AcReal3& lhs, const AcReal3& rhs)
 {