This commit is contained in:
jpekkila
2020-01-22 15:21:10 +02:00
9 changed files with 1241 additions and 650 deletions

View File

@@ -1,6 +1,6 @@
# Astaroth - A Multi-GPU Library for Generic Stencil Computations {#mainpage}
[Specification](doc/Astaroth_API_specification_and_user_manual/API_specification_and_user_manual.md) | [Contributing](CONTRIBUTING.md) | [Licence](LICENCE.md) | [Issue Tracker](https://bitbucket.org/jpekkila/astaroth/issues?status=new&status=open) | [Wiki](https://bitbucket.org/jpekkila/astaroth/wiki/Home)
[Specification](doc/Astaroth_API_specification_and_user_manual/API_specification_and_user_manual.md) | [Contributing](CONTRIBUTING.md) | [Licence](LICENCE.md) | [Repository](https://bitbucket.org/jpekkila/astaroth) | [Issue Tracker](https://bitbucket.org/jpekkila/astaroth/issues?status=new&status=open) | [Wiki](https://bitbucket.org/jpekkila/astaroth/wiki/Home)
Astaroth is a multi-GPU library for three-dimensional stencil computations. It is designed especially for performing high-order stencil
computations in structured grids, where several coupled fields are updated each time step. Astaroth consists of a multi-GPU and single-GPU
@@ -29,12 +29,9 @@ In the base directory, run
3. `cmake ..`
4. `make -j`
> **Optional:** Documentation can be generated by running `doxygen` in the base directory. The
generated documentation can be found in `doc/doxygen`.
> **Optional:** Documentation can be generated by running `doxygen` in the base directory. Generated documentation can be found in `doc/doxygen`.
> **Tip:** The library is configured by passing [options](#markdown-header-cmake-options) to CMake with `-D[option]=[ON|OFF]`.
For example, double precision can be enabled by calling `cmake -DBUILD_DOUBLE_PRECISION=ON ..`.
See [CMakeLists.txt](https://bitbucket.org/jpekkila/astaroth/src/master/CMakeLists.txt) for an up-to-date list of options.
> **Tip:** The library is configured by passing [options](#markdown-header-cmake-options) to CMake with `-D[option]=[ON|OFF]`. For example, double precision can be enabled by calling `cmake -DBUILD_DOUBLE_PRECISION=ON ..`. See [CMakeLists.txt](https://bitbucket.org/jpekkila/astaroth/src/master/CMakeLists.txt) for an up-to-date list of options.
> **Note:** CMake will inform you if there are missing dependencies.

View File

@@ -308,7 +308,7 @@ AcResult acNodeSynchronizeVertexBuffer(const Node node, const Stream stream,
```
> **NOTE**: Local halos must be up to date before synchronizing the data. Local halos are the grid points outside the computational domain which are used only by a single device. The mesh is distributed to multiple devices by blocking along the z axis. If there are *n* devices and the z-dimension of the computational domain is *nz*, then each device is assigned *nz / n* two-dimensional planes. For example with two devices, the data block that has to be up to date ranges from *(0, 0, nz)* to *(mx, my, nz + 2 * NGHOST)*.
> **Note:** Local halos must be up to date before synchronizing the data. Local halos are the grid points outside the computational domain which are used only by a single device. The mesh is distributed to multiple devices by blocking along the z axis. If there are *n* devices and the z-dimension of the computational domain is *nz*, then each device is assigned *nz / n* two-dimensional planes. For example with two devices, the data block that has to be up to date ranges from *(0, 0, nz)* to *(mx, my, nz + 2 * NGHOST)*.
## Input and Output Buffers
@@ -325,7 +325,7 @@ is done via the API calls
AcResult acDeviceSwapBuffers(const Device device);
AcResult acNodeSwapBuffers(const Node node);
```
> **NOTE**: All functions provided with the API operate on input buffers and ensure that the complete result is available in the input buffer when the function has completed. User-specified kernels are exceptions and write the result to output buffers. Therefore buffers have to be swapped only after calling user-specified kernels.
> **Note:** All functions provided with the API operate on input buffers and ensure that the complete result is available in the input buffer when the function has completed. User-specified kernels are exceptions and write the result to output buffers. Therefore buffers have to be swapped only after calling user-specified kernels.
## Devices
@@ -429,7 +429,7 @@ Let *i* be the device id. The portion of the halos shared by neighboring devices
`acNodeSynchronizeVertexBuffer` and `acNodeSynchronizeMesh` communicate these shared areas among
the devices in the node.
> **NOTE:** The decomposition scheme is subject to change.
> **Note:** The decomposition scheme is subject to change.
# Astaroth Domain-Specific Language
@@ -563,11 +563,11 @@ which use those uniforms.
`Uniform`s can be of type `Scalar`, `Vector`, `int`, `int3`, `ScalarField` and `ScalarArray`.
> Note: As of 2019-10-01, the types `ScalarField` (DSL) and `VertexBuffer` (CUDA) are aliases of the same type. This naming may be changed in the future.
> **Note:** As of 2019-10-01, the types `ScalarField` (DSL) and `VertexBuffer` (CUDA) are aliases of the same type. This naming may be changed in the future.
> Note: As of 2019-10-01, `VectorField`s cannot be declared as uniforms. Instead, one should declare each component as a `ScalarField` and use them to construct a `VectorField` during the stencil processing stage. For example, `in VectorField(A, B, C);`, where `A`, `B` and `C` are `uniform ScalarField`s.
> **Note:** As of 2019-10-01, `VectorField`s cannot be declared as uniforms. Instead, one should declare each component as a `ScalarField` and use them to construct a `VectorField` during the stencil processing stage. For example, `in VectorField(A, B, C);`, where `A`, `B` and `C` are `uniform ScalarField`s.
> Note: As of 2019-10-01, `uniform`s cannot be assigned values in the stencil definition headers. Instead, one should load the appropriate values during runtime using the `acLoadScalarUniform` and related functions.
> **Note:** As of 2019-10-01, `uniform`s cannot be assigned values in the stencil definition headers. Instead, one should load the appropriate values during runtime using the `acLoadScalarUniform` and related functions.
## Standard Libraries

View File

@@ -811,7 +811,7 @@ RECURSIVE = YES
# Note that relative paths are relative to the directory from which doxygen is
# run.
EXCLUDE =
EXCLUDE = analysis 3rdparty samples scripts src/standalone src/core/kernels/deprecated
# The EXCLUDE_SYMLINKS tag can be used to select whether or not files or
# directories that are symbolic links (a Unix file system feature) are excluded

View File

@@ -14,7 +14,7 @@ find_package(CUDA REQUIRED)
set(CMAKE_CUDA_FLAGS "-gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 --restrict")
# Compile kernels
add_library(astaroth_kernels STATIC kernels/boundconds.cu kernels/integration.cu kernels/reductions.cu)
add_library(astaroth_kernels STATIC kernels/boundconds.cu kernels/integration.cu kernels/reductions.cu kernels/packing.cu)
target_include_directories(astaroth_kernels PRIVATE .)
target_compile_features(astaroth_kernels PRIVATE cxx_std_11)
set_target_properties(astaroth_kernels PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
@@ -23,7 +23,7 @@ add_dependencies(astaroth_kernels dsl_headers)
# Compile core
add_library(astaroth_core STATIC astaroth.cc node.cc device.cc)
target_include_directories(astaroth_core PRIVATE . "${CUDA_INCLUDE_DIRS}")
target_link_libraries(astaroth_core m astaroth_kernels cudart)
target_link_libraries(astaroth_core m astaroth_kernels ${CUDA_LIBRARIES})
target_compile_definitions(astaroth_core PRIVATE AC_USE_CUDA_RUNTIME_API)
set_target_properties(astaroth_core PROPERTIES POSITION_INDEPENDENT_CODE ON)
add_dependencies(astaroth_kernels dsl_headers)

File diff suppressed because it is too large Load Diff

116
src/core/kernels/packing.cu Normal file
View File

@@ -0,0 +1,116 @@
/*
Copyright (C) 2014-2020, Johannes Pekkilae, Miikka Vaeisalae.
This file is part of Astaroth.
Astaroth is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
Astaroth is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with Astaroth. If not, see <http://www.gnu.org/licenses/>.
*/
/**
* @file
* \brief Brief info.
*
* Detailed info.
*
*/
#include "packing.cuh"
#include "src/core/errchk.h"
__global__ void
kernel_pack_data(const VertexBufferArray vba, const int3 vba_start, PackedData packed)
{
const int i_packed = threadIdx.x + blockIdx.x * blockDim.x;
const int j_packed = threadIdx.y + blockIdx.y * blockDim.y;
const int k_packed = threadIdx.z + blockIdx.z * blockDim.z;
// If within the start-end range (this allows threadblock dims that are not
// divisible by end - start)
if (i_packed >= packed.dims.x || //
j_packed >= packed.dims.y || //
k_packed >= packed.dims.z) {
return;
}
const int i_unpacked = i_packed + vba_start.x;
const int j_unpacked = j_packed + vba_start.y;
const int k_unpacked = k_packed + vba_start.z;
const int unpacked_idx = DEVICE_VTXBUF_IDX(i_unpacked, j_unpacked, k_unpacked);
const int packed_idx = i_packed + //
j_packed * packed.dims.x + //
k_packed * packed.dims.x * packed.dims.y;
const size_t vtxbuf_offset = packed.dims.x * packed.dims.y * packed.dims.z;
for (int i = 0; i < NUM_VTXBUF_HANDLES; ++i)
packed.data[packed_idx + i * vtxbuf_offset] = vba.in[i][unpacked_idx];
}
__global__ void
kernel_unpack_data(const PackedData packed, const int3 vba_start, VertexBufferArray vba)
{
const int i_packed = threadIdx.x + blockIdx.x * blockDim.x;
const int j_packed = threadIdx.y + blockIdx.y * blockDim.y;
const int k_packed = threadIdx.z + blockIdx.z * blockDim.z;
// If within the start-end range (this allows threadblock dims that are not
// divisible by end - start)
if (i_packed >= packed.dims.x || //
j_packed >= packed.dims.y || //
k_packed >= packed.dims.z) {
return;
}
const int i_unpacked = i_packed + vba_start.x;
const int j_unpacked = j_packed + vba_start.y;
const int k_unpacked = k_packed + vba_start.z;
const int unpacked_idx = DEVICE_VTXBUF_IDX(i_unpacked, j_unpacked, k_unpacked);
const int packed_idx = i_packed + //
j_packed * packed.dims.x + //
k_packed * packed.dims.x * packed.dims.y;
const size_t vtxbuf_offset = packed.dims.x * packed.dims.y * packed.dims.z;
for (int i = 0; i < NUM_VTXBUF_HANDLES; ++i)
vba.in[i][unpacked_idx] = packed.data[packed_idx + i * vtxbuf_offset];
}
AcResult
acKernelPackData(const cudaStream_t stream, const VertexBufferArray vba, const int3 vba_start,
PackedData packed)
{
const dim3 tpb(32, 8, 1);
const dim3 bpg((unsigned int)ceil(packed.dims.x / (float)tpb.x),
(unsigned int)ceil(packed.dims.y / (float)tpb.y),
(unsigned int)ceil(packed.dims.z / (float)tpb.z));
kernel_pack_data<<<bpg, tpb, 0, stream>>>(vba, vba_start, packed);
ERRCHK_CUDA_KERNEL();
return AC_SUCCESS;
}
AcResult
acKernelUnpackData(const cudaStream_t stream, const PackedData packed, const int3 vba_start,
VertexBufferArray vba)
{
const dim3 tpb(32, 8, 1);
const dim3 bpg((unsigned int)ceil(packed.dims.x / (float)tpb.x),
(unsigned int)ceil(packed.dims.y / (float)tpb.y),
(unsigned int)ceil(packed.dims.z / (float)tpb.z));
kernel_unpack_data<<<bpg, tpb, 0, stream>>>(packed, vba_start, vba);
ERRCHK_CUDA_KERNEL();
return AC_SUCCESS;
}

View File

@@ -0,0 +1,40 @@
/*
Copyright (C) 2014-2020, Johannes Pekkilae, Miikka Vaeisalae.
This file is part of Astaroth.
Astaroth is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
Astaroth is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with Astaroth. If not, see <http://www.gnu.org/licenses/>.
*/
/**
* @file
* \brief Brief info.
*
* Detailed info.
*
*/
#pragma once
#include "astaroth.h"
#include "common.cuh"
typedef struct {
int3 dims;
AcReal* data;
} PackedData;
AcResult acKernelPackData(const cudaStream_t stream, const VertexBufferArray vba,
const int3 vba_start, PackedData packed);
AcResult acKernelUnpackData(const cudaStream_t stream, const PackedData packed,
const int3 vba_start, VertexBufferArray vba);

View File

@@ -101,6 +101,11 @@ operator+(const int3& a, const int3& b)
return (int3){a.x + b.x, a.y + b.y, a.z + b.z};
}
static HOST_DEVICE_INLINE int3 operator*(const int3& a, const int3& b)
{
return (int3){a.x * b.x, a.y * b.y, a.z * b.z};
}
static HOST_DEVICE_INLINE void
operator+=(AcReal3& lhs, const AcReal3& rhs)
{