Merge branch 'master' of https://bitbucket.org/jpekkila/astaroth
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Astaroth - A Multi-GPU Library for Generic Stencil Computations {#mainpage}
|
||||
|
||||
[Specification](doc/Astaroth_API_specification_and_user_manual/API_specification_and_user_manual.md) | [Contributing](CONTRIBUTING.md) | [Licence](LICENCE.md) | [Issue Tracker](https://bitbucket.org/jpekkila/astaroth/issues?status=new&status=open) | [Wiki](https://bitbucket.org/jpekkila/astaroth/wiki/Home)
|
||||
[Specification](doc/Astaroth_API_specification_and_user_manual/API_specification_and_user_manual.md) | [Contributing](CONTRIBUTING.md) | [Licence](LICENCE.md) | [Repository](https://bitbucket.org/jpekkila/astaroth) | [Issue Tracker](https://bitbucket.org/jpekkila/astaroth/issues?status=new&status=open) | [Wiki](https://bitbucket.org/jpekkila/astaroth/wiki/Home)
|
||||
|
||||
Astaroth is a multi-GPU library for three-dimensional stencil computations. It is designed especially for performing high-order stencil
|
||||
computations in structured grids, where several coupled fields are updated each time step. Astaroth consists of a multi-GPU and single-GPU
|
||||
@@ -29,12 +29,9 @@ In the base directory, run
|
||||
3. `cmake ..`
|
||||
4. `make -j`
|
||||
|
||||
> **Optional:** Documentation can be generated by running `doxygen` in the base directory. The
|
||||
generated documentation can be found in `doc/doxygen`.
|
||||
> **Optional:** Documentation can be generated by running `doxygen` in the base directory. Generated documentation can be found in `doc/doxygen`.
|
||||
|
||||
> **Tip:** The library is configured by passing [options](#markdown-header-cmake-options) to CMake with `-D[option]=[ON|OFF]`.
|
||||
For example, double precision can be enabled by calling `cmake -DBUILD_DOUBLE_PRECISION=ON ..`.
|
||||
See [CMakeLists.txt](https://bitbucket.org/jpekkila/astaroth/src/master/CMakeLists.txt) for an up-to-date list of options.
|
||||
> **Tip:** The library is configured by passing [options](#markdown-header-cmake-options) to CMake with `-D[option]=[ON|OFF]`. For example, double precision can be enabled by calling `cmake -DBUILD_DOUBLE_PRECISION=ON ..`. See [CMakeLists.txt](https://bitbucket.org/jpekkila/astaroth/src/master/CMakeLists.txt) for an up-to-date list of options.
|
||||
|
||||
> **Note:** CMake will inform you if there are missing dependencies.
|
||||
|
||||
|
@@ -308,7 +308,7 @@ AcResult acNodeSynchronizeVertexBuffer(const Node node, const Stream stream,
|
||||
|
||||
```
|
||||
|
||||
> **NOTE**: Local halos must be up to date before synchronizing the data. Local halos are the grid points outside the computational domain which are used only by a single device. The mesh is distributed to multiple devices by blocking along the z axis. If there are *n* devices and the z-dimension of the computational domain is *nz*, then each device is assigned *nz / n* two-dimensional planes. For example with two devices, the data block that has to be up to date ranges from *(0, 0, nz)* to *(mx, my, nz + 2 * NGHOST)*.
|
||||
> **Note:** Local halos must be up to date before synchronizing the data. Local halos are the grid points outside the computational domain which are used only by a single device. The mesh is distributed to multiple devices by blocking along the z axis. If there are *n* devices and the z-dimension of the computational domain is *nz*, then each device is assigned *nz / n* two-dimensional planes. For example with two devices, the data block that has to be up to date ranges from *(0, 0, nz)* to *(mx, my, nz + 2 * NGHOST)*.
|
||||
|
||||
## Input and Output Buffers
|
||||
|
||||
@@ -325,7 +325,7 @@ is done via the API calls
|
||||
AcResult acDeviceSwapBuffers(const Device device);
|
||||
AcResult acNodeSwapBuffers(const Node node);
|
||||
```
|
||||
> **NOTE**: All functions provided with the API operate on input buffers and ensure that the complete result is available in the input buffer when the function has completed. User-specified kernels are exceptions and write the result to output buffers. Therefore buffers have to be swapped only after calling user-specified kernels.
|
||||
> **Note:** All functions provided with the API operate on input buffers and ensure that the complete result is available in the input buffer when the function has completed. User-specified kernels are exceptions and write the result to output buffers. Therefore buffers have to be swapped only after calling user-specified kernels.
|
||||
|
||||
## Devices
|
||||
|
||||
@@ -429,7 +429,7 @@ Let *i* be the device id. The portion of the halos shared by neighboring devices
|
||||
`acNodeSynchronizeVertexBuffer` and `acNodeSynchronizeMesh` communicate these shared areas among
|
||||
the devices in the node.
|
||||
|
||||
> **NOTE:** The decomposition scheme is subject to change.
|
||||
> **Note:** The decomposition scheme is subject to change.
|
||||
|
||||
# Astaroth Domain-Specific Language
|
||||
|
||||
@@ -563,11 +563,11 @@ which use those uniforms.
|
||||
|
||||
`Uniform`s can be of type `Scalar`, `Vector`, `int`, `int3`, `ScalarField` and `ScalarArray`.
|
||||
|
||||
> Note: As of 2019-10-01, the types `ScalarField` (DSL) and `VertexBuffer` (CUDA) are aliases of the same type. This naming may be changed in the future.
|
||||
> **Note:** As of 2019-10-01, the types `ScalarField` (DSL) and `VertexBuffer` (CUDA) are aliases of the same type. This naming may be changed in the future.
|
||||
|
||||
> Note: As of 2019-10-01, `VectorField`s cannot be declared as uniforms. Instead, one should declare each component as a `ScalarField` and use them to construct a `VectorField` during the stencil processing stage. For example, `in VectorField(A, B, C);`, where `A`, `B` and `C` are `uniform ScalarField`s.
|
||||
> **Note:** As of 2019-10-01, `VectorField`s cannot be declared as uniforms. Instead, one should declare each component as a `ScalarField` and use them to construct a `VectorField` during the stencil processing stage. For example, `in VectorField(A, B, C);`, where `A`, `B` and `C` are `uniform ScalarField`s.
|
||||
|
||||
> Note: As of 2019-10-01, `uniform`s cannot be assigned values in the stencil definition headers. Instead, one should load the appropriate values during runtime using the `acLoadScalarUniform` and related functions.
|
||||
> **Note:** As of 2019-10-01, `uniform`s cannot be assigned values in the stencil definition headers. Instead, one should load the appropriate values during runtime using the `acLoadScalarUniform` and related functions.
|
||||
|
||||
|
||||
## Standard Libraries
|
||||
|
2
doxyfile
2
doxyfile
@@ -811,7 +811,7 @@ RECURSIVE = YES
|
||||
# Note that relative paths are relative to the directory from which doxygen is
|
||||
# run.
|
||||
|
||||
EXCLUDE =
|
||||
EXCLUDE = analysis 3rdparty samples scripts src/standalone src/core/kernels/deprecated
|
||||
|
||||
# The EXCLUDE_SYMLINKS tag can be used to select whether or not files or
|
||||
# directories that are symbolic links (a Unix file system feature) are excluded
|
||||
|
@@ -14,7 +14,7 @@ find_package(CUDA REQUIRED)
|
||||
set(CMAKE_CUDA_FLAGS "-gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 --restrict")
|
||||
|
||||
# Compile kernels
|
||||
add_library(astaroth_kernels STATIC kernels/boundconds.cu kernels/integration.cu kernels/reductions.cu)
|
||||
add_library(astaroth_kernels STATIC kernels/boundconds.cu kernels/integration.cu kernels/reductions.cu kernels/packing.cu)
|
||||
target_include_directories(astaroth_kernels PRIVATE .)
|
||||
target_compile_features(astaroth_kernels PRIVATE cxx_std_11)
|
||||
set_target_properties(astaroth_kernels PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
|
||||
@@ -23,7 +23,7 @@ add_dependencies(astaroth_kernels dsl_headers)
|
||||
# Compile core
|
||||
add_library(astaroth_core STATIC astaroth.cc node.cc device.cc)
|
||||
target_include_directories(astaroth_core PRIVATE . "${CUDA_INCLUDE_DIRS}")
|
||||
target_link_libraries(astaroth_core m astaroth_kernels cudart)
|
||||
target_link_libraries(astaroth_core m astaroth_kernels ${CUDA_LIBRARIES})
|
||||
target_compile_definitions(astaroth_core PRIVATE AC_USE_CUDA_RUNTIME_API)
|
||||
set_target_properties(astaroth_core PROPERTIES POSITION_INDEPENDENT_CODE ON)
|
||||
add_dependencies(astaroth_kernels dsl_headers)
|
||||
|
1665
src/core/device.cc
1665
src/core/device.cc
File diff suppressed because it is too large
Load Diff
116
src/core/kernels/packing.cu
Normal file
116
src/core/kernels/packing.cu
Normal file
@@ -0,0 +1,116 @@
|
||||
/*
|
||||
Copyright (C) 2014-2020, Johannes Pekkilae, Miikka Vaeisalae.
|
||||
|
||||
This file is part of Astaroth.
|
||||
|
||||
Astaroth is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation, either version 3 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
Astaroth is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with Astaroth. If not, see <http://www.gnu.org/licenses/>.
|
||||
*/
|
||||
|
||||
/**
|
||||
* @file
|
||||
* \brief Brief info.
|
||||
*
|
||||
* Detailed info.
|
||||
*
|
||||
*/
|
||||
#include "packing.cuh"
|
||||
|
||||
#include "src/core/errchk.h"
|
||||
|
||||
__global__ void
|
||||
kernel_pack_data(const VertexBufferArray vba, const int3 vba_start, PackedData packed)
|
||||
{
|
||||
const int i_packed = threadIdx.x + blockIdx.x * blockDim.x;
|
||||
const int j_packed = threadIdx.y + blockIdx.y * blockDim.y;
|
||||
const int k_packed = threadIdx.z + blockIdx.z * blockDim.z;
|
||||
|
||||
// If within the start-end range (this allows threadblock dims that are not
|
||||
// divisible by end - start)
|
||||
if (i_packed >= packed.dims.x || //
|
||||
j_packed >= packed.dims.y || //
|
||||
k_packed >= packed.dims.z) {
|
||||
return;
|
||||
}
|
||||
|
||||
const int i_unpacked = i_packed + vba_start.x;
|
||||
const int j_unpacked = j_packed + vba_start.y;
|
||||
const int k_unpacked = k_packed + vba_start.z;
|
||||
|
||||
const int unpacked_idx = DEVICE_VTXBUF_IDX(i_unpacked, j_unpacked, k_unpacked);
|
||||
const int packed_idx = i_packed + //
|
||||
j_packed * packed.dims.x + //
|
||||
k_packed * packed.dims.x * packed.dims.y;
|
||||
|
||||
const size_t vtxbuf_offset = packed.dims.x * packed.dims.y * packed.dims.z;
|
||||
for (int i = 0; i < NUM_VTXBUF_HANDLES; ++i)
|
||||
packed.data[packed_idx + i * vtxbuf_offset] = vba.in[i][unpacked_idx];
|
||||
}
|
||||
|
||||
__global__ void
|
||||
kernel_unpack_data(const PackedData packed, const int3 vba_start, VertexBufferArray vba)
|
||||
{
|
||||
const int i_packed = threadIdx.x + blockIdx.x * blockDim.x;
|
||||
const int j_packed = threadIdx.y + blockIdx.y * blockDim.y;
|
||||
const int k_packed = threadIdx.z + blockIdx.z * blockDim.z;
|
||||
|
||||
// If within the start-end range (this allows threadblock dims that are not
|
||||
// divisible by end - start)
|
||||
if (i_packed >= packed.dims.x || //
|
||||
j_packed >= packed.dims.y || //
|
||||
k_packed >= packed.dims.z) {
|
||||
return;
|
||||
}
|
||||
|
||||
const int i_unpacked = i_packed + vba_start.x;
|
||||
const int j_unpacked = j_packed + vba_start.y;
|
||||
const int k_unpacked = k_packed + vba_start.z;
|
||||
|
||||
const int unpacked_idx = DEVICE_VTXBUF_IDX(i_unpacked, j_unpacked, k_unpacked);
|
||||
const int packed_idx = i_packed + //
|
||||
j_packed * packed.dims.x + //
|
||||
k_packed * packed.dims.x * packed.dims.y;
|
||||
|
||||
const size_t vtxbuf_offset = packed.dims.x * packed.dims.y * packed.dims.z;
|
||||
for (int i = 0; i < NUM_VTXBUF_HANDLES; ++i)
|
||||
vba.in[i][unpacked_idx] = packed.data[packed_idx + i * vtxbuf_offset];
|
||||
}
|
||||
|
||||
AcResult
|
||||
acKernelPackData(const cudaStream_t stream, const VertexBufferArray vba, const int3 vba_start,
|
||||
PackedData packed)
|
||||
{
|
||||
const dim3 tpb(32, 8, 1);
|
||||
const dim3 bpg((unsigned int)ceil(packed.dims.x / (float)tpb.x),
|
||||
(unsigned int)ceil(packed.dims.y / (float)tpb.y),
|
||||
(unsigned int)ceil(packed.dims.z / (float)tpb.z));
|
||||
|
||||
kernel_pack_data<<<bpg, tpb, 0, stream>>>(vba, vba_start, packed);
|
||||
ERRCHK_CUDA_KERNEL();
|
||||
|
||||
return AC_SUCCESS;
|
||||
}
|
||||
|
||||
AcResult
|
||||
acKernelUnpackData(const cudaStream_t stream, const PackedData packed, const int3 vba_start,
|
||||
VertexBufferArray vba)
|
||||
{
|
||||
const dim3 tpb(32, 8, 1);
|
||||
const dim3 bpg((unsigned int)ceil(packed.dims.x / (float)tpb.x),
|
||||
(unsigned int)ceil(packed.dims.y / (float)tpb.y),
|
||||
(unsigned int)ceil(packed.dims.z / (float)tpb.z));
|
||||
|
||||
kernel_unpack_data<<<bpg, tpb, 0, stream>>>(packed, vba_start, vba);
|
||||
ERRCHK_CUDA_KERNEL();
|
||||
return AC_SUCCESS;
|
||||
}
|
40
src/core/kernels/packing.cuh
Normal file
40
src/core/kernels/packing.cuh
Normal file
@@ -0,0 +1,40 @@
|
||||
/*
|
||||
Copyright (C) 2014-2020, Johannes Pekkilae, Miikka Vaeisalae.
|
||||
|
||||
This file is part of Astaroth.
|
||||
|
||||
Astaroth is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation, either version 3 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
Astaroth is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with Astaroth. If not, see <http://www.gnu.org/licenses/>.
|
||||
*/
|
||||
|
||||
/**
|
||||
* @file
|
||||
* \brief Brief info.
|
||||
*
|
||||
* Detailed info.
|
||||
*
|
||||
*/
|
||||
#pragma once
|
||||
#include "astaroth.h"
|
||||
#include "common.cuh"
|
||||
|
||||
typedef struct {
|
||||
int3 dims;
|
||||
AcReal* data;
|
||||
} PackedData;
|
||||
|
||||
AcResult acKernelPackData(const cudaStream_t stream, const VertexBufferArray vba,
|
||||
const int3 vba_start, PackedData packed);
|
||||
|
||||
AcResult acKernelUnpackData(const cudaStream_t stream, const PackedData packed,
|
||||
const int3 vba_start, VertexBufferArray vba);
|
@@ -101,6 +101,11 @@ operator+(const int3& a, const int3& b)
|
||||
return (int3){a.x + b.x, a.y + b.y, a.z + b.z};
|
||||
}
|
||||
|
||||
static HOST_DEVICE_INLINE int3 operator*(const int3& a, const int3& b)
|
||||
{
|
||||
return (int3){a.x * b.x, a.y * b.y, a.z * b.z};
|
||||
}
|
||||
|
||||
static HOST_DEVICE_INLINE void
|
||||
operator+=(AcReal3& lhs, const AcReal3& rhs)
|
||||
{
|
||||
|
Reference in New Issue
Block a user