This commit is contained in:
jpekkila
2019-09-19 17:55:12 +03:00

View File

@@ -22,18 +22,38 @@ Copyright (C) 2014-2019, Johannes Pekkila, Miikka Vaisala.
# Introduction and background
Astaroth is a collection of tools for utilizing multiple graphics processing units (GPUs) efficiently in three-dimensional stencil computations. This document specifies the Astaroth application-programming interface (API) and domain-specific language (DSL).
Astaroth is a collection of tools for utilizing multiple graphics processing units (GPUs)
efficiently in three-dimensional stencil computations. This document specifies the Astaroth
application-programming interface (API) and domain-specific language (DSL).
Astaroth has been designed for the demands in computational sciences, where large stencils are often used to attain sufficient accuracy. The majority of previous work focuses on stencil computations with low-order stencils for which several efficient algorithms have been proposed, whereas work on high-order stencils is more limited. In addition, in computational physics multiple fields interact with each other, such as the velocity and magnetic fields of electrically conducting fluids. Such computations are especially challenging to solve efficiently because of the problem's relatively low operational intensity and the small caches provided by GPUs. Efficient methods for computations with several coupled fields and large stencils have not been addressed sufficiently in prior work.
Astaroth has been designed for the demands in computational sciences, where large stencils are
often used to attain sufficient accuracy. The majority of previous work focuses on stencil
computations with low-order stencils for which several efficient algorithms have been proposed,
whereas work on high-order stencils is more limited. In addition, in computational physics multiple
fields interact with each other, such as the velocity and magnetic fields of electrically
conducting fluids. Such computations are especially challenging to solve efficiently because of the
problem's relatively low operational intensity and the small caches provided by GPUs. Efficient
methods for computations with several coupled fields and large stencils have not been addressed
sufficiently in prior work.
With Astaroth, we have taken inspiration of image processing and graphics pipelines which rely on holding intermediate data in caches for the duration of computations, and extended the idea to work efficiently also with large three-dimensional stencils and an arbitrary number of coupled fields. As programming GPUs efficiently is relatively verbose and requires deep knowledge of the underlying hardware and execution model, we have created a high-level domain-specific language for expressing a wide range of tasks in computational sciences and provide a source-to-source compiler for translating stencil problems expressed in our language into efficient CUDA kernels.
With Astaroth, we have taken inspiration of image processing and graphics pipelines which rely on
holding intermediate data in caches for the duration of computations, and extended the idea to work
efficiently also with large three-dimensional stencils and an arbitrary number of coupled fields.
As programming GPUs efficiently is relatively verbose and requires deep knowledge of the underlying
hardware and execution model, we have created a high-level domain-specific language for expressing
a wide range of tasks in computational sciences and provide a source-to-source compiler for
translating stencil problems expressed in our language into efficient CUDA kernels.
The kernels generated from the Astaroth DSL are embedded in the Astaroth Core library, which is usable via the Astaroth API. While the Astaroth library is written in C++/CUDA, the API conforms to the C99 standard.
The kernels generated from the Astaroth DSL are embedded in the Astaroth Core library, which is
usable via the Astaroth API. While the Astaroth library is written in C++/CUDA, the API conforms to
the C99 standard.
# Publications
The foundational work was done in (Väisälä, Pekkilä, 2017) and the library, API and DSL described in this document were introduced in (Pekkilä, 2019). We kindly wish the users of Astaroth to cite to these publications in their work.
The foundational work was done in (Väisälä, Pekkilä, 2017) and the library, API and DSL described
in this document were introduced in (Pekkilä, 2019). We kindly wish the users of Astaroth to cite
to these publications in their work.
> J. Pekkilä, Astaroth: A Library for Stencil Computations on Graphics Processing Units. Master's thesis, Aalto University School of Science, Espoo, Finland, 2019.
@@ -45,7 +65,12 @@ The foundational work was done in (Väisälä, Pekkilä, 2017) and the library,
# Astaroth API
The Astroth application-programming interface (API) provides the means for controlling execution of user-defined and built-in functions on multiple graphics processing units. Functions in the API are prefixed with lower case ```ac```, while structures and data types are prefixed with capitalized ```Ac```. Compile-time constants, such as definitions and enumerations, have the prefix ```AC_```. All of the API functions return an AcResult value indicating either success or failure. The return codes are
The Astroth application-programming interface (API) provides the means for controlling execution of
user-defined and built-in functions on multiple graphics processing units. Functions in the API are
prefixed with lower case ```ac```, while structures and data types are prefixed with capitalized
```Ac```. Compile-time constants, such as definitions and enumerations, have the prefix ```AC_```.
All of the API functions return an AcResult value indicating either success or failure. The return
codes are
```C
typedef enum {
AC_SUCCESS = 0,
@@ -53,7 +78,8 @@ typedef enum {
} AcResult;
```
The API is divided into layers which differ in the level of control provided over the execution. There are two primary layers:
The API is divided into layers which differ in the level of control provided over the execution.
There are two primary layers:
* Device layer
> Functions start with acDevice*.
@@ -79,7 +105,9 @@ There are also several helper functions defined in `include/astaroth_defines.h`,
## List of Astaroth API functions
Here's a non-exhaustive list of astaroth API functions. For more info and an up-to-date list, see the corresponding header files in `include/astaroth_defines.h`, `include/astaroth.h`, `include/astaroth_node.h`, `include/astaroth_device.h`.
Here's a non-exhaustive list of astaroth API functions. For more info and an up-to-date list, see
the corresponding header files in `include/astaroth_defines.h`, `include/astaroth.h`, `include/
astaroth_node.h`, `include/astaroth_device.h`.
### Initialization, quitting and helper functions
@@ -219,9 +247,9 @@ AcResult acNodeReduceVec(const Node node, const Stream stream_type, const Reduct
### Stream synchronization
All library functions that take a `Stream` as a parameter are asynchronous. When calling these functions,
the control returns immediately back to the host even if the called device function has not yet completed.
Therefore special care must be taken in order to ensure proper synchronization.
All library functions that take a `Stream` as a parameter are asynchronous. When calling these
functions, control returns immediately back to the host even if the called device function has not
yet completed. Therefore special care must be taken in order to ensure proper synchronization.
Synchronization is done using `Stream` primitives, defined as
```C
@@ -230,25 +258,25 @@ typedef enum { STREAM_DEFAULT, STREAM_0, ..., STREAM_16, NUM_STREAMS } Stream;
```
Functions queued in the same stream will be executed sequentially. If two or more consequent
functions are queued in different streams, then these functions may execute in parallel. Finally,
there is a barrier synchronization function, which blocks until all functions in some stream have
completed. The Astaroth API provides barrier synchronization with functions `acDeviceSynchronize` and
`acNodeSynchronize`. All streams can be synchronized at once by passing the alias `STREAM_ALL` to
the synchronization function.
functions are queued in different streams, then these functions may execute in parallel. For
additional control over streams, there is a barrier synchronization function which blocks execution
until all functions in the specified streams have completed. The Astaroth API provides barrier
synchronization with functions `acDeviceSynchronize` and `acNodeSynchronize`. All streams can be
synchronized at once by passing the alias `STREAM_ALL` to the synchronization function.
Usage of streams is demonstrated with the following example.
```C
funcA(STREAM_0);
funcB(STREAM_0); // Blocks until funcA has completed
funcC(STREAM_1); // May execute in parallel with funcB
synchronizeStream(STREAM_ALL); // Blocks until functions in all streams have completed
barrierSynchronizeStream(STREAM_ALL); // Blocks until functions in all streams have completed
funcD(STREAM_2); // Is started when command returns from synchronizeStream()
```
### Data synchronization
Stream synchronization works in the same fashion on node and device layers. However, on the node
layer one has to take in account that a portion of the mesh is shared between devices and that the
Stream synchronization works in the same fashion on node and device layers. However on the node
layer, one has to take in account that a portion of the mesh is shared between devices and that the
data is always up to date.
In stencil computations, the mesh is surrounded by a halo, where data is only used for updating grid
@@ -263,24 +291,38 @@ AcResult acNodeSynchronizeVertexBuffer(const Node node, const Stream stream,
```
> **NOTE**: Local halos must be up to date before synchronizing the data. Local halos are the grid points outside the computational domain which are used only by a single device. The mesh is distributed to multiple devices by blocking along the z axis. If there are *n* devices and the z-dimension of the computational domain is *nz*, then each device is assigned *nz / n* two-dimensional planes. For example with two devices, the data block that has to be up to date ranges from *(0, 0, nz)* to *(mx, my, nz + 2 * NGHOST)*
> **NOTE**: Local halos must be up to date before synchronizing the data. Local halos are the grid
points outside the computational domain which are used only by a single device. The mesh is
distributed to multiple devices by blocking along the z axis. If there are *n* devices and the z-
dimension of the computational domain is *nz*, then each device is assigned *nz / n* two-
dimensional planes. For example with two devices, the data block that has to be up to date ranges
from *(0, 0, nz)* to *(mx, my, nz + 2 * NGHOST)*
### Input and output buffers
The mesh is duplicated to input and output buffers for performance reasons. The input buffers are
read-only in user-specified compute kernels, which allows us to read them via the texture cache optimized
for spatially local memory accesses. The results of compute kernels are written into the output buffers.
read-only in user-specified compute kernels, which allows us to read them via the texture cache
optimized for spatially local memory accesses. The results of compute kernels are written into the
output buffers.
Since we allow the user to operate on subsets of the computational domain in user-specified kernels, we have no way to know when the output buffers are complete and can be swapped. Therefore the user should explicitly state when the input and output buffer should be swapped. This is done via the API calls
Since we allow the user to operate on subsets of the computational domain in user-specified
kernels, we have no way of knowing when the output buffers are complete and can be swapped.
Therefore the user must explicitly state when the input and output buffer should be swapped. This
is done via the API calls
```C
AcResult acDeviceSwapBuffers(const Device device);
AcResult acNodeSwapBuffers(const Node node);
```
> **NOTE**: All functions provided with the API operate on input buffers and ensure that the complete result is available in the input buffer when the function has completed. User-specified kernels are exceptions and write the result to output buffers. Therefore buffers have to be swapped only after calling user-specified kernels.
> **NOTE**: All functions provided with the API operate on input buffers and ensure that the
complete result is available in the input buffer when the function has completed. User-specified
kernels are exceptions and write the result to output buffers. Therefore buffers have to be swapped
only after calling user-specified kernels.
## Devices
`Device` is a handle to some single device and is used in device layer functions to specify which device should execute the function. A `Device` is created and destroyed with the following interface functions.
`Device` is a handle to some single device and is used in device layer functions to specify which
device should execute the function. A `Device` is created and destroyed with the following
interface functions.
```C
AcResult acDeviceCreate(const int device_id, const AcMeshInfo device_config, Device* device);
AcResult acDeviceDestroy(Device device);
@@ -288,13 +330,16 @@ AcResult acDeviceDestroy(Device device);
## Nodes
`Node` is a handle to some compute node which consists of multiple devices. The `Node` handle is used to specify which node the node layer functions should operate in. A node is created and destroyed with the following interface functions.
`Node` is a handle to some compute node which consists of multiple devices. The `Node` handle is
used to specify which node the node layer functions should operate in. A node is created and
destroyed with the following interface functions.
```C
AcResult acNodeCreate(const int id, const AcMeshInfo node_config, Node* node);
AcResult acNodeDestroy(Node node);
```
The function acNodeCreate calls acDeviceCreate for all devices that are visible from the current process. After a node has been created, the devices in it can be retrived with the function
The function acNodeCreate calls acDeviceCreate for all devices that are visible from the current
process. After a node has been created, the devices in it can be retrieved with the function
```C
AcResult acNodeQueryDeviceConfiguration(const Node node, DeviceConfiguration* config);
```
@@ -309,6 +354,8 @@ typedef struct {
} DeviceConfiguration;
```
See Section **Decomposition** for discussion about `Grid`.
## Meshes
Meshes are the primary structures for passing information to the library and kernels. The definition
@@ -363,16 +410,20 @@ typedef struct {
} Grid;
```
As briefly discussed in the section Data synchronization, a `Mesh` is distributed to multiple devices
by blocking the data along the *z*-axis. Given the mesh dimensions *(mx, my, mz)*, its computational
domain *(nx, ny, nz)* and *n* number of devices, then each device is assigned a mesh of size
*(mx, my, 2 * NGHOST + nz/n)* and a computational domain of size *(nx, ny, nz/n)*.
As briefly discussed in the section Data synchronization, a `Mesh` is distributed to multiple
devices by blocking the data along the *z*-axis. Given the mesh dimensions *(mx, my, mz)*, its
computational domain *(nx, ny, nz)* and *n* number of devices, then each device is assigned a mesh
of size *(mx, my, 2 * NGHOST + nz/n)* and a computational domain of size *(nx, ny, nz/n)*.
Let *i* be the device id. The portion of the halos shared by neighboring devices is then *(0, 0, i * nz/n)* - *(mx, my, 2 * NGHOST + i * nz/n)*. The functions `acNodeSynchronizeVertexBuffer` and `acNodeSynchronizeMesh` communicate these shared areas among the devices in the node.
Let *i* be the device id. The portion of the halos shared by neighboring devices is then
*(0, 0, i * nz/n)* - *(mx, my, 2 * NGHOST + i * nz/n)*. The functions
`acNodeSynchronizeVertexBuffer` and `acNodeSynchronizeMesh` communicate these shared areas among
the devices in the node.
## Integration, reductions and boundary conditions
The library provides the following functions for integration, reductions and computing periodic boundary conditions.
The library provides the following functions for integration, reductions and computing periodic
boundary conditions.
```C
AcResult acDeviceIntegrateSubstep(const Device device, const Stream stream, const int step_number,
const int3 start, const int3 end, const AcReal dt);
@@ -409,11 +460,55 @@ AcResult acDeviceKernel_##identifier(const Device device, const Stream stream,
Where `##identifier` is replaced with the name of the user-specified kernel. For example, a device
function `Kernel solve()` can be called with `acDeviceKernel_solve()` via the API.
# Astaroth DSL
# Astaroth Domain-Specific Language
## Uniforms
We designed the Astaroth Domain-specific Language (DSL) for expressing stencil computations in a high-level language that can be translated into efficient GPU kernels. The benefits of creating a DSL are two-fold. First, scientists using the language can focus on developing solvers and mathematical models using an easy-to-use language, while still achieving performance close to handwritten code. Second, procedures written in the DSL are decoupled from implementation, which allows us to extend the DSL compiler, say, to generate optimized code for several hardware generations without the users having to modify existing DSL sources.
### Control flow and implementing switches
## Overview
The syntax of the Astaroth DSL is an extended subset of C-like languages. The programming model is
based on stream processing, or dataflow programming, where a chain of functions are executed on
streams of data. A kernel is a small GPU program, which defines the operations performed on a
number of data streams. In our case, each data stream corresponds to a single vertex in the mesh,
similar to how vertex shaders operate in graphics shading languages.
With Astaroth DSL, we have borrowed the idea of graphics and image processing pipelines, and used
it for performing three-dimensional stencil computations cache efficiently. The Astaroth DSL is
comprises of three closely related languages, which correspond to distinct stages in the stencil
pipeline shown in the following figure.
![Figure: Stencil pipeline.](./doc/Astaroth_API_specification_and_user_manual/stencil_pipeline.svg "Stencil Pipeline")
| Stage | File ending | Description |
|--------------------|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Stencil assembly | .sas | Defines the shape of the stencils and functions to be preprocessed before entering the stencil processing stage. Reading from input arrays is only possible during this stage. |
| Stencil process | .sps | The functions executed on streams of data are defined here. Contains kernels, which are essentially main functions of GPU programs. |
| Stencil definition | .sdh | All field identifiers and constant memory symbols are defined in this file. |
| Any | .h | Optional header files which can be included in any other file.
Compilation of the DSL files is integrated into `CMakelists.txt` provided with the library and
dependencies are recompiled if needed when calling `make`. All DSL files should reside in the same
directory and there should be only one `.sas`, `.sps` and `.sdh` file. There may be any number of
optional `.h` files. When configuring the project, the user should pass the path to the DSL
directory as a cmake option like so: ```cmake -DDSL_MODULE_DIR="some/user/dir" ..```.
## Data types
In addition to basic datatypes in C/C++/CUDA, such as int and int3, we provide the following datatypes with the DSL.
| Data type | Description | C/C++/CUDA equivalent |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
| Scalar | 32- or 64-bit floating-point number | float or double |
| Vector | A tuple of three 32- or 64-bit floating-point numbers | float3 or double3 |
| Complex | A tuple of two 32- or 64-bit floating-point numbers. The real part is stored in member .x, while the imaginary component is in .y. Basic operations, such as multiplication, are defined as built-in functions. | std::complex<float> or std::complex<double> |
| Matrix | A tuple of three Vectors. Is stored in column-major order, f.ex. Matrix[i][j] is the component on row i, column j. (TODO recheck specs.) | float3[3] or double3[3] |
| ScalarArray | A one-dimensional array of Scalars stored in device memory. Given mesh dimensions (mx, my, mz), consists of max(mx, max(my, mz)) elements. | float[] or double[] |
| ScalarField | An abstraction of a three-dimensional scalar field stored in device memory. Is implemented as a handle to a one-dimensional Scalar array consisting of input and output segments. The data is stored linearly in order i + j * mx + k * mx * my, given some vertex index (i, j, k) and mesh constisting of (mx, my, mz) vertices. | float[2][] or double[2][] |
| VectorField | An abstraction of a three-dimensional vector field stored in device memory. Is implemented as a tuple of three ScalarField handles. | Three distinct float[2][] or double[2][] arrays for each component. Stored as a structure of arrays. |
## Built-in variables and functions
## Control flow
// Runtime constants are as fast as compile-time constants as long as
// 1) They are not placed in tight loops, especially those that inlcude global memory accesses, that could be unrolled
// 2) They are not multiplied with each other
@@ -421,18 +516,14 @@ function `Kernel solve()` can be called with `acDeviceKernel_solve()` via the AP
// Safe and efficient to use as switches
## Vertex buffers
### Input and output buffers
## Built-in variables and functions
## Functions
### Kernel
### Preprocessed
## Uniforms
// Device constants
// Loaded at runtime
## Kernels
// in and out
## Preprocessed functions
// Reading input fields