Updated documentation and made it work with Doxygen. Now the doc/doxygen/index.html generated with it looks quite good and contains lots of useful and up-to-date information about Astaroth

This commit is contained in:
jpekkila
2020-01-13 21:11:04 +02:00
parent a6cf5a8b79
commit d51d48071f
6 changed files with 112 additions and 68 deletions

View File

@@ -1,4 +1,7 @@
# Astaroth specification and user manual
Astaroth Specification and User Manual
============
# Astaroth Specification and User Manual
Copyright (C) 2014-2019, Johannes Pekkila, Miikka Vaisala.
@@ -20,7 +23,7 @@ Copyright (C) 2014-2019, Johannes Pekkila, Miikka Vaisala.
along with Astaroth. If not, see <http://www.gnu.org/licenses/>.
# Introduction and background
# Introduction and Background
Astaroth is a collection of tools for utilizing multiple graphics processing units (GPUs)
efficiently in three-dimensional stencil computations. This document specifies the Astaroth
@@ -67,8 +70,8 @@ to these publications in their work.
The Astroth application-programming interface (API) provides the means for controlling execution of
user-defined and built-in functions on multiple graphics processing units. Functions in the API are
prefixed with lower case ```ac```, while structures and data types are prefixed with capitalized
```Ac```. Compile-time constants, such as definitions and enumerations, have the prefix ```AC_```.
prefixed with lower case `ac`, while structures and data types are prefixed with capitalized
`Ac`. Compile-time constants, such as definitions and enumerations, have the prefix `AC_`.
All of the API functions return an AcResult value indicating either success or failure. The return
codes are
```C
@@ -103,13 +106,13 @@ Finally, a third layer is provided for convenience and backwards compatibility.
There are also several helper functions defined in `include/astaroth_defines.h`, which can be used for, say, determining the size or performing index calculations within the simulation domain.
## List of Astaroth API functions
## List of Astaroth API Functions
Here's a non-exhaustive list of astaroth API functions. For more info and an up-to-date list, see
the corresponding header files in `include/astaroth_defines.h`, `include/astaroth.h`, `include/
astaroth_node.h`, `include/astaroth_device.h`.
### Initialization, quitting and helper functions
### Initialization, Quitting and Helper Functions
Device layer.
```C
@@ -137,7 +140,7 @@ size_t acVertexBufferCompdomainSizeBytes(const AcMeshInfo info);
size_t acVertexBufferIdx(const int i, const int j, const int k, const AcMeshInfo info);
```
### Loading and storing
### Loading and Storing
Loading meshes and vertex buffers to device memory.
```C
@@ -245,7 +248,7 @@ AcResult acNodeReduceVec(const Node node, const Stream stream_type, const Reduct
const VertexBufferHandle vtxbuf2, AcReal* result);
```
### Stream synchronization
### Stream Synchronization
All library functions that take a `Stream` as a parameter are asynchronous. When calling these
functions, control returns immediately back to the host even if the called device function has not
@@ -273,7 +276,7 @@ barrierSynchronizeStream(STREAM_ALL); // Blocks until functions in all streams h
funcD(STREAM_2); // Is started when command returns from synchronizeStream()
```
### Data synchronization
### Data Synchronization
Stream synchronization works in the same fashion on node and device layers. However on the node
layer, one has to take in account that a portion of the mesh is shared between devices and that the
@@ -291,14 +294,9 @@ AcResult acNodeSynchronizeVertexBuffer(const Node node, const Stream stream,
```
> **NOTE**: Local halos must be up to date before synchronizing the data. Local halos are the grid
points outside the computational domain which are used only by a single device. The mesh is
distributed to multiple devices by blocking along the z axis. If there are *n* devices and the z-
dimension of the computational domain is *nz*, then each device is assigned *nz / n* two-
dimensional planes. For example with two devices, the data block that has to be up to date ranges
from *(0, 0, nz)* to *(mx, my, nz + 2 * NGHOST)*
> **NOTE**: Local halos must be up to date before synchronizing the data. Local halos are the grid points outside the computational domain which are used only by a single device. The mesh is distributed to multiple devices by blocking along the z axis. If there are *n* devices and the z-dimension of the computational domain is *nz*, then each device is assigned *nz / n* two-dimensional planes. For example with two devices, the data block that has to be up to date ranges from *(0, 0, nz)* to *(mx, my, nz + 2 * NGHOST)*.
### Input and output buffers
### Input and Output Buffers
The mesh is duplicated to input and output buffers for performance reasons. The input buffers are
read-only in user-specified compute kernels, which allows us to read them via the texture cache
@@ -313,10 +311,7 @@ is done via the API calls
AcResult acDeviceSwapBuffers(const Device device);
AcResult acNodeSwapBuffers(const Node node);
```
> **NOTE**: All functions provided with the API operate on input buffers and ensure that the
complete result is available in the input buffer when the function has completed. User-specified
kernels are exceptions and write the result to output buffers. Therefore buffers have to be swapped
only after calling user-specified kernels.
> **NOTE**: All functions provided with the API operate on input buffers and ensure that the complete result is available in the input buffer when the function has completed. User-specified kernels are exceptions and write the result to output buffers. Therefore buffers have to be swapped only after calling user-specified kernels.
## Devices
@@ -420,7 +415,7 @@ Let *i* be the device id. The portion of the halos shared by neighboring devices
`acNodeSynchronizeVertexBuffer` and `acNodeSynchronizeMesh` communicate these shared areas among
the devices in the node.
## Integration, reductions and boundary conditions
## Integration, Reductions and Boundary Conditions
The library provides the following functions for integration, reductions and computing periodic
boundary conditions.
@@ -487,18 +482,18 @@ pipeline shown in the following figure.
| Stage | File ending | Description |
|--------------------|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Stencil assembly | .sas | Defines the shape of the stencils and functions to be preprocessed before entering the stencil processing stage. Reading from input arrays is only possible during this stage. |
| Stencil process | .sps | The functions executed on streams of data are defined here. Contains kernels, which are essentially main functions of GPU programs. |
| Stencil definition | .sdh | All field identifiers and constant memory symbols are defined in this file. |
| Any | .h | Optional header files which can be included in any other file.
| Stencil assembly | .ac | Defines the shape of the stencils and functions to be preprocessed before entering the stencil processing stage. Reading from input arrays is only possible during this stage. |
| Stencil process | .ac | The functions executed on streams of data are defined here. Contains kernels, which are essentially main functions of GPU programs. |
| Stencil definition | .ac | All field identifiers and constant memory symbols are defined in this file. |
| Any | .h | Optional header files which can be included in any file.
Compilation of the DSL files is integrated into `CMakelists.txt` provided with the library and
dependencies are recompiled if needed when calling `make`. All DSL files should reside in the same
directory and there should be only one `.sas`, `.sps` and `.sdh` file. There may be any number of
directory and there should be only one `.ac` file. There may be any number of
optional `.h` files. When configuring the project, the user should pass the path to the DSL
directory as a cmake option like so: ```cmake -DDSL_MODULE_DIR="some/user/dir" ..```.
## Data types
## Data Types
In addition to basic datatypes in C/C++/CUDA, such as int and int3, we provide the following datatypes with the DSL.
@@ -517,13 +512,13 @@ In addition to basic datatypes in C/C++/CUDA, such as int and int3, we provide t
`Scalars` are 32-bit floating-point numbers by default. Double precision can be turned on by setting cmake option `DOUBLE_PRECISION=ON`.
All real number literals are converted automatically to the correct precision. In cases where , the precision can be declared explicitly by appending `f` or `d` postfix to the real number. For example,
```C
1.0 // The same precision as Scalar/AcReal
1.0f // Explicit float
1.0d // Explicit double
1.0 // The same precision as Scalar/AcReal
1.0f // Explicit float
1.0d // Explicit double
(1.0f * 1.0d) // 1.0f is implicitly cast to double and the multiplication is done in double precision.
```
## Control flow
## Control Flow
Conditional statements are expressed with the `if-else` construct. Unlike in C and C++, we require
that the scope of the `if-else` statement is explicitly declared using braces `{` and `}` in order
@@ -566,19 +561,21 @@ The following built-in variables are available in `Kernel`s.
| globalVertexIdx | Holds the global index of the currently processed vertex. If there is only single device, then vertexIdx is the same as globalVertexIdx. Otherwise globalVertexIdx is offset accordingly. |
| globalGridN | Holds the dimensions of the computational domain. |
## Preprocessed functions
## Preprocessed Functions
The type qualifier `Preprocessed` indicates which functions should be evaluated immediately when
entering a `Kernel` function. The return values of `Preprocessed` functions are cached and calling
these functions during the stencil processing stage is essentially free. As main memory bandwidth is
significantly slower than on-chip memories and registers, declaring reading-heavy functions as
`Preprocessed` is critical for obtaining good performance in stencil codes.
`Preprocessed` functions may only be defined in stencil assembly files.
`Preprocessed` is critical for obtaining good performance in stencil codes.
The built-in variables `vertexIdx`, `globalVertexidx` and `globalGridN` are available in all
`Preprocessed` functions.
## Device Functions
The type qualifier `Device` indicates which functions can be called from `Kernel` functions or other `Device` functions.
## Uniforms
`Uniform`s are global device variables which stay constant for the duration of a kernel launch.
@@ -603,17 +600,23 @@ Instead, one should load the appropriate values during runtime using the `acLoad
related functions.
## Standard libraries
## Standard Libraries
> Not implemented
The following table lists the standard libraries currently available.
## Performance considerations
| Built-in variable | Description |
|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| stdderiv.h | Contains functions for computing 2nd, 4th, 6th and 8th order derivatives (configured by defining the STENCIL_ORDER before including stdderiv.h) |
Astaroth DSL libraries can be included in the same way as C/C++ headers. For example, `#include <stdderiv.h>`.
## Performance Considerations
Uniforms are as fast as compile-time constants as long as
0. The halting condition of a tight loop does not depend on an uniform or a variable, as this would prevent unrolling of the loop during compile-time.
0. Uniforms are not multiplied with each other. The result should be stored in an auxiliary uniform instead. For example, the result of `nx * ny` should be stored in a new `uniform nxy`
0. At least 32 neighboring streams in the x-axis access the same `uniform`. That is, the vertices at vertexIdx.x = i... i + 32 should access the same `uniform` where i is a multiple of 32.
1. The halting condition of a tight loop does not depend on an uniform or a variable, as this would prevent unrolling of the loop during compile-time.
2. Uniforms are not multiplied with each other. The result should be stored in an auxiliary uniform instead. For example, the result of `nx * ny` should be stored in a new `uniform nxy`
3. At least 32 neighboring streams in the x-axis access the same `uniform`. That is, the vertices at vertexIdx.x = i... i + 32 should access the same `uniform` where i is a multiple of 32.