Updated documentation and made it work with Doxygen. Now the doc/doxygen/index.html generated with it looks quite good and contains lots of useful and up-to-date information about Astaroth

2020-01-13 21:11:04 +02:00
parent a6cf5a8b79
commit d51d48071f
6 changed files with 112 additions and 68 deletions
--- a/doc/Astaroth_API_specification_and_user_manual/API_specification_and_user_manual.md
+++ b/doc/Astaroth_API_specification_and_user_manual/API_specification_and_user_manual.md
@@ -1,4 +1,7 @@
-# Astaroth specification and user manual
+Astaroth Specification and User Manual
+============
+
+# Astaroth Specification and User Manual

 Copyright (C) 2014-2019, Johannes Pekkila, Miikka Vaisala.

@@ -20,7 +23,7 @@ Copyright (C) 2014-2019, Johannes Pekkila, Miikka Vaisala.
 	   along with Astaroth.  If not, see <http://www.gnu.org/licenses/>.


-# Introduction and background
+# Introduction and Background

 Astaroth is a collection of tools for utilizing multiple graphics processing units (GPUs)
 efficiently in three-dimensional stencil computations. This document specifies the Astaroth
@@ -67,8 +70,8 @@ to these publications in their work.

 The Astroth application-programming interface (API) provides the means for controlling execution of
 user-defined and built-in functions on multiple graphics processing units. Functions in the API are
-prefixed with lower case ```ac```, while structures and data types are prefixed with capitalized
-```Ac```. Compile-time constants, such as definitions and enumerations, have the prefix ```AC_```.
+prefixed with lower case `ac`, while structures and data types are prefixed with capitalized
+`Ac`. Compile-time constants, such as definitions and enumerations, have the prefix `AC_`.
 All of the API functions return an AcResult value indicating either success or failure. The return
 codes are
 ```C
@@ -103,13 +106,13 @@ Finally, a third layer is provided for convenience and backwards compatibility.
 There are also several helper functions defined in `include/astaroth_defines.h`, which can be used for, say, determining the size or performing index calculations within the simulation domain.


-## List of Astaroth API functions
+## List of Astaroth API Functions

 Here's a non-exhaustive list of astaroth API functions. For more info and an up-to-date list, see
 the corresponding header files in `include/astaroth_defines.h`, `include/astaroth.h`, `include/
 astaroth_node.h`, `include/astaroth_device.h`.

-### Initialization, quitting and helper functions
+### Initialization, Quitting and Helper Functions

 Device layer.
 ```C
@@ -137,7 +140,7 @@ size_t acVertexBufferCompdomainSizeBytes(const AcMeshInfo info);
 size_t acVertexBufferIdx(const int i, const int j, const int k, const AcMeshInfo info);
 ```

-### Loading and storing
+### Loading and Storing

 Loading meshes and vertex buffers to device memory.
 ```C
@@ -245,7 +248,7 @@ AcResult acNodeReduceVec(const Node node, const Stream stream_type, const Reduct
                         const VertexBufferHandle vtxbuf2, AcReal* result);
 ```

-### Stream synchronization
+### Stream Synchronization

 All library functions that take a `Stream` as a parameter are asynchronous. When calling these
 functions, control returns immediately back to the host even if the called device function has not
@@ -273,7 +276,7 @@ barrierSynchronizeStream(STREAM_ALL); // Blocks until functions in all streams h
 funcD(STREAM_2); // Is started when command returns from synchronizeStream()
 ```

-### Data synchronization
+### Data Synchronization

 Stream synchronization works in the same fashion on node and device layers. However on the node
 layer, one has to take in account that a portion of the mesh is shared between devices and that the
@@ -291,14 +294,9 @@ AcResult acNodeSynchronizeVertexBuffer(const Node node, const Stream stream,

 ```

-> **NOTE**: Local halos must be up to date before synchronizing the data. Local halos are the grid
-points outside the computational domain which are used only by a single device. The mesh is
-distributed to multiple devices by blocking along the z axis. If there are *n* devices and the z-
-dimension of the computational domain is *nz*, then each device is assigned *nz / n* two-
-dimensional planes. For example with two devices, the data block that has to be up to date ranges
-from *(0, 0, nz)* to *(mx, my, nz + 2 * NGHOST)*
+> **NOTE**: Local halos must be up to date before synchronizing the data. Local halos are the grid points outside the computational domain which are used only by a single device. The mesh is distributed to multiple devices by blocking along the z axis. If there are *n* devices and the z-dimension of the computational domain is *nz*, then each device is assigned *nz / n* two-dimensional planes. For example with two devices, the data block that has to be up to date ranges from *(0, 0, nz)* to *(mx, my, nz + 2 * NGHOST)*.

-### Input and output buffers
+### Input and Output Buffers

 The mesh is duplicated to input and output buffers for performance reasons. The input buffers are
 read-only in user-specified compute kernels, which allows us to read them via the texture cache
@@ -313,10 +311,7 @@ is done via the API calls
 AcResult acDeviceSwapBuffers(const Device device);
 AcResult acNodeSwapBuffers(const Node node);
 ```
-> **NOTE**: All functions provided with the API operate on input buffers and ensure that the
-complete result is available in the input buffer when the function has completed. User-specified
-kernels are exceptions and write the result to output buffers. Therefore buffers have to be swapped
-only after calling user-specified kernels.
+> **NOTE**: All functions provided with the API operate on input buffers and ensure that the complete result is available in the input buffer when the function has completed. User-specified kernels are exceptions and write the result to output buffers. Therefore buffers have to be swapped only after calling user-specified kernels.

 ## Devices

@@ -420,7 +415,7 @@ Let *i* be the device id. The portion of the halos shared by neighboring devices
 `acNodeSynchronizeVertexBuffer` and `acNodeSynchronizeMesh` communicate these shared areas among
 the devices in the node.

-## Integration, reductions and boundary conditions
+## Integration, Reductions and Boundary Conditions

 The library provides the following functions for integration, reductions and computing periodic
 boundary conditions.
@@ -487,18 +482,18 @@ pipeline shown in the following figure.

 | Stage              | File ending | Description                                                                                                                                                                     |
 |--------------------|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Stencil assembly   | .sas        | Defines the shape of the stencils and functions to be preprocessed before entering the stencil processing stage. Reading from input arrays is only possible during this stage.  |
-| Stencil process    | .sps        | The functions executed on streams of data are defined here. Contains kernels, which are essentially main functions of GPU programs.                                            |
-| Stencil definition | .sdh        | All field identifiers and constant memory symbols are defined in this file.                                                                                                     |
-| Any                |  .h         | Optional header files which can be included in any other file.    
+| Stencil assembly   | .ac        | Defines the shape of the stencils and functions to be preprocessed before entering the stencil processing stage. Reading from input arrays is only possible during this stage.  |
+| Stencil process    | .ac        | The functions executed on streams of data are defined here. Contains kernels, which are essentially main functions of GPU programs.                                            |
+| Stencil definition | .ac        | All field identifiers and constant memory symbols are defined in this file.                                                                                                     |
+| Any                |  .h         | Optional header files which can be included in any file.    

 Compilation of the DSL files is integrated into `CMakelists.txt` provided with the library and
 dependencies are recompiled if needed when calling `make`. All DSL files should reside in the same
-directory and there should be only one `.sas`, `.sps` and `.sdh` file. There may be any number of
+directory and there should be only one `.ac` file. There may be any number of
 optional `.h` files. When configuring the project, the user should pass the path to the DSL
 directory as a cmake option like so: ```cmake -DDSL_MODULE_DIR="some/user/dir" ..```.

-## Data types
+## Data Types

 In addition to basic datatypes in C/C++/CUDA, such as int and int3, we provide the following datatypes with the DSL.

@@ -517,13 +512,13 @@ In addition to basic datatypes in C/C++/CUDA, such as int and int3, we provide t
 `Scalars` are 32-bit floating-point numbers by default. Double precision can be turned on by setting cmake option `DOUBLE_PRECISION=ON`. 
 All real number literals are converted automatically to the correct precision. In cases where , the precision can be declared explicitly by appending `f` or `d` postfix to the real number. For example,
 ```C
-1.0 // The same precision as Scalar/AcReal
-1.0f // Explicit float
-1.0d // Explicit double
+1.0           // The same precision as Scalar/AcReal
+1.0f          // Explicit float
+1.0d          // Explicit double
 (1.0f * 1.0d) // 1.0f is implicitly cast to double and the multiplication is done in double precision.
 ```

-## Control flow
+## Control Flow

 Conditional statements are expressed with the `if-else` construct. Unlike in C and C++, we require
 that the scope of the `if-else` statement is explicitly declared using braces `{` and `}` in order
@@ -566,19 +561,21 @@ The following built-in variables are available in `Kernel`s.
 | globalVertexIdx   | Holds the global index of the currently processed vertex. If there is only single device, then vertexIdx is the same as globalVertexIdx. Otherwise globalVertexIdx is offset accordingly. |
 | globalGridN       | Holds the dimensions of the computational domain.                                                                                                                                         |

-## Preprocessed functions
+## Preprocessed Functions

 The type qualifier `Preprocessed` indicates which functions should be evaluated immediately when
 entering a `Kernel` function. The return values of `Preprocessed` functions are cached and calling
 these functions during the stencil processing stage is essentially free. As main memory bandwidth is
 significantly slower than on-chip memories and registers, declaring reading-heavy functions as
-`Preprocessed` is critical for obtaining good performance in stencil codes. 
-
-`Preprocessed` functions may only be defined in stencil assembly files.
+`Preprocessed` is critical for obtaining good performance in stencil codes.

 The built-in variables `vertexIdx`, `globalVertexidx` and `globalGridN` are available in all
 `Preprocessed` functions.

+## Device Functions
+
+The type qualifier `Device` indicates which functions can be called from `Kernel` functions or other `Device` functions.
+
 ## Uniforms

 `Uniform`s are global device variables which stay constant for the duration of a kernel launch.
@@ -603,17 +600,23 @@ Instead, one should load the appropriate values during runtime using the `acLoad
 related functions.


-## Standard libraries
+## Standard Libraries

-> Not implemented
+The following table lists the standard libraries currently available.

-## Performance considerations
+| Built-in variable | Description                                                                                                                                                                               |
+|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| stdderiv.h         | Contains functions for computing 2nd, 4th, 6th and 8th order derivatives (configured by defining the STENCIL_ORDER before including stdderiv.h)                                                                                                                                  |
+
+Astaroth DSL libraries can be included in the same way as C/C++ headers. For example, `#include <stdderiv.h>`.
+
+## Performance Considerations

 Uniforms are as fast as compile-time constants as long as

-0. The halting condition of a tight loop does not depend on an uniform or a variable, as this would prevent unrolling of the loop during compile-time. 
-0. Uniforms are not multiplied with each other. The result should be stored in an auxiliary uniform instead. For example, the result of `nx * ny` should be stored in a new `uniform nxy`
-0. At least 32 neighboring streams in the x-axis access the same `uniform`. That is, the vertices at vertexIdx.x = i... i + 32 should access the same `uniform` where i is a multiple of 32.
+1. The halting condition of a tight loop does not depend on an uniform or a variable, as this would prevent unrolling of the loop during compile-time. 
+2. Uniforms are not multiplied with each other. The result should be stored in an auxiliary uniform instead. For example, the result of `nx * ny` should be stored in a new `uniform nxy`
+3. At least 32 neighboring streams in the x-axis access the same `uniform`. That is, the vertices at vertexIdx.x = i... i + 32 should access the same `uniform` where i is a multiple of 32.