latest results to MPI Pack post

2020-10-09 16:48:36 -06:00
parent 12460a7f5a
commit b9807976ba
3 changed files with 61 additions and 43 deletions
--- a/content/post/20201006-mpi-pack/deneb-mvapich-2.3.4.png
+++ b/content/post/20201006-mpi-pack/deneb-mvapich-2.3.4.png
--- a/content/post/20201006-mpi-pack/deneb-ompi-4.0.5.png
+++ b/content/post/20201006-mpi-pack/deneb-ompi-4.0.5.png
--- a/content/post/20201006-mpi-pack/index.md
+++ b/content/post/20201006-mpi-pack/index.md
@@ -1,6 +1,5 @@
 +++
-title = "MPI_Pack performance on GPUs"
-subtitle = ""
+title = "Improving MPI_Pack performance in CUDA-aware MPI"
 date = 2020-10-06T00:00:00
 lastmod = 2020-10-06T00:00:00
 draft = false
@@ -41,11 +40,11 @@ categories = []

 +++

-*this post is still a work in progress*
-
 # Abstract

 Roll your own MPI packing and unpacking on GPUs for a **10,000x** speedup.
+I am working on an MPI wrapper library to integrate these changes (and other research results) into existing MPI code with no source changes.
+Please contact me if you are interested in trying it out.

 # Background

@@ -185,46 +184,68 @@ Each contiguous row is a block, and we provide the library function with an arra

 **kernel**

-The final comparison is a custom GPU kernel that can directly read from the source memory and compute the propery offsets in the device buffer, consistent with `MPI_Pack`.
+The final comparison is a custom GPU kernel that can directly read from the source memory and compute the proper offsets in the device buffer, consistent with `MPI_Pack`.
+
+A key component of the performance of the kernel is the chosen block and grid dimensions.
+The block is sized to distribute a chosen number of threads across the X, Y, and Z dimension, filling X then Y, then Z until the available threads are exhausted.
+The grid dimension is then used to tile the blocks across the entire 3D region to be packed.
+Without such intelligent sizing, each thread will execute many loop iterations, drastically reducing performance.
+
+Further improvement is gained by specializing the kernel call according to the blockLength.
+Groups of sequential X-threads collaborate to load the `blockLength`-long chunks.
+If the blockLength can be evenly divided by a larger word size, that size is used.

 ```c++
-__device__ void grid_pack(void *__restrict__ dst, const cudaPitchedPtr src,
-                          const Dim3 srcPos,    // logical offset into the 3D region, in elements
-                          const Dim3 srcExtent, // logical extent of the 3D region to pack, in elements
-                          const size_t elemSize // size of the element in bytes
+template <unsigned N>
+__global__ static void pack_bytes_n(
+    void *__restrict__ outbuf, int position, const void *__restrict__ inbuf,
+    const int incount,
+    unsigned blockLength, // block length (B)
+    unsigned count0,      // count of inner blocks in a group
+    unsigned stride0,     // stride (B) between start of inner blocks in group
+    unsigned count1,      // number of block groups
+    unsigned stride1      // stride (B) between start of block groups
 ) {

-  const char *__restrict__ sp = static_cast<char *>(src.ptr);
+  assert(blockLength % N == 0); // N should evenly divide block length

  const unsigned int tz = blockDim.z * blockIdx.z + threadIdx.z;
  const unsigned int ty = blockDim.y * blockIdx.y + threadIdx.y;
  const unsigned int tx = blockDim.x * blockIdx.x + threadIdx.x;

-  for (unsigned int zo = tz; zo < srcExtent.z; zo += blockDim.z * gridDim.z) {
-    unsigned int zi = zo + srcPos.z;
-    for (unsigned int yo = ty; yo < srcExtent.y; yo += blockDim.y * gridDim.y) {
-      unsigned int yi = yo + srcPos.y;
-      for (unsigned int xo = tx; xo < srcExtent.x; xo += blockDim.x * gridDim.x) {
-        unsigned int xi = xo + srcPos.x;
+  char *__restrict__ op = reinterpret_cast<char *>(outbuf);
+  const char *__restrict__ ip = reinterpret_cast<const char *>(inbuf);

-        // logical offset of packed output
-        const size_t oi = zo * srcExtent.y * srcExtent.x + yo * srcExtent.x + xo;
-        // printf("[xo, yo, zo]->oi = [%u, %u, %u]->%lu\n", xo, yo, zo, oi);
-        // byte offset of input
-        const size_t bi = zi * src.ysize * src.pitch + yi * src.pitch + xi * elemSize;
-        // printf("[xi, yi, zi]->bi = [%u, %u, %u]->%lu\n", xi, yi, zi, bi);
-        if (1 == elemSize) {
-          char v = *reinterpret_cast<const char *>(sp + bi);
-          reinterpret_cast<char *>(dst)[oi] = v;
-        } else if (4 == elemSize) {
-          uint32_t v = *reinterpret_cast<const uint32_t *>(sp + bi);
-          reinterpret_cast<uint32_t *>(dst)[oi] = v;
-        } else if (8 == elemSize) {
-          uint64_t v = *reinterpret_cast<const uint64_t *>(sp + bi);
-          reinterpret_cast<uint64_t *>(dst)[oi] = v;
-        } else {
-          char *pDst = reinterpret_cast<char *>(dst);
-          memcpy(&pDst[oi * elemSize], sp + bi, elemSize);
+  for (int i = 0; i < incount; ++i) {
+    char *__restrict__ dst = op + position + i * count1 * count0 * blockLength;
+    const char *__restrict__ src = ip + i * stride1 * count1 * stride0 * count0;
+
+    for (unsigned z = tz; z < count1; z += gridDim.z * blockDim.z) {
+      for (unsigned y = ty; y < count0; y += gridDim.y * blockDim.y) {
+        for (unsigned x = tx; x < blockLength / N;
+             x += gridDim.x * blockDim.x) {
+          unsigned bo = z * count0 * blockLength + y * blockLength + x * N;
+          unsigned bi = z * stride1 + y * stride0 + x * N;
+          // printf("%u -> %u\n", bi, bo);
+
+          if (N == 1) {
+            dst[bo] = src[bi];
+          } else if (N == 2) {
+            uint16_t *__restrict__ d = reinterpret_cast<uint16_t *>(dst + bo);
+            const uint16_t *__restrict__ s =
+                reinterpret_cast<const uint16_t *>(src + bi);
+            *d = *s;
+          } else if (N == 4) {
+            uint32_t *__restrict__ d = reinterpret_cast<uint32_t *>(dst + bo);
+            const uint32_t *__restrict__ s =
+                reinterpret_cast<const uint32_t *>(src + bi);
+            *d = *s;
+          } else if (N == 8) {
+            uint64_t *__restrict__ d = reinterpret_cast<uint64_t *>(dst + bo);
+            const uint64_t *__restrict__ s =
+                reinterpret_cast<const uint64_t *>(src + bi);
+            *d = *s;
+          }
        }
      }
    }
@@ -237,7 +258,7 @@ __device__ void grid_pack(void *__restrict__ dst, const cudaPitchedPtr src,
 Each operation is executed 5 times, and the trimean is reported.
 For MPI_Pack, the measured time begins when MPI_Pack is called and ends after a subsequent `cudaDeviceSynchronize`.
 This is necessary as the CUDA-aware MPI_Pack may be asynchronous w.r.t. the CPU, as the implementation may insert future CUDA operations into the same stream to ensure their ordering.
-For the GPU kernel, the measured time begins at the kernel invocation and ends after a subsequent `cudaDeviceSynchronize`.
+The GPU time includes any time required to set up the kernel launch.

 The following table describes the evaluation system

@@ -279,14 +300,14 @@ And for for openMPI-4.0.5:
 # Discussion

 In most cases, the GPU kernel is **100-10,000x** faster than OpenMPI and **30x-3000x** faster than mvapich.
-
+It matches the underlying MPI_Pack performance when the region to be packed is already contiguous.


 The two images below are taken from the Nsight Systems profiling tool.
 It creates a timeline on which GPU activity and CUDA calls are shown.
-For mvapich, each contiguous block is transferred using a call to cudaMemcpyAsync.
-This means that one cudaMemcpyAsync call is generated for each row.
-As we can see, that actuall call takes much more time than the corresponding GPU activity, causing the slow performance.
+For mvapich, each contiguous block is transferred using a call to `cudaMemcpyAsync`.
+This means that one `cudaMemcpyAsync` call is generated for each `blockLength` chunk.
+As we can see, that actual call takes much more time than the corresponding GPU activity, causing the slow performance.


 **strided mvapich**
@@ -297,9 +318,7 @@ The OpenMPI situation is much worse - for some reason, the CPU is forced to sync
 **strided openmpi**
 ![](openmpi-strided.png)

-When the nature of the copy causes the source memory to be contiguous, there is a single cudaMemcpyAsync call, which is much faster than the GPU kernel.
-(5 copies are shown due to 5 measurement iterations).
-Here, we can also see that OpenMPI's unnecessary synchronizations degrade performance by causing the GPU to be idle.
+When the nature of the copy causes the source memory to be contiguous, there is a single cudaMemcpyAsync call, which makes as effective use of the GPU bandwidth as the GPU kernel does.

 **contiguous mvapich**
 ![](mvapich-contiguous.png)
@@ -310,7 +329,6 @@ Here, we can also see that OpenMPI's unnecessary synchronizations degrade perfor

 The GPU kernel performance is limited by inefficient use of DRAM bandwidth.
 For example, when loading a single byte separated by hundreds of bytes, most of each 128B cache line is wasted.
-For the contiguous 1 MiB transfers, where the kernel is slower, performance is limited by each 32-thread warp collaboartively loading 32B.

 # Conclusion