Files
hugo-cwpearson/content/post/20201006-mpi-pack/index.md
Carl Pearson 803c7d5a7b .
2020-10-06 17:35:47 -06:00

12 KiB

+++ title = "MPI_Pack performance on GPUs" subtitle = "" date = 2020-10-06T00:00:00 lastmod = 2020-10-06T00:00:00 draft = false

Authors. Comma separated list, e.g. ["Bob Smith", "David Jones"].

authors = ["Carl Pearson"]

tags = ["GPU", "MPI", "CUDA", "MPI_Pack"]

summary = ""

Projects (optional).

Associate this post with one or more of your projects.

Simply enter your project's folder or file name without extension.

E.g. projects = ["deep-learning"] references

content/project/deep-learning/index.md.

Otherwise, set projects = [].

projects = ["stencil"]

Featured image

To use, add an image named featured.jpg/png to your project's folder.

[image]

Caption (optional)

caption = ""

Focal point (optional)

Options: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight

focal_point = "Center"

Show image only in page previews?

preview_only = true

categories = []

Set captions for image gallery.

+++

this post is still a work in progress

Abstract

Roll your own MPI packing and unpacking on GPUs for a 10,000x speedup.

Background

MPI_Pack and MPI_Unpack are commonly-used MPI library functions for dealing with non-contiguous data. Through the MPI_Type* class of function, datatypes may be recursively described in terms of the basic MPI datatypes (MPI_FLOAT, etc), and then MPI functions may be called on those types, removing the burden of indexing, offsets, and alignment from the user.

This is a fundamental operation in MPI, where we often prefer to send one larger message rather than many small messages. Instead of MPI_Sending each value, we first pack separate values into a contiguous memory space to form larger messages.

The case examined here is using MPI's types to describe a 3D object to be packed into a single contiguous buffer.

The total source allocation is 1 GiB in which a 1024x1024x1024 cube of 1-byte elements is stored in XYZ order (adjacent elements in memory are adjacent in X).

When a subset of that region is packed, this introduces strided memory reads. For example, a (100, 200, 300) region would be represented in memory as 100 contiguous bytes, separated by 924 (1024-100) bytes before the next group of 100 contiguous bytes begins. This pattern would be repeated 200 times to form a single XY plane of the cube. The entire cube is made up of 300 of those planes, each starting 1024*1024 bytes from the start of the previous plane.

We examine 4 different ways of describing the 3D structure to MPI, and also include a custom GPU kernel.

The test code can be found at https://github.com/cwpearson/stencil in bin/bench_mpi_pack.cu.

v_hv_hv (vector-hvector-hvector)

Here, we use MPI_Type_vector to describe the contiguous bytes in a row in a new type rowType. MPI_Type_create_hvector is applied to that, to describe the strided rows that make up a plane in planeType, and then again the strided planes that make up a cube in fullType.

  MPI_Datatype rowType = {};
  MPI_Datatype planeType = {};
  MPI_Datatype fullType = {};
  {
    {
      {
        // number of blocks
        int count = ce.width;
        // number of elements in each block
        int blocklength = 1;
        // number of elements between the start of each block
        const int stride = 1;
        MPI_Type_vector(count, blocklength, stride, MPI_BYTE, &rowType);
        MPI_Type_commit(&rowType);
      }
      int count = ce.height;
      int blocklength = 1;
      // bytes between start of each block
      const int stride = ae.width;
      MPI_Type_create_hvector(count, blocklength, stride, rowType, &planeType);
      MPI_Type_commit(&planeType);
    }
    int count = ce.depth;
    int blocklength = 1;
    // bytes between start of each block
    const int stride = ae.width * ae.height;
    MPI_Type_create_hvector(count, blocklength, stride, planeType, &fullType);
    MPI_Type_commit(&fullType);
  }

v_hv (vector-hvector)

Here, we remove one hvector from the hierarchy by using MPI_Type_vector to directly describe the strided blocks of bytes that make up a plane. MPI_Type_create_hvector is again used to represent the cube.

  MPI_Datatype planeType = {};
  MPI_Datatype fullType = {};
  {
    {
      // number of blocks
      int count = ce.height;
      // number of elements in each block
      int blocklength = ce.width;
      // elements between start of each block
      const int stride = ae.width;
      MPI_Type_vector(count, blocklength, stride, MPI_BYTE, &planeType);
      MPI_Type_commit(&planeType);
    }
    int count = ce.depth;
    int blocklength = 1;
    // bytes between start of each block
    const int stride = ae.width * ae.height;
    MPI_Type_create_hvector(count, blocklength, stride, planeType, &fullType);
    MPI_Type_commit(&fullType);
  }

hi (hindexed)

Here, we use MPI_Type_create_hindexed to describe the cube. Each contiguous row is a block, and we provide the library function with an array of block lengths (all the same) and block offsets.

  MPI_Datatype fullType = {};
  // z*y rows
  const int count = copyExt.z * copyExt.y;

  // byte offset of each row
  MPI_Aint *const displacements = new MPI_Aint[count];
  for (int64_t z = 0; z < copyExt.z; ++z) {
    for (int64_t y = 0; y < copyExt.y; ++y) {
      MPI_Aint bo = z * allocExt.y * allocExt.x + y * allocExt.x;
      // std::cout << bo << "\n";
      displacements[z * copyExt.y + y] = bo;
    }
  }
  // each row is the same length
  int *const blocklengths = new int[count];
  for (int i = 0; i < count; ++i) {
    blocklengths[i] = copyExt.x;
  }

  MPI_Type_create_hindexed(count, blocklengths, displacements, MPI_BYTE, &fullType);
  MPI_Type_commit(&fullType);

hib (hindexed_block)

MPI_Type_create_hindexed_block is the same as MPI_Type_create_hindexed except each block length must be a constant.

  MPI_Datatype fullType = {};
  // z*y rows
  const int count = copyExt.z * copyExt.y;
  const int blocklength = copyExt.x;

  // byte offset of each row
  MPI_Aint *const displacements = new MPI_Aint[count];
  for (int64_t z = 0; z < copyExt.z; ++z) {
    for (int64_t y = 0; y < copyExt.y; ++y) {
      MPI_Aint bo = z * allocExt.y * allocExt.x + y * allocExt.x;
      // std::cout << bo << "\n";
      displacements[z * copyExt.y + y] = bo;
    }
  }

  MPI_Type_create_hindexed_block(count, blocklength, displacements, MPI_BYTE, &fullType);
  MPI_Type_commit(&fullType);
  return fullType;

kernel

The final comparison is a custom GPU kernel that can directly read from the source memory and compute the propery offsets in the device buffer, consistent with MPI_Pack.

__device__ void grid_pack(void *__restrict__ dst, const cudaPitchedPtr src,
                          const Dim3 srcPos,    // logical offset into the 3D region, in elements
                          const Dim3 srcExtent, // logical extent of the 3D region to pack, in elements
                          const size_t elemSize // size of the element in bytes
) {

  const char *__restrict__ sp = static_cast<char *>(src.ptr);

  const unsigned int tz = blockDim.z * blockIdx.z + threadIdx.z;
  const unsigned int ty = blockDim.y * blockIdx.y + threadIdx.y;
  const unsigned int tx = blockDim.x * blockIdx.x + threadIdx.x;

  for (unsigned int zo = tz; zo < srcExtent.z; zo += blockDim.z * gridDim.z) {
    unsigned int zi = zo + srcPos.z;
    for (unsigned int yo = ty; yo < srcExtent.y; yo += blockDim.y * gridDim.y) {
      unsigned int yi = yo + srcPos.y;
      for (unsigned int xo = tx; xo < srcExtent.x; xo += blockDim.x * gridDim.x) {
        unsigned int xi = xo + srcPos.x;

        // logical offset of packed output
        const size_t oi = zo * srcExtent.y * srcExtent.x + yo * srcExtent.x + xo;
        // printf("[xo, yo, zo]->oi = [%u, %u, %u]->%lu\n", xo, yo, zo, oi);
        // byte offset of input
        const size_t bi = zi * src.ysize * src.pitch + yi * src.pitch + xi * elemSize;
        // printf("[xi, yi, zi]->bi = [%u, %u, %u]->%lu\n", xi, yi, zi, bi);
        if (1 == elemSize) {
          char v = *reinterpret_cast<const char *>(sp + bi);
          reinterpret_cast<char *>(dst)[oi] = v;
        } else if (4 == elemSize) {
          uint32_t v = *reinterpret_cast<const uint32_t *>(sp + bi);
          reinterpret_cast<uint32_t *>(dst)[oi] = v;
        } else if (8 == elemSize) {
          uint64_t v = *reinterpret_cast<const uint64_t *>(sp + bi);
          reinterpret_cast<uint64_t *>(dst)[oi] = v;
        } else {
          char *pDst = reinterpret_cast<char *>(dst);
          memcpy(&pDst[oi * elemSize], sp + bi, elemSize);
        }
      }
    }
  }
}

Method

Each operation is executed 5 times, and the trimean is reported. For MPI_Pack, the measured time begins when MPI_Pack is called and ends after a subsequent cudaDeviceSynchronize. This is necessary as the CUDA-aware MPI_Pack may be asynchronous w.r.t. the CPU, as the implementation may insert future CUDA operations into the same stream to ensure their ordering. For the GPU kernel, the measured time begins at the kernel invocation and ends after a subsequent cudaDeviceSynchronize.

The following table describes the evaluation system

Component Value
OS Linux 5.8.0-2-amd64
CUDA 11.1.74
Nvidia Driver 455.23.05
gcc 10.2.0
CPU Ryzen 7 3700X
GPU Nvidia GTX 1070

The GPU is also driving 2 4k 60Hz displays used as the desktop during these tests.

The following MPI implementations were built with gcc and the following configuration options:

Implementation configure flags
OpenMPI 4.0.5 --with-cuda=/usr/local/cuda
mvapich 2.3.4 --enable-cuda --with-cuda=/usr/local/cuda --enable-fortran=no --disable-mcast
mpich 3.4a3 failed to configure

I was unable to compile mpich 3.4a3 with GPU support. Hopefully I can revisit that soon.

Results

In both results, the "configuration" refers to the X/Y/Z size of the region copied out of a 1024x1024x1024 byte allocation. The total packing size is kept constant at 1MiB.

First, the results for mvapich-2.3.4:

And for for openMPI-4.0.5:

Discussion

In most cases, the GPU kernel is 100-10,000x faster than OpenMPI and 30x-3000x faster than mvapich.

The two images below are taken from the Nsight Systems profiling tool. It creates a timeline on which GPU activity and CUDA calls are shown. For mvapich, each contiguous block is transferred using a call to cudaMemcpyAsync. This means that one cudaMemcpyAsync call is generated for each row. As we can see, that actuall call takes much more time than the corresponding GPU activity, causing the slow performance.

strided mvapich

The OpenMPI situation is much worse - for some reason, the CPU is forced to synchronize with the device after each copy, decreasing the performance even further.

strided openmpi

When the nature of the copy causes the source memory to be contiguous, there is a single cudaMemcpyAsync call, which is much faster than the GPU kernel. (5 copies are shown due to 5 measurement iterations). Here, we can also see that OpenMPI's unnecessary synchronizations degrade performance by causing the GPU to be idle.

contiguous mvapich

contiguous openmpi

The GPU kernel performance is limited by inefficient use of DRAM bandwidth. For example, when loading a single byte separated by hundreds of bytes, most of each 128B cache line is wasted. For the contiguous 1 MiB transfers, where the kernel is slower, performance is limited by each 32-thread warp collaboartively loading 32B.

Conclusion

Despite this being an open problem for years, it seems there is much work to do on fast handling of MPI datatypes on GPUs. Even relatively simple cases such as strided tensors behave very poortly.

MPI implementors should probably analyze the datatypes at MPI_Type_commit to determine if there is a high-performance way to handle pack and unpack. Future uses of that datatype should use the appropriate fast path, dropping back to cudaMemcpyAsync when no better option is available.