diff --git a/content/_index.md b/content/_index.md index fd01abc..592f4df 100644 --- a/content/_index.md +++ b/content/_index.md @@ -1,5 +1,6 @@ -Carl Pearson is a Ph.D candidate in the Electrical and Computer Engineering department at the University of Illinois at Urbana-Champaign and a member of the [IMPACT Research Group](http://impact.crhc.illinois.edu/) led by Wen-Mei Hwu. +Carl Pearson is a Postdoctoral Appointee at Sandia National Labs. +He works on GPU communication for distributed linear algebra, and acceleration of sparse matrix multiplication. -He works on multi-GPU communication and scaling as part of the joint UIUC / IBM C3SR cognitive computing systems research center. The focus of these activities is to apply tools and techniques developed in the IMPACT group to improve the performance of real-world applications. +He recieved his Ph.D in Electrical and Computer Engineering from the University of Illinois in 2021, and his B.S. in Engineering from Harvey Mudd College. --- \ No newline at end of file diff --git a/content/publication/20210121_pearson_arxiv/index.md b/content/publication/20210121_pearson_arxiv/index.md deleted file mode 100644 index dabeae0..0000000 --- a/content/publication/20210121_pearson_arxiv/index.md +++ /dev/null @@ -1,19 +0,0 @@ -+++ -title = "[preprint] TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes" -date = 2021-01-21T00:00:00 # Schedule page publish date. -draft = false - -math = false - -tags = ["stencil", "mpi"] -+++ - -**Carl Pearson, Kun Wu, I-Hsin Chung, Jinjun Xiong, Wen-Mei Hwu** - -*arxiv preprint* - -MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. Despite substantial attention to CUDA-aware MPI implementations, they continue to offer cripplingly poor GPU performance when manipulating derived datatypes on GPUs. This work presents a new MPI library, TEMPI, to address this issue. TEMPI first introduces a common datatype to represent equivalent MPI derived datatypes. TEMPI can be used as an interposed library on existing MPI deployments without system or application changes. Furthermore, this work presents a performance model of GPU derived datatype handling, demonstrating that previously preferred "one-shot" methods are not always fastest. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 242,000x and MPI_Send speedup of up to 59,000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 1000x in a 3D halo exchange at 192 ranks. - -* [pdf](/pdf/20210121_pearson_arxiv.pdf) -* [github](https://github.com/cwpearson/tempi) -* [arxiv](https://arxiv.org/abs/2012.14363) \ No newline at end of file diff --git a/content/publication/20210420_pearson_phd/index.md b/content/publication/20210420_pearson_phd/index.md new file mode 100644 index 0000000..15ef20b --- /dev/null +++ b/content/publication/20210420_pearson_phd/index.md @@ -0,0 +1,32 @@ ++++ +title = "[Ph.D Dissertation] Movement and Placement of Non-Contiguous Data In Distributed GPU Computing" +date = 2021-04-20T00:00:00 # Schedule page publish date. +draft = false + +math = false + +tags = ["stencil", "mpi"] ++++ + +**Carl Pearson** + +*Ph.D Dissertation* + +Steady increase in accelerator performance has driven demand for faster interconnects to avert the memory bandwidth wall. +This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. +Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. + +This work finds that empirical communication measurements can be used to automatically schedule and execute intra- and inter-node communication in a modern heterogeneous system, providing ``hand-tuned'' performance without the need for complex or error-prone communication development at the application level. +Empirical measurements are provided by a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. +These benchmarks are the first comprehensive evaluation of all GPU communication primitives. +For communication-heavy applications, optimally using communication capabilities is challenging and essential for performance. +Two different approaches are examined. +The first is a high-level 3D stencil communication library, which can automatically create a static communication plan based on the stencil and system parameters. +This library is able to reduce iteration time of a state-of-the-art stencil code by $1.45\times$ at 3072 GPUs and 512 nodes. +The second is a more general MPI interposer library, with novel non-contiguous data handling and runtime implementation selection for MPI communication primitives. +A portable pure-MPI halo exchange is brought to within half the speed of the stencil-specific library, supported by a five order-of-magnitude improvement in MPI communication latency for non-contiguous data. + +* [pdf](/pdf/20210420_pearson_phd.pdf) +* [Comm|Scope (github)](https://github.com/c3sr/comm_scope) +* [Stencil (github)](https://github.com/cwpearson/stencil) +* [TEMPI (github)](https://github.com/cwpearson/tempi) \ No newline at end of file diff --git a/content/publication/20210621_pearson_hpdc/index.md b/content/publication/20210621_pearson_hpdc/index.md new file mode 100644 index 0000000..9451dce --- /dev/null +++ b/content/publication/20210621_pearson_hpdc/index.md @@ -0,0 +1,19 @@ ++++ +title = "[HPDC] TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes" +date = 2021-04-23T00:00:00 # Schedule page publish date. +draft = false + +math = false + +tags = ["stencil", "mpi"] ++++ + +**Carl Pearson, Kun Wu, I-Hsin Chung, Jinjun Xiong, Wen-Mei Hwu** + +To be presented June 21-15 in *2021 ACM Symposium on High-Performance Parallel and Distributed Computing* + +MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. This work first presents a novel datatype handling strategy for nested strided datatypes, which finds a middle ground between the specialized or generic handling in prior work. This work also shows that the performance characteristics of non-contiguous data handling can be modeled with empirical system measurements, and used to transparently improve MPI_Send/Recv latency. Finally, despite substantial attention to non-contiguous GPU data and CUDA-aware MPI implementations, good performance cannot be taken for granted. This work demonstrates its contributions through an MPI interposer library, TEMPI. TEMPI can be used with existing MPI deployments without system or application changes. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 242,000× and MPI_Send speedup of up to 59,000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 917x in a 3D halo exchange with 3072 processes. + +* [pdf](/pdf/20210621_pearson_hpdc.pdf) +* [github](https://github.com/cwpearson/tempi) +* [arxiv](https://arxiv.org/abs/2012.14363) \ No newline at end of file diff --git a/static/pdf/20210420_pearson_phd.pdf b/static/pdf/20210420_pearson_phd.pdf new file mode 100644 index 0000000..23631b0 Binary files /dev/null and b/static/pdf/20210420_pearson_phd.pdf differ diff --git a/static/pdf/20210121_pearson_arxiv.pdf b/static/pdf/20210621_pearson_hpdc.pdf similarity index 66% rename from static/pdf/20210121_pearson_arxiv.pdf rename to static/pdf/20210621_pearson_hpdc.pdf index 5ae3749..8dd4222 100644 Binary files a/static/pdf/20210121_pearson_arxiv.pdf and b/static/pdf/20210621_pearson_hpdc.pdf differ