publications, experiment with experience list

2021-01-27 18:02:45 -07:00
parent 3a685bf1a6
commit 74c5687a80
18 changed files with 95 additions and 578 deletions
--- a/content/experience.md
+++ b/content/experience.md
@@ -42,18 +42,12 @@ I also created a set of resources on using Nvidia's Nsight Compute and Nsight Sy
 See the [Github repository](https://github.com/cwpearson/nvidia-performance-tools) to get started.


-## Industory Experience
+## Industry Experience
+
+#### Treasurer, University YMCA
+* August 2019 - April 2020
+* Community member of the board of governors, serving as the chair of the budget committee, the Treasurer, and on the Bailey Scholarship steering committee.

-<!-- [[experience]]
-  title = "Treasurer"
-  company = "University YMCA"
-  company_url = ""
-  location = "Urbana, IL"
-  date_start = "2019-08-01"
-  date_end = "2020-04-01"
-  description = """
-  Community member of the board of governors, serving as the chair of the budget committee, the Treasurer, and on the Bailey Scholarship steering committee.
-  """
  
 [[experience]]
  title = "Research Intern"
--- a/content/publication/20140601_chen.md
+++ b/content/publication/20140601_chen.md
@@ -11,7 +11,7 @@ draft = false

 **Xuhao Chen, Shengzhao Wu, Li-Wen Chang, Wei-Sheng Huang, Carl Pearson, Wen-mei Hwu**

-In *Proceedings of International Workshop on Manycore Embedded Systems.*
+In *Proceedings of International Workshop on Manycore Embedded Systems, 2016*

 Many-core accelerators, e.g. GPUs, are widely used for accelerating general-purpose compute kernels. With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many regular applications. To support more applications with irregular memory access pattern, cache hierarchy is introduced to GPU architecture to capture input data sharing and mitigate the effect of irregular accesses. However, GPU caches suffer from poor efficiency due to severe contention, which makes it difficult to adopt heuristic management policies, and also limits system performance and energy-efficiency. We propose an adaptive cache management policy specifically for many-core accelerators. The tag array of L2 cache is enhanced with extra bits to track memory access history, an thus the locality information is captured and provided to L1 cache as heuristics to guide its run-time bypass and insertion decisions. By preventing un-reused data from polluting the cache and alleviating contention, cache efficiency is significantly improved. As a result, the system performance is improved by 31% on average for cache sensitive benchmarks, compared to the baseline GPU architecture.

--- a/content/publication/20170621_hidayetoglu_cem.md
+++ b/content/publication/20170621_hidayetoglu_cem.md
@@ -2,7 +2,7 @@
 draft = false

 date = "2017-06-21"
-title = "[CME] Scalable Parallel DBIM Solutions of Inverse-Scattering Problems"
+title = "[CEM] Scalable Parallel DBIM Solutions of Inverse-Scattering Problems"

 math = false
 publication = "Computing and Electromagnetics International Workshop (CEM), 2017"
--- a/content/publication/20171108_hwu_icrc.md
+++ b/content/publication/20171108_hwu_icrc.md
@@ -1,67 +1,14 @@
 +++
-title = "Rebooting the Data Access Hierarchy of Computing Systems"
+title = "[CEM] Rebooting the Data Access Hierarchy of Computing Systems"
 date = 2017-11-18
 draft = false

-# Publication name and optional abbreviated version.
-publication = "In *Computing and Electromagnetics International Workshop*."
-publication_short = "In *CEM*"
-
-# Is this a selected publication? (true/false)
-selected = false
-
-# Projects (optional).
-#   Associate this publication with one or more of your projects.
-#   Simply enter your project's folder or file name without extension.
-#   E.g. `projects = ["deep-learning"]` references 
-#   `content/project/deep-learning/index.md`.
-#   Otherwise, set `projects = []`.
-projects = []
-
-# Slides (optional).
-#   Associate this publication with Markdown slides.
-#   Simply enter your slide deck's filename without extension.
-#   E.g. `slides = "example-slides"` references 
-#   `content/slides/example-slides.md`.
-#   Otherwise, set `slides = ""`.
-slides = ""
-
-# Tags (optional).
-#   Set `tags = []` for no tags, or use the form `tags = ["A Tag", "Another Tag"]` for one or more tags.
-tags = []
-
-# Links (optional).
-url_pdf = "pdf/20170621_hwu_cem.pdf"
-url_preprint = ""
-url_code = ""
-url_dataset = ""
-url_project = ""
-url_slides = ""
-url_video = ""
-url_poster = ""
-url_source = ""
-
-# Custom links (optional).
-#   Uncomment line below to enable. For multiple links, use the form `[{...}, {...}, {...}]`.
-# url_custom = [{name = "Custom Link", url = "http://example.org"}]
-
-# Digital Object Identifier (DOI)
-doi = ""
-
-# Does this page contain LaTeX math? (true/false)
-math = false
-
-# Featured image
-# To use, add an image named `featured.jpg/png` to your page's folder. 
-[image]
-  # Caption (optional)
-  caption = "Image credit: [**Unsplash**](https://unsplash.com/photos/jdD8gXaTZsc)"
-
-  # Focal point (optional)
-  # Options: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight
-  focal_point = ""
 +++

-authors = ["Wen-mei Hwu", "Izzat El Hajj", "Simon Garcia de Gonzalo", "Carl Pearson", "Nam Sung Kim", "Deming Chen", "Jinjun Xiong", "Zehra Sura"]
+**Wen-mei Hwu, Izzat El Hajj, Simon Garcia de Gonzalo, Carl Pearson, Nam Sung Kim, Deming Chen, Jinjun Xiong, Zehra Sura**

-In this paper, we present our view of massively-parallel heterogeneous computing for solving large scientific problems. We start by observing that computing has been the primary driver of major innovations since the beginning of the 21st century. We argue that this is the fruit of decades of progress in computing methods, technology, and systems. A high-level analysis on out-scaling and up-scaling on large supercomputers is given through a time-domain wave-scattering simulation example. The importance of heterogeneous node architectures for good up-scaling is highlighted. A case for low-complexity algorithms is made for continued scale-out towards exascale systems.
+In *Computing and Electromagnetics International Workshop 2017*.
+
+In this paper, we present our view of massively-parallel heterogeneous computing for solving large scientific problems. We start by observing that computing has been the primary driver of major innovations since the beginning of the 21st century. We argue that this is the fruit of decades of progress in computing methods, technology, and systems. A high-level analysis on out-scaling and up-scaling on large supercomputers is given through a time-domain wave-scattering simulation example. The importance of heterogeneous node architectures for good up-scaling is highlighted. A case for low-complexity algorithms is made for continued scale-out towards exascale systems.
+
+* [pdf](/pdf/20170621_hwu_cem.pdf)
--- a/content/publication/20180521_hidayetoglu_ipdps.md
+++ b/content/publication/20180521_hidayetoglu_ipdps.md
@@ -1,67 +1,13 @@
 +++
-title = "A Fast and Massively-Parallel Solver for Multiple-Scattering Tomographic Image Reconstruction"
+title = "[IPDPS] A Fast and Massively-Parallel Solver for Multiple-Scattering Tomographic Image Reconstruction"
 date = 2018-05-21
 draft = false
-
-# Authors. Comma separated list, e.g. `["Bob Smith", "David Jones"]`.
-authors = ["Mert Hidayetoglu", "Carl Pearson", "Izzat El Hajj", "Levent Gurel", "Weng Cho Chew", "Wen-Mei Hwu"]
-
-# Publication type.
-# Legend:
-# 0 = Uncategorized
-# 1 = Conference proceedings
-# 2 = Journal
-# 3 = Work in progress
-# 4 = Technical report
-# 5 = Book
-# 6 = Book chapter
-publication_types = ["1"]
-
-# Publication name and optional abbreviated version.
-publication = "In *2018 IEEE International Parallel and Distributed Processing Symposium*"
-publication_short = "In *IPDPS*"
-
-# Does this page contain LaTeX math? (true/false)
-math = false
-
-# Does this page require source code highlighting? (true/false)
-highlight = true
-
-# Featured image thumbnail (optional)
-image_preview = ""
-
-# Is this a selected publication? (true/false)
-selected = true
-
-# Projects (optional).
-#   Associate this publication with one or more of your projects.
-#   Simply enter your project's folder or file name without extension.
-#   E.g. `projects = ["deep-learning"]` references 
-#   `content/project/deep-learning/index.md`.
-#   Otherwise, set `projects = []`.
-projects = ["app_studies"]
-
-# Links (optional)
-url_pdf = "pdf/20180521_hidayetoglu_ipdps.pdf"
-url_preprint = ""
-url_code = ""
-url_dataset = ""
-url_project = ""
-url_slides = ""
-url_video = ""
-url_poster = ""
-url_source = ""
-
-# Featured image
-# To use, add an image named `featured.jpg/png` to your page's folder. 
-[image]
-  # Caption (optional)
-  caption = ""
-
-  # Focal point (optional)
-  # Options: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight
-  focal_point = ""
-
 +++

-We present a massively-parallel solver for large Helmholtz-type inverse scattering problems. The solver employs the distorted Born iterative method for capturing the multiple-scattering phenomena in image reconstructions. This method requires many full-wave forward-scattering solutions in each iteration, constituting the main performance bottleneck with its high computational complexity. As a remedy, we use the multilevel fast multipole algorithm (MLFMA). The solver scales among computing nodes using a two-dimensional parallelization strategy that distributes illuminations in one dimension, and MLFMA sub-trees in the other dimension. Multi-core CPUs and GPUs are used to provide per-node speedup. We demonstrate a 76% efficiency when scaling from 64 GPUs to 4,096 GPUs. The paper provides reconstruction of a 204.8λ×204.8λ image (4M unknowns) executed on 4,096 GPUs in near-real time (almost 2 minutes). To the best of our knowledge, this is the largest full-wave inverse scattering solution to date, in terms of both image size and computational resources.
+**Mert Hidayetoglu, Carl Pearson, Izzat El Hajj, Levent Gurel, Weng Cho Chew, Wen-Mei Hwu**
+
+In *2018 IEEE International Parallel and Distributed Processing Symposium*
+
+We present a massively-parallel solver for large Helmholtz-type inverse scattering problems. The solver employs the distorted Born iterative method for capturing the multiple-scattering phenomena in image reconstructions. This method requires many full-wave forward-scattering solutions in each iteration, constituting the main performance bottleneck with its high computational complexity. As a remedy, we use the multilevel fast multipole algorithm (MLFMA). The solver scales among computing nodes using a two-dimensional parallelization strategy that distributes illuminations in one dimension, and MLFMA sub-trees in the other dimension. Multi-core CPUs and GPUs are used to provide per-node speedup. We demonstrate a 76% efficiency when scaling from 64 GPUs to 4,096 GPUs. The paper provides reconstruction of a 204.8λ×204.8λ image (4M unknowns) executed on 4,096 GPUs in near-real time (almost 2 minutes). To the best of our knowledge, this is the largest full-wave inverse scattering solution to date, in terms of both image size and computational resources.
+
+* [pdf](/pdf/20180521_hidayetoglu_ipdps.pdf)
--- a/content/publication/20180625_pearson_ms.md
+++ b/content/publication/20180625_pearson_ms.md
@@ -18,7 +18,7 @@ tags = ["scope"]

 **Carl Pearson**

-*M.S. Thesis*
+*M.S. Thesis, May 2018*

 With the end of Dennard scaling, high-performance computing increasingly relies on heterogeneous systems with specialized hardware to improve application performance. This trend has driven up the complexity of high-performance software development, as developers must manage multiple programming systems and develop system-tuned code to utilize specialized hardware. In addition, it has exacerbated existing challenges of data placement as the specialized hardware often has local memories to fuel its computational demands. In addition to using appropriate software resources to target application computation at the best hardware for the job, application developers now must manage data movement and placement within their application, which also must be specifically tuned to the target system. Instead of relying on the application developer to have specialized knowledge of system characteristics and specialized expertise in multiple programming systems, this work proposes a heterogeneous system communication library that automatically chooses data location and data movement for high-performance application development and execution on heterogeneous systems. This work presents the foundational components of that library: a systematic approach for characterization of system communication links and application communication demands.

--- a/content/publication/20180628_pearson_iwoph.md
+++ b/content/publication/20180628_pearson_iwoph.md
@@ -1,68 +1,15 @@
 +++
-title = "NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems"
+title = "[IWOPH] NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems"
 date = 2018-06-28
 draft = false

-# Authors. Comma separated list, e.g. `["Bob Smith", "David Jones"]`.
-authors = ["Carl Pearson", "I-Hsin Chung", "Zehra Sura", "Jinjun Xiong", "Wen-Mei Hwu"]
-
-# Publication type.
-# Legend:
-# 0 = Uncategorized
-# 1 = Conference paper
-# 2 = Journal article
-# 3 = Manuscript
-# 4 = Report
-# 5 = Book
-# 6 = Book section
-publication_types = ["1"]
-
-# Publication name and optional abbreviated version.
-publication = "International Workshop on OpenPower in HPC"
-publication_short = "IWOPH 2018"
-
-
-
-# Does this page contain LaTeX math? (true/false)
-math = false
-
-# Does this page require source code highlighting? (true/false)
-highlight = false
-
-# Featured image thumbnail (optional)
-image_preview = ""
-
-# Is this a selected publication? (true/false)
-selected = false
-
-# Projects (optional).
-#   Associate this publication with one or more of your projects.
-#   Simply enter your project's folder or file name without extension.
-#   E.g. `projects = ["deep-learning"]` references 
-#   `content/project/deep-learning/index.md`.
-#   Otherwise, set `projects = []`.
-projects = ["scope"]
-
-# Links (optional)
-url_pdf = "pdf/20180628-iwoph.pdf"
-url_preprint = ""
-url_code = ""
-url_dataset = ""
-url_project = ""
-url_slides = "pdf/20180628-iwoph-slides.pdf"
-url_video = ""
-url_poster = ""
-url_source = ""
-
-# Featured image
-# To use, add an image named `featured.jpg/png` to your page's folder. 
-[image]
-  # Caption (optional)
-  caption = ""
-
-  # Focal point (optional)
-  # Options: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight
-  focal_point = ""
 +++

-High-performance computing increasingly relies on heterogeneous systems with specialized hardware accelerators to improve application performance. For example, NVIDIA’s CUDA programming system and general-purpose GPUs have emerged as a widespread accelerator in HPC systems. This trend has exacerbated challenges of data placement as accelerators often have fast local memories to fuel their computational demands, but slower interconnects to feed those memories. Crucially, real-world data-transfer performance is strongly influenced not just by the underlying hardware, but by the capabilities of the programming systems. Understanding how application performance is affected by the logical communication exposed through abstractions, as well as the underlying system topology, is crucial for developing high-performance applications and architectures. This report presents initial data-transfer microbenchmark results from two POWER-based systems obtained during work towards developing an automated system performance characterization tool.
+**Carl Pearson, I-Hsin Chung, Zehra Sura, Jinjun Xiong, Wen-Mei Hwu**
+
+In *International Workshop on OpenPower in HPC (IWOPH) 2018*
+
+High-performance computing increasingly relies on heterogeneous systems with specialized hardware accelerators to improve application performance. For example, NVIDIA’s CUDA programming system and general-purpose GPUs have emerged as a widespread accelerator in HPC systems. This trend has exacerbated challenges of data placement as accelerators often have fast local memories to fuel their computational demands, but slower interconnects to feed those memories. Crucially, real-world data-transfer performance is strongly influenced not just by the underlying hardware, but by the capabilities of the programming systems. Understanding how application performance is affected by the logical communication exposed through abstractions, as well as the underlying system topology, is crucial for developing high-performance applications and architectures. This report presents initial data-transfer microbenchmark results from two POWER-based systems obtained during work towards developing an automated system performance characterization tool.
+
+* [pdf](/pdf/20180628-iwoph.pdf)
+* [slides](/pdf/20180628-iwoph-slides.pdf)
--- a/content/publication/20180919_pearson_arxiv.md
+++ b/content/publication/20180919_pearson_arxiv.md
@@ -1,70 +1,16 @@
 +++
-title = "SCOPE: C3SR Systems Characterization and Benchmarking Framework"
+title = "[tech report] SCOPE: C3SR Systems Characterization and Benchmarking Framework"
 date = 2018-09-18
 draft = false

-# Authors. Comma separated list, e.g. `["Bob Smith", "David Jones"]`.
-authors = ["Carl Pearson", "Abdul Dakkak", "Cheng Li", "Sarah Hashash", "Jinjun Xiong", "Wen-Mei Hwu"]
-
-# Publication type.
-# Legend:
-# 0 = Uncategorized
-# 1 = Conference paper
-# 2 = Journal article
-# 3 = Manuscript
-# 4 = Report
-# 5 = Book
-# 6 = Book section
-publication_types = ["4"]
-
-# Publication name and optional abbreviated version.
-publication = "arXiv preprint"
-publication_short = "arXiv preprint"
-
-# Abstract and optional shortened version.
-abstract = "This report presents the design of the Scope infrastructure for extensible and portable benchmarking. Improvements in high-performance computing systems rely on coordination across different levels of system abstraction. Developing and defining accurate performance measurements is necessary at all levels of the system hierarchy, and should be as accessible as possible to developers with different backgrounds. The Scope project aims to lower the barrier to entry for developing performance benchmarks by providing a software architecture that allows benchmarks to be developed independently, by providing useful C/C++ abstractions and utilities, and by providing a Python package for generating publication-quality plots of resulting measurements."
-abstract_short = ""
-
-# Does this page contain LaTeX math? (true/false)
-math = false
-
-# Does this page require source code highlighting? (true/false)
-highlight = false
-
-# Featured image thumbnail (optional)
-image_preview = ""
-
-# Is this a selected publication? (true/false)
-selected = false
-
-# Projects (optional).
-#   Associate this publication with one or more of your projects.
-#   Simply enter your project's folder or file name without extension.
-#   E.g. `projects = ["deep-learning"]` references 
-#   `content/project/deep-learning/index.md`.
-#   Otherwise, set `projects = []`.
-projects = ["scope"]
-
-# Links (optional)
-url_pdf = "pdf/20180918_pearson_arxiv.pdf"
-url_preprint = "https://arxiv.org/abs/1809.08311"
-url_code = ""
-url_dataset = ""
-url_project = ""
-url_slides = ""
-url_video = ""
-url_poster = ""
-url_source = ""
-
-# Featured image
-# To use, add an image named `featured.jpg/png` to your page's folder. 
-[image]
-  # Caption (optional)
-  caption = ""
-
-  # Focal point (optional)
-  # Options: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight
-  focal_point = ""
+tags = ["scope"]
 +++

-This report presents the design of the Scope infrastructure for extensible and portable benchmarking. Improvements in high-performance computing systems rely on coordination across different levels of system abstraction. Developing and defining accurate performance measurements is necessary at all levels of the system hierarchy, and should be as accessible as possible to developers with different backgrounds. The Scope project aims to lower the barrier to entry for developing performance benchmarks by providing a software architecture that allows benchmarks to be developed independently, by providing useful C/C++ abstractions and utilities, and by providing a Python package for generating publication-quality plots of resulting measurements.
+**Carl Pearson, Abdul Dakkak, Cheng Li, Sarah Hashash, Jinjun Xiong, Wen-Mei Hwu**
+
+*arxiv preprint*
+
+This report presents the design of the Scope infrastructure for extensible and portable benchmarking. Improvements in high-performance computing systems rely on coordination across different levels of system abstraction. Developing and defining accurate performance measurements is necessary at all levels of the system hierarchy, and should be as accessible as possible to developers with different backgrounds. The Scope project aims to lower the barrier to entry for developing performance benchmarks by providing a software architecture that allows benchmarks to be developed independently, by providing useful C/C++ abstractions and utilities, and by providing a Python package for generating publication-quality plots of resulting measurements.
+
+* [pdf](/pdf/20180918_pearson_arxiv.pdf)
+* [preprint](https://arxiv.org/abs/1809.08311)
--- a/content/publication/20180925_mailthody_hpec.md
+++ b/content/publication/20180925_mailthody_hpec.md
@@ -2,65 +2,14 @@
 title = "Collaborative (CPU+ GPU) Algorithms for Triangle Counting and Truss Decomposition"
 date = 2018-09-25
 draft = false
+tags = ["pangolin"]

-# Authors. Comma separated list, e.g. `["Bob Smith", "David Jones"]`.
-authors = ["Vikram S. Mailthody", "Ketan Date", "Zaid Qureshi", "Carl Pearson", "Rakesh Nagi", "Jinjun Xiong", "Wen-Mei Hwu"]
-
-# Publication type.
-# Legend:
-# 0 = Uncategorized
-# 1 = Conference paper
-# 2 = Journal article
-# 3 = Manuscript
-# 4 = Report
-# 5 = Book
-# 6 = Book section
-publication_types = ["1"]
-
-# Publication name and optional abbreviated version.
-publication = "In *2018 IEEE High Performance extreme Computing Conference*"
-publication_short = "In *HPEC*"
-
-# Does this page contain LaTeX math? (true/false)
-math = false
-
-# Does this page require source code highlighting? (true/false)
-highlight = false
-
-# Featured image thumbnail (optional)
-image_preview = ""
-
-# Is this a selected publication? (true/false)
-selected = true
-
-# Projects (optional).
-#   Associate this publication with one or more of your projects.
-#   Simply enter your project's folder or file name without extension.
-#   E.g. `projects = ["deep-learning"]` references 
-#   `content/project/deep-learning/index.md`.
-#   Otherwise, set `projects = []`.
-projects = ["graph_library"]
-
-# Links (optional)
-url_pdf = "pdf/20180925_mailthody_iwoph.pdf"
-url_preprint = ""
-url_code = ""
-url_dataset = ""
-url_project = ""
-url_slides = ""
-url_video = ""
-url_poster = ""
-url_source = ""
-
-# Featured image
-# To use, add an image named `featured.jpg/png` to your page's folder. 
-[image]
-  # Caption (optional)
-  caption = ""
-
-  # Focal point (optional)
-  # Options: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight
-  focal_point = ""
 +++

-In this paper, we present an update to our previous submission  from  Graph  Challenge  2017.  This  work  describes and evaluates new software algorithm optimizations undertaken for our 2018 year submission on Collaborative CPU+GPU Algorithms for Triangle Counting and Truss Decomposition. First, we describe four major optimizations for the triangle counting which improved performance by up to 117x over our prior submission. Additionally,  we  show  that  our triangle-counting  algorithm  is on average 151.7x faster than NVIDIA’s NVGraph library (max 476x)  for  SNAP  datasets.  Second,  we  propose  a  novel  parallel k-truss  decomposition  algorithm  that  is  time-efficient  and  is  up to 13.9x faster than our previous submission. Third, we evaluate the  effect  of  generational  hardware  improvements  between  the IBM  “Minsky”  (POWER8,  P100,  NVLink  1.0)  and  “Newell” (POWER9,  V100,  NVLink  2.0)  platforms.  Lastly,  the  software optimizations presented in this work and the hardware improvements  in  the  Newell  platform  enable  analytics  and  discovery  on large graphs  with millions of nodes  and billions of edges  in less than a minute. In sum, the new algorithmic implementations are significantly  faster  and  can  handle  much  larger  “big”  graphs.
+**Vikram S. Mailthody, Ketan Date, Zaid Qureshi, Carl Pearson, Rakesh Nagi, Jinjun Xiong, Wen-Mei Hwu**
+
+In *2018 IEEE High Performance extreme Computing Conference*
+
+In this paper, we present an update to our previous submission  from  Graph  Challenge  2017.  This  work  describes and evaluates new software algorithm optimizations undertaken for our 2018 year submission on Collaborative CPU+GPU Algorithms for Triangle Counting and Truss Decomposition. First, we describe four major optimizations for the triangle counting which improved performance by up to 117x over our prior submission. Additionally,  we  show  that  our triangle-counting  algorithm  is on average 151.7x faster than NVIDIA’s NVGraph library (max 476x)  for  SNAP  datasets.  Second,  we  propose  a  novel  parallel k-truss  decomposition  algorithm  that  is  time-efficient  and  is  up to 13.9x faster than our previous submission. Third, we evaluate the  effect  of  generational  hardware  improvements  between  the IBM  “Minsky”  (POWER8,  P100,  NVLink  1.0)  and  “Newell” (POWER9,  V100,  NVLink  2.0)  platforms.  Lastly,  the  software optimizations presented in this work and the hardware improvements  in  the  Newell  platform  enable  analytics  and  discovery  on large graphs  with millions of nodes  and billions of edges  in less than a minute. In sum, the new algorithmic implementations are significantly  faster  and  can  handle  much  larger  “big”  graphs.
+
+* [pdf](/pdf/20180925_mailthody_iwoph.pdf)
--- a/content/publication/20190410_pearson_icpe/index.md
+++ b/content/publication/20190410_pearson_icpe/index.md
@@ -9,7 +9,7 @@ tags = ["scope"]

 **Carl Pearson, Adbul Dakkak, Sarah Hashash, Cheng Li, I-Hsin Chung, Jinjun Xiong, Wen-Mei Hwu**

-*2019 ACM/SPEC International Conference on Performance Engineering*
+In *2019 ACM/SPEC International Conference on Performance Engineering*

 Data-intensive applications such as machine learning and analytics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement.

--- a/content/publication/20190926_almasri_hpec/index.md
+++ b/content/publication/20190926_almasri_hpec/index.md
@@ -1,71 +1,18 @@
 +++
-title = "Update on k-truss Decomposition on GPU"
-date = 2019-08-22T00:00:00  # Schedule page publish date.
+title = "[HPEC] Update on k-truss Decomposition on GPU"
+date = 2019-08-22  # Schedule page publish date.
 draft = false
-
-# Authors. Comma separated list, e.g. `["Bob Smith", "David Jones"]`.
-authors = ["Mohammad Almasri", "Omer Anjum", "Carl Pearson", "Vikram S. Mailthody", "Zaid Qureshi", "Rakesh Nagi", "Jinjun Xiong", "Wen-Mei Hwu"]
-
-# Publication type.
-# Legend:
-# 0 = Uncategorized
-# 1 = Conference paper
-# 2 = Journal article
-# 3 = Manuscript
-# 4 = Report
-# 5 = Book
-# 6 = Book section
-publication_types = ["1"]
-
-# Publication name and optional abbreviated version.
-publication = "2019 IEEE High Performance Extreme Computing Conference"
-publication_short = "In *HPEC'19*"
-
-
-# Does this page contain LaTeX math? (true/false)
-math = false
-
-# Does this page require source code highlighting? (true/false)
-highlight = false
-
-# Featured image thumbnail (optional)
-image_preview = ""
-
-# Is this a selected publication? (true/false)
-selected = false
-
-# Projects (optional).
-#   Associate this publication with one or more of your projects.
-#   Simply enter your project's folder or file name without extension.
-#   E.g. `projects = ["deep-learning"]` references 
-#   `content/project/deep-learning/index.md`.
-#   Otherwise, set `projects = []`.
-projects = []
-
-# Links (optional)
-url_pdf = "pdf/2019_almasri_hpec.pdf"
-url_preprint = ""
-url_code = ""
-url_dataset = ""
-url_project = ""
-url_slides = "pdf/2019_almasri_hpec_slides.pdf"
-url_video = ""
-url_poster = ""
-url_source = ""
-
-# Featured image
-# To use, add an image named `featured.jpg/png` to your page's folder. 
-[image]
-  # Caption (optional)
-  caption = ""
-
-  # Focal point (optional)
-  # Options: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight
-  focal_point = ""
 +++

+**Mohammad Almasri, Omer Anjum, Carl Pearson, Vikram S. Mailthody, Zaid Qureshi, Rakesh Nagi, Jinjun Xiong, Wen-Mei Hwu**
+
+In *2019 IEEE High Performance Extreme Computing Conference*
+
 In this paper, we present an update to our previous submission on k-truss decomposition from Graph Challenge 2018. 
 For single GPU k-truss implementation, we propose multiple algorithmic optimizations that significantly improve performance by up to 35.2x (6.9x on average) compared to our previous GPU implementation. In addition, we present a scalable multi-GPU implementation in which each GPU handles a different 'k' value.
 Compared to our prior multi-GPU implementation,the proposed approach is faster by up to 151.3x (78.8x on average). In case when the edges with only maximal k-truss are sought, incrementing the 'k' value in each iteration is inefficient particularly for graphs with large maximum k-truss.
 Thus, we propose binary search for the 'k' value to find the maximal k-truss. The binary search approach on a single GPU is up to 101.5 (24.3x on average) faster than our 2018 $k$-truss submission. 
-Lastly, we  show that the proposed binary search finds the maximum k-truss for "Twitter" graph dataset having 2.8 billion bidirectional edges in just 16 minutes on a single V100 GPU.
+Lastly, we  show that the proposed binary search finds the maximum k-truss for "Twitter" graph dataset having 2.8 billion bidirectional edges in just 16 minutes on a single V100 GPU.
+
+* [pdf](/pdf/2019_almasri_hpec.pdf)
+* [slides](/pdf/2019_almasri_hpec_slides.pdf)
--- a/content/publication/20190926_huang_hpec/index.md
+++ b/content/publication/20190926_huang_hpec/index.md
@@ -1,66 +1,13 @@
 +++
-title = "Accelerating Sparse Deep Neural Networks on FPGAs"
+title = "[HPEC] Accelerating Sparse Deep Neural Networks on FPGAs"
 date = 2019-09-26T00:00:00  # Schedule page publish date.
 draft = false
-
-# Authors. Comma separated list, e.g. `["Bob Smith", "David Jones"]`.
-authors = ["Sitao Huang", "Carl Pearson", "Rakesh Nagi", "Jinjun Xiong", "Deming Chen", "Wen-Mei Hwu"]
-
-# Publication type.
-# Legend:
-# 0 = Uncategorized
-# 1 = Conference paper
-# 2 = Journal article
-# 3 = Manuscript
-# 4 = Report
-# 5 = Book
-# 6 = Book section
-publication_types = ["1"]
-
-# Publication name and optional abbreviated version.
-publication = "2019 IEEE High Performance Extreme Computing Conference"
-publication_short = "In *HPEC'19*"
-
-
-# Does this page contain LaTeX math? (true/false)
-math = false
-
-# Does this page require source code highlighting? (true/false)
-highlight = false
-
-# Featured image thumbnail (optional)
-image_preview = ""
-
-# Is this a selected publication? (true/false)
-selected = true
-
-# Projects (optional).
-#   Associate this publication with one or more of your projects.
-#   Simply enter your project's folder or file name without extension.
-#   E.g. `projects = ["deep-learning"]` references 
-#   `content/project/deep-learning/index.md`.
-#   Otherwise, set `projects = []`.
-projects = []
-
-# Links (optional)
-url_pdf = "pdf/2019_huang_hpec.pdf"
-url_preprint = ""
-url_code = ""
-url_dataset = ""
-url_project = ""
-url_slides = ""
-url_video = ""
-url_poster = ""
-url_source = ""
-
-# Featured image
-# To use, add an image named `featured.jpg/png` to your page's folder. 
-[image]
-  # Caption (optional)
-  caption = ""
-
-  # Focal point (optional)
-  # Options: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight
-  focal_point = ""
 +++
-Deep neural networks (DNNs) have been widely adopted in many domains, including computer vision, natural language processing, and medical care. Recent research revealsthat sparsity in DNN parameters can be exploited to reduce inference computational complexity and improve network quality. However, sparsity also introduces irregularity and extra complexity in data processing, which make the accelerator design challenging. This work presents the design and implementation of a highly flexible sparse DNN inference accelerator on FPGA.Our proposed inference engine can be easily configured to beused in both mobile computing and high-performance computing scenarios. Evaluation shows our proposed inference engine effectively accelerates sparse DNNs and outperforms CPU solution by up to 4.7x in terms of energy efficiency.
+
+**Sitao Huang, Carl Pearson, Rakesh Nagi, Jinjun Xiong, Deming Chen, Wen-Mei Hwu**
+
+In *2019 IEEE High Performance Extreme Computing Conference*
+
+Deep neural networks (DNNs) have been widely adopted in many domains, including computer vision, natural language processing, and medical care. Recent research revealsthat sparsity in DNN parameters can be exploited to reduce inference computational complexity and improve network quality. However, sparsity also introduces irregularity and extra complexity in data processing, which make the accelerator design challenging. This work presents the design and implementation of a highly flexible sparse DNN inference accelerator on FPGA.Our proposed inference engine can be easily configured to beused in both mobile computing and high-performance computing scenarios. Evaluation shows our proposed inference engine effectively accelerates sparse DNNs and outperforms CPU solution by up to 4.7x in terms of energy efficiency.
+
+* [pdf](/pdf/2019_huang_hpec.pdf)
--- a/content/publication/20190926_pearson_hpec/index.md
+++ b/content/publication/20190926_pearson_hpec/index.md
@@ -3,9 +3,6 @@ title = "[HPEC] Update on Triangle Counting on GPU"
 date = 2019-08-22T00:00:00  # Schedule page publish date.
 draft = false

-
-
-
 tags = ["applications"]
 +++

--- a/content/publication/20200522_pearson_iwapt/index.md
+++ b/content/publication/20200522_pearson_iwapt/index.md
@@ -1,73 +1,24 @@
 +++
-title = "Node-Aware Stencil Communication on Heterogeneous Supercomputers"
-date = 2020-03-09T00:00:00  # Schedule page publish date.
+title = "[iWAPT] Node-Aware Stencil Communication on Heterogeneous Supercomputers"
+date = 2020-03-09  # Schedule page publish date.
 draft = false

-# Authors. Comma separated list, e.g. `["Bob Smith", "David Jones"]`.
-authors = ["Carl Pearson", "Mert Hidayetoglu", "Mohammad Almasri", "Omer Anjum", "I-Hsin Chung", "Jinjun Xiong", "Wen-Mei Hwu"]
+projects = ["stencil"]

-# Publication type.
-# Legend:
-# 0 = Uncategorized
-# 1 = Conference paper
-# 2 = Journal article
-# 3 = Manuscript
-# 4 = Report
-# 5 = Book
-# 6 = Book section
-publication_types = ["1"]
-
-# Publication name and optional abbreviated version.
-publication = "2020 IEEE International Workshop on Automatic Performance Tuning"
-publication_short = "In *iWAPT'20*"
-
-
-# Does this page contain LaTeX math? (true/false)
-math = false
-
-# Does this page require source code highlighting? (true/false)
-highlight = false
-
-# Featured image thumbnail (optional)
-image_preview = ""
-
-# Is this a selected publication? (true/false)
-selected = true
-
-# Projects (optional).
-#   Associate this publication with one or more of your projects.
-#   Simply enter your project's folder or file name without extension.
-#   E.g. `projects = ["deep-learning"]` references 
-#   `content/project/deep-learning/index.md`.
-#   Otherwise, set `projects = []`.
-projects = ["stencil_library"]
-
-# Links (optional)
-url_pdf = "pdf/20200522_pearson_iwapt.pdf"
-url_preprint = ""
-url_code = "https://github.com/cwpearson/stencil"
-url_dataset = ""
-url_project = ""
-url_slides = "pdf/20200522_pearson_iwapt_slides.pdf"
-url_video = ""
-url_poster = ""
-url_source = ""
-
-# Featured image
-# To use, add an image named `featured.jpg/png` to your page's folder. 
-[image]
-  # Caption (optional)
-  caption = ""
-
-  # Focal point (optional)
-  # Options: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight
-  focal_point = ""
 +++

+**Carl Pearson, Mert Hidayetoglu, Mohammad Almasri, Omer Anjum, I-Hsin Chung, Jinjun Xiong, Wen-Mei Hwu**
+
+In *2020 IEEE International Workshop on Automatic Performance Tuning (iWAPT)*
+
 High-performance distributed computing systems increasingly feature nodes that have multiple CPU sockets and multiple GPUs.
 The communication bandwidth between these components is non-uniform.
 Furthermore, these systems can expose different communication capabilities between these components.
 For communication-heavy applications, optimally using these capabilities is challenging and essential for performance. 
 Bespoke codes with optimized communication may be non-portable across run-time/software/hardware configurations, and existing stencil frameworks neglect optimized communication.
 This work presents node-aware approaches for automatic data placement and communication implementation for 3D stencil codes on multi-GPU nodes with non-homogeneous communication performance and capabilities.
-Benchmarking results in the Summit system show that choices in placement can result in a 20% improvement in single-node exchange, and communication specialization can yield a further 6x improvement in exchange time in a single node, and a 16% improvement at 1536 GPUs.
+Benchmarking results in the Summit system show that choices in placement can result in a 20% improvement in single-node exchange, and communication specialization can yield a further 6x improvement in exchange time in a single node, and a 16% improvement at 1536 GPUs.
+
+* [pdf](pdf/20200522_pearson_iwapt.pdf)
+* [code](https://github.com/cwpearson/stencil)
+* [slides](pdf/20200522_pearson_iwapt_slides.pdf)
--- a/content/publication/20200923_hidayetoglu_hpec/index.md
+++ b/content/publication/20200923_hidayetoglu_hpec/index.md
@@ -1,69 +1,13 @@
 +++
-title = "At-Scale Sparse Deep Neural Network Inference With Efficient GPU Implementation"
-date = 2020-09-23T00:00:00  # Schedule page publish date.
+title = "[HPEC] At-Scale Sparse Deep Neural Network Inference With Efficient GPU Implementation"
+date = 2020-09-23  # Schedule page publish date.
 draft = false
-
-# Authors. Comma separated list, e.g. `["Bob Smith", "David Jones"]`.
-authors = ["Mert Hidayetoglu", "Carl Pearson", "Vikram Sharma Mailthody", "Eiman Ebrahimi", "Jinjun Xiong", "Rakesh Nagi", "Wen-Mei Hwu"]
-
-# Publication type.
-# Legend:
-# 0 = Uncategorized
-# 1 = Conference paper
-# 2 = Journal article
-# 3 = Manuscript
-# 4 = Report
-# 5 = Book
-# 6 = Book section
-publication_types = ["1"]
-
-# Publication name and optional abbreviated version.
-publication = "2020 IEEE High Performance Extreme Compute Conference"
-publication_short = "In *HPEC'20*"
-
-
-# Does this page contain LaTeX math? (true/false)
-math = false
-
-# Does this page require source code highlighting? (true/false)
-highlight = false
-
-# Featured image thumbnail (optional)
-image_preview = ""
-
-# Is this a selected publication? (true/false)
-selected = true
-
-# Projects (optional).
-#   Associate this publication with one or more of your projects.
-#   Simply enter your project's folder or file name without extension.
-#   E.g. `projects = ["deep-learning"]` references 
-#   `content/project/deep-learning/index.md`.
-#   Otherwise, set `projects = []`.
-projects = [""]
-
-# Links (optional)
-url_pdf = "pdf/20200923_hidayetoglu_hpec.pdf"
-url_preprint = ""
-url_code = "https://github.com/merthidayetoglu/sparse-DNN"
-url_dataset = ""
-url_project = ""
-url_slides = "pdf/20200923_hidayetoglu_hpec_slides.pdf"
-url_video = ""
-url_poster = ""
-url_source = ""
-
-# Featured image
-# To use, add an image named `featured.jpg/png` to your page's folder. 
-[image]
-  # Caption (optional)
-  caption = ""
-
-  # Focal point (optional)
-  # Options: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight
-  focal_point = ""
 +++

+**Mert Hidayetoglu, Carl Pearson, Vikram Sharma Mailthody, Eiman Ebrahimi, Jinjun Xiong, Rakesh Nagi, Wen-Mei Hwu**
+
+In *2020 IEEE High Performance Extreme Compute Conference*
+
 This paper presents GPU performance optimization and scaling results for inference models of the Sparse Deep Neural Network Challenge 2020.
 Demands for network quality have increased rapidly, pushing the size and thus the memory requirements of many neural networks beyond the capacity ofavailable accelerators.
 Sparse deep neural networks (SpDNN) have shown promise for reining in the memory footprint of large neural networks.\
@@ -73,4 +17,8 @@ The optimized kernels reuse input feature mapsfrom the shared memory and sparse
 For multi-GPU parallelism, our SpDNN implementation duplicates weights and statically partition the feature maps across GPUs.
 Results for the challenge benchmarks show that the proposed kernel design and multi-GPU parallelization achieve up to 180TeraEdges per second inference throughput.
 These results areup to 4.3x faster for a single GPU and an order of magnitude faster at full scale than those of the champion of the 2019 SparseDeep Neural Network Graph Challenge for the same generation of NVIDIA V100 GPUs.
-Using the same implementation1, we also show single-GPU throughput on NVIDIA A100 is 2.37x fasterthan V100
+Using the same implementation1, we also show single-GPU throughput on NVIDIA A100 is 2.37x faster than V100
+
+* [pdf](pdf/20200923_hidayetoglu_hpec.pdf)
+* [code] (https://github.com/merthidayetoglu/sparse-DNN)
+* [slides](pdf/20200923_hidayetoglu_hpec_slides.pdf)
--- a/content/publication/20210121_pearson_arxiv/index.md
+++ b/content/publication/20210121_pearson_arxiv/index.md
@@ -1,10 +1,8 @@
 +++
 title = "[preprint] Fast CUDA-Aware MPI Datatypes without Platform Support (preprint)"
-date = 2020-01-03T00:00:00  # Schedule page publish date.
+date = 2021-01-21T00:00:00  # Schedule page publish date.
 draft = false

-
-# Does this page contain LaTeX math? (true/false)
 math = false

 tags = ["stencil, mpi"]
@@ -16,5 +14,5 @@ tags = ["stencil, mpi"]

 MPI Derived Datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. These implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. Despite substantial attention to CUDA-aware MPI implementations, they continue to offer cripplingly poor GPU performance when manipulating derived datatypes on GPUs. This work presents an approach to integrating fast derived datatype handling into existing MPI deployments through an interposed library. This library can be used regardless of MPI deployment and without modifying application code. Furthermore, this work presents a performance model of GPU derived datatype handling, demonstrating that "one-shot" methods are not always fastest. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 724,000 and MPI_Send speedup of up to 59,000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 20,000x in a 3D halo exchange. 

-* [pdf](/pdf/20201229_pearson_arxiv.pdf)
+* [pdf](/pdf/20210121_pearson_arxiv.pdf)
 * [code](https://github.com/cwpearson/tempi)