final revisions

2017-05-18 08:33:56 -07:00
parent f0ac93a60a
commit 91069b54fa
1 changed files with 10 additions and 11 deletions
--- a/main.tex
+++ b/main.tex
@@ -16,15 +16,14 @@
 \usepackage{verbatim}
-\title{Evaluating MLFMM Performance for 2-D VIE Problems on Multiple Architectures}
+\title{Evaluating MLFMM Performance for 2-D VIE Problems on Multiple-GPU Architectures}
 \author{
 {Carl Pearson{\small $^{1}$}, Mert Hidayeto\u{g}lu{\small $^{1}$}, Wei Ren{\small $^{2}$}, Levent G\"{u}rel{\small $^{1}$}, and Wen-Mei Hwu{\small $^{1}$} }
 \vspace{1.6mm}\\
 \fontsize{10}{10}\selectfont\itshape
 $~^{1}$Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA\\
 $~^{2}$Department of Physics, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA\\
-%\fontsize{9}{9}\upshape \texttt{\{pearson, hidayet2, weiren2, lgurel, w-hwu\}}@illinois.edu}
+\fontsize{9}{9}\upshape \{pearson, hidayet2, weiren2, lgurel, w-hwu\}@illinois.edu}
 \fontsize{9}{9}\upshape pearson@illinois.edu}
 \begin{document}
 \maketitle
@@ -42,13 +41,13 @@ The MLFMM is evaluated on current- and next-generation GPU-accelerated supercomp
-MLFMM computes pairwise interactions between pixels in the scattering problem by hierarchically clustering pixels into a spatial quad-tree. In the nearfield phase, nearby pixel interactions are computed within the lowest level of the MLFMM tree. The aggregation and disaggregation phases propagate interactions up and down the tree, and the translation phase propagates long-range interactions within a level. In this way, $\mathcal{O}(N)$ work for $N^2$ interactions is achieved for $N$ pixels~\cite{chew01}.
+The multilevel fast multipole method (MLFMM) computes pairwise interactions between pixels in the scattering problem by hierarchically clustering pixels into a spatial quad-tree. In the nearfield phase, nearby pixel interactions are computed within the lowest level of the MLFMM tree. The aggregation and disaggregation phases propagate interactions up and down the tree, and the translation phase propagates long-range interactions within each level. In this way, $\mathcal{O}(N)$ work for $N^2$ interactions is achieved for $N$ pixels~\cite{chew01}.
 Even with algorithmic speedup, high performance parallel MLFMM is needed to take advantage of high-performancing computing resources.
 This work presents how a GPU-accelerated MLFMM effectively scales from current to next-generation computers.
 In order to achieve an efficient implementation on graphics processing units (GPUs), these four MLFMM phases are formulated as matrix multiplications.
 Common operators are pre-computed, moved to the GPU, and reused as needed to avoid host-device data transfer.
-The MLFMM tree structure is partitioned among message passing interface (MPI) processes where each process employs a single GPU for performing partial multiplications.
+The MLFMM tree structure is partitioned among message passing interface (MPI) processes, where each process employs a single GPU for performing partial multiplications.
 During the MLFMM multiplications, data is transferred between GPUs through their owning MPI processes by moving the data from GPUs to central processing units (CPUs), CPUs to CPUs through MPI, and then from CPUs to GPUs.
 To hide this communication cost, MPI communication is overlapped with GPU kernels.
 This strategy completely hides the communication cost and provides 96\%, MPI parallelization efficiency on up to 16 GPUs.
@@ -100,7 +99,7 @@ It also has four NVIDIA P100 GPUs with 16 GB of RAM each.
  \caption{
  Amount of application time spent in MLFMM for a 32-thread CPU run on an XE node (left) and a 160-thread run on S822LC (right).
  MLFMM is the dominant application component even with CPU parallelization.
-  As the number of pixels grow larger, MLFMM time further increases as a proportion of application time.
+  As the number of pixels grows larger, MLFMM time further increases as a proportion of application time.
  }
  \label{fig:app_breakdown}
 \end{figure}
@@ -160,7 +159,7 @@ A 16-GPU MPI execution is not shown, as only one S822LC was available for evalua
 Both XE and S822LC achieve more CPU speedup than they have floating-point units (17x with 32 threads on 16 units for XE, 26x with 160 threads on 20 units for S822LC).
 When floating-point units are oversubscribed, they are more fully utilized.
-The CUDA implementations leverage well-understood techniques for optimizing  matrix operations, including hybrid shared-memory and register tiling, and thread coarsening\cite{hwu11}
+The CUDA implementations leverage well-understood techniques for optimizing  matrix operations, including hybrid shared-memory and register tiling, and thread coarsening~\cite{hwu11}
 In both systems, using a GPU for MLFMM provides substantial speedup (additional 3.1x on XE/XK, 9.2x on S822LC) over fully utilizing the CPUs.
 This speedup justifies the considerable time invested in a CUDA implementation.
 Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation.
@@ -186,20 +185,20 @@ On modern GPUs, this speedup justifies the significant CUDA time investment.
 \section*{Acknowledgment}
-
+This work was supported by the NVIDIA GPU Center of Excellence, the NCSA Petascale Improvement Discovery Program, and the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR).
 \bibliographystyle{IEEEtran}
 \begin{thebibliography}{99}
 \bibitem{chew01}
 W. C. Chew, et al.,
-\textit{Fast and efficient algorithms in computational electromagnetics}
+\textit{Fast and efficient algorithms in computational electromagnetics}.
-Artech House, Inc.,
+Artech House,
 2001
 \bibitem{hwu11}
 W. Hwu,
-\textit{GPU Computing Gems Emerald Edition}
+\textit{GPU Computing Gems Emerald Edition}.
 Elsevier,
 2011