Merge sharelatex-2017-05-15-0146 into master

2017-05-14 18:46:09 -07:00
parent 2c66897263 277a2b353e
commit 848f746d42
1 changed files with 90 additions and 68 deletions
--- a/main.tex
+++ b/main.tex
@@ -19,20 +19,21 @@
 %\title{Solving Problems Involving Inhomogeneous Media with MLFMM on GPU Clusters}
 \title{Evaluating MLFMM for Large Scattering Problems on Multiple GPUs}
 \author{
-{Carl Pearson{\small $^{1}$}, Mert Hidayetoglu{\small $^{1}$}, and
+{Carl Pearson{\small $^{1}$}, Mert Hidayetoglu{\small $^{1}$}, Wei Ren{\small $^{1}$}, and Wen-Mei Hwu{\small $^{1}$} }
 Wen-Mei Hwu{\small $^{1}$} }
 \vspace{1.6mm}\\
 \fontsize{10}{10}\selectfont\itshape
-$~^{1}$University of Illinois Urbana-Champaign Electrical and Computer Engineering, Urbana, 61801, USA\\
+$~^{1}$Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, 61801, USA\\
 $~^{2}$Second Affiliation, City, Postal Code, Country\\
-\fontsize{9}{9}\upshape \texttt{\{pearson, hidayet2, w-hwu\}}@illinois.edu}
+\fontsize{9}{9}\upshape \texttt{\{pearson, hidayet2, weiren2, w-hwu\}}@illinois.edu}
 \begin{document}
 \maketitle
 \begin{abstract}
-The multilevel fast multiple method (MLFMM) is a key tool for efficiently solving large scattering problems govered by the Hemholtz equation.
+The multilevel fast multiple method (MLFMM) is a key tool for efficiently solving large scattering problems governed by the Helmholtz equation.
-Highly inhomogeneous media prevents converting the problem into a surface-scattering problem via equivalence principle, and therefore the problem is solved using the corresponding volume integral equation.
+The problems are solved using volume integral equations instead of conversion into a corresponding surface-scattering problem through the equivalence principle to support highly inhomogeneous media.
-We evaluate an efficient implementation of MLFMM for such two-dimensional volumetric scattering problems on high-performance GPU-accelerated supercomputing nodes, where up to 969x speedup is achieved over single-thread CPU execution using 4 NVIDIA P100 GPUs .
+The MLFMM implementation for two-dimensional volumetric scattering problems is realized through matrix operations optimized with shared memory tiling, register tiling, and thread coarsening.
 MPI communications are overlapped with GPU kernels to achieve high multi-node parallel efficiency.
 The MLFMM is evaluated on current- and next-generation GPU-accelerated supercomputing nodes, where up to 969x speedup is achieved over single-thread CPU execution using 4 NVIDIA P100 graphics processing units.
 \end{abstract}
@@ -41,13 +42,16 @@ We evaluate an efficient implementation of MLFMM for such two-dimensional volume
-In order to achieve an efficient implementation on graphics processing units (GPUs), the MLFMM operations are formulated as matrix-matrix multiplications.
+MLFMM computes pairwise interactions between pixels in the scattering problem by hierarchically clustering pixels into a spatial quad-tree. In a ``nearfield'' phase, nearby pixel interactions are computed within a level of a tree. An ``aggregation'' and ``disaggregation'' phase propagate interactions up and down the tree, and a ``translation'' phase propagates long-range interactions within a level. In this way, for $N$ pixels $\mathcal{O}(N)$ work for $N^2$ interactions is achieved~\cite{rokhlin93}.
-To avoid host-device data transfer, common operators are pre-computed, moved to the GPU, and reused as needed.
+Even with algorithmic speedup, high performance parallel MLFMM is needed to take advantage of high-performancing computing resources.
-Large matrices are partitioned among message passing interface (MPI) processes and each process employs a single GPU for performing partial multiplications.
+This work presents how a GPU-accelerated MLFMM effectively scales from current to next-generation computers.
 During the MLFMM multiplications, data is transferred between GPUs through their owning MPI processes bye moving the data from GPUs to central processing units (CPUs), CPUs to CPUs through MPI, and then from CPUs to GPUs.
 To hide this communication cost, MPI communication is overlapped with a long-running GPU kernel through a reordering of the MLFMM operations.
 This strategy completely hides the communication cost and provides $96$\%, MPI parallelization efficiency on up to 16 GPUs.
 In order to achieve an efficient implementation on graphics processing units (GPUs), these four MLFMM phases are formulated as matrix multiplications.
 Common operators are pre-computed, moved to the GPU, and reused as needed to avoid host-device data transfer.
 Large matrices are partitioned among message passing interface (MPI) processes and each process employs a single GPU for performing partial multiplications.
 During the MLFMM multiplications, data is transferred between GPUs through their owning MPI processes by moving the data from GPUs to central processing units (CPUs), CPUs to CPUs through MPI, and then from CPUs to GPUs.
 To hide this communication cost, MPI communication is overlapped with GPU kernels.
 This strategy completely hides the communication cost and provides $96$\%, MPI parallelization efficiency on up to 16 GPUs.
 \section{MLFMM Performance Results}
@@ -75,24 +79,18 @@ This section presents an analysis of the performance of the MLFMM algorithm on d
 %\end{tabular}
 %\end{table}
-The performance of MLFMM is evaluated on three different computing systems: Blue Waters XE nodes, Blue Waters XK nodes, and an IBM S822LC.
+The performance of MLFMM is evaluated on three systems: XE and XK nodes from the Blue Waters supercomputer~\cite{ncsa}, and an IBM S822LC.
 The Blue Waters XE and XK nodes are two different kinds of computing nodes available on the Blue Waters supercomputer~\cite{ncsa}.
 Each Blue Waters node is a two-socket system: the XE node has two AMD Opteron 6276 CPUs, each with eight floating-point units, hardware support for 16 executing threads, and $32$~GB of RAM.
-The XK node replaces one of these CPUs with an NVIDIA K20X GPU with the Kepler architecture and $6$~GB of RAM.
+The XK node replaces one of these CPUs with an NVIDIA K20X GPU and $6$~GB of RAM.
-The K20x is connected to the Operton 6276 with PCIe.
+% The K20x is connected to the Operton 6276 with PCIe.
 These XE and XK nodes are representative of the compute capabilities of current-generation clusters and supercomputers.
 The IBM S822LC represents a next-generation accelerator-heavy supercomputing node.
-It has two IBM Power8 CPUs with ten floating-point units, support for 80 executing threads, and $256$~GB of RAM.
+It has two IBM Power8 CPUs, each with ten floating-point units, support for 80 executing threads, and $256$~GB of RAM.
-In addition, each Minsky machine has four NVIDIA P100 GPUs Pascal-architecture GPUs with $16$~GB of RAM.
+It also has four NVIDIA P100 GPUs with $16$~GB of RAM each.
-The P100s are connected to the Power8 CPUs via $80$~GB/s NVLink connections.
+% The P100s are connected to the Power8 CPUs via $80$~GB/s NVLink connections.
 \subsection{MLFMM Contribution to Application Time}
 The MLFMM realization of matrix-vector multiplications forms the core computational kernel of the application, and its performance dominates that of the full inverse solver.
 Fig.~\ref{fig:app_breakdown} shows the amount of time the full inverse-solver application spends on MFLMM in two parallelized CPU executions.
 MLFMM is responsible for 72\% of the execution time on a single XE node and 83\% of time on S822LC \textit{after} full CPU parallelization.
 This proportion grows arbitrarily close to $1.0$ as the scattering problems become larger or more challenging, justifying further targeted acceleration of MLFMM.
 \begin{figure}[t]
 \begin{center}
 \begin{tabular}{c}
@@ -107,6 +105,31 @@ This proportion grows arbitrarily close to $1.0$ as the scattering problems beco
  \label{fig:app_breakdown}
 \end{figure}
 As shown in Fig.~\ref{fig:app_breakdown}, MLFMM forms the core computational kernel of the application, and its performance dominates that of the full inverse solver in CPU-parallelized execution on XE and S822LC ($72$\% and $83$\% respectively).
 This proportion grows arbitrarily close to $100$\% as the scattering problems become larger or more challenging, justifying further targeted acceleration of MLFMM.
 \subsection{MPI Communication Overlap}
 Fig.~\ref{fig:mpi_overlap} shows a timeline of a particular MLFMM execution with 16 GPUs in 16 XK nodes.
 This configuration is the worst-case for overlapping computation and communication: the MLFMM tree structure is divided amongst many GPUs, creating the shortest kernel times and the most required communication.
 Even in this scenario, the required communication time is substantially smaller than the long-running ``P2P'' nearfield kernel that it is overlapped with.
 This method will hide the full communication cost even for faster GPUs or slower inter-process communication.
 \begin{figure}[htbp]
 \begin{center}
 \begin{tabular}{c}
 \mbox{\psfig{figure=figures/mpi_comm.png,width=8cm}}
 \end{tabular}
 \end{center}
  \caption{
  Representation of MPI communication overlap during GPU-accelerated MLFMM on 16 XK nodes.
  The communication during nearfield computations provides the results of aggregation on other GPUs to the translation on each GPU.
  }
  \label{fig:mpi_overlap}
 \end{figure}
 \subsection{MLFMM Performance}
@@ -138,61 +161,26 @@ A $16$-GPU MPI execution is not shown, as only one S822LC was available for eval
 Both XE and S822LC achieve more CPU speedup than they have floating-point units ($17\times$ with $32$ threads on $16$ units for XE, $26\times$ with $160$ threads on $20$ units for S822LC).
-When more threads than units are created, each unit is more fully-utilized than it would be under one-to-one
+When floating-point units are oversubscribed, they are more fully utilized.
 thread-to-unit conditions.
 The CUDA implementations leverage well-understood techniques for optimizing  matrix operations, including hybrid shared-memory and register tiling, and thread coarsening.
 In both systems, using a GPU for MLFMM provides substantial speedup (additional $3.1\times$ on XE/XK, $9.2\times$ on S822LC) over fully utilizing the CPUs.
-In current-generation GPUs like the P100 in S822LC, this speedup justifies the considerable time invested in a CUDA implementation.
+This speedup justifies the considerable time invested in a CUDA implementation.
-Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation, for a total speedup of $794\times$ over ``1T'' when using $16$ GPUs on $16$ XK nodes, and $969\times$ when using $4$ GPUs on S822LC.
+Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation.
 This corresponds to a reduction in execution time from approximately $33$ seconds to $40$ milliseconds on XK nodes, and $28$ seconds to $29$ milliseconds on S822LC.
 Despite the 5-year gap between deployment of Blue Waters and S822LC, the baseline ``1T'' execution is only $1.2\times$ faster on S822LC than on an XE node.
-This reflects the current slow pace of single-threaded CPU performance improvement.
+This reflects the slow pace of single-threaded CPU performance improvement.
 On the other hand, the P100 GPU in S822LC provides $4.4\times$ speedup over the K20x in XK. 
 On a per-node basis the four GPUs in S822LC provide $17.9\times$ speedup over the single GPU in XK.
 \subsection{MPI Communication Overlap}
 \begin{figure}[htbp]
 \begin{center}
 \begin{tabular}{c}
 \mbox{\psfig{figure=figures/mpi_comm.png,width=8cm}}
 \end{tabular}
 \end{center}
  \caption{Representation of MPI communication overlap during a particular MLFMM.}
  \label{fig:kernel_breakdown}
 \end{figure}
 \subsection{Computation Kernel Breakdown}
 Fig.~\ref{fig:kernel_breakdown} shows the amount of  of MLFMM execution time spent in computational kernels.
 \texttt{P2P} is the ``particle-to-particle'' or nearfield exchanges. 
 \texttt{P2M} and \texttt{M2M} are the lowest-level and higher-level aggregations, respectively.
 \texttt{L2L} and \texttt{L2P} are the higher-level and lowest-level disaggregations, respectively.
 \texttt{M2M} is the translations.
 \begin{figure}[htbp]
 \begin{center}
 \begin{tabular}{c}
 \mbox{\psfig{figure=figures/kernels.pdf,width=8cm}}
 \end{tabular}
 \end{center}
  \caption{Normalized breakdown of the computation time across different MLFMM kernels in different execution environments.}
  \label{fig:kernel_breakdown}
 \end{figure}
 Since MLFMM is realized as matrix operations, the CUDA implementations leverage well-understood techniques for matrix-matrix and matrix-vector multiplication, including
 hybrid shared-memory and register tiling, and thread coarsening.
 The \texttt{P2P} nearfield kernel is the majority of the MLFMM execution time.
-The MPI communication is hidden-behind this long-running kernel.
+The average kernel-execution speedup moving from K20x to P100 is $5.3\times$, and the \texttt{L2L} disaggregation kernel speedup is the largest, at $8\times$.
-The average GPU kernel speedup on four GPU moving from XK to S822LC is $5.3\times$, but the \texttt{L2L} kernel speedup is the largest, at $8\times$.
+On both K20x and P100, this kernel's performance is limited by the amount of CUDA shared memory it requires. 
 On both XK and S822LC, this kernel's performance is limited by the amount of CUDA shared memory it requires. 
 In S822LC, the newer Pascal GPU architecture provides $64$~KB of shared memory per thread-block rather than the $48$~KB on XK, which allows more thread-blocks to run concurrently and provide the disproportionate speedup on that machine. 
 % the following vfill coarsely balances the columns on the last page
 \vfill \pagebreak
 \section{Conclusions}
 This paper presents MLFMM performance results on three types of computer systems: Blue Waters XE and XK nodes, and an IBM S822LC. 
 MLFMM is realized as matrix operations.
@@ -201,10 +189,19 @@ On modern GPUs, this speedup justifies the significant CUDA time investment.
 \section*{Acknowledgment}
-%Acknowledgments should be here.
+Here are some acknowledgements.
 Here are some more acknowledgements.
 THis work was funded by grants blah blah blah / blah blah and grant 123465778909 fromr blah blah blah.
 \bibliographystyle{IEEEtran}
 \begin{thebibliography}{99}
 \bibitem{rokhlin93}
 V. Rokhlin, 
 ``Diagonal forms of translation operators for the Helmholtz equation in three dimensions.''
 in \textit{Applied and Computational Harmonic Analysis}
 1, 1 (1993), 82–93
 \bibitem{ncsa} 
 National Center for Supercomputing Applications, 
 ``System Summary,'' 
@@ -212,6 +209,31 @@ National Center for Supercomputing Applications,
 Available: https://bluewaters.ncsa.illinois.edu/hardware-summary.
 [Accessed: 8-May-2017].
 % the following vfill coarsely balances the columns on the last page
 \vfill \pagebreak
 %\subsection{Computation Kernel Breakdown}
 %Fig.~\ref{fig:kernel_breakdown} shows the amount of  of MLFMM execution time spent in computational kernels.
 %\texttt{P2P} is the ``particle-to-particle'' or nearfield exchanges. 
 %\texttt{P2M} and \texttt{M2M} are the lowest-level and higher-level aggregations, respectively.
 %\texttt{L2L} and \texttt{L2P} are the higher-level and lowest-level disaggregations, respectively.
 %\texttt{M2M} is the translations.
 %\begin{figure}[htbp]
 %\begin{center}
 %\begin{tabular}{c}
 %\mbox{\psfig{figure=figures/kernels.pdf,width=8cm}}
 %\end{tabular}
 %\end{center}
 %  \caption{Normalized breakdown of the computation time across different MLFMM kernels in different execution environments.}
 %  \label{fig:kernel_breakdown}
 %\end{figure}
 %\bibitem{journal} A.~Author, B.~Author, and C.~Author,
 %``Publication title,'' {\it Journal Title}, vol.~0, no.~0,
 %pp.~00--00, Month~Year.