Updates from ShareLaTeX
This commit is contained in:
39
main.tex
39
main.tex
@@ -15,7 +15,8 @@
|
||||
\usepackage{todonotes}
|
||||
\usepackage{verbatim}
|
||||
|
||||
\title{Solving Problems Involving Inhomogeneous Media with MLFMA on GPU Clusters}
|
||||
%\title{Solving Problems Involving Inhomogeneous Media with MLFMM on GPU Clusters}
|
||||
\title{Evaluating MLFMM for Large Scattering Problems on Multiple GPUs}
|
||||
\author{
|
||||
{Carl Pearson{\small $^{1}$}, Mert Hidayetoglu{\small $^{1}$}, and
|
||||
Wen-Mei Hwu{\small $^{1}$} }
|
||||
@@ -28,24 +29,29 @@ $~^{2}$Second Affiliation, City, Postal Code, Country\\
|
||||
\maketitle
|
||||
|
||||
\begin{abstract}
|
||||
The multilevel fast multiple method (MLFMM) is a key tool for efficiently solving large scattering problems.
|
||||
The multilevel fast multiple method (MLFMM) is a key tool for efficiently solving large scattering problems govered by the Hemholtz equation.
|
||||
Highly inhomogeneous media prevents converting the problem into a surface-scattering problem via equivalence principle, and therefore we solve the corresponding volume integral equation.
|
||||
We evaluate an efficient implementation of MLFMM for such two-dimensional volumetric scattering problems on high-performance GPU-accelerated supercomputing nodes.
|
||||
This class of problems are commonly encountered in imaging and inverse-scattering applications.
|
||||
We evaluate an efficient implementation of MLFMM for such two-dimensional volumetric scattering problems on high-performance GPU-accelerated supercomputing nodes, where up to 969x speedup is achieved over single-thread CPU execution.
|
||||
|
||||
\end{abstract}
|
||||
|
||||
\section{Introduction}
|
||||
\label{sec:introduction}
|
||||
|
||||
In order to achieve an efficient implementation on graphics processing units (GPUs), the MLFMM operations are formulated as matrix-matrix multiplications.
|
||||
To avoid host-device data transfer, common operators are pre-computed, moved to the GPU, and reused as needed.
|
||||
Large matrices are partitioned among message passing interface (MPI) processes and each process employs a single GPU for performing partial multiplications.
|
||||
During the MLFMM multiplications, data is transferred between GPUs through their owning MPI processes bye moving the data from GPUs to central processing units (CPUs), CPUs to CPUs through MPI, and then from CPUs to GPUs.
|
||||
To hide this communication cost, MPI communication is overlapped with a long-running GPU kernel through a reordering of the MLFMM operations.
|
||||
This strategy completely hides the communication cost and provides $96$\%, MPI parallelization efficiency on up to 16 GPUs.
|
||||
|
||||
In order to achieve an efficient implementation on multiple graphics processing units (GPUs), we formulate the MLFMM operations as matrix-matrix multiplications, where the large matrices are partitioned among message passing interface (MPI) processes. Each process employs a single GPU for performing the corresponding partial multiplications. The implementation can employ up to 16 GPUs. During the MLFMM multiplications, the GPUs communicate through MPI to receive the required data from each other. These communications are costly since they involve moving the data from GPUs to central processing units (CPUs), CPUs to CPUs (as in the traditional CPU implementation), and then from CPUs to GPUs. To minimize this cost, we optimize the data amount to be transferred, and merge small MPI buffers into large ones. Furthermore, we overlap the communications with the GPU computations by a reordering of the MLFMM operations. This strategy completely hides the communication overhead and provides good, i.e., 94\%, MPI parallelization efficiency.
|
||||
|
||||
\section{Inverse-Scattering Formulation and Application Architecture}
|
||||
\section{MLFMM Contribution to Application Time}
|
||||
\label{sec:application}
|
||||
|
||||
Fig.~\ref{fig:app_breakdown} shows the amount of time the full inverse-solver application spends on MFLMM in two parallelized CPU executions.
|
||||
|
||||
``BW (32T)'' corresponds to a 32-thread OpenMP parallel run on a single XE node, and S822LC corresponds to a 160-thread OpenMP parallel run on the S822LC node.
|
||||
Non-MLFMM operations are a minority of the time, and become an even smaller proportion of the time as the object reconstructions grow larger.
|
||||
The MLFMM execution parameters are described in Section \ref{sec:results}.
|
||||
MLFMM is the dominant application component, responsible for 72\% of the execution time on a single XE node and 83\% of time on S822LC \textit{after} full CPU parallelization.
|
||||
This proportion grows arbitrarily close to $1.0$ as the scattering problems become larger or more challenging, justifying further targeted acceleration of MLFMM.
|
||||
|
||||
\begin{figure}[ht]
|
||||
\begin{center}
|
||||
@@ -55,15 +61,16 @@ Non-MLFMM operations are a minority of the time, and become an even smaller prop
|
||||
\end{center}
|
||||
\caption{
|
||||
Amount of application time spent in MLFMM for two different execution environments.
|
||||
MLFMM is the dominant component even with CPU parallelization on a single node.
|
||||
XE (32T) corresponds to a 32-thread OpenMP parallel run on a single XE node, and S822LC corresponds to a 160-thread OpenMP parallel run on the S822LC node.
|
||||
MLFMM is the dominant application component even with CPU parallelization.
|
||||
As object reconstructions grow larger or more challenging, MLFMM time further increases as a proportion of application time.
|
||||
}
|
||||
\label{fig:app_breakdown}
|
||||
\end{figure}
|
||||
|
||||
|
||||
|
||||
\section{MLFMM Results}
|
||||
\section{MLFMM Performance Results}
|
||||
\label{sec:results}
|
||||
|
||||
As described in Section \ref{sec:application} and shown in Fig. \ref{fig:app_breakdown}, the MLFMM realization of matrix-vector multiplications forms the core computational kernel of the application, and its performance dominates that of the full inverse solver.
|
||||
This section presents an analysis of the performance of the MLFMM algorithm in three different environments.
|
||||
@@ -159,7 +166,7 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe
|
||||
\label{fig:kernel_breakdown}
|
||||
\end{figure}
|
||||
|
||||
Since MLFMM is realized as dense matrix operations, the CUDA implementations leverage well-understood techniques for dense matrix-matrix and matrix-vector multiplication, including
|
||||
Since MLFMM is realized as matrix operations, the CUDA implementations leverage well-understood techniques for matrix-matrix and matrix-vector multiplication, including
|
||||
hybrid shared-memory and register tiling, and thread coarsening.
|
||||
The \texttt{P2P} nearfield kernel is the majority of the MLFMM execution time.
|
||||
The MPI communication is hidden-behind this long-running kernel.
|
||||
@@ -247,8 +254,8 @@ In S822LC, the newer Pascal GPU architecture provides $64$~KB of shared memory p
|
||||
|
||||
\section{Conclusions}
|
||||
This paper presents MLFMM performance results on three types of computer systems: Blue Waters XE and XK nodes, and an IBM S822LC.
|
||||
MLFMM is realized as dense matrix operations.
|
||||
Significant CPU speedup on both systems is achieved with OpenMP, and further eclipsed by CUDA implementations that take advantage of well-understood dense matrix optimization techniques, up to a speedup of $969\times$ over single-threaded CPU execution on S822LC, bringing execution times from seconds to milliseconds even for large problems.
|
||||
MLFMM is realized as matrix operations.
|
||||
Significant CPU speedup on both systems is achieved with OpenMP, and further eclipsed by CUDA implementations that take advantage of well-understood matrix optimization techniques, up to a speedup of $969\times$ over single-threaded CPU execution on S822LC, bringing execution times from seconds to milliseconds even for large problems.
|
||||
On modern GPUs, this speedup justifies the significant CUDA time investment.
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user