Updates from ShareLaTeX
This commit is contained in:
51
main.tex
51
main.tex
@@ -32,7 +32,7 @@ The multilevel fast multiple method (MLFMM) is a key tool for efficiently solvin
|
||||
The problems are solved using volume integral equations instead of conversion into a corresponding surface-scattering problem through the equivalence principle to support highly inhomogeneous media.
|
||||
The MLFMM implementation for two-dimensional volumetric scattering problems is realized through matrix operations optimized with shared memory tiling, register tiling, and thread coarsening.
|
||||
MPI communications are overlapped with GPU kernels to achieve high multi-node parallel efficiency.
|
||||
The MLFMM is evaluated on current- and next-generation GPU-accelerated supercomputing nodes, where up to 969x speedup is achieved over single-thread CPU execution using 4 NVIDIA P100 graphics processing units.
|
||||
The MLFMM is evaluated on current- and next-generation GPU-accelerated supercomputing nodes, where up to 969x speedup is achieved over sequential CPU execution using 4 NVIDIA P100 graphics processing units.
|
||||
|
||||
\end{abstract}
|
||||
|
||||
@@ -60,23 +60,7 @@ This section presents an analysis of the performance of the MLFMM algorithm on d
|
||||
|
||||
\subsection{Evaluation Environments}
|
||||
|
||||
%\begin{table}{}
|
||||
%\centering \caption{Evaluation Systems} \label{tab:systems}
|
||||
%\begin{tabular}{|c|c|c|c|}
|
||||
%\hline & \textbf{XK Node} & \textbf{XE Node} & \textbf{S822LC} \\
|
||||
%\hline
|
||||
%\hline \textbf{CPU 1} & AMD Opteron 6276 & AMD Opteron 6276 & IBM Power8 \\
|
||||
%\hline \textbf{CPU 2} & -- & AMD Opteron 6276 & IBM Power8 \\
|
||||
%\hline
|
||||
%\hline \textbf{GPU 1} & \makecell{K20X \\ (6 GB RAM) } & -- & P100 (16GB RAM) \\
|
||||
%\hline \textbf{GPU 2} & -- & -- & P100 (16GB RAM) \\
|
||||
%\hline \textbf{GPU 3} & -- & -- & P100 (16GB RAM) \\
|
||||
%\hline \textbf{GPU 4} & -- & -- & P100 (16GB RAM) \\
|
||||
%\hline \textbf{RAM} & 32GB & 64 GB & 512 GB \\
|
||||
%\hline \makecell{\textbf{CPU-GPU} \\ \textbf{Bus}} & PCIe & -- & NVLink \\
|
||||
%\hline
|
||||
%\end{tabular}
|
||||
%\end{table}
|
||||
|
||||
|
||||
The performance of MLFMM is evaluated on three systems: XE and XK nodes from the Blue Waters supercomputer~\cite{ncsa}, and an IBM S822LC.
|
||||
Each Blue Waters node is a two-socket system: the XE node has two AMD Opteron 6276 CPUs, each with eight floating-point units, hardware support for 16 executing threads, and 32 GB of RAM.
|
||||
@@ -159,9 +143,8 @@ A 16-GPU MPI execution is not shown, as only one S822LC was available for evalua
|
||||
Both XE and S822LC achieve more CPU speedup than they have floating-point units (17x with 32 threads on 16 units for XE, 26x with 160 threads on 20 units for S822LC).
|
||||
When floating-point units are oversubscribed, they are more fully utilized.
|
||||
|
||||
The CUDA implementations leverage well-understood techniques for optimizing matrix operations, including hybrid shared-memory and register tiling, and thread coarsening~\cite{hwu11}
|
||||
The CUDA implementations leverage hybrid shared-memory and register tiling, and thread coarsening~\cite{hwu11}.
|
||||
In both systems, using a GPU for MLFMM provides substantial speedup (additional 3.1x on XE/XK, 9.2x on S822LC) over fully utilizing the CPUs.
|
||||
This speedup justifies the considerable time invested in a CUDA implementation.
|
||||
Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation.
|
||||
This corresponds to a reduction in execution time from approximately 33 seconds to 40 milliseconds on XK nodes, and 28 seconds to 29 milliseconds on S822LC.
|
||||
|
||||
@@ -170,22 +153,20 @@ This reflects the slow pace of single-threaded CPU performance improvement.
|
||||
On the other hand, the P100 GPU in S822LC provides 4.4x speedup over the K20x in XK.
|
||||
On a per-node basis the four GPUs in S822LC provide 17.9 speedup over the single GPU in XK.
|
||||
|
||||
|
||||
The nearfield kernel is the majority of the MLFMM execution time.
|
||||
The average kernel-execution speedup moving from K20x to P100 is 5.3x, and the disaggregation kernel speedup is the largest, at 8x.
|
||||
On both K20x and P100, this kernel's performance is limited by the amount of CUDA shared memory it requires.
|
||||
In S822LC, the newer Pascal GPU architecture provides 64 KB of shared memory per thread-block rather than the 48 KB on XK, which allows more thread-blocks to run concurrently and provide the disproportionate speedup on that machine.
|
||||
|
||||
|
||||
\section{Conclusions}
|
||||
This paper presents MLFMM performance results on three types of computer systems: Blue Waters XE and XK nodes, and an IBM S822LC.
|
||||
MLFMM is realized as matrix operations for excellent performance.
|
||||
Significant CPU speedup on both systems is achieved with OpenMP, and further eclipsed by CUDA implementations that take advantage of well-understood matrix optimization techniques, up to a speedup of 969x over single-threaded CPU execution on S822LC, bringing execution times from seconds to milliseconds even for large problems.
|
||||
On modern GPUs, this speedup justifies the significant CUDA time investment.
|
||||
Significant CPU speedup on both systems is achieved with OpenMP, and further eclipsed by CUDA implementations that take advantage of well-understood matrix optimization techniques.
|
||||
A speedup of 969x over single-threaded CPU execution is achieved on S822LC, bringing execution times from seconds to milliseconds even for large problems.
|
||||
This speedup justifies the significant CUDA time investment.
|
||||
|
||||
|
||||
\section*{Acknowledgments}
|
||||
This work was supported by the NVIDIA GPU Center of Excellence, the NCSA Petascale Improvement Discovery Program, and the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR)- a research collaboration as part of the IBM Cognitive Horizon Network.
|
||||
This work was supported by the NVIDIA GPU Center of Excellence, the NCSA Petascale Improvement Discovery Program, and the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR).
|
||||
|
||||
\bibliographystyle{IEEEtran}
|
||||
\begin{thebibliography}{99}
|
||||
@@ -213,7 +194,23 @@ Available: https://bluewaters.ncsa.illinois.edu/hardware-summary.
|
||||
\vfill \pagebreak
|
||||
|
||||
|
||||
|
||||
%\begin{table}{}
|
||||
%\centering \caption{Evaluation Systems} \label{tab:systems}
|
||||
%\begin{tabular}{|c|c|c|c|}
|
||||
%\hline & \textbf{XK Node} & \textbf{XE Node} & \textbf{S822LC} \\
|
||||
%\hline
|
||||
%\hline \textbf{CPU 1} & AMD Opteron 6276 & AMD Opteron 6276 & IBM Power8 \\
|
||||
%\hline \textbf{CPU 2} & -- & AMD Opteron 6276 & IBM Power8 \\
|
||||
%\hline
|
||||
%\hline \textbf{GPU 1} & \makecell{K20X \\ (6 GB RAM) } & -- & P100 (16GB RAM) \\
|
||||
%\hline \textbf{GPU 2} & -- & -- & P100 (16GB RAM) \\
|
||||
%\hline \textbf{GPU 3} & -- & -- & P100 (16GB RAM) \\
|
||||
%\hline \textbf{GPU 4} & -- & -- & P100 (16GB RAM) \\
|
||||
%\hline \textbf{RAM} & 32GB & 64 GB & 512 GB \\
|
||||
%\hline \makecell{\textbf{CPU-GPU} \\ \textbf{Bus}} & PCIe & -- & NVLink \\
|
||||
%\hline
|
||||
%\end{tabular}
|
||||
%\end{table}
|
||||
|
||||
%\subsection{Computation Kernel Breakdown}
|
||||
|
||||
|
Reference in New Issue
Block a user