This commit is contained in:
Carl Pearson
2017-05-08 07:54:23 -07:00
parent 22fdb16fda
commit 9542232927

View File

@@ -88,8 +88,8 @@ This section presents an analysis of the performance of the MLFMM algorithm in t
%\end{tabular} %\end{tabular}
%\end{table} %\end{table}
The performance of MLFMM is evaluated in three different computing systems: Blue Waters XE nodes, Blue Waters XK nodes, and an IBM S822LC. The performance of MLFMM is evaluated on three different computing systems: Blue Waters XE nodes, Blue Waters XK nodes, and an IBM S822LC.
The Blue Waters XE and XK nodes are two different kinds of computing nodes available on the Blue Waters supercomputer. The Blue Waters XE and XK nodes are two different kinds of computing nodes available on the Blue Waters supercomputer\cite{ncsa}.
Each Blue Waters node is a two-socket system: the XE node has two AMD Opteron 6276 CPUs, each with eight floating-point units, hardware support for 16 executing threads, and $32$~GB of RAM. Each Blue Waters node is a two-socket system: the XE node has two AMD Opteron 6276 CPUs, each with eight floating-point units, hardware support for 16 executing threads, and $32$~GB of RAM.
The XK node replaces one of these CPUs with an NVIDIA K20X GPU with the Kepler architecture and $6$~GB of RAM. The XK node replaces one of these CPUs with an NVIDIA K20X GPU with the Kepler architecture and $6$~GB of RAM.
The K20x is connected to the Operton 6276 with PCIe. The K20x is connected to the Operton 6276 with PCIe.
@@ -104,7 +104,10 @@ The P100s are connected to the Power8 CPUs via $80$~GB/s NVLink connections.
All evaluations are done on a problem with these parameters. \todo{get from mert} All evaluations are done on a problem with these parameters. \todo{get from mert}
Fig.~\ref{fig:mlfmm_bw} shows the MLFMM performance scaling on various Blue Waters configurations. Fig.~\ref{fig:mlfmm_bw} shows MLFMM performance scaling on various Blue Waters configurations.
``1T'' and ``32T'' are single-threaded and $32$-thread OpenMP executions on a single XE node.
``1 GPU'' is a GPU-accelerated execution on an XK node.
``4 GPU'' and ``16 GPU'' are GPU-accelerated multi-node executions using $4$ and $16$ XK nodes with MPI communication.
\begin{figure}[htbp] \begin{figure}[htbp]
\begin{center} \begin{center}
@@ -113,13 +116,18 @@ Fig.~\ref{fig:mlfmm_bw} shows the MLFMM performance scaling on various Blue Wate
\end{tabular} \end{tabular}
\end{center} \end{center}
\caption{ \caption{
BW. MLFMM execution time and speedup over single-threaded execution on Blue Waters XE and XK nodes.
Dark represent are execution time (left axis).
Light bars show speedup normalized to the ``1T'' execution (right axis).
} }
\label{fig:mlfmm_bw} \label{fig:mlfmm_bw}
\end{figure} \end{figure}
Fig.~\ref{fig:mlfmm_minsky} shows the MLFMM performance scaling for various S822LC configurations. Fig.~\ref{fig:mlfmm_minsky} shows the MLFMM performance scaling for various S822LC configurations.
``160T'' is a $160$-thread OpenMP executions on a single XE node.
``1 GPU'' and ``4 GPU'' are GPU-accelerated single-node executions using $1$ and $4$ GPUs in S822LC.
Even though only one S822LC node is used, $4$ MPI ranks are executed on that single node.
A $16$-GPU MPI execution is not shown, as only one S822LC was available for evaluation.
\begin{figure}[htbp] \begin{figure}[htbp]
\begin{center} \begin{center}
@@ -128,12 +136,26 @@ Fig.~\ref{fig:mlfmm_minsky} shows the MLFMM performance scaling for various S822
\end{tabular} \end{tabular}
\end{center} \end{center}
\caption{ \caption{
S822LC. MLFMM execution time and speedup over single-threaded execution on the S822LC.
Dark represent are execution time (left axis).
Light bars show speedup normalized to the ``1T'' execution (right axis).
} }
\label{fig:mlfmm_minsky} \label{fig:mlfmm_minsky}
\end{figure} \end{figure}
Both XE and S822LC achieve more CPU speedup than they have floating-point units ($17\times$ at $32$ threads on $16$ units for XE, $26\times$ at $160$ threads on $20$ units for S822LC).
When more threads than units are created, each unit is more fully-utilized than it would be under one-to-one
thread-to-unit conditions.
In both systems, using a GPU for MLFMM provides substantial speedup (additional $3.1\times$ on XE/XK, $9.2\times$ on S822LC) over fully utilizing the CPUs.
In current-generation GPUs like the P100 in S822LC, this speedup justifies the considerable time investmeed in a CUDA implementation.
Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation, for a total speedup of $794\times$ over ``1T'' when using $16$ GPUs on $16$ XK nodes.\todo{s822lc numbers}
This corresponds to a reduction in execution time from approximately $33$ seconds to $40$ milliseconds.
Despite the 5-year gap between deployment of Blue Waters and S822LC, the baseline ``1T'' execution is only 1.2x faster on S822LC than on an XE node.
This reflects the current slow pace of single-threaded CPU performance improvement in the industry.
The corresponding single-GPU speedup in S822LC over XK $4.4\times$.
On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9\times$.
\subsection{Computation Kernel Breakdown} \subsection{Computation Kernel Breakdown}
@@ -162,7 +184,7 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe
%The papers are expected to be two-pages long. %The papers are expected to be two-pages long.
\section{Text Format} %\section{Text Format}
%Page size is A4, which is 210 mm (8.27 in) wide and 297 mm %Page size is A4, which is 210 mm (8.27 in) wide and 297 mm
%(11.69 in) long. The margins are as follows: %(11.69 in) long. The margins are as follows:
%\begin{itemize} %\begin{itemize}
@@ -222,7 +244,7 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe
\section{References} %\section{References}
%The heading of the references section is %The heading of the references section is
%not be numbered and all reference items are in 8~pt font. %not be numbered and all reference items are in 8~pt font.
%References are required to be in IEEE style. Please refer to the %References are required to be in IEEE style. Please refer to the
@@ -237,6 +259,9 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe
%This template uses IEEE style and provides necessary information %This template uses IEEE style and provides necessary information
%to prepare papers for CEM'17 Workshop. Thank you for your %to prepare papers for CEM'17 Workshop. Thank you for your
%contributions. %contributions.
Significant CPU speedup from OpenMP.
On modern accelerations, speedup justifies CUDA investment.
Parallelism responsible for making the problem solvable in useful timescales.
\section*{Acknowledgment} \section*{Acknowledgment}
@@ -244,18 +269,30 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe
\bibliographystyle{IEEEtran} \bibliographystyle{IEEEtran}
\begin{thebibliography}{99} \begin{thebibliography}{99}
\bibitem{journal} A.~Author, B.~Author, and C.~Author, \bibitem{ncsa}
``Publication title,'' {\it Journal Title}, vol.~0, no.~0, National Center for Supercomputing Applications,
pp.~00--00, Month~Year. ``System Summary,''
\bibitem{book1} A.~Author, B.~Author, and C.~Author, [online]
{\it Book Title}. Location: Publisher,~Year. Available: https://bluewaters.ncsa.illinois.edu/hardware-summary.
\bibitem{book2} A.~Author, B.~Author, and C.~Author, [Accessed: 8-May-2017].
``Chapter title,'' in {\it Book Title}, A.~Editor,~Ed. Location:
Publisher,~Year,~Chap.~0. %\bibitem{journal} A.~Author, B.~Author, and C.~Author,
\bibitem{conf1} A.~Author, B.~Author, and C.~Author, ``Paper %``Publication title,'' {\it Journal Title}, vol.~0, no.~0,
title,'' in {\it Proc. Conference Title}, vol.~0, Year, pp.~0--0. %pp.~00--00, Month~Year.
\bibitem{conf2} A.~Author, B.~Author, and C.~Author, ``Paper
title,'' {\it Conference Title}, Location, Country, Month~Year. %\bibitem{book1} A.~Author, B.~Author, and C.~Author,
%{\it Book Title}. Location: Publisher,~Year.
%\bibitem{book2} A.~Author, B.~Author, and C.~Author,
%``Chapter title,'' in {\it Book Title}, A.~Editor,~Ed. Location:
%Publisher,~Year,~Chap.~0.
%\bibitem{conf1} A.~Author, B.~Author, and C.~Author, ``Paper
%title,'' in {\it Proc. Conference Title}, vol.~0, Year, pp.~0--0.
%\bibitem{conf2} A.~Author, B.~Author, and C.~Author, ``Paper
%title,'' {\it Conference Title}, Location, Country, Month~Year.
\end{thebibliography} \end{thebibliography}
\end{document} \end{document}