work
This commit is contained in:
@@ -88,8 +88,8 @@ This section presents an analysis of the performance of the MLFMM algorithm in t
|
||||
%\end{tabular}
|
||||
%\end{table}
|
||||
|
||||
The performance of MLFMM is evaluated in three different computing systems: Blue Waters XE nodes, Blue Waters XK nodes, and an IBM S822LC.
|
||||
The Blue Waters XE and XK nodes are two different kinds of computing nodes available on the Blue Waters supercomputer.
|
||||
The performance of MLFMM is evaluated on three different computing systems: Blue Waters XE nodes, Blue Waters XK nodes, and an IBM S822LC.
|
||||
The Blue Waters XE and XK nodes are two different kinds of computing nodes available on the Blue Waters supercomputer\cite{ncsa}.
|
||||
Each Blue Waters node is a two-socket system: the XE node has two AMD Opteron 6276 CPUs, each with eight floating-point units, hardware support for 16 executing threads, and $32$~GB of RAM.
|
||||
The XK node replaces one of these CPUs with an NVIDIA K20X GPU with the Kepler architecture and $6$~GB of RAM.
|
||||
The K20x is connected to the Operton 6276 with PCIe.
|
||||
@@ -104,7 +104,10 @@ The P100s are connected to the Power8 CPUs via $80$~GB/s NVLink connections.
|
||||
All evaluations are done on a problem with these parameters. \todo{get from mert}
|
||||
|
||||
|
||||
Fig.~\ref{fig:mlfmm_bw} shows the MLFMM performance scaling on various Blue Waters configurations.
|
||||
Fig.~\ref{fig:mlfmm_bw} shows MLFMM performance scaling on various Blue Waters configurations.
|
||||
``1T'' and ``32T'' are single-threaded and $32$-thread OpenMP executions on a single XE node.
|
||||
``1 GPU'' is a GPU-accelerated execution on an XK node.
|
||||
``4 GPU'' and ``16 GPU'' are GPU-accelerated multi-node executions using $4$ and $16$ XK nodes with MPI communication.
|
||||
|
||||
\begin{figure}[htbp]
|
||||
\begin{center}
|
||||
@@ -113,13 +116,18 @@ Fig.~\ref{fig:mlfmm_bw} shows the MLFMM performance scaling on various Blue Wate
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
\caption{
|
||||
BW.
|
||||
MLFMM execution time and speedup over single-threaded execution on Blue Waters XE and XK nodes.
|
||||
Dark represent are execution time (left axis).
|
||||
Light bars show speedup normalized to the ``1T'' execution (right axis).
|
||||
}
|
||||
\label{fig:mlfmm_bw}
|
||||
\end{figure}
|
||||
|
||||
Fig.~\ref{fig:mlfmm_minsky} shows the MLFMM performance scaling for various S822LC configurations.
|
||||
|
||||
``160T'' is a $160$-thread OpenMP executions on a single XE node.
|
||||
``1 GPU'' and ``4 GPU'' are GPU-accelerated single-node executions using $1$ and $4$ GPUs in S822LC.
|
||||
Even though only one S822LC node is used, $4$ MPI ranks are executed on that single node.
|
||||
A $16$-GPU MPI execution is not shown, as only one S822LC was available for evaluation.
|
||||
|
||||
\begin{figure}[htbp]
|
||||
\begin{center}
|
||||
@@ -128,12 +136,26 @@ Fig.~\ref{fig:mlfmm_minsky} shows the MLFMM performance scaling for various S822
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
\caption{
|
||||
S822LC.
|
||||
MLFMM execution time and speedup over single-threaded execution on the S822LC.
|
||||
Dark represent are execution time (left axis).
|
||||
Light bars show speedup normalized to the ``1T'' execution (right axis).
|
||||
}
|
||||
\label{fig:mlfmm_minsky}
|
||||
\end{figure}
|
||||
|
||||
Both XE and S822LC achieve more CPU speedup than they have floating-point units ($17\times$ at $32$ threads on $16$ units for XE, $26\times$ at $160$ threads on $20$ units for S822LC).
|
||||
When more threads than units are created, each unit is more fully-utilized than it would be under one-to-one
|
||||
thread-to-unit conditions.
|
||||
|
||||
In both systems, using a GPU for MLFMM provides substantial speedup (additional $3.1\times$ on XE/XK, $9.2\times$ on S822LC) over fully utilizing the CPUs.
|
||||
In current-generation GPUs like the P100 in S822LC, this speedup justifies the considerable time investmeed in a CUDA implementation.
|
||||
Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation, for a total speedup of $794\times$ over ``1T'' when using $16$ GPUs on $16$ XK nodes.\todo{s822lc numbers}
|
||||
This corresponds to a reduction in execution time from approximately $33$ seconds to $40$ milliseconds.
|
||||
|
||||
Despite the 5-year gap between deployment of Blue Waters and S822LC, the baseline ``1T'' execution is only 1.2x faster on S822LC than on an XE node.
|
||||
This reflects the current slow pace of single-threaded CPU performance improvement in the industry.
|
||||
The corresponding single-GPU speedup in S822LC over XK $4.4\times$.
|
||||
On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9\times$.
|
||||
|
||||
\subsection{Computation Kernel Breakdown}
|
||||
|
||||
@@ -162,7 +184,7 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe
|
||||
%The papers are expected to be two-pages long.
|
||||
|
||||
|
||||
\section{Text Format}
|
||||
%\section{Text Format}
|
||||
%Page size is A4, which is 210 mm (8.27 in) wide and 297 mm
|
||||
%(11.69 in) long. The margins are as follows:
|
||||
%\begin{itemize}
|
||||
@@ -222,7 +244,7 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe
|
||||
|
||||
|
||||
|
||||
\section{References}
|
||||
%\section{References}
|
||||
%The heading of the references section is
|
||||
%not be numbered and all reference items are in 8~pt font.
|
||||
%References are required to be in IEEE style. Please refer to the
|
||||
@@ -237,6 +259,9 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe
|
||||
%This template uses IEEE style and provides necessary information
|
||||
%to prepare papers for CEM'17 Workshop. Thank you for your
|
||||
%contributions.
|
||||
Significant CPU speedup from OpenMP.
|
||||
On modern accelerations, speedup justifies CUDA investment.
|
||||
Parallelism responsible for making the problem solvable in useful timescales.
|
||||
|
||||
|
||||
\section*{Acknowledgment}
|
||||
@@ -244,18 +269,30 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe
|
||||
|
||||
\bibliographystyle{IEEEtran}
|
||||
\begin{thebibliography}{99}
|
||||
\bibitem{journal} A.~Author, B.~Author, and C.~Author,
|
||||
``Publication title,'' {\it Journal Title}, vol.~0, no.~0,
|
||||
pp.~00--00, Month~Year.
|
||||
\bibitem{book1} A.~Author, B.~Author, and C.~Author,
|
||||
{\it Book Title}. Location: Publisher,~Year.
|
||||
\bibitem{book2} A.~Author, B.~Author, and C.~Author,
|
||||
``Chapter title,'' in {\it Book Title}, A.~Editor,~Ed. Location:
|
||||
Publisher,~Year,~Chap.~0.
|
||||
\bibitem{conf1} A.~Author, B.~Author, and C.~Author, ``Paper
|
||||
title,'' in {\it Proc. Conference Title}, vol.~0, Year, pp.~0--0.
|
||||
\bibitem{conf2} A.~Author, B.~Author, and C.~Author, ``Paper
|
||||
title,'' {\it Conference Title}, Location, Country, Month~Year.
|
||||
\bibitem{ncsa}
|
||||
National Center for Supercomputing Applications,
|
||||
``System Summary,''
|
||||
[online]
|
||||
Available: https://bluewaters.ncsa.illinois.edu/hardware-summary.
|
||||
[Accessed: 8-May-2017].
|
||||
|
||||
%\bibitem{journal} A.~Author, B.~Author, and C.~Author,
|
||||
%``Publication title,'' {\it Journal Title}, vol.~0, no.~0,
|
||||
%pp.~00--00, Month~Year.
|
||||
|
||||
%\bibitem{book1} A.~Author, B.~Author, and C.~Author,
|
||||
%{\it Book Title}. Location: Publisher,~Year.
|
||||
|
||||
%\bibitem{book2} A.~Author, B.~Author, and C.~Author,
|
||||
%``Chapter title,'' in {\it Book Title}, A.~Editor,~Ed. Location:
|
||||
%Publisher,~Year,~Chap.~0.
|
||||
|
||||
%\bibitem{conf1} A.~Author, B.~Author, and C.~Author, ``Paper
|
||||
%title,'' in {\it Proc. Conference Title}, vol.~0, Year, pp.~0--0.
|
||||
|
||||
%\bibitem{conf2} A.~Author, B.~Author, and C.~Author, ``Paper
|
||||
%title,'' {\it Conference Title}, Location, Country, Month~Year.
|
||||
|
||||
\end{thebibliography}
|
||||
|
||||
\end{document}
|
||||
|
Reference in New Issue
Block a user