Merge sharelatex-2017-05-08-2106 into master

This commit is contained in:
Carl Pearson
2017-05-08 14:06:31 -07:00
committed by GitHub

View File

@@ -121,8 +121,8 @@ A $16$-GPU MPI execution is not shown, as only one S822LC was available for eval
\end{center} \end{center}
\caption{ \caption{
MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and S822LC (b). MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and S822LC (b).
Dark bars represents execution time (left axis). Light bars represent execution time (left axis).
Light bars show speedup normalized to the ``1T'' execution (right axis). Dark bars show speedup normalized to the ``1T'' execution (right axis).
} }
\label{fig:mlfmm_performance} \label{fig:mlfmm_performance}
\end{figure} \end{figure}
@@ -133,13 +133,13 @@ When more threads than units are created, each unit is more fully-utilized than
thread-to-unit conditions. thread-to-unit conditions.
In both systems, using a GPU for MLFMM provides substantial speedup (additional $3.1\times$ on XE/XK, $9.2\times$ on S822LC) over fully utilizing the CPUs. In both systems, using a GPU for MLFMM provides substantial speedup (additional $3.1\times$ on XE/XK, $9.2\times$ on S822LC) over fully utilizing the CPUs.
In current-generation GPUs like the P100 in S822LC, this speedup justifies the considerable time investmeed in a CUDA implementation. In current-generation GPUs like the P100 in S822LC, this speedup justifies the considerable time invested in a CUDA implementation.
Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation, for a total speedup of $794\times$ over ``1T'' when using $16$ GPUs on $16$ XK nodes, and $969\times$ when using $4$ GPUs on S822LC. Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation, for a total speedup of $794\times$ over ``1T'' when using $16$ GPUs on $16$ XK nodes, and $969\times$ when using $4$ GPUs on S822LC.
This corresponds to a reduction in execution time from approximately $33$ seconds to $40$ milliseconds on XK nodes, and $28$ seconds to $29$ milliseconds on S822LC. This corresponds to a reduction in execution time from approximately $33$ seconds to $40$ milliseconds on XK nodes, and $28$ seconds to $29$ milliseconds on S822LC.
Despite the 5-year gap between deployment of Blue Waters and S822LC, the baseline ``1T'' execution is only $1.2\times$ faster on S822LC than on an XE node. Despite the 5-year gap between deployment of Blue Waters and S822LC, the baseline ``1T'' execution is only $1.2\times$ faster on S822LC than on an XE node.
This reflects the current slow pace of single-threaded CPU performance improvement in the industry. This reflects the current slow pace of single-threaded CPU performance improvement in the industry.
The corresponding single-GPU speedup in S822LC over XK $4.4\times$. The corresponding single-GPU speedup in S822LC over XK is $4.4\times$.
On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9\times$. On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9\times$.
\subsection{Computation Kernel Breakdown} \subsection{Computation Kernel Breakdown}
@@ -160,7 +160,7 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe
\label{fig:kernel_breakdown} \label{fig:kernel_breakdown}
\end{figure} \end{figure}
The \texttt{L2L} kernels exhibit the
%This document is a template for authors preparing papers for the %This document is a template for authors preparing papers for the
%CEM'17 Computing and Electromagnetics Workshop in Barcelona, Spain. %CEM'17 Computing and Electromagnetics Workshop in Barcelona, Spain.