Merge sharelatex-2017-05-08-2106 into master

This commit is contained in:
Carl Pearson
2017-05-08 14:06:31 -07:00
committed by GitHub

View File

@@ -121,8 +121,8 @@ A $16$-GPU MPI execution is not shown, as only one S822LC was available for eval
\end{center}
\caption{
MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and S822LC (b).
Dark bars represents execution time (left axis).
Light bars show speedup normalized to the ``1T'' execution (right axis).
Light bars represent execution time (left axis).
Dark bars show speedup normalized to the ``1T'' execution (right axis).
}
\label{fig:mlfmm_performance}
\end{figure}
@@ -133,13 +133,13 @@ When more threads than units are created, each unit is more fully-utilized than
thread-to-unit conditions.
In both systems, using a GPU for MLFMM provides substantial speedup (additional $3.1\times$ on XE/XK, $9.2\times$ on S822LC) over fully utilizing the CPUs.
In current-generation GPUs like the P100 in S822LC, this speedup justifies the considerable time investmeed in a CUDA implementation.
In current-generation GPUs like the P100 in S822LC, this speedup justifies the considerable time invested in a CUDA implementation.
Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation, for a total speedup of $794\times$ over ``1T'' when using $16$ GPUs on $16$ XK nodes, and $969\times$ when using $4$ GPUs on S822LC.
This corresponds to a reduction in execution time from approximately $33$ seconds to $40$ milliseconds on XK nodes, and $28$ seconds to $29$ milliseconds on S822LC.
Despite the 5-year gap between deployment of Blue Waters and S822LC, the baseline ``1T'' execution is only $1.2\times$ faster on S822LC than on an XE node.
This reflects the current slow pace of single-threaded CPU performance improvement in the industry.
The corresponding single-GPU speedup in S822LC over XK $4.4\times$.
The corresponding single-GPU speedup in S822LC over XK is $4.4\times$.
On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9\times$.
\subsection{Computation Kernel Breakdown}
@@ -160,7 +160,7 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe
\label{fig:kernel_breakdown}
\end{figure}
The \texttt{L2L} kernels exhibit the
%This document is a template for authors preparing papers for the
%CEM'17 Computing and Electromagnetics Workshop in Barcelona, Spain.