Merge sharelatex-2017-05-08-2137 into master

This commit is contained in:
Carl Pearson
2017-05-08 14:37:25 -07:00
committed by GitHub

View File

@@ -104,12 +104,7 @@ The P100s are connected to the Power8 CPUs via $80$~GB/s NVLink connections.
All evaluations are done on a problem with these parameters. \todo{get from mert} All evaluations are done on a problem with these parameters. \todo{get from mert}
Fig.~\ref{fig:mlfmm_performance} shows MLFMM performance scaling on various Blue Waters and S822LC configurations. Fig.~\ref{fig:mlfmm_performance} shows MLFMM performance scaling on various Blue Waters and S822LC configurations. On XK nodes (Fig.~\ref{fig:mlfmm_performance}~(a)), each node runs a single MPI rank.
``1T'' and ``32T'' are single-threaded and $32$-thread OpenMP executions on a single XE node.
``160T'' is a $160$-thread OpenMP executions on S822LC.
``1 GPU'' is a GPU-accelerated execution on a single XK node or using one GPU on S822LC.
``4 GPU'' and ``16 GPU'' are GPU-accelerated executions with a corresponding number of MPI ranks.
On XK nodes (Fig.~\ref{fig:mlfmm_performance}~(a)), each node runs a single MPI rank.
On S822LC, the 4 MPI ranks run on a single machine to utilize the $4$ GPUs. On S822LC, the 4 MPI ranks run on a single machine to utilize the $4$ GPUs.
A $16$-GPU MPI execution is not shown, as only one S822LC was available for evaluation. A $16$-GPU MPI execution is not shown, as only one S822LC was available for evaluation.
@@ -121,6 +116,10 @@ A $16$-GPU MPI execution is not shown, as only one S822LC was available for eval
\end{center} \end{center}
\caption{ \caption{
MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and S822LC (b). MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and S822LC (b).
1T and 32T are single-threaded and $32$-thread OpenMP executions on a single XE node.
160T is a $160$-thread OpenMP executions on S822LC.
1 GPU is a GPU-accelerated execution on a single XK node or using one GPU on S822LC.
4 GPU and 16 GPU are GPU-accelerated executions with a corresponding number of MPI ranks.
Light bars represent execution time (left axis). Light bars represent execution time (left axis).
Dark bars show speedup normalized to the ``1T'' execution (right axis). Dark bars show speedup normalized to the ``1T'' execution (right axis).
} }
@@ -146,7 +145,7 @@ On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9
Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spent in computational kernels. Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spent in computational kernels.
\texttt{P2P} is the ``particle-to-particle'' or nearfield exchanges. \texttt{P2P} is the ``particle-to-particle'' or nearfield exchanges.
\texttt{P2M} and \texttt{M2M} are the lowest-level and higher-level aggregations, respectively. \texttt{P2M} and \texttt{M2M} are the lowest-level and higher-level aggregations, respectively.
\texttt{L2L} and \texttt{L2P} are the higher-level and lowest-level disaggregations, respectively. \texttt{L2L} and \texttt{L2P} are the higher-level and lowest-level disaggregations, respectively.
\texttt{M2M} is the translations. \texttt{M2M} is the translations.
@@ -156,11 +155,17 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe
\mbox{\psfig{figure=figures/kernels.pdf,width=8cm}} \mbox{\psfig{figure=figures/kernels.pdf,width=8cm}}
\end{tabular} \end{tabular}
\end{center} \end{center}
\caption{Normalized breakdown of the computation time across different MLFMM kernels in different exection environments.} \caption{Normalized breakdown of the computation time across different MLFMM kernels in different execution environments.}
\label{fig:kernel_breakdown} \label{fig:kernel_breakdown}
\end{figure} \end{figure}
The \texttt{L2L} kernels exhibit the Since MLFMM is realized as dense matrix operations, the CUDA implementations leverage well-understood techniques for dense matrix-matrix and matrix-vector multiplication, including
hybrid shared-memory and register tiling, and thread coarsening.
The \texttt{P2P} nearfield kernel is the majority of the MLFMM execution time.
The MPI communication is hidden-behind this long-running kernel.
The average GPU kernel speedup on four GPU moving from XK to S822LC is $5.3\times$, but the \texttt{L2L} kernel speedup is the largest, at $8\times$.
On both XK and S822LC, this kernel's performance is limited by the amount of CUDA shared memory it requires.
In S822LC, the newer Pascal GPU architecture provides $64$~KB of shared memory per thread-block rather than the $48$~KB on XK, which allows more thread-blocks to run concurrently and provide the disproportionate speedup on that machine.
%This document is a template for authors preparing papers for the %This document is a template for authors preparing papers for the
%CEM'17 Computing and Electromagnetics Workshop in Barcelona, Spain. %CEM'17 Computing and Electromagnetics Workshop in Barcelona, Spain.
@@ -241,12 +246,10 @@ The \texttt{L2L} kernels exhibit the
\vfill \pagebreak \vfill \pagebreak
\section{Conclusions} \section{Conclusions}
%This template uses IEEE style and provides necessary information This paper presents MLFMM performance results on three types of computer systems: Blue Waters XE and XK nodes, and an IBM S822LC.
%to prepare papers for CEM'17 Workshop. Thank you for your MLFMM is realized as dense matrix operations.
%contributions. Significant CPU speedup on both systems is achieved with OpenMP, and further eclipsed by CUDA implementations that take advantage of well-understood dense matrix optimization techniques, up to a speedup of $969\times$ over single-threaded CPU execution on S822LC, bringing execution times from seconds to milliseconds even for large problems.
Significant CPU speedup from OpenMP. On modern GPUs, this speedup justifies the significant CUDA time investment.
On modern accelerations, speedup justifies CUDA investment.
Parallelism responsible for making the problem solvable in useful timescales.
\section*{Acknowledgment} \section*{Acknowledgment}