Merge branch 'master' of github.com:cwpearson/cem17
This commit is contained in:
33
main.tex
33
main.tex
@@ -104,12 +104,7 @@ The P100s are connected to the Power8 CPUs via $80$~GB/s NVLink connections.
|
||||
All evaluations are done on a problem with these parameters. \todo{get from mert}
|
||||
|
||||
|
||||
Fig.~\ref{fig:mlfmm_performance} shows MLFMM performance scaling on various Blue Waters and S822LC configurations.
|
||||
``1T'' and ``32T'' are single-threaded and $32$-thread OpenMP executions on a single XE node.
|
||||
``160T'' is a $160$-thread OpenMP executions on S822LC.
|
||||
``1 GPU'' is a GPU-accelerated execution on a single XK node or using one GPU on S822LC.
|
||||
``4 GPU'' and ``16 GPU'' are GPU-accelerated executions with a corresponding number of MPI ranks.
|
||||
On XK nodes (Fig.~\ref{fig:mlfmm_performance}~(a)), each node runs a single MPI rank.
|
||||
Fig.~\ref{fig:mlfmm_performance} shows MLFMM performance scaling on various Blue Waters and S822LC configurations. On XK nodes (Fig.~\ref{fig:mlfmm_performance}~(a)), each node runs a single MPI rank.
|
||||
On S822LC, the 4 MPI ranks run on a single machine to utilize the $4$ GPUs.
|
||||
A $16$-GPU MPI execution is not shown, as only one S822LC was available for evaluation.
|
||||
|
||||
@@ -121,6 +116,10 @@ A $16$-GPU MPI execution is not shown, as only one S822LC was available for eval
|
||||
\end{center}
|
||||
\caption{
|
||||
MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and S822LC (b).
|
||||
1T and 32T are single-threaded and $32$-thread OpenMP executions on a single XE node.
|
||||
160T is a $160$-thread OpenMP executions on S822LC.
|
||||
1 GPU is a GPU-accelerated execution on a single XK node or using one GPU on S822LC.
|
||||
4 GPU and 16 GPU are GPU-accelerated executions with a corresponding number of MPI ranks.
|
||||
Light bars represent execution time (left axis).
|
||||
Dark bars show speedup normalized to the ``1T'' execution (right axis).
|
||||
}
|
||||
@@ -146,7 +145,7 @@ On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9
|
||||
|
||||
Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spent in computational kernels.
|
||||
\texttt{P2P} is the ``particle-to-particle'' or nearfield exchanges.
|
||||
\texttt{P2M} and \texttt{M2M} are the lowest-level and higher-level aggregations, respectively.
|
||||
\texttt{P2M} and \texttt{M2M} are the lowest-level and higher-level aggregations, respectively.
|
||||
\texttt{L2L} and \texttt{L2P} are the higher-level and lowest-level disaggregations, respectively.
|
||||
\texttt{M2M} is the translations.
|
||||
|
||||
@@ -156,11 +155,17 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe
|
||||
\mbox{\psfig{figure=figures/kernels.pdf,width=8cm}}
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
\caption{Normalized breakdown of the computation time across different MLFMM kernels in different exection environments.}
|
||||
\caption{Normalized breakdown of the computation time across different MLFMM kernels in different execution environments.}
|
||||
\label{fig:kernel_breakdown}
|
||||
\end{figure}
|
||||
|
||||
The \texttt{L2L} kernels exhibit the
|
||||
Since MLFMM is realized as dense matrix operations, the CUDA implementations leverage well-understood techniques for dense matrix-matrix and matrix-vector multiplication, including
|
||||
hybrid shared-memory and register tiling, and thread coarsening.
|
||||
The \texttt{P2P} nearfield kernel is the majority of the MLFMM execution time.
|
||||
The MPI communication is hidden-behind this long-running kernel.
|
||||
The average GPU kernel speedup on four GPU moving from XK to S822LC is $5.3\times$, but the \texttt{L2L} kernel speedup is the largest, at $8\times$.
|
||||
On both XK and S822LC, this kernel's performance is limited by the amount of CUDA shared memory it requires.
|
||||
In S822LC, the newer Pascal GPU architecture provides $64$~KB of shared memory per thread-block rather than the $48$~KB on XK, which allows more thread-blocks to run concurrently and provide the disproportionate speedup on that machine.
|
||||
|
||||
%This document is a template for authors preparing papers for the
|
||||
%CEM'17 Computing and Electromagnetics Workshop in Barcelona, Spain.
|
||||
@@ -241,12 +246,10 @@ The \texttt{L2L} kernels exhibit the
|
||||
\vfill \pagebreak
|
||||
|
||||
\section{Conclusions}
|
||||
%This template uses IEEE style and provides necessary information
|
||||
%to prepare papers for CEM'17 Workshop. Thank you for your
|
||||
%contributions.
|
||||
Significant CPU speedup from OpenMP.
|
||||
On modern accelerations, speedup justifies CUDA investment.
|
||||
Parallelism responsible for making the problem solvable in useful timescales.
|
||||
This paper presents MLFMM performance results on three types of computer systems: Blue Waters XE and XK nodes, and an IBM S822LC.
|
||||
MLFMM is realized as dense matrix operations.
|
||||
Significant CPU speedup on both systems is achieved with OpenMP, and further eclipsed by CUDA implementations that take advantage of well-understood dense matrix optimization techniques, up to a speedup of $969\times$ over single-threaded CPU execution on S822LC, bringing execution times from seconds to milliseconds even for large problems.
|
||||
On modern GPUs, this speedup justifies the significant CUDA time investment.
|
||||
|
||||
|
||||
\section*{Acknowledgment}
|
||||
|
Reference in New Issue
Block a user