diff --git a/main.tex b/main.tex index 86d5f7d..2184efe 100644 --- a/main.tex +++ b/main.tex @@ -104,12 +104,7 @@ The P100s are connected to the Power8 CPUs via $80$~GB/s NVLink connections. All evaluations are done on a problem with these parameters. \todo{get from mert} -Fig.~\ref{fig:mlfmm_performance} shows MLFMM performance scaling on various Blue Waters and S822LC configurations. -``1T'' and ``32T'' are single-threaded and $32$-thread OpenMP executions on a single XE node. -``160T'' is a $160$-thread OpenMP executions on S822LC. -``1 GPU'' is a GPU-accelerated execution on a single XK node or using one GPU on S822LC. -``4 GPU'' and ``16 GPU'' are GPU-accelerated executions with a corresponding number of MPI ranks. -On XK nodes (Fig.~\ref{fig:mlfmm_performance}~(a)), each node runs a single MPI rank. +Fig.~\ref{fig:mlfmm_performance} shows MLFMM performance scaling on various Blue Waters and S822LC configurations. On XK nodes (Fig.~\ref{fig:mlfmm_performance}~(a)), each node runs a single MPI rank. On S822LC, the 4 MPI ranks run on a single machine to utilize the $4$ GPUs. A $16$-GPU MPI execution is not shown, as only one S822LC was available for evaluation. @@ -121,6 +116,10 @@ A $16$-GPU MPI execution is not shown, as only one S822LC was available for eval \end{center} \caption{ MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and S822LC (b). + 1T and 32T are single-threaded and $32$-thread OpenMP executions on a single XE node. + 160T is a $160$-thread OpenMP executions on S822LC. + 1 GPU is a GPU-accelerated execution on a single XK node or using one GPU on S822LC. + 4 GPU and 16 GPU are GPU-accelerated executions with a corresponding number of MPI ranks. Light bars represent execution time (left axis). Dark bars show speedup normalized to the ``1T'' execution (right axis). } @@ -146,7 +145,7 @@ On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9 Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spent in computational kernels. \texttt{P2P} is the ``particle-to-particle'' or nearfield exchanges. -\texttt{P2M} and \texttt{M2M} are the lowest-level and higher-level aggregations, respectively. +\texttt{P2M} and \texttt{M2M} are the lowest-level and higher-level aggregations, respectively. \texttt{L2L} and \texttt{L2P} are the higher-level and lowest-level disaggregations, respectively. \texttt{M2M} is the translations. @@ -156,11 +155,17 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spe \mbox{\psfig{figure=figures/kernels.pdf,width=8cm}} \end{tabular} \end{center} - \caption{Normalized breakdown of the computation time across different MLFMM kernels in different exection environments.} + \caption{Normalized breakdown of the computation time across different MLFMM kernels in different execution environments.} \label{fig:kernel_breakdown} \end{figure} -The \texttt{L2L} kernels exhibit the +Since MLFMM is realized as dense matrix operations, the CUDA implementations leverage well-understood techniques for dense matrix-matrix and matrix-vector multiplication, including +hybrid shared-memory and register tiling, and thread coarsening. +The \texttt{P2P} nearfield kernel is the majority of the MLFMM execution time. +The MPI communication is hidden-behind this long-running kernel. +The average GPU kernel speedup on four GPU moving from XK to S822LC is $5.3\times$, but the \texttt{L2L} kernel speedup is the largest, at $8\times$. +On both XK and S822LC, this kernel's performance is limited by the amount of CUDA shared memory it requires. +In S822LC, the newer Pascal GPU architecture provides $64$~KB of shared memory per thread-block rather than the $48$~KB on XK, which allows more thread-blocks to run concurrently and provide the disproportionate speedup on that machine. %This document is a template for authors preparing papers for the %CEM'17 Computing and Electromagnetics Workshop in Barcelona, Spain. @@ -241,12 +246,10 @@ The \texttt{L2L} kernels exhibit the \vfill \pagebreak \section{Conclusions} -%This template uses IEEE style and provides necessary information -%to prepare papers for CEM'17 Workshop. Thank you for your -%contributions. -Significant CPU speedup from OpenMP. -On modern accelerations, speedup justifies CUDA investment. -Parallelism responsible for making the problem solvable in useful timescales. +This paper presents MLFMM performance results on three types of computer systems: Blue Waters XE and XK nodes, and an IBM S822LC. +MLFMM is realized as dense matrix operations. +Significant CPU speedup on both systems is achieved with OpenMP, and further eclipsed by CUDA implementations that take advantage of well-understood dense matrix optimization techniques, up to a speedup of $969\times$ over single-threaded CPU execution on S822LC, bringing execution times from seconds to milliseconds even for large problems. +On modern GPUs, this speedup justifies the significant CUDA time investment. \section*{Acknowledgment}