Mert's fixes
This commit is contained in:
73
main.tex
73
main.tex
@@ -42,13 +42,13 @@ The MLFMM is evaluated on current- and next-generation GPU-accelerated supercomp
|
||||
|
||||
|
||||
|
||||
MLFMM computes pairwise interactions between pixels in the scattering problem by hierarchically clustering pixels into a spatial quad-tree. In the nearfield phase, nearby pixel interactions are computed within the lowest level of the MLFMM tree. The aggregation and disaggregation phases propagate interactions up and down the tree, and the translation phase propagates long-range interactions within a level. In this way, $\mathcal{O}(N)$ work for $N^2$ interactions is achieved for $N$ pixels~\cite{rokhlin93}.
|
||||
MLFMM computes pairwise interactions between pixels in the scattering problem by hierarchically clustering pixels into a spatial quad-tree. In the nearfield phase, nearby pixel interactions are computed within the lowest level of the MLFMM tree. The aggregation and disaggregation phases propagate interactions up and down the tree, and the translation phase propagates long-range interactions within a level. In this way, $\mathcal{O}(N)$ work for $N^2$ interactions is achieved for $N$ pixels~\cite{chew}.
|
||||
Even with algorithmic speedup, high performance parallel MLFMM is needed to take advantage of high-performancing computing resources.
|
||||
This work presents how a GPU-accelerated MLFMM effectively scales from current to next-generation computers.
|
||||
|
||||
In order to achieve an efficient implementation on graphics processing units (GPUs), these four MLFMM phases are formulated as matrix multiplications.
|
||||
Common operators are pre-computed, moved to the GPU, and reused as needed to avoid host-device data transfer.
|
||||
The MLFMM tree structure is partitioned among message passing interface (MPI) processes and each process employs a single GPU for performing partial multiplications.
|
||||
The MLFMM tree structure is partitioned among message passing interface (MPI) processes where each process employs a single GPU for performing partial multiplications.
|
||||
During the MLFMM multiplications, data is transferred between GPUs through their owning MPI processes by moving the data from GPUs to central processing units (CPUs), CPUs to CPUs through MPI, and then from CPUs to GPUs.
|
||||
To hide this communication cost, MPI communication is overlapped with GPU kernels.
|
||||
This strategy completely hides the communication cost and provides $96$\%, MPI parallelization efficiency on up to 16 GPUs.
|
||||
@@ -80,13 +80,13 @@ This section presents an analysis of the performance of the MLFMM algorithm on d
|
||||
%\end{table}
|
||||
|
||||
The performance of MLFMM is evaluated on three systems: XE and XK nodes from the Blue Waters supercomputer~\cite{ncsa}, and an IBM S822LC.
|
||||
Each Blue Waters node is a two-socket system: the XE node has two AMD Opteron 6276 CPUs, each with eight floating-point units, hardware support for 16 executing threads, and $32$~GB of RAM.
|
||||
The XK node replaces one of these CPUs with an NVIDIA K20X GPU and $6$~GB of RAM.
|
||||
Each Blue Waters node is a two-socket system: the XE node has two AMD Opteron 6276 CPUs, each with eight floating-point units, hardware support for 16 executing threads, and 32 GB of RAM.
|
||||
The XK node replaces one of these CPUs with an NVIDIA K20X GPU and 6 GB of RAM.
|
||||
% The K20x is connected to the Operton 6276 with PCIe.
|
||||
These XE and XK nodes are representative of the compute capabilities of current-generation clusters and supercomputers.
|
||||
The IBM S822LC represents a next-generation accelerator-heavy supercomputing node.
|
||||
It has two IBM Power8 CPUs, each with ten floating-point units, support for 80 executing threads, and $256$~GB of RAM.
|
||||
It also has four NVIDIA P100 GPUs with $16$~GB of RAM each.
|
||||
It has two IBM Power8 CPUs, each with ten floating-point units, support for 80 executing threads, and 256 GB of RAM.
|
||||
It also has four NVIDIA P100 GPUs with 16 GB of RAM each.
|
||||
% The P100s are connected to the Power8 CPUs via $80$~GB/s NVLink connections.
|
||||
|
||||
\subsection{MLFMM Contribution to Application Time}
|
||||
@@ -100,13 +100,12 @@ It also has four NVIDIA P100 GPUs with $16$~GB of RAM each.
|
||||
\caption{
|
||||
Amount of application time spent in MLFMM for a 32-thread CPU run on an XE node (left) and a 160-thread run on S822LC (right).
|
||||
MLFMM is the dominant application component even with CPU parallelization.
|
||||
As object reconstructions grow larger or more challenging, MLFMM time further increases as a proportion of application time.
|
||||
As the number of pixels grow larger, MLFMM time further increases as a proportion of application time.
|
||||
}
|
||||
\label{fig:app_breakdown}
|
||||
\end{figure}
|
||||
|
||||
As shown in Fig.~\ref{fig:app_breakdown}, MLFMM forms the core computational kernel of the application, and its performance dominates that of the full inverse solver in CPU-parallelized execution on XE and S822LC ($72$\% and $83$\% respectively).
|
||||
This proportion grows arbitrarily close to $100$\% as the scattering problems become larger or more challenging, justifying further targeted acceleration of MLFMM.
|
||||
As shown in Fig.~\ref{fig:app_breakdown}, MLFMM forms the core computational kernel of the application, and its performance dominates that of the full numerical solver in CPU-parallelized execution on XE and S822LC (72\% and 83\% respectively), justifying further targeted acceleration of MLFMM.
|
||||
|
||||
|
||||
|
||||
@@ -134,12 +133,10 @@ This method will hide the full communication cost even for faster GPUs or slower
|
||||
|
||||
\subsection{MLFMM Performance}
|
||||
|
||||
All evaluations are done on a problem with these parameters. \todo{get from mert}
|
||||
|
||||
|
||||
All evaluations are done on a problem with 16 million pixels.
|
||||
Fig.~\ref{fig:mlfmm_performance} shows MLFMM performance scaling on various Blue Waters and S822LC configurations. On XK nodes (Fig.~\ref{fig:mlfmm_performance}~(a)), each node runs a single MPI rank.
|
||||
On S822LC, the 4 MPI ranks run on a single machine to utilize the $4$ GPUs.
|
||||
A $16$-GPU MPI execution is not shown, as only one S822LC was available for evaluation.
|
||||
On S822LC, the 4 MPI ranks run on a single machine to utilize the 4 GPUs.
|
||||
A 16-GPU MPI execution is not shown, as only one S822LC was available for evaluation.
|
||||
|
||||
\begin{figure}[htbp]
|
||||
\begin{center}
|
||||
@@ -148,57 +145,61 @@ A $16$-GPU MPI execution is not shown, as only one S822LC was available for eval
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
\caption{
|
||||
MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and S822LC (b).
|
||||
1T and 32T are single-threaded and $32$-thread OpenMP executions on a single XE node.
|
||||
160T is a $160$-thread OpenMP executions on S822LC.
|
||||
MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and IBM S822LC (b).
|
||||
1T and 32T are single-threaded and 32-thread OpenMP executions on a single XE node.
|
||||
160T is a 160-thread OpenMP executions on S822LC.
|
||||
1 GPU is a GPU-accelerated execution on a single XK node or using one GPU on S822LC.
|
||||
4 GPU and 16 GPU are GPU-accelerated executions with a corresponding number of MPI ranks.
|
||||
Light bars represent execution time (left axis).
|
||||
Dark bars show speedup normalized to the ``1T'' execution (right axis).
|
||||
Dark bars show speedup normalized to the sequential execution on the respective system.
|
||||
}
|
||||
\label{fig:mlfmm_performance}
|
||||
\end{figure}
|
||||
|
||||
|
||||
Both XE and S822LC achieve more CPU speedup than they have floating-point units ($17\times$ with $32$ threads on $16$ units for XE, $26\times$ with $160$ threads on $20$ units for S822LC).
|
||||
Both XE and S822LC achieve more CPU speedup than they have floating-point units (17x with 32 threads on 16 units for XE, 26x with 160 threads on 20 units for S822LC).
|
||||
When floating-point units are oversubscribed, they are more fully utilized.
|
||||
|
||||
The CUDA implementations leverage well-understood techniques for optimizing matrix operations, including hybrid shared-memory and register tiling, and thread coarsening.
|
||||
In both systems, using a GPU for MLFMM provides substantial speedup (additional $3.1\times$ on XE/XK, $9.2\times$ on S822LC) over fully utilizing the CPUs.
|
||||
The CUDA implementations leverage well-understood techniques for optimizing matrix operations, including hybrid shared-memory and register tiling, and thread coarsening\cite{hwu}
|
||||
In both systems, using a GPU for MLFMM provides substantial speedup (additional 3.1x on XE/XK, 9.2x on S822LC) over fully utilizing the CPUs.
|
||||
This speedup justifies the considerable time invested in a CUDA implementation.
|
||||
Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation.
|
||||
This corresponds to a reduction in execution time from approximately $33$ seconds to $40$ milliseconds on XK nodes, and $28$ seconds to $29$ milliseconds on S822LC.
|
||||
This corresponds to a reduction in execution time from approximately 33 seconds to 40 milliseconds on XK nodes, and 28 seconds to 29 milliseconds on S822LC.
|
||||
|
||||
Despite the 5-year gap between deployment of Blue Waters and S822LC, the baseline ``1T'' execution is only $1.2\times$ faster on S822LC than on an XE node.
|
||||
Despite the 5-year gap between deployment of the Blue Waters and IBM S822LC systems, the baseline sequential execution is only 1.2x faster on S822LC than on an XE node.
|
||||
This reflects the slow pace of single-threaded CPU performance improvement.
|
||||
On the other hand, the P100 GPU in S822LC provides $4.4\times$ speedup over the K20x in XK.
|
||||
On a per-node basis the four GPUs in S822LC provide $17.9\times$ speedup over the single GPU in XK.
|
||||
On the other hand, the P100 GPU in S822LC provides 4.4x speedup over the K20x in XK.
|
||||
On a per-node basis the four GPUs in S822LC provide 17.9 speedup over the single GPU in XK.
|
||||
|
||||
|
||||
The \texttt{P2P} nearfield kernel is the majority of the MLFMM execution time.
|
||||
The average kernel-execution speedup moving from K20x to P100 is $5.3\times$, and the \texttt{L2L} disaggregation kernel speedup is the largest, at $8\times$.
|
||||
The nearfield kernel is the majority of the MLFMM execution time.
|
||||
The average kernel-execution speedup moving from K20x to P100 is 5.3x, and the disaggregation kernel speedup is the largest, at 8x.
|
||||
On both K20x and P100, this kernel's performance is limited by the amount of CUDA shared memory it requires.
|
||||
In S822LC, the newer Pascal GPU architecture provides $64$~KB of shared memory per thread-block rather than the $48$~KB on XK, which allows more thread-blocks to run concurrently and provide the disproportionate speedup on that machine.
|
||||
In S822LC, the newer Pascal GPU architecture provides 64 KB of shared memory per thread-block rather than the 48 KB on XK, which allows more thread-blocks to run concurrently and provide the disproportionate speedup on that machine.
|
||||
|
||||
|
||||
\section{Conclusions}
|
||||
This paper presents MLFMM performance results on three types of computer systems: Blue Waters XE and XK nodes, and an IBM S822LC.
|
||||
MLFMM is realized as matrix operations.
|
||||
Significant CPU speedup on both systems is achieved with OpenMP, and further eclipsed by CUDA implementations that take advantage of well-understood matrix optimization techniques, up to a speedup of $969\times$ over single-threaded CPU execution on S822LC, bringing execution times from seconds to milliseconds even for large problems.
|
||||
MLFMM is realized as matrix operations for excellent performance.
|
||||
Significant CPU speedup on both systems is achieved with OpenMP, and further eclipsed by CUDA implementations that take advantage of well-understood matrix optimization techniques, up to a speedup of 969x over single-threaded CPU execution on S822LC, bringing execution times from seconds to milliseconds even for large problems.
|
||||
On modern GPUs, this speedup justifies the significant CUDA time investment.
|
||||
|
||||
|
||||
\section*{Acknowledgment}
|
||||
Here are some acknowledgements.
|
||||
Here are some more acknowledgements.
|
||||
THis work was funded by grants blah blah blah / blah blah and grant 123465778909 fromr blah blah blah.
|
||||
|
||||
|
||||
\bibliographystyle{IEEEtran}
|
||||
\begin{thebibliography}{99}
|
||||
|
||||
\bibitem{rokhlin93}
|
||||
V. Rokhlin,
|
||||
``Diagonal forms of translation operators for the Helmholtz equation in three dimensions.''
|
||||
\bibitem{chew}
|
||||
V. Chew,
|
||||
``Fast and efficient''
|
||||
in \textit{Applied and Computational Harmonic Analysis}
|
||||
1, 1 (1993), 82–93
|
||||
|
||||
\bibitem{hwu}
|
||||
W. Hwu,
|
||||
``GPU Computing Gems''
|
||||
in \textit{Applied and Computational Harmonic Analysis}
|
||||
1, 1 (1993), 82–93
|
||||
|
||||
|
Reference in New Issue
Block a user