% IEEE Paper Template for A4 Page Size (V1) % Sample Conference Paper using IEEE LaTeX style file for A4 pagesize. % Copyright (C) 2006 Causal Productions Pty Ltd. % Permission is granted to distribute and revise this file provided that % this header remains intact. % % % This style file is produced for CEM'17 Computational Electromagnetics Workshop % Modified from a file indicated above. \documentclass[10pt,conference,a4paper]{IEEEtran} \usepackage{times,amsmath,epsfig} \usepackage{makecell} \usepackage{todonotes} \usepackage{verbatim} %\title{Solving Problems Involving Inhomogeneous Media with MLFMM on GPU Clusters} \title{Evaluating MLFMM for Large Scattering Problems on Multiple GPUs} \author{ {Carl Pearson{\small $^{1}$}, Mert Hidayetoglu{\small $^{1}$}, Wei Ren{\small $^{1}$}, and Wen-Mei Hwu{\small $^{1}$} } \vspace{1.6mm}\\ \fontsize{10}{10}\selectfont\itshape $~^{1}$Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, 61801, USA\\ $~^{2}$Second Affiliation, City, Postal Code, Country\\ \fontsize{9}{9}\upshape \texttt{\{pearson, hidayet2, weiren2, w-hwu\}}@illinois.edu} \begin{document} \maketitle \begin{abstract} The multilevel fast multiple method (MLFMM) is a key tool for efficiently solving large scattering problems governed by the Helmholtz equation. The problems are solved using volume integral equations instead of conversion into a corresponding surface-scattering problem through the equivalence principle to support highly inhomogeneous media. The MLFMM implementation for two-dimensional volumetric scattering problems is realized through matrix operations optimized with shared memory tiling, register tiling, and thread coarsening. MPI communications are overlapped with GPU kernels to achieve high multi-node parallel efficiency. The MLFMM is evaluated on current- and next-generation GPU-accelerated supercomputing nodes, where up to 969x speedup is achieved over single-thread CPU execution using 4 NVIDIA P100 graphics processing units. \end{abstract} \section{Introduction} \label{sec:introduction} MLFMM computes pairwise interactions between pixels in the scattering problem by hierarchically clustering pixels into a spatial quad-tree. In a ``nearfield'' phase, nearby pixel interactions are computed within a level of a tree. An ``aggregation'' and ``disaggregation'' phase propagate interactions up and down the tree, and a ``translation'' phase propagates long-range interactions within a level. In this way, for $N$ pixels $\mathcal{O}(N)$ work for $N^2$ interactions is achieved~\cite{rokhlin93}. Even with algorithmic speedup, high performance parallel MLFMM is needed to take advantage of high-performancing computing resources. This work presents how a GPU-accelerated MLFMM effectively scales from current to next-generation computers. In order to achieve an efficient implementation on graphics processing units (GPUs), these four MLFMM phases are formulated as matrix multiplications. Common operators are pre-computed, moved to the GPU, and reused as needed to avoid host-device data transfer. Large matrices are partitioned among message passing interface (MPI) processes and each process employs a single GPU for performing partial multiplications. During the MLFMM multiplications, data is transferred between GPUs through their owning MPI processes by moving the data from GPUs to central processing units (CPUs), CPUs to CPUs through MPI, and then from CPUs to GPUs. To hide this communication cost, MPI communication is overlapped with GPU kernels. This strategy completely hides the communication cost and provides $96$\%, MPI parallelization efficiency on up to 16 GPUs. \section{MLFMM Performance Results} \label{sec:results} This section presents an analysis of the performance of the MLFMM algorithm on different computing systems. \subsection{Evaluation Environments} %\begin{table}{} %\centering \caption{Evaluation Systems} \label{tab:systems} %\begin{tabular}{|c|c|c|c|} %\hline & \textbf{XK Node} & \textbf{XE Node} & \textbf{S822LC} \\ %\hline %\hline \textbf{CPU 1} & AMD Opteron 6276 & AMD Opteron 6276 & IBM Power8 \\ %\hline \textbf{CPU 2} & -- & AMD Opteron 6276 & IBM Power8 \\ %\hline %\hline \textbf{GPU 1} & \makecell{K20X \\ (6 GB RAM) } & -- & P100 (16GB RAM) \\ %\hline \textbf{GPU 2} & -- & -- & P100 (16GB RAM) \\ %\hline \textbf{GPU 3} & -- & -- & P100 (16GB RAM) \\ %\hline \textbf{GPU 4} & -- & -- & P100 (16GB RAM) \\ %\hline \textbf{RAM} & 32GB & 64 GB & 512 GB \\ %\hline \makecell{\textbf{CPU-GPU} \\ \textbf{Bus}} & PCIe & -- & NVLink \\ %\hline %\end{tabular} %\end{table} The performance of MLFMM is evaluated on three systems: XE and XK nodes from the Blue Waters supercomputer~\cite{ncsa}, and an IBM S822LC. Each Blue Waters node is a two-socket system: the XE node has two AMD Opteron 6276 CPUs, each with eight floating-point units, hardware support for 16 executing threads, and $32$~GB of RAM. The XK node replaces one of these CPUs with an NVIDIA K20X GPU and $6$~GB of RAM. % The K20x is connected to the Operton 6276 with PCIe. These XE and XK nodes are representative of the compute capabilities of current-generation clusters and supercomputers. The IBM S822LC represents a next-generation accelerator-heavy supercomputing node. It has two IBM Power8 CPUs, each with ten floating-point units, support for 80 executing threads, and $256$~GB of RAM. It also has four NVIDIA P100 GPUs with $16$~GB of RAM each. % The P100s are connected to the Power8 CPUs via $80$~GB/s NVLink connections. \subsection{MLFMM Contribution to Application Time} \begin{figure}[t] \begin{center} \begin{tabular}{c} \mbox{\psfig{figure=figures/cpu_matvec.pdf,width=8cm}} \end{tabular} \end{center} \caption{ Amount of application time spent in MLFMM for a 32-thread CPU run on an XE node (left) and a 160-thread run on S822LC (right). MLFMM is the dominant application component even with CPU parallelization. As object reconstructions grow larger or more challenging, MLFMM time further increases as a proportion of application time. } \label{fig:app_breakdown} \end{figure} As shown in Fig.~\ref{fig:app_breakdown}, MLFMM forms the core computational kernel of the application, and its performance dominates that of the full inverse solver in CPU-parallelized execution on XE and S822LC ($72$\% and $83$\% respectively). This proportion grows arbitrarily close to $100$\% as the scattering problems become larger or more challenging, justifying further targeted acceleration of MLFMM. \subsection{MPI Communication Overlap} Fig.~\ref{fig:mpi_overlap} shows a timeline of a particular MLFMM execution with 16 GPUs in 16 XK nodes. This configuration is the worst-case for overlapping computation and communication: the MLFMM tree structure is divided amongst many GPUs, creating the shortest kernel times and the most required communication. Even in this scenario, the required communication time is substantially smaller than the long-running ``P2P'' nearfield kernel that it is overlapped with. This method will hide the full communication cost even for faster GPUs or slower inter-process communication. \begin{figure}[htbp] \begin{center} \begin{tabular}{c} \mbox{\psfig{figure=figures/mpi_comm.png,width=8cm}} \end{tabular} \end{center} \caption{ Representation of MPI communication overlap during GPU-accelerated MLFMM on 16 XK nodes. The communication during nearfield computations provides the results of aggregation on other GPUs to the translation on each GPU. } \label{fig:mpi_overlap} \end{figure} \subsection{MLFMM Performance} All evaluations are done on a problem with these parameters. \todo{get from mert} Fig.~\ref{fig:mlfmm_performance} shows MLFMM performance scaling on various Blue Waters and S822LC configurations. On XK nodes (Fig.~\ref{fig:mlfmm_performance}~(a)), each node runs a single MPI rank. On S822LC, the 4 MPI ranks run on a single machine to utilize the $4$ GPUs. A $16$-GPU MPI execution is not shown, as only one S822LC was available for evaluation. \begin{figure}[htbp] \begin{center} \begin{tabular}{c} \mbox{\psfig{figure=figures/mlfmm.pdf,width=8cm}} \end{tabular} \end{center} \caption{ MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and S822LC (b). 1T and 32T are single-threaded and $32$-thread OpenMP executions on a single XE node. 160T is a $160$-thread OpenMP executions on S822LC. 1 GPU is a GPU-accelerated execution on a single XK node or using one GPU on S822LC. 4 GPU and 16 GPU are GPU-accelerated executions with a corresponding number of MPI ranks. Light bars represent execution time (left axis). Dark bars show speedup normalized to the ``1T'' execution (right axis). } \label{fig:mlfmm_performance} \end{figure} Both XE and S822LC achieve more CPU speedup than they have floating-point units ($17\times$ with $32$ threads on $16$ units for XE, $26\times$ with $160$ threads on $20$ units for S822LC). When floating-point units are oversubscribed, they are more fully utilized. The CUDA implementations leverage well-understood techniques for optimizing matrix operations, including hybrid shared-memory and register tiling, and thread coarsening. In both systems, using a GPU for MLFMM provides substantial speedup (additional $3.1\times$ on XE/XK, $9.2\times$ on S822LC) over fully utilizing the CPUs. This speedup justifies the considerable time invested in a CUDA implementation. Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation. This corresponds to a reduction in execution time from approximately $33$ seconds to $40$ milliseconds on XK nodes, and $28$ seconds to $29$ milliseconds on S822LC. Despite the 5-year gap between deployment of Blue Waters and S822LC, the baseline ``1T'' execution is only $1.2\times$ faster on S822LC than on an XE node. This reflects the slow pace of single-threaded CPU performance improvement. On the other hand, the P100 GPU in S822LC provides $4.4\times$ speedup over the K20x in XK. On a per-node basis the four GPUs in S822LC provide $17.9\times$ speedup over the single GPU in XK. The \texttt{P2P} nearfield kernel is the majority of the MLFMM execution time. The average kernel-execution speedup moving from K20x to P100 is $5.3\times$, and the \texttt{L2L} disaggregation kernel speedup is the largest, at $8\times$. On both K20x and P100, this kernel's performance is limited by the amount of CUDA shared memory it requires. In S822LC, the newer Pascal GPU architecture provides $64$~KB of shared memory per thread-block rather than the $48$~KB on XK, which allows more thread-blocks to run concurrently and provide the disproportionate speedup on that machine. \section{Conclusions} This paper presents MLFMM performance results on three types of computer systems: Blue Waters XE and XK nodes, and an IBM S822LC. MLFMM is realized as matrix operations. Significant CPU speedup on both systems is achieved with OpenMP, and further eclipsed by CUDA implementations that take advantage of well-understood matrix optimization techniques, up to a speedup of $969\times$ over single-threaded CPU execution on S822LC, bringing execution times from seconds to milliseconds even for large problems. On modern GPUs, this speedup justifies the significant CUDA time investment. \section*{Acknowledgment} Here are some acknowledgements. Here are some more acknowledgements. THis work was funded by grants blah blah blah / blah blah and grant 123465778909 fromr blah blah blah. \bibliographystyle{IEEEtran} \begin{thebibliography}{99} \bibitem{rokhlin93} V. Rokhlin, ``Diagonal forms of translation operators for the Helmholtz equation in three dimensions.'' in \textit{Applied and Computational Harmonic Analysis} 1, 1 (1993), 82–93 \bibitem{ncsa} National Center for Supercomputing Applications, ``System Summary,'' [online] Available: https://bluewaters.ncsa.illinois.edu/hardware-summary. [Accessed: 8-May-2017]. % the following vfill coarsely balances the columns on the last page \vfill \pagebreak %\subsection{Computation Kernel Breakdown} %Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spent in computational kernels. %\texttt{P2P} is the ``particle-to-particle'' or nearfield exchanges. %\texttt{P2M} and \texttt{M2M} are the lowest-level and higher-level aggregations, respectively. %\texttt{L2L} and \texttt{L2P} are the higher-level and lowest-level disaggregations, respectively. %\texttt{M2M} is the translations. %\begin{figure}[htbp] %\begin{center} %\begin{tabular}{c} %\mbox{\psfig{figure=figures/kernels.pdf,width=8cm}} %\end{tabular} %\end{center} % \caption{Normalized breakdown of the computation time across different MLFMM kernels in different execution environments.} % \label{fig:kernel_breakdown} %\end{figure} %\bibitem{journal} A.~Author, B.~Author, and C.~Author, %``Publication title,'' {\it Journal Title}, vol.~0, no.~0, %pp.~00--00, Month~Year. %\bibitem{book1} A.~Author, B.~Author, and C.~Author, %{\it Book Title}. Location: Publisher,~Year. %\bibitem{book2} A.~Author, B.~Author, and C.~Author, %``Chapter title,'' in {\it Book Title}, A.~Editor,~Ed. Location: %Publisher,~Year,~Chap.~0. %\bibitem{conf1} A.~Author, B.~Author, and C.~Author, ``Paper %title,'' in {\it Proc. Conference Title}, vol.~0, Year, pp.~0--0. %\bibitem{conf2} A.~Author, B.~Author, and C.~Author, ``Paper %title,'' {\it Conference Title}, Location, Country, Month~Year. \end{thebibliography} \end{document} %This document is a template for authors preparing papers for the %CEM'17 Computing and Electromagnetics Workshop in Barcelona, Spain. %The papers are required to use the IEEE style by following the %instructions provided in this document. The language is English. %The papers are expected to be two-pages long. %\section{Text Format} %Page size is A4, which is 210 mm (8.27 in) wide and 297 mm %(11.69 in) long. The margins are as follows: %\begin{itemize} %\item Top: 19 mm (0.75 in) \item Bottom: 43 mm (1.69 in) \item %Left-Right: 14.32 mm (0.56 in) %\end{itemize} %The paper is in two column format with a space of 4.22 mm (0.17 in) %between columns. All title and author details must be in %single-column format and must be centered. All paragraphs are %indented. The entire document should be in Times New Roman or %Times font. Recommended font size is 10~pt for the main text. %Headings of the subsections are as follows, if required: %\subsection{This is First-Level Subsection} %You may use 1st level subsections, if required. %\\ %\subsubsection{This is Second-Level Subsection} %You may use 2nd level subsections, if required. %\\ %\\ %\indent Page numbers, headers and footers should not be used. All %hypertext links and bookmarks should be removed from papers. If %you need to refer to an Internet email address or URL in your %paper, you should type out the address or URL fully in regular %font. %\section{Figures and Tables} %Figures should be centered in the column, but large figures may %span across both columns, if they are positioned either at the top %or at the bottom of the page. Graphics should have an adequate %resolution. Fig.~\ref{fig1} presents an example plot in gray-scale %format. Colors can be used; however, it is recommended that the %graphics are checked to reproduce the required details in %gray-scale copy. For example, the colors in Fig.~\ref{fig2}(a) are %not appropriate for a gray-scale print. For the same plot, %Fig.~\ref{fig2}(b) is more preferable. Figures are numbered using %Arabic numerals and the captions are in 8~pt regular font. Tables %should be numbered using uppercase Roman numerals and their %captions are centered as in Table~\ref{table1}. %\begin{table}{} %\centering \caption{Caption of the Table.} \label{table1} %\begin{tabular}{|c|c|c|c|} %\hline Item~1& Item~2 %& Item~3 & Item~4\\ %\hline\hline \multicolumn{4}{|c|}{Item~5} \\ %\hline Item~6& %\multicolumn{3}{|c|}{Item~7}\\ %\hline Item~8 & Item~9 & Item~10 & Item~11\\ %\hline %\end{tabular} %\end{table} %\section{References} %The heading of the references section is %not be numbered and all reference items are in 8~pt font. %References are required to be in IEEE style. Please refer to the %examples for journals~\cite{journal}, for %books~\cite{book1},~\cite{book2}, and for conference %papers~\cite{conf1},~\cite{conf2}.