diff --git a/main.tex b/main.tex index 36f352d..cb0b293 100644 --- a/main.tex +++ b/main.tex @@ -39,6 +39,8 @@ We evaluate an efficient implementation of MLFMM for such two-dimensional volume \section{Introduction} \label{sec:introduction} + + In order to achieve an efficient implementation on graphics processing units (GPUs), the MLFMM operations are formulated as matrix-matrix multiplications. To avoid host-device data transfer, common operators are pre-computed, moved to the GPU, and reused as needed. Large matrices are partitioned among message passing interface (MPI) processes and each process employs a single GPU for performing partial multiplications. @@ -149,6 +151,12 @@ This reflects the current slow pace of single-threaded CPU performance improveme The corresponding single-GPU speedup in S822LC over XK is $4.4\times$. On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9\times$. +\subsection{MPI Communication Overlap} + +\tikzstyle{int}=[draw, fill=blue!20, minimum size=2em] +\tikzstyle{init} = [pin edge={to-,thin,black}] + + \subsection{Computation Kernel Breakdown} Fig.~\ref{fig:kernel_breakdown} shows the amount of of MLFMM execution time spent in computational kernels.