Updates from ShareLaTeX

This commit is contained in:
Carl Pearson
2017-05-08 13:20:59 -07:00
parent 3100202962
commit e473030b8c

View File

@@ -104,55 +104,40 @@ The P100s are connected to the Power8 CPUs via $80$~GB/s NVLink connections.
All evaluations are done on a problem with these parameters. \todo{get from mert} All evaluations are done on a problem with these parameters. \todo{get from mert}
Fig.~\ref{fig:mlfmm_bw} shows MLFMM performance scaling on various Blue Waters configurations. Fig.~\ref{fig:mlfmm_performance} shows MLFMM performance scaling on various Blue Waters and S822LC configurations.
``1T'' and ``32T'' are single-threaded and $32$-thread OpenMP executions on a single XE node. ``1T'' and ``32T'' are single-threaded and $32$-thread OpenMP executions on a single XE node.
``1 GPU'' is a GPU-accelerated execution on an XK node. ``160T'' is a $160$-thread OpenMP executions on S822LC.
``4 GPU'' and ``16 GPU'' are GPU-accelerated multi-node executions using $4$ and $16$ XK nodes with MPI communication. ``1 GPU'' is a GPU-accelerated execution on a single XK node or using one GPU on S822LC.
``4 GPU'' and ``16 GPU'' are GPU-accelerated executions with a corresponding number of MPI ranks.
\begin{figure}[htbp] On XK nodes (Fig.~\ref{fig:mlfmm_performance}~(a)), each node runs a single MPI rank.
\begin{center} On S822LC, the 4 MPI ranks run on a single machine to utilize the $4$ GPUs.
\begin{tabular}{c}
\mbox{\psfig{figure=figures/mlfmm_bw.pdf,width=8cm}}
\end{tabular}
\end{center}
\caption{
MLFMM execution time and speedup over single-threaded execution on Blue Waters XE and XK nodes.
Dark represent are execution time (left axis).
Light bars show speedup normalized to the ``1T'' execution (right axis).
}
\label{fig:mlfmm_bw}
\end{figure}
Fig.~\ref{fig:mlfmm_minsky} shows the MLFMM performance scaling for various S822LC configurations.
``160T'' is a $160$-thread OpenMP executions on a single XE node.
``1 GPU'' and ``4 GPU'' are GPU-accelerated single-node executions using $1$ and $4$ GPUs in S822LC.
Even though only one S822LC node is used, $4$ MPI ranks are executed on that single node.
A $16$-GPU MPI execution is not shown, as only one S822LC was available for evaluation. A $16$-GPU MPI execution is not shown, as only one S822LC was available for evaluation.
\begin{figure}[htbp] \begin{figure}[htbp]
\begin{center} \begin{center}
\begin{tabular}{c} \begin{tabular}{c}
\mbox{\psfig{figure=figures/mlfmm_minsky.pdf,width=8cm}} \mbox{\psfig{figure=figures/mlfmm.pdf,width=8cm}}
\end{tabular} \end{tabular}
\end{center} \end{center}
\caption{ \caption{
MLFMM execution time and speedup over single-threaded execution on the S822LC. MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and S822LC (b).
Dark represent are execution time (left axis). Dark bars represents execution time (left axis).
Light bars show speedup normalized to the ``1T'' execution (right axis). Light bars show speedup normalized to the ``1T'' execution (right axis).
} }
\label{fig:mlfmm_minsky} \label{fig:mlfmm_performance}
\end{figure} \end{figure}
Both XE and S822LC achieve more CPU speedup than they have floating-point units ($17\times$ at $32$ threads on $16$ units for XE, $26\times$ at $160$ threads on $20$ units for S822LC). Both XE and S822LC achieve more CPU speedup than they have floating-point units ($17\times$ at $32$ threads on $16$ units for XE, $26\times$ at $160$ threads on $20$ units for S822LC).
When more threads than units are created, each unit is more fully-utilized than it would be under one-to-one When more threads than units are created, each unit is more fully-utilized than it would be under one-to-one
thread-to-unit conditions. thread-to-unit conditions.
In both systems, using a GPU for MLFMM provides substantial speedup (additional $3.1\times$ on XE/XK, $9.2\times$ on S822LC) over fully utilizing the CPUs. In both systems, using a GPU for MLFMM provides substantial speedup (additional $3.1\times$ on XE/XK, $9.2\times$ on S822LC) over fully utilizing the CPUs.
In current-generation GPUs like the P100 in S822LC, this speedup justifies the considerable time investmeed in a CUDA implementation. In current-generation GPUs like the P100 in S822LC, this speedup justifies the considerable time investmeed in a CUDA implementation.
Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation, for a total speedup of $794\times$ over ``1T'' when using $16$ GPUs on $16$ XK nodes.\todo{s822lc numbers} Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation, for a total speedup of $794\times$ over ``1T'' when using $16$ GPUs on $16$ XK nodes, and $969\times$ when using $4$ GPUs on S822LC.
This corresponds to a reduction in execution time from approximately $33$ seconds to $40$ milliseconds. This corresponds to a reduction in execution time from approximately $33$ seconds to $40$ milliseconds on XK nodes, and $28$ seconds to $29$ milliseconds on S822LC.
Despite the 5-year gap between deployment of Blue Waters and S822LC, the baseline ``1T'' execution is only 1.2x faster on S822LC than on an XE node. Despite the 5-year gap between deployment of Blue Waters and S822LC, the baseline ``1T'' execution is only $1.2\times$ faster on S822LC than on an XE node.
This reflects the current slow pace of single-threaded CPU performance improvement in the industry. This reflects the current slow pace of single-threaded CPU performance improvement in the industry.
The corresponding single-GPU speedup in S822LC over XK $4.4\times$. The corresponding single-GPU speedup in S822LC over XK $4.4\times$.
On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9\times$. On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9\times$.