Merge branch 'master' of github.com:cwpearson/cem17

2017-05-08 14:06:12 -07:00
parent 89ac4e1ccc e473030b8c
commit d1a3c62e84
1 changed files with 14 additions and 29 deletions
--- a/main.tex
+++ b/main.tex
@@ -104,55 +104,40 @@ The P100s are connected to the Power8 CPUs via $80$~GB/s NVLink connections.
 All evaluations are done on a problem with these parameters. \todo{get from mert}


-Fig.~\ref{fig:mlfmm_bw} shows MLFMM performance scaling on various Blue Waters configurations.
+Fig.~\ref{fig:mlfmm_performance} shows MLFMM performance scaling on various Blue Waters and S822LC configurations.
 ``1T'' and ``32T'' are single-threaded and $32$-thread OpenMP executions on a single XE node.
-``1 GPU'' is a GPU-accelerated execution on an XK node.
-``4 GPU'' and ``16 GPU'' are GPU-accelerated multi-node executions using $4$ and $16$ XK nodes with MPI communication.
-
-\begin{figure}[htbp]
-\begin{center}
-\begin{tabular}{c}
-\mbox{\psfig{figure=figures/mlfmm_bw.pdf,width=8cm}}
-\end{tabular}
-\end{center}
-  \caption{
-  MLFMM execution time and speedup over single-threaded execution on Blue Waters XE and XK nodes.
-  Dark represent are execution time (left axis).
-  Light bars show speedup normalized to the ``1T'' execution (right axis).
-  }
-  \label{fig:mlfmm_bw}
-\end{figure}
-
-Fig.~\ref{fig:mlfmm_minsky} shows the MLFMM performance scaling for various S822LC configurations.
-``160T'' is a $160$-thread OpenMP executions on a single XE node.
-``1 GPU'' and ``4 GPU'' are GPU-accelerated single-node executions using $1$ and $4$ GPUs in S822LC.
-Even though only one S822LC node is used, $4$ MPI ranks are executed on that single node.
+``160T'' is a $160$-thread OpenMP executions on S822LC.
+``1 GPU'' is a GPU-accelerated execution on a single XK node or using one GPU on S822LC.
+``4 GPU'' and ``16 GPU'' are GPU-accelerated executions with a corresponding number of MPI ranks. 
+On XK nodes (Fig.~\ref{fig:mlfmm_performance}~(a)), each node runs a single MPI rank.
+On S822LC, the 4 MPI ranks run on a single machine to utilize the $4$ GPUs.
 A $16$-GPU MPI execution is not shown, as only one S822LC was available for evaluation.

 \begin{figure}[htbp]
 \begin{center}
 \begin{tabular}{c}
-\mbox{\psfig{figure=figures/mlfmm_minsky.pdf,width=8cm}}
+\mbox{\psfig{figure=figures/mlfmm.pdf,width=8cm}}
 \end{tabular}
 \end{center}
  \caption{
-  MLFMM execution time and speedup over single-threaded execution on the S822LC.
-  Dark represent are execution time (left axis).
+  MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and S822LC (b).
+  Dark bars represents execution time (left axis).
  Light bars show speedup normalized to the ``1T'' execution (right axis).
  }
-  \label{fig:mlfmm_minsky}
+  \label{fig:mlfmm_performance}
 \end{figure}

+
 Both XE and S822LC achieve more CPU speedup than they have floating-point units ($17\times$ at $32$ threads on $16$ units for XE, $26\times$ at $160$ threads on $20$ units for S822LC).
 When more threads than units are created, each unit is more fully-utilized than it would be under one-to-one
 thread-to-unit conditions.

 In both systems, using a GPU for MLFMM provides substantial speedup (additional $3.1\times$ on XE/XK, $9.2\times$ on S822LC) over fully utilizing the CPUs.
 In current-generation GPUs like the P100 in S822LC, this speedup justifies the considerable time investmeed in a CUDA implementation.
-Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation, for a total speedup of $794\times$ over ``1T'' when using $16$ GPUs on $16$ XK nodes.\todo{s822lc numbers}
-This corresponds to a reduction in execution time from approximately $33$ seconds to $40$ milliseconds.
+Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation, for a total speedup of $794\times$ over ``1T'' when using $16$ GPUs on $16$ XK nodes, and $969\times$ when using $4$ GPUs on S822LC.
+This corresponds to a reduction in execution time from approximately $33$ seconds to $40$ milliseconds on XK nodes, and $28$ seconds to $29$ milliseconds on S822LC.

-Despite the 5-year gap between deployment of Blue Waters and S822LC, the baseline ``1T'' execution is only 1.2x faster on S822LC than on an XE node.
+Despite the 5-year gap between deployment of Blue Waters and S822LC, the baseline ``1T'' execution is only $1.2\times$ faster on S822LC than on an XE node.
 This reflects the current slow pace of single-threaded CPU performance improvement in the industry.
 The corresponding single-GPU speedup in S822LC over XK  $4.4\times$.
 On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9\times$.