Merge sharelatex-2017-05-08-2106 into master

2017-05-08 14:06:31 -07:00
parent d1a3c62e84 855e9aed9c
commit 7c10592bd4
1 changed files with 5 additions and 5 deletions
--- a/main.tex
+++ b/main.tex
@@ -121,8 +121,8 @@ A $16$-GPU MPI execution is not shown, as only one S822LC was available for eval
 \end{center}
  \caption{
  MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and S822LC (b).
-  Dark bars represents execution time (left axis).
+  Light bars represent execution time (left axis).
-  Light bars show speedup normalized to the ``1T'' execution (right axis).
+  Dark bars show speedup normalized to the ``1T'' execution (right axis).
  }
  \label{fig:mlfmm_performance}
 \end{figure}
@@ -133,13 +133,13 @@ When more threads than units are created, each unit is more fully-utilized than
 thread-to-unit conditions.
 In both systems, using a GPU for MLFMM provides substantial speedup (additional $3.1\times$ on XE/XK, $9.2\times$ on S822LC) over fully utilizing the CPUs.
-In current-generation GPUs like the P100 in S822LC, this speedup justifies the considerable time investmeed in a CUDA implementation.
+In current-generation GPUs like the P100 in S822LC, this speedup justifies the considerable time invested in a CUDA implementation.
 Furthermore, nearly linear scaling when using multiple GPUs is also achieved thanks to overlapping all required MPI communication with GPU computation, for a total speedup of $794\times$ over ``1T'' when using $16$ GPUs on $16$ XK nodes, and $969\times$ when using $4$ GPUs on S822LC.
 This corresponds to a reduction in execution time from approximately $33$ seconds to $40$ milliseconds on XK nodes, and $28$ seconds to $29$ milliseconds on S822LC.
 Despite the 5-year gap between deployment of Blue Waters and S822LC, the baseline ``1T'' execution is only $1.2\times$ faster on S822LC than on an XE node.
 This reflects the current slow pace of single-threaded CPU performance improvement in the industry.
-The corresponding single-GPU speedup in S822LC over XK  $4.4\times$.
+The corresponding single-GPU speedup in S822LC over XK is $4.4\times$.
 On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9\times$.
 \subsection{Computation Kernel Breakdown}
@@ -160,7 +160,7 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of  of MLFMM execution time spe
  \label{fig:kernel_breakdown}
 \end{figure}
-
+The \texttt{L2L} kernels exhibit the 
 %This document is a template for authors preparing papers for the
 %CEM'17 Computing and Electromagnetics Workshop in Barcelona, Spain.