diff --git a/main.tex b/main.tex
index 86d5f7d..2184efe 100644
--- a/main.tex
+++ b/main.tex
@@ -104,12 +104,7 @@ The P100s are connected to the Power8 CPUs via $80$~GB/s NVLink connections.
 All evaluations are done on a problem with these parameters. \todo{get from mert}
 
 
-Fig.~\ref{fig:mlfmm_performance} shows MLFMM performance scaling on various Blue Waters and S822LC configurations.
-``1T'' and ``32T'' are single-threaded and $32$-thread OpenMP executions on a single XE node.
-``160T'' is a $160$-thread OpenMP executions on S822LC.
-``1 GPU'' is a GPU-accelerated execution on a single XK node or using one GPU on S822LC.
-``4 GPU'' and ``16 GPU'' are GPU-accelerated executions with a corresponding number of MPI ranks. 
-On XK nodes (Fig.~\ref{fig:mlfmm_performance}~(a)), each node runs a single MPI rank.
+Fig.~\ref{fig:mlfmm_performance} shows MLFMM performance scaling on various Blue Waters and S822LC configurations. On XK nodes (Fig.~\ref{fig:mlfmm_performance}~(a)), each node runs a single MPI rank.
 On S822LC, the 4 MPI ranks run on a single machine to utilize the $4$ GPUs.
 A $16$-GPU MPI execution is not shown, as only one S822LC was available for evaluation.
 
@@ -121,6 +116,10 @@ A $16$-GPU MPI execution is not shown, as only one S822LC was available for eval
 \end{center}
   \caption{
   MLFMM execution times and speedup over single-threaded execution on Blue Waters XE and XK nodes (a) and S822LC (b).
+  1T and 32T are single-threaded and $32$-thread OpenMP executions on a single XE node.
+  160T is a $160$-thread OpenMP executions on S822LC.
+  1 GPU is a GPU-accelerated execution on a single XK node or using one GPU on S822LC.
+  4 GPU and 16 GPU are GPU-accelerated executions with a corresponding number of MPI ranks. 
   Light bars represent execution time (left axis).
   Dark bars show speedup normalized to the ``1T'' execution (right axis).
   }
@@ -146,7 +145,7 @@ On a per-node basis (``1 GPU'' in XK, ``4 GPU'' in S822LC), the speedup is $17.9
 
 Fig.~\ref{fig:kernel_breakdown} shows the amount of  of MLFMM execution time spent in computational kernels.
 \texttt{P2P} is the ``particle-to-particle'' or nearfield exchanges. 
-\texttt{P2M} and \texttt{M2M} are the lowest-level and higher-level aggregations, respectively. 
+\texttt{P2M} and \texttt{M2M} are the lowest-level and higher-level aggregations, respectively.
 \texttt{L2L} and \texttt{L2P} are the higher-level and lowest-level disaggregations, respectively.
 \texttt{M2M} is the translations.
 
@@ -156,11 +155,17 @@ Fig.~\ref{fig:kernel_breakdown} shows the amount of  of MLFMM execution time spe
 \mbox{\psfig{figure=figures/kernels.pdf,width=8cm}}
 \end{tabular}
 \end{center}
-  \caption{Normalized breakdown of the computation time across different MLFMM kernels in different exection environments.}
+  \caption{Normalized breakdown of the computation time across different MLFMM kernels in different execution environments.}
   \label{fig:kernel_breakdown}
 \end{figure}
 
-The \texttt{L2L} kernels exhibit the 
+Since MLFMM is realized as dense matrix operations, the CUDA implementations leverage well-understood techniques for dense matrix-matrix and matrix-vector multiplication, including
+hybrid shared-memory and register tiling, and thread coarsening.
+The \texttt{P2P} nearfield kernel is the majority of the MLFMM execution time.
+The MPI communication is hidden-behind this long-running kernel.
+The average GPU kernel speedup on four GPU moving from XK to S822LC is $5.3\times$, but the \texttt{L2L} kernel speedup is the largest, at $8\times$.
+On both XK and S822LC, this kernel's performance is limited by the amount of CUDA shared memory it requires. 
+In S822LC, the newer Pascal GPU architecture provides $64$~KB of shared memory per thread-block rather than the $48$~KB on XK, which allows more thread-blocks to run concurrently and provide the disproportionate speedup on that machine. 
 
 %This document is a template for authors preparing papers for the
 %CEM'17 Computing and Electromagnetics Workshop in Barcelona, Spain.
@@ -241,12 +246,10 @@ The \texttt{L2L} kernels exhibit the
 \vfill \pagebreak
 
 \section{Conclusions}
-%This template uses IEEE style and provides necessary information
-%to prepare papers for CEM'17 Workshop. Thank you for your
-%contributions.
-Significant CPU speedup from OpenMP.
-On modern accelerations, speedup justifies CUDA investment.
-Parallelism responsible for making the problem solvable in useful timescales.
+This paper presents MLFMM performance results on three types of computer systems: Blue Waters XE and XK nodes, and an IBM S822LC. 
+MLFMM is realized as dense matrix operations.
+Significant CPU speedup on both systems is achieved with OpenMP, and further eclipsed by CUDA implementations that take advantage of well-understood dense matrix optimization techniques, up to a speedup of $969\times$ over single-threaded CPU execution on S822LC, bringing execution times from seconds to milliseconds even for large problems.
+On modern GPUs, this speedup justifies the significant CUDA time investment.
 
 
 \section*{Acknowledgment}