fix acks

2017-05-18 09:33:50 -07:00
parent 8d32532f8c
commit 41e214500c
6 changed files with 2 additions and 1 deletions
--- a/figures/cpu_matvec.pdf
+++ b/figures/cpu_matvec.pdf
--- a/figures/kernels.pdf
+++ b/figures/kernels.pdf
--- a/figures/mlfmm.pdf
+++ b/figures/mlfmm.pdf
--- a/figures/mlfmm_bw.pdf
+++ b/figures/mlfmm_bw.pdf
--- a/figures/mlfmm_minsky.pdf
+++ b/figures/mlfmm_minsky.pdf
--- a/main.tex
+++ b/main.tex
@@ -153,6 +153,7 @@ This reflects the slow pace of single-threaded CPU performance improvement.
 On the other hand, the P100 GPU in S822LC provides 4.4x speedup over the K20x in XK. 
 On a per-node basis the four GPUs in S822LC provide 17.9 speedup over the single GPU in XK.

+The nearfield kernel consumes approximately 60\% of the MLFMM time.
 The average kernel-execution speedup moving from K20x to P100 is 5.3x, and the disaggregation kernel speedup is the largest, at 8x.
 On both K20x and P100, this kernel's performance is limited by the amount of CUDA shared memory it requires. 
 In S822LC, the newer Pascal GPU architecture provides 64 KB of shared memory per thread-block rather than the 48 KB on XK, which allows more thread-blocks to run concurrently and provide the disproportionate speedup on that machine.
@@ -166,7 +167,7 @@ This speedup justifies the significant CUDA time investment.


 \section*{Acknowledgments}
-This work was supported by the NVIDIA GPU Center of Excellence, the NCSA Petascale Improvement Discovery Program, and the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR).
+This work was supported by the NVIDIA GPU Center of Excellence and the NCSA Petascale Improvement Discovery Program (PAID).

 \bibliographystyle{IEEEtran}
 \begin{thebibliography}{99}