diff --git a/figures/cpu_matvec.pdf b/figures/cpu_matvec.pdf index 1d17335..6117160 100644 Binary files a/figures/cpu_matvec.pdf and b/figures/cpu_matvec.pdf differ diff --git a/figures/kernels.pdf b/figures/kernels.pdf index 394a59e..370aaf0 100644 Binary files a/figures/kernels.pdf and b/figures/kernels.pdf differ diff --git a/figures/mlfmm.pdf b/figures/mlfmm.pdf index 2dc6d1a..c3b0f1d 100644 Binary files a/figures/mlfmm.pdf and b/figures/mlfmm.pdf differ diff --git a/figures/mlfmm_bw.pdf b/figures/mlfmm_bw.pdf index cee36aa..5658d9a 100644 Binary files a/figures/mlfmm_bw.pdf and b/figures/mlfmm_bw.pdf differ diff --git a/figures/mlfmm_minsky.pdf b/figures/mlfmm_minsky.pdf index 31738af..fb5e01c 100644 Binary files a/figures/mlfmm_minsky.pdf and b/figures/mlfmm_minsky.pdf differ diff --git a/main.tex b/main.tex index 465c47a..cd78471 100644 --- a/main.tex +++ b/main.tex @@ -153,6 +153,7 @@ This reflects the slow pace of single-threaded CPU performance improvement. On the other hand, the P100 GPU in S822LC provides 4.4x speedup over the K20x in XK. On a per-node basis the four GPUs in S822LC provide 17.9 speedup over the single GPU in XK. +The nearfield kernel consumes approximately 60\% of the MLFMM time. The average kernel-execution speedup moving from K20x to P100 is 5.3x, and the disaggregation kernel speedup is the largest, at 8x. On both K20x and P100, this kernel's performance is limited by the amount of CUDA shared memory it requires. In S822LC, the newer Pascal GPU architecture provides 64 KB of shared memory per thread-block rather than the 48 KB on XK, which allows more thread-blocks to run concurrently and provide the disproportionate speedup on that machine. @@ -166,7 +167,7 @@ This speedup justifies the significant CUDA time investment. \section*{Acknowledgments} -This work was supported by the NVIDIA GPU Center of Excellence, the NCSA Petascale Improvement Discovery Program, and the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR). +This work was supported by the NVIDIA GPU Center of Excellence and the NCSA Petascale Improvement Discovery Program (PAID). \bibliographystyle{IEEEtran} \begin{thebibliography}{99}