fix acks
This commit is contained in:
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
3
main.tex
3
main.tex
@@ -153,6 +153,7 @@ This reflects the slow pace of single-threaded CPU performance improvement.
|
||||
On the other hand, the P100 GPU in S822LC provides 4.4x speedup over the K20x in XK.
|
||||
On a per-node basis the four GPUs in S822LC provide 17.9 speedup over the single GPU in XK.
|
||||
|
||||
The nearfield kernel consumes approximately 60\% of the MLFMM time.
|
||||
The average kernel-execution speedup moving from K20x to P100 is 5.3x, and the disaggregation kernel speedup is the largest, at 8x.
|
||||
On both K20x and P100, this kernel's performance is limited by the amount of CUDA shared memory it requires.
|
||||
In S822LC, the newer Pascal GPU architecture provides 64 KB of shared memory per thread-block rather than the 48 KB on XK, which allows more thread-blocks to run concurrently and provide the disproportionate speedup on that machine.
|
||||
@@ -166,7 +167,7 @@ This speedup justifies the significant CUDA time investment.
|
||||
|
||||
|
||||
\section*{Acknowledgments}
|
||||
This work was supported by the NVIDIA GPU Center of Excellence, the NCSA Petascale Improvement Discovery Program, and the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR).
|
||||
This work was supported by the NVIDIA GPU Center of Excellence and the NCSA Petascale Improvement Discovery Program (PAID).
|
||||
|
||||
\bibliographystyle{IEEEtran}
|
||||
\begin{thebibliography}{99}
|
||||
|
Reference in New Issue
Block a user