サブロウ丸

Sabrou-mal サブロウ丸

主にプログラミングと数学

集団通信の可視化方法について

故きを温ねて新しきを知る

Fast Multi-GPU collectives with NCCL

"Fast Multi-GPU collectives with NCCL | NVIDIA Technical Blog." NVIDIA Technical Blog, 21 Aug. 2022, developer.nvidia.com/blog/fast-multi-gpu-collectives-nccl.

https://developer-blogs.nvidia.com/wp-content/uploads/2016/04/image01.png

https://developer-blogs.nvidia.com/wp-content/uploads/2016/04/image00.png

https://developer-blogs.nvidia.com/wp-content/uploads/2016/04/image04.png

NCCL: Collective Operations

"Collective Operations — NCCL 2.18.1 documentation." 6 May. 2023, docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html.

AllReduce

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/_images/allreduce.png

Broadcast

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/_images/broadcast.png

Reduce

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/_images/reduce.png

AllGather

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/_images/allgather.png

ReduceScatter

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/_images/reducescatter.png

Collective communication: theory, practice, and experience

Chan, Ernie, et al. "Collective communication: theory, practice, and experience." Concurrency and Computation: Practice and Experience 19.13 (2007): 1749-1783.

https://www.cs.utexas.edu/~pingali/CSE392/2011sp/lectures/Conc_Comp.pdf

Figure 5. Minimum-spanning tree algorithm for reduce スクリーンショット 2023-05-24 15 32 34

Figure 9. Recursive-doubling algorithm for reduce–scater スクリーンショット 2023-05-24 15 35 08

Figure 14. Bidirectional exchange algorithm for allreduce スクリーンショット 2023-05-24 15 38 19

A Generalization of the Allreduce Operation

Kolmakov, Dmitry, and Xuecang Zhang. "A generalization of the allreduce operation." arXiv preprint arXiv:2004.09362 (2020).

スクリーンショット 2023-05-24 17 55 32

スクリーンショット 2023-05-24 17 56 07

TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches

Shah, Aashaka, et al. "{TACCL}: Guiding Collective Algorithm Synthesis using Communication Sketches." 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 2023.

スクリーンショット 2023-05-24 15 55 22 スクリーンショット 2023-05-24 15 56 10

A Communication Efficient ADMM-based Distributed Algorithm Using Two-Dimensional Torus Grouping AllReduce

Wang, Guozheng, et al. "A Communication Efficient ADMM-based Distributed Algorithm Using Two-Dimensional Torus Grouping AllReduce." Data Science and Engineering (2023): 1-12.

スクリーンショット 2023-05-24 18 01 22

Recent Improvements of MPI Communication for DDLS

Kim, Hyejin. "Recent Improvements of MPI Communication for DDLS - Hyejin Kim - Medium." Medium, 6 Jan. 2022, hk3342.medium.com/recent-improvements-of-mpi-communication-74e3c4a1ccb4.

スクリーンショット 2023-05-24 18 04 58

Optimization of Collective Communication Operations in MPICH

Thakur, Rajeev, Rolf Rabenseifner, and William Gropp. "Optimization of collective communication operations in MPICH." The International Journal of High Performance Computing Applications 19.1 (2005): 49-66.

スクリーンショット 2023-05-24 18 06 53 スクリーンショット 2023-05-24 18 06 47

Sparse allreduce: Efficient scalable communication for power-law data

Zhao, Huasha, and John Canny. "Sparse allreduce: Efficient scalable communication for power-law data." arXiv preprint arXiv:1312.3020 (2013).

スクリーンショット 2023-05-25 7 37 11