サーベイ: On Optimizing the Communication of Model Parallelism

分散深層学習論文サーベイ

@article{zhuang2023optimizing, title={On optimizing the communication of model parallelism}, author={Zhuang, Yonghao and Zheng, Lianmin and Li, Zhuohan and Xing, Eric and Ho, Qirong and Gonzalez, Joseph and Stoica, Ion and Zhang, Hao and Z…

2023-07-10

サーベイ: Synthesizing Optimal Collective Algorithms (2021)

分散深層学習論文サーベイ

Cai, Zixian, et al. "Synthesizing optimal collective algorithms." Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2021. どんなもの？詳細論文メタ情報 AI向けハードウェア DGX-1 Gigabyte Z52 …

2023-06-12

サーベイ: Blink: Fast and Generic Collectives for Distributed ML (2020)

分散深層学習

Wang, Guanhua, et al. "Blink: Fast and generic collectives for distributed ml." Proceedings of Machine Learning and Systems 2 (2020): 172-186. Microsoft + インターン？ [paper] 概要どんなもの？与えられたトポロジに適した集団通信アルゴリズ…

2023-05-15

サーベイ: Topoopt: Co-optimizing network topology and parallelization strategy for distributed training jobs (2022)

分散深層学習論文サーベイ深層学習最適化

Wang, Weiyang, et al. "Topoopt: Co-optimizing network topology and parallelization strategy for distributed training jobs." arXiv preprint arXiv:2202.00433 (2022). [paper] 概要どんなもの? Metaにおける分散DNNトレーニングジョブの解析それを…

#TOPOOPT

2023-01-02

サーベイ: Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

論文サーベイ深層学習分散深層学習

@article{zheng2022alpa, title={Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning}, author={Zheng, Lianmin and Li, Zhuohan and Zhang, Hao and Zhuang, Yonghao and Chen, Zhifeng and Huang, Yanping and Wang, Y…

2022-10-17

DistributedDataParallel (pytorch) の内部デザイン

pytorch 分散深層学習 DistributedDataParallel python

https://pytorch.org/docs/stable/notes/ddp.html#internal-design の日本語訳 + おまけの脚注 pytorch はv1.12 Internal Design ここでは、torch.nn.parallel.DistributedDataParallelがどのように動作しているかを、1つの反復処理の各ステップの詳細に踏み…

#Pytorch #DistributedDataParallel

2022-10-11

DistributedDataParallel (pytorch) サンプルコード

pytorch python 分散深層学習 DistributedDataParallel

本稿ではDistributedDataParallelのサンプルコードを示し、また実行中にどのような通信が行われているかを確認します。参考: Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.13.0+cu117 documentation pytorch DistributedDataPara…

#Python #Pytorch #DistributedDataParallel

2022-09-19

Megatron-LMのソースコードを読む

分散深層学習

NVIDIAが提案するTransformerをベースとする言語処理モデルの並列化実装。サーベイ記事はこちら↓↓↓ Githubのレポジトリには Data Preprocessing （データ前処理） Pretraining（事前学習） Evaluation and Tasks（評価）のコードが含まれています。事前学…

#Megatron-LM

2022-08-03

MLBenchのビルド

分散深層学習

MLBenchのビルド(https://mlbench.github.io)に色々手間取ったのでその記録を残します。最終的な手順実行コマンドエラー一覧 error: blis/cy.c: No such file or directory Failed to build matplotlib Starting control-plane Error docker.errors.APIEr…

2022-07-06

Pytorchでの分散処理にMPI backendを使用する

python pytorch 分散処理 MPI 分散深層学習

python.distributedは、Point-to-Point通信や集団通信といった分散処理のAPIを提供しています。これにより、細かな処理をカスタマイズすることが可能です。通信のbackendとしては、pytorch 1.13時点では、MPI、GLOO、NCCLが選択できます。各backendで利用で…

#Pytorch #MPI #DistributedDataParallel #TORCH_DISTRIBUTED_DEBUG

2022-06-30

サーベイ: Automatic Graph Partitioning for Very Large-scale Deep Learning

論文サーベイ分散深層学習

Tanaka, Masahiro, et al. "Automatic graph partitioning for very large-scale deep learning." 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2021. @inproceedings{tanaka2021automatic, title={Automatic gra…

2022-06-28

サーベイ: Supporting Very Large Models using Automatic Dataflow Graph Partitioning

論文サーベイ分散深層学習

Wang, Minjie, Chien-chin Huang, and Jinyang Li. "Supporting very large models using automatic dataflow graph partitioning." Proceedings of the Fourteenth EuroSys Conference 2019. 2019. @inproceedings{wang2019supporting, title={Supporting v…

2022-06-08

サーベイ: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

論文サーベイ分散深層学習

@article{shoeybi2019megatron, title={Megatron-lm: Training multi-billion parameter language models using model parallelism}, author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanza…

#Megatron-LM

2022-05-17

サーベイ: Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

分散深層学習論文サーベイ

https://dl.acm.org/doi/10.1145/3458817.3476209 paper: @inproceedings{10.1145/3458817.3476209, author = {Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbr…

2022-05-13

サーベイ: Mesh-tensorflow:Deep learning for supercomputers

分散深層学習論文サーベイ

@article{shazeer2018mesh, title={Mesh-tensorflow: Deep learning for supercomputers}, author={Shazeer, Noam and Cheng, Youlong and Parmar, Niki and Tran, Dustin and Vaswani, Ashish and Koanantakool, Penporn and Hawkins, Peter and Lee, Hyouk…

2022-05-12

サーベイ: PipeDream: Generalized Pipeline Parallelism for DNN Training

分散深層学習論文サーベイ

https://dl.acm.org/doi/abs/10.1145/3341301.3359646?casa_token=L-sKQKrRoE4AAAAA%3AYKo9NPdnPyG6IouMN5jfTHTCYFAGORDxen32GKAteeSG-ROhqx_OX-hVOfuyHiVBXLLJH0RPujhFPEk @inproceedings{narayanan2019pipedream, title={PipeDream: generalized pipeline …

2022-05-10

サーベイ: Gpipe: Efficient training of giant neural networks using pipeline parallelism

分散深層学習論文サーベイ

@article{huang2019gpipe, title={Gpipe: Efficient training of giant neural networks using pipeline parallelism}, author={Huang, Yanping and Cheng, Youlong and Bapna, Ankur and Firat, Orhan and Chen, Dehao and Chen, Mia and Lee, HyoukJoong a…

2022-05-09

サーベイ: 分散深層学習

分散深層学習論文サーベイ

深層学習において、学習データと学習モデルの巨大化が最新のトレンドになっています。そこで学習時間の削減のために複数のマシンを用いてモデルを訓練する試みが行われており、分散深層学習(distributed deep learning)などという呼ばれ方で一つの分野にな…

Sabrou-mal サブロウ丸

主にプログラミングと数学

分散深層学習

サーベイ: On Optimizing the Communication of Model Parallelism

サーベイ: Synthesizing Optimal Collective Algorithms (2021)

サーベイ: Blink: Fast and Generic Collectives for Distributed ML (2020)

サーベイ: Topoopt: Co-optimizing network topology and parallelization strategy for distributed training jobs (2022)

サーベイ: Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

DistributedDataParallel (pytorch) の内部デザイン

DistributedDataParallel (pytorch) サンプルコード

Megatron-LMのソースコードを読む

MLBenchのビルド

Pytorchでの分散処理にMPI backendを使用する

サーベイ: Automatic Graph Partitioning for Very Large-scale Deep Learning

サーベイ: Supporting Very Large Models using Automatic Dataflow Graph Partitioning

サーベイ: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

サーベイ: Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

サーベイ: Mesh-tensorflow:Deep learning for supercomputers

サーベイ: PipeDream: Generalized Pipeline Parallelism for DNN Training

サーベイ: Gpipe: Efficient training of giant neural networks using pipeline parallelism

サーベイ: 分散深層学習