サーベイ: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

@inproceedings{rajbhandari2020zero,
  title={Zero: Memory optimizations toward training trillion parameter models},
  author={Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong},
  booktitle={SC20: International Conference for High Performance Computing, Networking, Storage and Analysis},
  pages={1--16},
  year={2020},
  organization={IEEE}
}

paper: https://arxiv.org/pdf/1910.02054.pdf

ZeROがタイトルの先頭につく一連の論文をもとに microsoftがDeepSpeedというライブラリを開発しています。

背景

モデルの巨大化が進行し、1つのサーバーにモデルを載せて学習できない

どんなもの

学習時に使用されるメモリの大半はパラメタ更新のための情報（モデルパラメタ自体はそうでもない）と解析
パラメタ更新のための情報を分散させることで、データ並列ベースのメモリ使用量の少ない学習方法の提案

先行研究と比べてどこがすごい

とにかくメモリ使用量を減らすための工夫の提案
メモリ使用量に関するプロファイル情報を丁寧に解説していてありがたい
MPによる通信オーバーヘッドがない(と主張)ため、MPで高い効率を得るために必要なNVLINKやNVSwitchなどの非常に高速なノード内相互接続がないローエンドの計算ノードでも巨大なモデルを学習することが可能

技術や手法のキモはどこ

学習時における使用するメモリのほとんどはパラメタ更新用であることを説明
パラメタ更新用の情報を分散させるために
- deviceにモデルの一部を重複なく割り当て、deviceは割り当てられた箇所のパラメタのみを更新する
- 各deviceが担当するパラメタの更新に必要な情報のみ保持すれば良いためメモリ使用量を節約
さらにメモリ使用量を減らすなら、モデルのパラメタも割り当てられた箇所のみ保存させる
- 他のdeviceにパラメタをbroadcastすることで、該当箇所のパラメタを保有しないdeviceでもforward可能に
諸々の分通信量は増えるがシンプルなdata-parallelと比較して最大でも1.5倍であると主張
メモリ使用量を減らせると、より巨大なモデルを学習できるし、例えばforward時の活性化状態を保存できたりするので良い
メモリの断片化にも考慮して、短期計算用にあらかじめメモリをallocateしておく

どうやって有効だと検証した？

理論的評価
32GB V100 GPUsを1024個使えばstage-3の実行で、1T=1兆パラメタのGPTベースモデルの訓練に必要なメモリをクリアできる

Model parallelと併用すれば、もう少し学習可能なモデルサイズを大きくできる

メモリ使用量
- 理論的評価、1024個のGPUで1兆個以上のパラメータを学習できる
- ZeRO-100Bにより128GPUで最大13Bのパラメータを持つモデルをGPUあたり平均40TFlops以上のスループットで学習した
  - ZeROを用いない場合はdata-parallelのみで学習可能な最大のモデルは1.4Bパラメータで、スループットは20TFlops/GPU以下となる

2020年4月にTuring-NLG 170億のパラメタモデルを学習させ、言語モデルのWebtext-103 perplexityでSOTAを達成した
- それまではMegatron-LMで学習した83億パラメタモデルがSOTAだった

議論はある？

とにかく集団通信が多くなる(通信を増やすことでメモリを節約しているので)
通信量の解析において比較対象のMegatronの通信量をノード内のものとノード間で通信幅がかなり異なることを無視して単純に足し合わせて概算している
比較対象がtorchのDistributedDataParallel なので、比較対象としては弱い

次に読むべき

Model parallelism系
- (2019, 406 cited)Megatron-LM: Shoeybi, Mohammad, et al. "Megatron-lm: Training multi-billion parameter language models using model parallelism." arXiv preprint arXiv:1909.08053 (2019).
- (2018, 208 cited)Shazeer, Noam, et al. "Mesh-tensorflow: Deep learning for supercomputers." Advances in neural information processing systems 31 (2018).
- (2019, 79 cited)Wang, Minjie, Chien-chin Huang, and Jinyang Li. "Supporting very large models using automatic dataflow graph partitioning." Proceedings of the Fourteenth EuroSys Conference 2019. 2019.
Pipeline系
- (2019, 716 cited)GPipe: Huang, Yanping, et al. "Gpipe: Efficient training of giant neural networks using pipeline parallelism." Advances in neural information processing systems 32 (2019).
- (2019, 225 cited)PipeDream: Narayanan, Deepak, et al. "PipeDream: generalized pipeline parallelism for DNN training." Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019.
Reducing Activation Memory系
- (2018, 107 cited)Jain, Animesh, et al. "Gist: Efficient data encoding for deep neural network training." 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018.
- (2016, 407 cited)Activation Checkpointing: Chen, Tianqi, et al. "Training deep nets with sublinear memory cost." arXiv preprint arXiv:1604.06174 (2016).
- (2020, 58 cited)Jain, Paras, et al. "Checkmate: Breaking the memory wall with optimal tensor rematerialization." Proceedings of Machine Learning and Systems 2 (2020): 497-511.
- (2018, 158 cited)Live Analysis: Wang, Linnan, et al. "Superneurons: Dynamic GPU memory management for training deep neural networks." Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming. 2018.
CPU Offload系
- (2020, 14 cited)Pudipeddi, Bharadwaj, et al. "Training large neural networks with constant memory using a new execution algorithm." arXiv preprint arXiv:2002.05645 (2020). 　- (2016, 298 cited)Rhu, Minsoo, et al. "vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design." 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016. 　- (2018, 24 citeid)Le, Tung D., et al. "Tflms: Large model support in tensorflow by graph rewriting." arXiv preprint arXiv:1807.02037 (2018).
Training Optimizer系
- (2019, 366 cited)You, Yang, et al. "Large batch optimization for deep learning: Training bert in 76 minutes." arXiv preprint arXiv:1904.00962 (2019).