サーベイ: Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

https://dl.acm.org/doi/10.1145/3458817.3476209

paper:

@inproceedings{10.1145/3458817.3476209,
  author = {Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and Phanishayee, Amar and Zaharia, Matei},
  title = {Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM},
  year = {2021},
  isbn = {9781450384421},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3458817.3476209}, doi = {10.1145/3458817.3476209},
  abstract = {Large language models have led to state-of-the-art accuracies across several tasks. However, training these models efficiently is challenging because: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to scaling issues at thousands of GPUs. In this paper, we show how tensor, pipeline, and data parallelism can be composed to scale to thousands of GPUs. We propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs (per-GPU throughput of 52% of theoretical peak).},
  booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
  articleno = {58},
  numpages = {15},
  location = {St. Louis, Missouri},
  series = {SC '21} }

背景

(特に自然言語処理において)モデルの指数的な巨大化が進行中

それに伴いGPUサーバーにもモデルがのらないケースも
そうでなくても学習時の演算がとにかく多くかなりの時間がかかる
単一層内の並列化（mesh-transformerなど）が効率よく動くのは200億パラメタのモデル程度（NVIDIA DGX A100 serverで）
- all-reduceがサーバー間を跨ぐ場合はかなり低速に
- 行列演算が細かすぎる場合はGPU使用効率が悪い
pipeline処理では、学習の都合上device間で同期（pipeline flush）を取る必要があるがそのための待ち時間が多い
- Gpipeでは実行の50%の時間をpipeline flushに費やしているらしい

どんなもの

PipeDreamのpipelineをベースとしたinterleaved 1F1B pipelineの開発と評価
data-parallelism, tensor-model-parallelism, model-parallelism(pipeline)を組み合わせた場合の処理性能を評価した
筆頭著者はPipeDreamの筆頭著者と同一人物

先行研究と比べてどこがすごい？

開発したinterleaved pipeline scheduleによりスループットを10%向上させた
memory footprintもコンパクトに
1兆のパラメタを持つGPT modelが3ヶ月程度で学習可能に

技術や手法のキモはどこ

deviceに連続した層（model chunkと本論文では呼ぶ）を2つ持たせる
- 1-4層を持たせるのではなく、1-2層、9-10層を持たせるなど
- これにより, 先のdeviceにいち早くmicro batchを送れるようになる = idle timeの減少
- pipeline bubble (pipelineの1周期)を短くすることができる
- 代償として通信量は増大(サーバー間の通信を分散させ、サーバー内でNVlinkによる高速なscatter/gather情報集約による工夫も行なっている)

どうやって有効だと検証した？

PyTorchで実装
GPTモデルの学習時のスループットから、1兆のパラメタを持つGPT modelが3ヶ月程度と推測
interleaveあり/なしで比較し、スループットの向上を確認
- (しかしバッチサイズが大きくなるにつれ差が縮まる)

議論はある？

モデル分割や計算資源の割り当ては手動
PipeDreamと同様に重みの更新とbackwardがずれるので、（いわゆる）勾配の陳腐化が発生し、学習に悪影響を与える

次に読むべき

引用数はMay, 13, 2022におけるGoogle scholar上の数値

Pipeline系
- (2021, 15 cited)TeraPipe: Li, Zhuohan, et al. "Terapipe: Token-level pipeline parallelism for training large-scale language models." International Conference on Machine Learning. PMLR, 2021.
- (2021, 4 cited)PipeTransformer: He, Chaoyang, et al. "PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models." International Conference on Machine Learning. PMLR, 2021.
- (2020, 41 cited)HetPipe: Jay H Park, et al. "HetPipe: Enabling Large DNN Train-ing on (Whimpy) Heterogeneous GPU Clusters through Integration of PipelinedModel Parallelism and Data Parallelism." In 2020 USENIX Annual Technical Con-ference (USENIX ATC 20), pages 307–321, 2020.
- (2021, 39 cited)PipeDream-2BW: Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia.Memory-Efficient Pipeline-Parallel DNN Training. InInternational Conference onMachine Learning, pages 7937–7947. PMLR, 2021.
- (2021, 29 cited)PipeMare; Yang, Bowen, et al. "Pipemare: Asynchronous pipeline parallel dnn training." Proceedings of Machine Learning and Systems 3 (2021): 269-296.
- (2021, 7 cited)Kosson, Atli, et al. "Pipelined backpropagation at scale: training large models without batches." Proceedings of Machine Learning and Systems 3 (2021): 479-501.
- (2019, 225 cited)PipeDream: Narayanan, Deepak, et al. "PipeDream: generalized pipeline parallelism for DNN training." Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019.
- DeepSpeed: Extreme-Scale Model Training for Everyone. https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/.
ZeRO
- (2020, 186 cited)Rajbhandari, Samyam, et al. "Zero: Memory optimizations toward training trillion parameter models." SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020.
- (2021, 38 cited)Rajbhandari, Samyam, et al. "Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning." Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2021.
モデルの自動分割
- (2019, 237 cited)FlexFlow: Jia, Zhihao, Matei Zaharia, and Alex Aiken. "Beyond Data and Model Parallelism for Deep Neural Networks." Proceedings of Machine Learning and Systems 1 (2019): 1-13.
- (2019, 225 cited)PipeDream: Narayanan, Deepak, et al. "PipeDream: generalized pipeline parallelism for DNN training." Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019.
- (2020, 21 cited)Tarnawski, Jakub M., et al. "Efficient algorithms for device placement of dnn graph operators." Advances in Neural Information Processing Systems 33 (2020): 15451-15463.
実装: https://github.com/nvidia/megatron-lm