サブロウ丸

サブロウ丸

主にプログラミングと数学

pytorchでの分散処理にMPI backendを使用する

python.distributedでは分散処理用のAPIとして、Point-to-Point通信集団通信が提供されています。なので結構細かい処理をカスタマイズできたりします。

通信のbackendとしては(pytorch 1.13の段階で) MPI、 GLOO、 NCCLが指定できます。それぞれのbackendで使用可能な通信関数は公式のドキュメントにまとめられています。

本稿ではpytorch.distributedでMPI backendを使用するためのコンパイル手順を紹介します。

pipでinstallしたpytorchではMPI backendは使えない

岸辺露伴は眠らない。みたいなセクションタイトルの導入です。 pytorch.distributedによる分散実行の公式のサンプルコードは下記です。backendはglooを使用。

#!/usr/bin/env python
import osp
import torch
import torch.distributed as dist
from torch.multiprocessing import Process

def run(rank, size):
    """ Distributed function to be implemented later. """
    pass

def init_process(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    size = 2
    processes = []
    backend = "mpi"
    for rank in range(size):
        p = Process(target=init_process, args=(rank, size, run, backend))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

このコードの解説は次のブログを参考のこと Torch.distributed 使い方 | でい tech blog

さて、このコードのbackendをglooからmpiに変更してみます。

 if __name__ == "__main__":
     size = 2
     processes = []
+    backend = "mpi"
     for rank in range(size):
-        p = Process(target=init_process, args=(rank, size, run))
+        p = Process(target=init_process, args=(rank, size, run, backend))
         p.start()
         processes.append(p)

で、実行してみると(Python 3.9.12, torch 1.13)

Traceback (most recent call last):
  File "/home/hoge/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/hoge/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hoge/mlbench-benchmarks/tmp/sample.py", line 15, in init_process
    dist.init_process_group(backend, rank=rank, world_size=size)
  File "/home/hoge/.pyenv/versions/3.9.12_mlbench/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 486, in init_process_group
    _update_default_pg(_new_process_group_helper(
  File "/home/hoge/.pyenv/versions/3.9.12_mlbench/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 566, in _new_process_group_helper
    raise RuntimeError(
RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed.

というエラーで RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed. ということで、どうやらMPI backendを使うにはMPIがインストールされたマシンでソースコードからpytorchをビルドする必要があるようです。まぁそうだよなぁ。。。

ちなみにこのコードもbackendとしてGLOOを使う場合とMPIを使う場合で多少変える必要があります。MPI用のコードは本稿の"実行 > 実行ファイル"に記載しています。

build pytorch (1.13.0a0+gitc6c207f) with MPI

ということで、openmpiを使ってビルドしてみます。環境はUbuntuです(Ubuntu 20.04.3 LTS (GNU/Linux 5.11.0-34-generic x86_64))。

のページが非常に参考になります(神)。

準備

openmpiのinstall

apt-get install libopenmpi-dev automake flex

また ~/.bashrcに下記を追加

export CUDA_HOME=/usr/local/cuda
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64

また下記のinstallも必要でした。

python -m pip install numpy pyyaml typing_extensions

build

wget https://github.com/pytorch/pytorch/releases/download/v1.10.2/pytorch-v1.10.2.tar.gz
tar -czf pytorch-v1.10.2.tar.gz
cd pycorch-v1.10.2

今回はv1.10.2を使用しました。

また、CMakeList.txtを次のように変更しています。コンパイル時に-Werror=cast-function-typeのオプションが追加されないような処置です。(私はエラーが起きたのでこうしていますが、コンパイラによっては大丈夫なのかも? またやっつけなので、もっと良い処置があるかも)

$ git diff CMakeLists.txt
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 698c56a46a..72e4c64e63 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -884,10 +884,10 @@ if(NOT MSVC)
   if(HAS_WERROR_FORMAT)
     string(APPEND CMAKE_CXX_FLAGS " -Werror=format")
   endif()
-  check_cxx_compiler_flag("-Werror=cast-function-type" HAS_WERROR_CAST_FUNCTION_TYPE)
-  if(HAS_WERROR_CAST_FUNCTION_TYPE)
-    string(APPEND CMAKE_CXX_FLAGS " -Werror=cast-function-type")
-  endif()
+  # check_cxx_compiler_flag("-Werror=cast-function-type" HAS_WERROR_CAST_FUNCTION_TYPE)
+  # if(HAS_WERROR_CAST_FUNCTION_TYPE)
+  #   string(APPEND CMAKE_CXX_FLAGS " -Werror=cast-function-type")
+  # endif()
   check_cxx_compiler_flag("-Werror=sign-compare" HAS_WERROR_SIGN_COMPARE)
   # This doesn't work globally so we use the test on specific
   # target_compile_options

そして、

python setup.py clean
USE_MPI=ON CMAKE_C_COMPILER=$(which mpicc) CMAKE_CXX_COMPILER=$(which mpicxx) python setup.py build develop

を実行します。MAX_JOBSを指定するとコンパイルの並列数を指定できます。(例えば、MAX_JOBS=1 USE_MPI=1 CMAKE_C_COMPILER=.....)

私の場合はなぜかnumpyのサポートができなくなったので USE_NUMPY=ONオプションも追加しました。

また下記のエラーでコンパイルに失敗する場合はBUILD_SPLIT_CUDA=ONをつけると上手くいくかも。

/usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax
Using /home/hoge/.pyenv/versions/3.9.12/envs/3.9.12_mp/lib/python3.9/site-packages
Finished processing dependencies for torch==1.13.0a0+gitc6c207f
-------------------------------------------------------------------------
|                                                                       |
|    It is no longer necessary to use the 'build' or 'rebuild' targets  |
|                                                                       |
|    To install:                                                        |
|      $ python setup.py install                                        |
|    To develop locally:                                                |
|      $ python setup.py develop                                        |
|    To force cmake to re-generate native build files (off by default): |
|      $ python setup.py develop --cmake                                |
|                                                                       |
-------------------------------------------------------------------------

という表示が出れば成功。

TORCH_DISTRIBUTED_DEBUGのための設定

また実行時にTORCH_DISTRIBUTED_DEBUGを設定すれば、プロファイリグが取れるのですが、いくつか作業があります。まずsetup.pyでビルドする時にUSE_GLOGオプションが必要です(参考)。またUSE_GLOGを有効にするにはgoogle glog packageをインストールする必要があります。Ubuntuの場合はapt-get install libgoogle-glog-dev参考)。

うまくいけば、TORCH_SHOW_CPP_STACKTRACES=1 TORCH_DISTRIBUTED_DEBUG=DETAILの環境変数をつけての実行時に下記のような統計情報が出るはずです。

Avg forward compute time: 1384401 
 Avg backward compute time: 1568068 
Avg backward comm. time: 1363581 
 Avg backward comm/comp overlap time: 952692
 labels tensor([[-0.5482,  2.3198,  0.4690, -0.2132,  0.2052]])
I0824 06:45:48.918404 38077 logger.cpp:381] [Rank 0 / 2] [before iteration 2] Training ToyModel unused_parameter_size=0 
 Avg forward compute time: 1103382 
 Avg backward compute time: 3084313 
Avg backward comm. time: 3070821 
 Avg backward comm/comp overlap time: 2678932

実行

実行ファイル

またMPIだとGLOOとdist.init_process_groupなどの書き方が変わります。実行も

mpirun -n 4 python mpi_test.py

とmpirunコマンドの引数として並列数を与えます。

$ cat mpi_test.py 
import platform

import torch
import torch.distributed as dist


def get_random_tensor(size, dtype):
    tensor = torch.rand(size).to(dtype=dtype)
    return tensor


def run(rank, size):
    """ Simple point-to-point communication. """
    for count in range(10):
        tensor = get_random_tensor(size=100, dtype=torch.float32)
        dist.all_reduce(tensor, op=dist.ReduceOp.SUM) #, group=group)
        print("Count", count, "Rank", rank, "has data", tensor[0])


if __name__ == "__main__":
    dist.init_process_group(backend="mpi")
    size = dist.get_world_size()
    rank = dist.get_rank()
    print("Rank", rank, "in", platform.uname())

    run(rank, size)

対処したエラー集

RuntimeError: Missing build dependency: Unable to import yaml.

--> pip install pyyamlで解決

Traceback (most recent call last):
  File "/home/hoge/sample_dd/pytorch/setup.py", line 944, in <module>
    build_deps()
  File "/home/hoge/sample_dd/pytorch/setup.py", line 398, in build_deps
    check_pydep('yaml', 'pyyaml')
  File "/home/hoge/sample_dd/pytorch/setup.py", line 454, in check_pydep
    raise RuntimeError(missing_pydep.format(importname=importname, module=module))
RuntimeError: Missing build dependency: Unable to `import yaml`.
Please install it via `conda install pyyaml` or `pip install pyyaml`
CMake Error at cmake/public/cuda.cmake:47 (enable_language): No CMAKE_CUDA_COMPILER could be found.

--> 環境変数LD_LIBRARY_PATHとPATHを変更することで解決

export CUDA_HOME=/usr/local/cuda
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
CMake Error at cmake/public/cuda.cmake:47 (enable_language):
  No CMAKE_CUDA_COMPILER could be found.

  Tell CMake where to find the compiler by setting either the environment
  variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full
  path to the compiler, or to the compiler name if it is in the PATH.
Call Stack (most recent call first):
  cmake/Dependencies.cmake:43 (include)
  CMakeLists.txt:692 (include)
ModuleNotFoundError: No module named 'typing_extensions'

--> pip install typing_extensionsで解決

Traceback (most recent call last):
  File "/home/hoge/.pyenv/versions/3.9.12/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/hoge/.pyenv/versions/3.9.12/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/hoge/sample_dd/pytorch/torchgen/gen.py", line 3, in <module>
    from typing_extensions import Literal
ModuleNotFoundError: No module named 'typing_extensions'
CMake Error at cmake/Codegen.cmake:193 (message):
  Failed to get generated_headers list
Call Stack (most recent call first):
  caffe2/CMakeLists.txt:2 (include)
make[2]: *** [caffe2/CMakeFiles/torch_cpu.dir/build.make:10909: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/distributed/c10d/ProcessGroupMPI.cpp.o] Error 1

--> CMakeList.txtを変更することで解決

$ git diff CMakeLists.txt
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 698c56a46a..72e4c64e63 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -884,10 +884,10 @@ if(NOT MSVC)
   if(HAS_WERROR_FORMAT)
     string(APPEND CMAKE_CXX_FLAGS " -Werror=format")
   endif()
-  check_cxx_compiler_flag("-Werror=cast-function-type" HAS_WERROR_CAST_FUNCTION_TYPE)
-  if(HAS_WERROR_CAST_FUNCTION_TYPE)
-    string(APPEND CMAKE_CXX_FLAGS " -Werror=cast-function-type")
-  endif()
+  # check_cxx_compiler_flag("-Werror=cast-function-type" HAS_WERROR_CAST_FUNCTION_TYPE)
+  # if(HAS_WERROR_CAST_FUNCTION_TYPE)
+  #   string(APPEND CMAKE_CXX_FLAGS " -Werror=cast-function-type")
+  # endif()
   check_cxx_compiler_flag("-Werror=sign-compare" HAS_WERROR_SIGN_COMPARE)
   # This doesn't work globally so we use the test on specific
   # target_compile_options
---
In file included from /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/mpicxx.h:277,
                 from /usr/lib/x86_64-linux-gnu/openmpi/include/mpi.h:2868,
                 from /home/tateiwa/sample_dd/pytorch/torch/csrc/distributed/c10d/ProcessGroupMPI.hpp:20,
                 from /home/tateiwa/sample_dd/pytorch/torch/csrc/distributed/c10d/ProcessGroupMPI.cpp:1:
/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/op_inln.h: In member function ‘virtual void MPI::Op::Init(void (*)(const void*, void*, int, const MPI::Datatype&), bool)’:
/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/op_inln.h:121:46: error: cast between incompatible function types from ‘void (*)(void*, void*, int*, ompi_datatype_t**, void (*)(void*, void*, int*, ompi_datatype_t**))’ to ‘void (*)(void*, void*, int*, ompi_datatype_t**)’ [-Werror=cast-function-type]
  121 |     (void)MPI_Op_create((MPI_User_function*) ompi_mpi_cxx_op_intercept,
      |                                              ^~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/ompi/mpi/cxx/op_inln.h:123:59: error: cast between incompatible function types from ‘void (*)(const void*, void*, int, const MPI::Datatype&)’ to ‘void (*)(void*, void*, int*, ompi_datatype_t**)’ [-Werror=cast-function-type]
  123 |     ompi_op_set_cxx_callback(mpi_op, (MPI_User_function*) func);
      |                                                           ^~~~
cc1plus: some warnings being treated as errors
make[2]: *** [caffe2/CMakeFiles/torch_cpu.dir/build.make:10909: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/distributed/c10d/ProcessGroupMPI.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:8024: caffe2/CMakeFiles/torch_cpu.dir/all] Error 2
make: *** [Makefile:141: all] Error 2

--> setup.py実行時にBUILD_SPLIT_CUDA=ONオプションをつけることで解決

エラー内容

/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/crti.o: In function `_init':
(.init+0x7): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `__gmon_start__'
../nccl/lib/libnccl_slim_static.a(reduce_scatter_sum_bf16.o): In function `__sti____cudaRegisterAll()':
tmpxft_000080b4_00000000-6_reduce_scatter_0_9.compute_86.cudafe1.cpp:(.text.startup+0x11): relocation truncated to fit: R_X86_64_PC32 against symbol `__fatbinwrap_924f2f8d_21_reduce_scatter_0_9_cu_2e887396' defined in .nvFatBinSegment section in ../nccl/lib/libnccl_slim_static.a(reduce_scatter_sum_bf16.o)
../nccl/lib/libnccl_slim_static.a(reduce_scatter_sum_bf16.o):(.eh_frame+0x20): relocation truncated to fit: R_X86_64_PC32 against `.text'
../nccl/lib/libnccl_slim_static.a(reduce_scatter_sum_bf16.o):(.eh_frame+0x34): relocation truncated to fit: R_X86_64_PC32 against `.text'
../nccl/lib/libnccl_slim_static.a(reduce_scatter_sum_bf16.o):(.eh_frame+0x7c): relocation truncated to fit: R_X86_64_PC32 against `.text'
../nccl/lib/libnccl_slim_static.a(reduce_scatter_sum_bf16.o):(.eh_frame+0xa8): relocation truncated to fit: R_X86_64_PC32 against `.text'
../nccl/lib/libnccl_slim_static.a(reduce_scatter_sum_bf16.o):(.eh_frame+0xd4): relocation truncated to fit: R_X86_64_PC32 against `.text'
../nccl/lib/libnccl_slim_static.a(reduce_scatter_sum_bf16.o):(.eh_frame+0x100): relocation truncated to fit: R_X86_64_PC32 against `.text'
../nccl/lib/libnccl_slim_static.a(reduce_scatter_sum_bf16.o):(.eh_frame+0x114): relocation truncated to fit: R_X86_64_PC32 against `.text'
../nccl/lib/libnccl_slim_static.a(reduce_scatter_sum_bf16.o):(.eh_frame+0x128): relocation truncated to fit: R_X86_64_PC32 against `.text'
../nccl/lib/libnccl_slim_static.a(reduce_scatter_sum_bf16.o):(.eh_frame+0x13c): additional relocation overflows omitted from the output
/usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax
collect2: error: ld returned 1 exit status
caffe2/CMakeFiles/torch_cuda.dir/build.make:6724: recipe for target 'lib/libtorch_cuda.so' failed
make[2]: *** [lib/libtorch_cuda.so] Error 1
CMakeFiles/Makefile2:5953: recipe for target 'caffe2/CMakeFiles/torch_cuda.dir/all' failed
make[1]: *** [caffe2/CMakeFiles/torch_cuda.dir/all] Error 2
Makefile:145: recipe for target 'all' failed
make: *** [all] Error 2

解決方法; setup.py実行時にBUILD_SPLIT_CUDA=ONを追加。

BUILD_SPLIT_CUDA=ON USE_GLOG=ON USE_NUMPY=ON USE_CUDA=ON USE_MPI=ON CMAKE_C_COMPILER=$(which mpicc) CMAKE_CXX_COMPILER=$(which mpicxx) python setup.py build develop install build install

参考