サブロウ丸

Sabrou-mal サブロウ丸

主にプログラミングと数学

Test NCCL failure common.cu:1005 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) ' .. pid 175547: Test failure common.cu:891

nccl-testsを実行中に次のようなエラー。

tateiwa@snail01:/data/nccl-tests$ NCCL_DEBUG=INFO ./build/all_reduce_perf -g 2
# nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 175547 on    snail01 device  0 [0x08] NVIDIA A100-PCIE-40GB
#  Rank  1 Group  0 Pid 175547 on    snail01 device  1 [0x09] NVIDIA A100-PCIE-40GB
snail01:175547:175547 [0] NCCL INFO Bootstrap : Using ibs4:192.168.100.217<0>
snail01:175547:175547 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.19.4+cuda12.4

snail01:175547:175850 [0] enqueue.cc:43 NCCL WARN Cuda failure 'named symbol not found'

snail01:175547:175851 [1] enqueue.cc:43 NCCL WARN Cuda failure 'named symbol not found'

... (中略) ...

snail01:175547:175850 [0] enqueue.cc:43 NCCL WARN Cuda failure 'named symbol not found'

snail01:175547:175851 [1] enqueue.cc:54 NCCL WARN Cuda failure 'named symbol not found'

snail01:175547:175850 [0] enqueue.cc:54 NCCL WARN Cuda failure 'named symbol not found'
snail01:175547:175851 [1] NCCL INFO init.cc:1364 -> 1
snail01:175547:175850 [0] NCCL INFO init.cc:1364 -> 1
snail01:175547:175851 [1] NCCL INFO group.cc:64 -> 1 [Async thread]
snail01:175547:175850 [0] NCCL INFO group.cc:64 -> 1 [Async thread]
snail01:175547:175547 [1] NCCL INFO group.cc:418 -> 1
snail01:175547:175547 [1] NCCL INFO group.cc:95 -> 1
snail01:175547:175547 [1] NCCL INFO init.cc:1728 -> 1
snail01: Test NCCL failure common.cu:1005 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. snail01 pid 175547: Test failure common.cu:891

環境

  • OS: Ubuntu20.04.6 LTS x86_64
  • NCCL: 2.19.4+cuda12.4
  • nccl-tests: commit 9d26b8422ba76c098df996b96e13b8ddf3a71165 (HEAD -> master, origin/master, origin/HEAD)

解決策

最新版のNCCLを使おう。

  • NCCL
    • version: v2.22.3-1
    • commit: 178b6b759074597777ce13438efb0e0ba625e429 (HEAD -> master, tag: v2.22.3-1, origin/master, origin/HEAD)
# NCCL install
git clone https://github.com/NVIDIA/nccl.git
cd nccl
make -j src.build
sudo make install

# nccl-tests build
cd nccl-tests
make clean
make MPI=1 -j
NCCL_DEBUG=INFO ./build/all_reduce_perf -g 2