nccl-testsを実行中に次のようなエラー。
tateiwa@snail01:/data/nccl-tests$ NCCL_DEBUG=INFO ./build/all_reduce_perf -g 2 # nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 175547 on snail01 device 0 [0x08] NVIDIA A100-PCIE-40GB # Rank 1 Group 0 Pid 175547 on snail01 device 1 [0x09] NVIDIA A100-PCIE-40GB snail01:175547:175547 [0] NCCL INFO Bootstrap : Using ibs4:192.168.100.217<0> snail01:175547:175547 [0] NCCL INFO cudaDriverVersion 12060 NCCL version 2.19.4+cuda12.4 snail01:175547:175850 [0] enqueue.cc:43 NCCL WARN Cuda failure 'named symbol not found' snail01:175547:175851 [1] enqueue.cc:43 NCCL WARN Cuda failure 'named symbol not found' ... (中略) ... snail01:175547:175850 [0] enqueue.cc:43 NCCL WARN Cuda failure 'named symbol not found' snail01:175547:175851 [1] enqueue.cc:54 NCCL WARN Cuda failure 'named symbol not found' snail01:175547:175850 [0] enqueue.cc:54 NCCL WARN Cuda failure 'named symbol not found' snail01:175547:175851 [1] NCCL INFO init.cc:1364 -> 1 snail01:175547:175850 [0] NCCL INFO init.cc:1364 -> 1 snail01:175547:175851 [1] NCCL INFO group.cc:64 -> 1 [Async thread] snail01:175547:175850 [0] NCCL INFO group.cc:64 -> 1 [Async thread] snail01:175547:175547 [1] NCCL INFO group.cc:418 -> 1 snail01:175547:175547 [1] NCCL INFO group.cc:95 -> 1 snail01:175547:175547 [1] NCCL INFO init.cc:1728 -> 1 snail01: Test NCCL failure common.cu:1005 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / ' .. snail01 pid 175547: Test failure common.cu:891
環境
- OS: Ubuntu20.04.6 LTS x86_64
- NCCL: 2.19.4+cuda12.4
- nccl-tests: commit
9d26b8422ba76c098df996b96e13b8ddf3a71165
(HEAD -> master, origin/master, origin/HEAD)
解決策
最新版のNCCLを使おう。
- NCCL
- version: v2.22.3-1
- commit:
178b6b759074597777ce13438efb0e0ba625e429
(HEAD -> master, tag: v2.22.3-1, origin/master, origin/HEAD)
# NCCL install git clone https://github.com/NVIDIA/nccl.git cd nccl make -j src.build sudo make install # nccl-tests build cd nccl-tests make clean make MPI=1 -j NCCL_DEBUG=INFO ./build/all_reduce_perf -g 2