NVIDIA SHARP; Error event recieved: event: RDMA_CM_EVENT_ROUTE_ERROR, error: -22; Local Port validation failed. error
NVIDIA SHARPを利用してreduce-scatterを実行しようとすると次のエラー。
[snail03:1:40413 unique id 139630260094914][2025-04-22 13:45:47] DEBUG collect_ports_data: found valid device (device mlx5_2 port 1) in at index 0 [snail02][Apr 22 13:45:48 344400][SR ][45620][error] - no AM service record found(SA query) [snail02][Apr 22 13:45:48 376295][RDMA_SR][45620][error] - Error event recieved: event: RDMA_CM_EVENT_ROUTE_ERROR, error: -22 [snail02][Apr 22 13:45:48 376381][RDMA_SR][45620][error] - Error occured during connection event handle [snail02][Apr 22 13:45:51 379511][RDMA_SR][45620][error] - poll failed due to poll_timeout=3000.000000, stop [snail02][Apr 22 13:45:51 379626][RDMA_SR][45620][error] - Poll failed [snail02][Apr 22 13:45:51 379686][RDMA_SR][45620][error] - Failed to connect [snail02][Apr 22 13:45:51 379803][RDMA_SR][45620][error] - rdma_resolve_addr failed with error: -1 [snail02][Apr 22 13:45:51 379891][RDMA_SR][45620][error] - rdma_resolve_addr failed with error: -1 [snail02][Apr 22 13:45:51 379958][SR ][45620][error] - unable to query AM service record(AM query) [snail02][Apr 22 13:45:51 380013][GENERAL][45620][error] - Could not query AM address, error: -52 [snail02][Apr 22 13:45:51 380066][GENERAL][45620][error] - unable to connect to AM [snail02][Apr 22 13:45:51 380120][GENERAL][45620][warn ] - SHARPD_OP_CREATE_JOB failed with status: 52 [snail02:0:45620 unique id 139630260094914][2025-04-22 13:45:51] ERROR No Aggregation Manager (sharp_am) detected in sharp_create_job.
sharp_am (SHARP Aggregation Manager)を起動しているノードでサービスのステータスを見ると落ちています。
● sharp_am.service - SHARP Aggregation Manager (sharp_am). Version: 3.8.0
Loaded: loaded (/etc/systemd/system/sharp_am.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/sharp_am.service.d
└─Service.conf
Active: failed (Result: exit-code) since Wed 2024-09-04 16:35:46 JST; 23min ago
Process: 1406820 ExecStart=/data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am -O $CONF (code=exited, status=255/EXCEPTION)
Main PID: 1406820 (code=exited, status=255/EXCEPTION)
Sep 04 16:35:22 snail01 sharp_am[1406820]: Sharp AM pid: 1406820
Sep 04 16:35:22 snail01 sharp_am[1406820]: Command line: /data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am -O -/data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/conf/sharp_am.cfg
Sep 04 16:35:24 snail01 sharp_am[1406820]: Built 1 trees.
Sep 04 16:35:44 snail01 sharp_am[1406820]: Local Port validation failed. error: FabricProvider must bind to port with master SM (SM LID:35 local LID:2. Exiting.
Sep 04 16:35:44 snail01 sharp_am[1406820]: signal 15 received from pid: 1406820, process: /data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am
Sep 04 16:35:44 snail01 sharp_am[1406820]: Received a graceful termination signal - Stopping sharp_am
Sep 04 16:35:44 snail01 sharp_am[1406820]: Shutting down SHARP Aggregation Manager
Sep 04 16:35:46 snail01 sharp_am[1406820]: sharp_am exit. Return code: -1
Sep 04 16:35:46 snail01 systemd[1]: sharp_am.service: Main process exited, code=exited, status=255/EXCEPTION
Sep 04 16:35:46 snail01 systemd[1]: sharp_am.service: Failed with result 'exit-code'.
解決策
同一ネットワークの他のサーバーにおいて、複数のサブネットワークマネージャーが起動しているとLIDとのバインディングがうまくいかなくなる模様。
- すべてのノードでまずはopensmを終了させる。systemctl stop opensm
- ibdiagnet コマンドを実行し、Master SMにNo Master SMと記載されていなることを確認する。
- 管理ノードでopensmとsharp_amの起動する。systemctl restart opensm sharp_am
補足
systemctl ... Linuxシステムにおいて、システムマネージャを制御、管理するコマンド。サービスの起動、管理、ログ記録などを行える
ibdiagnet ... InfiniBandネットワークトポロジの自動検出、各ポートのエラー統計の取得、パフォーマンス関連の診断を行う。infiniband diagnostic(診断) network