サブロウ丸

Sabrou-mal サブロウ丸

主にプログラミングと数学

NVIDIA SHARP; Error event recieved: event: RDMA_CM_EVENT_ROUTE_ERROR, error: -22; Local Port validation failed. error

NVIDIA SHARPを利用してreduce-scatterを実行しようとすると次のエラー。

[snail03:1:40413 unique id 139630260094914][2025-04-22 13:45:47] DEBUG collect_ports_data: found valid device (device mlx5_2 port 1) in at index 0

[snail02][Apr 22 13:45:48 344400][SR     ][45620][error] - no AM service record found(SA query)
[snail02][Apr 22 13:45:48 376295][RDMA_SR][45620][error] - Error event recieved: event: RDMA_CM_EVENT_ROUTE_ERROR,  error: -22
[snail02][Apr 22 13:45:48 376381][RDMA_SR][45620][error] - Error occured during connection event handle
[snail02][Apr 22 13:45:51 379511][RDMA_SR][45620][error] - poll failed due to poll_timeout=3000.000000, stop
[snail02][Apr 22 13:45:51 379626][RDMA_SR][45620][error] - Poll failed
[snail02][Apr 22 13:45:51 379686][RDMA_SR][45620][error] - Failed to connect
[snail02][Apr 22 13:45:51 379803][RDMA_SR][45620][error] - rdma_resolve_addr failed with error: -1
[snail02][Apr 22 13:45:51 379891][RDMA_SR][45620][error] - rdma_resolve_addr failed with error: -1
[snail02][Apr 22 13:45:51 379958][SR     ][45620][error] - unable to query AM service record(AM query)
[snail02][Apr 22 13:45:51 380013][GENERAL][45620][error] - Could not query AM address, error: -52
[snail02][Apr 22 13:45:51 380066][GENERAL][45620][error] - unable to connect to AM
[snail02][Apr 22 13:45:51 380120][GENERAL][45620][warn ] - SHARPD_OP_CREATE_JOB failed with status: 52
[snail02:0:45620 unique id 139630260094914][2025-04-22 13:45:51] ERROR No Aggregation Manager (sharp_am) detected in sharp_create_job.

sharp_am (SHARP Aggregation Manager)を起動しているノードでサービスのステータスを見ると落ちています。

● sharp_am.service - SHARP Aggregation Manager (sharp_am). Version: 3.8.0
     Loaded: loaded (/etc/systemd/system/sharp_am.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/sharp_am.service.d
             └─Service.conf
     Active: failed (Result: exit-code) since Wed 2024-09-04 16:35:46 JST; 23min ago
    Process: 1406820 ExecStart=/data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am -O $CONF (code=exited, status=255/EXCEPTION)
   Main PID: 1406820 (code=exited, status=255/EXCEPTION)

Sep 04 16:35:22 snail01 sharp_am[1406820]: Sharp AM pid: 1406820
Sep 04 16:35:22 snail01 sharp_am[1406820]: Command line: /data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am -O -/data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/conf/sharp_am.cfg
Sep 04 16:35:24 snail01 sharp_am[1406820]: Built 1 trees.
Sep 04 16:35:44 snail01 sharp_am[1406820]: Local Port validation failed. error: FabricProvider must bind to port with master SM (SM LID:35 local LID:2. Exiting.
Sep 04 16:35:44 snail01 sharp_am[1406820]: signal 15 received from pid: 1406820, process: /data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_am
Sep 04 16:35:44 snail01 sharp_am[1406820]: Received a graceful termination signal - Stopping sharp_am
Sep 04 16:35:44 snail01 sharp_am[1406820]: Shutting down SHARP Aggregation Manager
Sep 04 16:35:46 snail01 sharp_am[1406820]: sharp_am exit. Return code: -1
Sep 04 16:35:46 snail01 systemd[1]: sharp_am.service: Main process exited, code=exited, status=255/EXCEPTION
Sep 04 16:35:46 snail01 systemd[1]: sharp_am.service: Failed with result 'exit-code'.

解決策

同一ネットワークの他のサーバーにおいて、複数のサブネットワークマネージャーが起動しているとLIDとのバインディングがうまくいかなくなる模様。

  1. すべてのノードでまずはopensmを終了させる。systemctl stop opensm
  2. ibdiagnet コマンドを実行し、Master SMにNo Master SMと記載されていなることを確認する。
  3. 管理ノードでopensmとsharp_amの起動する。systemctl restart opensm sharp_am

補足

systemctl ... Linuxシステムにおいて、システムマネージャを制御、管理するコマンド。サービスの起動、管理、ログ記録などを行える ibdiagnet ... InfiniBandネットワークトポロジの自動検出、各ポートのエラー統計の取得、パフォーマンス関連の診断を行う。infiniband diagnostic(診断) network