User Tools

Site Tools


network_stuff:infiniband

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
network_stuff:infiniband [2025/07/02 10:57] jotasandokunetwork_stuff:infiniband [2025/10/03 16:04] (current) jotasandoku
Line 1: Line 1:
 __**INFINIBAND**__: __**INFINIBAND**__:
 +\\
 +{{ :network_stuff:infiniband_technical_guide_for_network_engineers.pdf |}}
 \\ \\
 From [[https://en.wikipedia.org/wiki/InfiniBand|wikipedia]]: InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency.[..]. InfiniBand provides remote direct memory access (RDMA) capabilities for low CPU overhead. More info [[https://network.nvidia.com/related-docs/whitepapers/InfiniBandFAQ_FQ_100.pd|here]]  From [[https://en.wikipedia.org/wiki/InfiniBand|wikipedia]]: InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency.[..]. InfiniBand provides remote direct memory access (RDMA) capabilities for low CPU overhead. More info [[https://network.nvidia.com/related-docs/whitepapers/InfiniBandFAQ_FQ_100.pd|here]] 
Line 10: Line 12:
   * RDMA provides access to the memory from one computer to the memory of another computer without involving either computer’s operating system. This technology enables high-throughput and low-latency networking with low CPU utilization.   * RDMA provides access to the memory from one computer to the memory of another computer without involving either computer’s operating system. This technology enables high-throughput and low-latency networking with low CPU utilization.
     * Mellanox provides RDMA via the OFED package     * Mellanox provides RDMA via the OFED package
-  * lid : local indentifier (All devices in a subnet have a Local Identifier (LID)). Routing between different subnets is done on the basis of a Global Identifier (GID)+  * **LID** : local indentifier (All devices in a subnet have a Local Identifier (LID)). Routing between different subnets is done on the basis of a **Global Identifier (GID)** 
 +  * GID: Is another identifier but is to route BETWEEN SUBNETS. Contains : Subnet Prefix and a GUID (Global Unique Identifier).
   * NSD (Network Shared Disks): In our context, NSD is the server that connects to the storage via the Mellanox switch. The servers share the NSD's to the clients, creating some sort of distributed logical disk (a bit like the hyperflex technology). Particuartly in our setupm the servers dont share their local disks but they expose the DDN's disks.   * NSD (Network Shared Disks): In our context, NSD is the server that connects to the storage via the Mellanox switch. The servers share the NSD's to the clients, creating some sort of distributed logical disk (a bit like the hyperflex technology). Particuartly in our setupm the servers dont share their local disks but they expose the DDN's disks.
-  * SM (Subnet Manager):  It performs the InfiniBand specification's required tasks for initializing InfiniBand hardware. One SM must be running for each InfiniBand subnet. It's run by the OpenSM daemon which can run bith in the switches and  the servers+  * **SM (Subnet Manager):**  It performs the InfiniBand specification's required tasks for initializing InfiniBand hardware. One SM must be running for each InfiniBand subnet. It's run by the OpenSM daemon which can run bith in the switches and  the servers
     * SM master is the node truly acting as SM. The node with the highest priority [0-15] wins.     * SM master is the node truly acting as SM. The node with the highest priority [0-15] wins.
     * In our setup, servers all have priority 14 while switch has priority 15.     * In our setup, servers all have priority 14 while switch has priority 15.
Line 32: Line 35:
      
 Useful server side ib commands Useful server side ib commands
-  ibstatus  # server infiniband (ib) interfaces status+  ibstatus  # server infiniband (ib) interfaces status + HCA model 
 +  ibping -S -L 10  ---- ibping -L 20 -c 10 -n 3     # to ping, we need to run one in the server (LID 20) and one in the client (LID 10). This is because even ping makes RDMA calls
  
  
Line 43: Line 47:
  
 Tshoot commands: Tshoot commands:
-  show ib smnodes+  show ib smnodes    # also smquery can be useful for information about the SM itself.
   show ib smnode nyzsfsll51 sm-state   show ib smnode nyzsfsll51 sm-state
   show guids   # so we can identify the macs   show guids   # so we can identify the macs
Line 85: Line 89:
  
 ---- ----
-===== Cas simple de dépannage InfiniBand =====+===== Simple InfiniBand Troubleshooting Case (1) =====
  
-**Symptôme :**   +**Symptom:**   
-Les jobs MPI entre deux nœuds GPU plafonnent à ~12 Gb/s, alors que le reste du réseau HDR100 fonctionne normalement à 100 Gb/s.+MPI (Message Passing Interface) jobs between two GPU nodes are limited to ~12 Gb/s, while other nodes on the HDR100 fabric reach 100 Gb/s. 
 + 
 +**Checks:**   
 +  * `ibstat` on the slow node shows:  
  
-**Vérifications :**   
-  * `ibstat` sur le nœud lent affiche :   
-    ``` 
     Port 1: State: Active       Port 1: State: Active  
             Physical state: LinkUp               Physical state: LinkUp  
             Rate: 25 Gb/sec               Rate: 25 Gb/sec  
             Width: 1X             Width: 1X
-    ``` 
-    → Le lien est tombé à 1 voie × 25 Gb/s au lieu de 4 voies × 25 Gb/s. 
  
-  * `iblinkinfo -r` confirme que seule **la voie 1** est active et signale des *symbol errors* sur ce port.+    → Link has fallen back to 1 lane × 25 Gb/s instead of 4 lanes × 25 Gb/s. 
 + 
 +  * `iblinkinfo -r` confirms only **lane 1** is active and reports symbol errors on that port. 
 + 
 +  `perfquery -x` shows increasing **SymbolErrorCounter** and **LinkErrorRecoveryCounter** on the affected switch port
 + 
 +**Fix:**   
 +Replaced the QSFP56 cable. `ibstat` now reports **Rate 100 Gb/s, Width 4X**. Bandwidth test (`ib_write_bw`) confirms ~97 Gb/s performance. 
 + 
 +**Root Cause:**   
 +One faulty cable lane caused fallback to 1X. Basic InfiniBand tools helped quickly identify and resolve the issue. 
 + 
 + 
 +===== Simple InfiniBand Troubleshooting Case (2) Subnet Manager Misconfiguration ===== 
 + 
 +**Symptom:**   
 +Some nodes fail to establish MPI communication or exhibit long startup times. `ibstat` shows ports stuck in "Initializing" or "Down"
 + 
 +**Checks:**   
 +  * On affected nodes, `ibstat` shows:   
 +    Port 1: State: Down   
 +            Physical state: Polling   
 +            Rate: 100 Gb/sec (Expected)   
 +    → Port sees light but is not becoming Active. 
 + 
 +  * `ibv_devinfo` shows device present but no active port. 
 + 
 +  * No errors found in hardware or cabling.
  
-  * `perfquery -xmontre une augmentation des compteurs **SymbolErrorCounter** et **LinkErrorRecoveryCounter** sur le port du switch.+  * On a working node, `sminforeports a different Subnet Manager than the one expected:
  
-**Solution :**   +  SM lid 1, lmc 0, smsl 0, priority 5, statemaster 
-Remplacement du câble QSFP56 → `ibstat` indique désormais **Rate 100 Gb/s, Width 4X**Le test `ib_write_bw` montre un retour à ~97 Gb/s.+  smguid 0x... "server-X"
  
-**Cause racine :**   +**Fix:**   
-Une voie du câble était défectueuseforçant une renégociation du lien en 1XLes outils de base (`ibstat``iblinkinfo`, `perfquery`) ont permis un diagnostic rapide.+Two Subnet Managers (SMs) were active with equal prioritycausing instabilityDisabled one SM (`opensm`) on the unintended node. After restarting `opensmon the correct nodeall ports transitioned to "Active".
  
 +**Root Cause:**  
 +Multiple Subnet Managers with conflicting roles caused port initialisation to stall or flap. Ensuring a single master SM with correct priority resolved the issue.
  
  
network_stuff/infiniband.1751453848.txt.gz · Last modified: by jotasandoku