This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| network_stuff:infiniband [2022/05/20 11:48] – jotasandoku | network_stuff:infiniband [2025/10/03 16:04] (current) – jotasandoku | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| __**INFINIBAND**__: | __**INFINIBAND**__: | ||
| + | \\ | ||
| + | {{ : | ||
| \\ | \\ | ||
| From [[https:// | From [[https:// | ||
| Line 10: | Line 12: | ||
| * RDMA provides access to the memory from one computer to the memory of another computer without involving either computer’s operating system. This technology enables high-throughput and low-latency networking with low CPU utilization. | * RDMA provides access to the memory from one computer to the memory of another computer without involving either computer’s operating system. This technology enables high-throughput and low-latency networking with low CPU utilization. | ||
| * Mellanox provides RDMA via the OFED package | * Mellanox provides RDMA via the OFED package | ||
| - | * lid : local indentifier (All devices in a subnet have a Local Identifier (LID)). Routing between different subnets is done on the basis of a Global Identifier (GID) | + | * **LID** |
| - | * NSD (Network Shared Disks): In our context, NSD is the server that connects to the storage via the Mellanox switch. The servers share the NSD's to the clients, creating some sort of distributed logical disk (a bit like the hyperflex technology). | + | * GID: Is another identifier but is to route BETWEEN SUBNETS. Contains : Subnet Prefix and a GUID (Global Unique Identifier). |
| - | * SM (Subnet Manager): | + | * NSD (Network Shared Disks): In our context, NSD is the server that connects to the storage via the Mellanox switch. The servers share the NSD's to the clients, creating some sort of distributed logical disk (a bit like the hyperflex technology). Particuartly in our setupm the servers dont share their local disks but they expose the DDN's disks. |
| + | | ||
| * SM master is the node truly acting as SM. The node with the highest priority [0-15] wins. | * SM master is the node truly acting as SM. The node with the highest priority [0-15] wins. | ||
| * In our setup, servers all have priority 14 while switch has priority 15. | * In our setup, servers all have priority 14 while switch has priority 15. | ||
| Line 29: | Line 32: | ||
| show guids ! To see the switch group identifier (like the switch main mac address) | show guids ! To see the switch group identifier (like the switch main mac address) | ||
| fae sminfo | fae sminfo | ||
| + | fae ibnetdiscover | ||
| | | ||
| Useful server side ib commands | Useful server side ib commands | ||
| - | ibstatus | + | ibstatus |
| - | + | | |
| - | + | ||
| - | ---- | + | |
| - | Terms:\\ | + | |
| Line 47: | Line 47: | ||
| Tshoot commands: | Tshoot commands: | ||
| - | show ib smnodes | + | show ib smnodes |
| show ib smnode nyzsfsll51 sm-state | show ib smnode nyzsfsll51 sm-state | ||
| show guids # so we can identify the macs | show guids # so we can identify the macs | ||
| Line 68: | Line 68: | ||
| ---- | ---- | ||
| - | UPGRADE PROCEDURE: | ||
| - | [[https:// | ||
| - | (The device has two partitions. We can install the new OS in the ' | ||
| - | If not really a backup one, is the ' | ||
| - | * Configuration backup. | + | MELLANOX UPGRADE |
| - | * Better via UI: Setup → Configuration → Configuration files (Active configuration file + Binary configuration file) | + | First scp the image to any of the linux servers. Preferably in the same region where the switch is. |
| - | * If we needed | + | \\ |
| - | | + | Then do the following om the switch |
| - | | + | |
| - | | + | |
| - | | + | image fetch scp://myusername:mypassword@servername/ |
| - | + | image install | |
| - | enable | + | |
| - | image fetch scp://[username:password@IP/image] | + | |
| - | image install | + | |
| image boot next | image boot next | ||
| + | configuration write | ||
| reload | reload | ||
| - | | + | |
| - | + | The upgrade itself (after the reload) takes 3-4 minutes. | |
| + | |||
| + | |||
| + | |||
| + | To downgrade [[https:// | ||
| ---- | ---- | ||
| + | ===== Simple InfiniBand Troubleshooting Case (1) ===== | ||
| + | |||
| + | **Symptom: | ||
| + | MPI (Message Passing Interface) jobs between two GPU nodes are limited to ~12 Gb/s, while other nodes on the HDR100 fabric reach 100 Gb/s. | ||
| + | |||
| + | **Checks: | ||
| + | * `ibstat` on the slow node shows: | ||
| + | |||
| + | Port 1: State: Active | ||
| + | Physical state: LinkUp | ||
| + | Rate: 25 Gb/ | ||
| + | Width: 1X | ||
| + | |||
| + | → Link has fallen back to 1 lane × 25 Gb/s instead of 4 lanes × 25 Gb/s. | ||
| + | |||
| + | * `iblinkinfo -r` confirms only **lane 1** is active and reports symbol errors on that port. | ||
| + | |||
| + | * `perfquery -x` shows increasing **SymbolErrorCounter** and **LinkErrorRecoveryCounter** on the affected switch port. | ||
| + | |||
| + | **Fix: | ||
| + | Replaced the QSFP56 cable. `ibstat` now reports **Rate 100 Gb/s, Width 4X**. Bandwidth test (`ib_write_bw`) confirms ~97 Gb/s performance. | ||
| + | |||
| + | **Root Cause: | ||
| + | One faulty cable lane caused fallback to 1X. Basic InfiniBand tools helped quickly identify and resolve the issue. | ||
| + | |||
| + | |||
| + | ===== Simple InfiniBand Troubleshooting Case (2) Subnet Manager Misconfiguration ===== | ||
| + | |||
| + | **Symptom: | ||
| + | Some nodes fail to establish MPI communication or exhibit long startup times. `ibstat` shows ports stuck in " | ||
| + | |||
| + | **Checks: | ||
| + | * On affected nodes, `ibstat` shows: | ||
| + | Port 1: State: Down | ||
| + | Physical state: Polling | ||
| + | Rate: 100 Gb/sec (Expected) | ||
| + | → Port sees light but is not becoming Active. | ||
| + | |||
| + | * `ibv_devinfo` shows device present but no active port. | ||
| + | |||
| + | * No errors found in hardware or cabling. | ||
| + | |||
| + | * On a working node, `sminfo` reports a different Subnet Manager than the one expected: | ||
| + | |||
| + | SM lid 1, lmc 0, smsl 0, priority 5, state: master | ||
| + | smguid 0x... " | ||
| + | |||
| + | **Fix: | ||
| + | Two Subnet Managers (SMs) were active with equal priority, causing instability. Disabled one SM (`opensm`) on the unintended node. After restarting `opensm` on the correct node, all ports transitioned to " | ||
| - | Downgrade software: | + | **Root Cause:** |
| - | [[https:// | + | Multiple Subnet Managers with conflicting roles caused port initialisation to stall or flap. Ensuring a single master SM with correct priority resolved the issue. |