User Tools

Site Tools


network_stuff:machine_learning:networking:ai:infiniband

INFINIBAND: InfiniBand is a key technology for AI workloads, widely used in high-performance computing (HPC) and AI clusters for its ultra-low latency, high throughput, and lossless performance. It plays a crucial role in enabling efficient communication between nodes, especially for distributed machine learning tasks where large amounts of data need to be shared rapidly across many devices.

Key characteristics of InfiniBand in AI networking:

  • Low Latency: InfiniBand offers latencies as low as 1-2 microseconds, which is critical for AI workloads that require predictable performance across nodes.
  • High Bandwidth: Supports bandwidths of up to 400 Gbps per link (with InfiniBand HDR), allowing the transfer of massive datasets needed in AI model training.
  • Lossless Transmission: InfiniBand is inherently lossless, ensuring that there is no packet loss during communication, which is essential for AI workloads that cannot tolerate retransmissions (e.g., when training deep learning models).
  • Remote Direct Memory Access (RDMA): One of the most important features of InfiniBand, RDMA allows direct memory-to-memory transfers between nodes without involving the CPU. This is crucial in reducing CPU overhead and accelerating data transfers, making it ideal for AI training where rapid data sharing is required between nodes.
  • Self-Healing: InfiniBand has built-in self-healing capabilities, which means that in the event of a failure or congestion in a link, it can reroute traffic dynamically to ensure continuous operation.
  • Queue Pair Communication: InfiniBand uses Queue Pairs (QP), consisting of a send queue and a receive queue, for managing communication between nodes.

Key operations managed by InfiniBand Verbs (the API for data transfer operations):

  • Send/Receive: For transmitting and receiving data.
  • RDMA Read/Write: To access remote memory directly.
  • Atomic Operations: Used for updating remote memory with atomicity, ensuring no race conditions in distributed systems.

Common InfiniBand verbs include:

  • ibv_post_send: This verb is used to post a send request to a Queue Pair (QP). It initiates the process of sending data from the local queue to a remote queue.
  • ibv_post_recv: This verb posts a receive request to a Queue Pair (QP). It prepares the local queue to receive incoming data from a remote queue.
  • ibv_reg_mr: This verb registers a memory region (MR) for RDMA access. It allows the application to specify a memory buffer that can be accessed directly by the InfiniBand hardware for data transfer operations.
  • ibv_modify_qp: This verb modifies the state of a Queue Pair (QP). It is used to transition the QP through various states, such as initiating a connection or resetting the QP.

InfiniBand is often deployed in AI training clusters where large-scale model training requires seamless, high-speed communication between GPUs across different servers. This makes it a popular choice in supercomputing environments and AI data centers.



From wikipedia: InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency.[..]. InfiniBand provides remote direct memory access (RDMA) capabilities for low CPU overhead. More info here
The key here is to understand that infiniband is designed so servers and storage talk directly, roughly said, from memory record to memory record. They don't use the classical network stack, allowing them much faster rates, comparable to an internal memory bus. They do this via the remote direct memory access protocol (RDMA).
We can run routing protocols over Infiniband but, in our case, the setup is very simple. Mellanox infiniband switches create a high performance between a cluster of the servers and the DDN storage (controllers in infiniband jargon)

Terms:

  • RDMA provides access to the memory from one computer to the memory of another computer without involving either computer’s operating system. This technology enables high-throughput and low-latency networking with low CPU utilization.
    • Mellanox provides RDMA via the OFED package
  • LID : local indentifier (All devices in a subnet have a Local Identifier (LID)). Routing between different subnets is done on the basis of a Global Identifier (GID)
  • GID: Is another identifier but is to route BETWEEN SUBNETS. Contains : Subnet Prefix and a GUID (Global Unique Identifier).
  • NSD (Network Shared Disks): In our context, NSD is the server that connects to the storage via the Mellanox switch. The servers share the NSD's to the clients, creating some sort of distributed logical disk (a bit like the hyperflex technology). Particuartly in our setupm the servers dont share their local disks but they expose the DDN's disks.
  • SM (Subnet Manager): It performs the InfiniBand specification's required tasks for initializing InfiniBand hardware. One SM must be running for each InfiniBand subnet. It's run by the OpenSM daemon which can run bith in the switches and the servers
    • SM master is the node truly acting as SM. The node with the highest priority [0-15] wins.
    • In our setup, servers all have priority 14 while switch has priority 15.
  • MAD: Infiniband management datagrams. They use RMPP (Reliable Multi Packet Protocol):
  • SRP : Discovers and connects to InfiniBand SCSI RDMA Protocol (SRP) targets in an IB fabric.
  • sysimgguid : system identifies
  • caguid : nic (hca) identifier

Useful MLNX-OS commands:

> sh interfaces | i state
! Below is under privilege exec mode
show run
show interfaces ib status  ! note is under config mode
show guids  ! To see the switch group identifier (like the switch main mac address)
fae sminfo  ! To show who is acting as sm master. Note it can be a server or the switch itself. If the latter, 'show guids; and 'fae sminfo' thrown the same value
fae ibnetdiscover   ! Show everything connected to the switches (servers and storage controllers) in a verbose format

Useful server side ib commands

ibstatus  # server infiniband (ib) interfaces status + HCA model
ibping -S -L 10  ---- ibping -L 20 -c 10 -n 3     # to ping, we need to run one in the server (LID 20) and one in the client (LID 10). This is because even ping makes RDMA calls

SUBNET MANAGER (opensm)
Infiniband subnet manager works in two planes:

  • SM-config: For config sync. It happens over mgmt network and relates to configuration and user management
  • smnode-OpenSM : Cluster master. SM-master. opensm is a software entity required to run for in order to initialize the InfiniBand hardware (at least one per each InfiniBand subnet).
    • sm keeps forwarding state; handout link identifiers (lid, l2 identifier); it calculates routes (doesn't apply to us too much)

Tshoot commands:

show ib smnodes    # also smquery can be useful for information about the SM itself.
show ib smnode nyzsfsll51 sm-state
show guids   # so we can identify the macs
fae sminfo
fae ibnetdiscover  # this is 'scanning' all fabric and gives us a 'topology' of all elements found. a bit like lldp in ethernet.
show ib smnode  NYZSFSLL02 ha-role    # shows the current sm ha mode
If we look at the command prompt of thw two switches we see :
server1[serversmname: standby] #
server2 [serversmname: master] #
^^ the master/standby refers to the SM-config master node (the one just coordinating the configuration syncup). It does not refer to the 'smnode-OpenSM-cluster-master'.

Initial setup:
https://docs.mellanox.com/display/MLNXOSv381000/Getting+Started


MELLANOX UPGRADE First scp the image to any of the linux servers. Preferably in the same region where the switch is.
Then do the following om the switch (this is an example):

conf t
image delete XXX // --> delete old images, if exist
image fetch scp://myusername:mypassword@servername/lxhome/santosja/firmware/image-X86_64-3.6.6162.img
image install image-X86_64-3.6.6162.img
image boot next
configuration write
reload

The upgrade itself (after the reload) takes 3-4 minutes.

To downgrade https://docs.mellanox.com/display/MLNXOSv391906/Downgrading+OS+Software


Simple InfiniBand Troubleshooting Case (1)

Symptom: MPI (Message Passing Interface) jobs between two GPU nodes are limited to ~12 Gb/s, while other nodes on the HDR100 fabric reach 100 Gb/s.

Checks:

  • `ibstat` on the slow node shows:
  Port 1: State: Active  
          Physical state: LinkUp  
          Rate: 25 Gb/sec  
          Width: 1X
  → Link has fallen back to 1 lane × 25 Gb/s instead of 4 lanes × 25 Gb/s.
  • `iblinkinfo -r` confirms only lane 1 is active and reports symbol errors on that port.
  • `perfquery -x` shows increasing SymbolErrorCounter and LinkErrorRecoveryCounter on the affected switch port.

Fix: Replaced the QSFP56 cable. `ibstat` now reports Rate 100 Gb/s, Width 4X. Bandwidth test (`ib_write_bw`) confirms ~97 Gb/s performance.

Root Cause: One faulty cable lane caused fallback to 1X. Basic InfiniBand tools helped quickly identify and resolve the issue.

Simple InfiniBand Troubleshooting Case (2) Subnet Manager Misconfiguration

Symptom: Some nodes fail to establish MPI communication or exhibit long startup times. `ibstat` shows ports stuck in “Initializing” or “Down”.

Checks:

  • On affected nodes, `ibstat` shows:

Port 1: State: Down

          Physical state: Polling  
          Rate: 100 Gb/sec (Expected)  
  → Port sees light but is not becoming Active.
  • `ibv_devinfo` shows device present but no active port.
  • No errors found in hardware or cabling.
  • On a working node, `sminfo` reports a different Subnet Manager than the one expected:
SM lid 1, lmc 0, smsl 0, priority 5, state: master
smguid 0x... "server-X"

Fix: Two Subnet Managers (SMs) were active with equal priority, causing instability. Disabled one SM (`opensm`) on the unintended node. After restarting `opensm` on the correct node, all ports transitioned to “Active”.

Root Cause: Multiple Subnet Managers with conflicting roles caused port initialisation to stall or flap. Ensuring a single master SM with correct priority resolved the issue.

network_stuff/machine_learning/networking/ai/infiniband.txt · Last modified: by jotasandoku