This is an old revision of the document!
INFINIBAND:
From wikipedia: InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency.[..]. InfiniBand provides remote direct memory access (RDMA) capabilities for low CPU overhead. More info here
The key here is to understand that infiniband is designed so servers and storage talk directly, roughly said, from memory record to memory record. They don't use the classical network stack, allowing them much faster rates, comparable to an internal memory bus. They do this via the remote direct memory access protocol (RDMA).
We can run routing protocols over Infiniband but, in our case, the setup is very simple. Mellanox infiniband switches create a high performance between a cluster of the servers and the DDN storage (controllers in infiniband jargon)
Terms:
Useful MLNX-OS commands:
> sh interfaces | i state ! Below is under privilege exec mode show run show interfaces ib status ! note is under config mode show guids ! To see the switch group identifier (like the switch main mac address) fae sminfo ! To show who is acting as sm master. Note it can be a server or the switch itself. If the latter, 'show guids; and 'fae sminfo' thrown the same value fae ibnetdiscover ! Show everything connected to the switches (servers and storage controllers) in a verbose format
Useful server side ib commands
ibstatus # server infiniband (ib) interfaces status
SUBNET MANAGER (opensm)
Infiniband subnet manager works in two planes:
Tshoot commands:
show ib smnodes show ib smnode nyzsfsll51 sm-state show guids # so we can identify the macs fae sminfo fae ibnetdiscover # this is 'scanning' all fabric and gives us a 'topology' of all elements found. a bit like lldp in ethernet. show ib smnode NYZSFSLL02 ha-role # shows the current sm ha mode
If we look at the command prompt of thw two switches we see : server1[serversmname: standby] # server2 [serversmname: master] # ^^ the master/standby refers to the SM-config master node (the one just coordinating the configuration syncup). It does not refer to the 'smnode-OpenSM-cluster-master'.
Initial setup:
https://docs.mellanox.com/display/MLNXOSv381000/Getting+Started
MELLANOX UPGRADE
First scp the image to any of the linux servers. Preferably in the same region where the switch is.
Then do the following om the switch (this is an example):
conf t image delete XXX // --> delete old images, if exist image fetch scp://myusername:mypassword@servername/lxhome/santosja/firmware/image-X86_64-3.6.6162.img image install image-X86_64-3.6.6162.img image boot next configuration write reload
The upgrade itself (after the reload) takes 3-4 minutes.
To downgrade https://docs.mellanox.com/display/MLNXOSv391906/Downgrading+OS+Software
Symptom: MPI (Message Passing Interface) jobs between two GPU nodes are limited to ~12 Gb/s, while other nodes on the HDR100 fabric reach 100 Gb/s.
Checks:
```
Port 1: State: Active
Physical state: LinkUp
Rate: 25 Gb/sec
Width: 1X
```
→ Link has fallen back to 1 lane × 25 Gb/s instead of 4 lanes × 25 Gb/s.
Fix: Replaced the QSFP56 cable. `ibstat` now reports Rate 100 Gb/s, Width 4X. Bandwidth test (`ib_write_bw`) confirms ~97 Gb/s performance.
Root Cause: One faulty cable lane caused fallback to 1X. Basic InfiniBand tools helped quickly identify and resolve the issue.