| Next revision | Previous revision |
| network_stuff:machine_learning:networking:ai:networking [2026/02/01 15:23] – created jotasandoku | network_stuff:machine_learning:networking:ai:networking [2026/02/01 17:37] (current) – jotasandoku |
|---|
| |
| === InfiniBand === | === InfiniBand === |
| **InfiniBand** is a key technology for AI workloads, widely used in high-performance computing (HPC) and AI clusters for its **ultra-low latency**, **high throughput**, and **lossless performance**. It plays a crucial role in enabling efficient communication between nodes, especially for distributed machine learning tasks where large amounts of data need to be shared rapidly across many devices. | [[https://camarreal.dedyn.io/doku.php?id=network_stuff:machine_learning:networking:ai:infiniband]] |
| |
| Key characteristics of **InfiniBand** in AI networking: | ==== ROCEV2 ===== |
| | |
| * **Low Latency**: InfiniBand offers **latencies as low as 1-2 microseconds**, which is critical for AI workloads that require predictable performance across nodes. | |
| * **High Bandwidth**: Supports **bandwidths of up to 400 Gbps** per link (with InfiniBand HDR), allowing the transfer of massive datasets needed in AI model training. | |
| * **Lossless Transmission**: InfiniBand is inherently lossless, ensuring that there is no packet loss during communication, which is essential for **AI workloads that cannot tolerate retransmissions** (e.g., when training deep learning models). | |
| * **Remote Direct Memory Access (RDMA)**: One of the most important features of InfiniBand, RDMA allows **direct memory-to-memory transfers** between nodes **without involving the CPU**. This is crucial in reducing CPU overhead and accelerating data transfers, making it ideal for AI training where rapid data sharing is required between nodes. | |
| * **Self-Healing**: InfiniBand has **built-in self-healing capabilities**, which means that in the event of a failure or congestion in a link, it can reroute traffic dynamically to ensure continuous operation. | |
| * **Queue Pair Communication**: InfiniBand uses **Queue Pairs (QP)**, consisting of a send queue and a receive queue, for managing communication between nodes. | |
| | |
| Key operations managed by **InfiniBand Verbs** (the API for data transfer operations): | |
| * **Send/Receive**: For transmitting and receiving data. | |
| * **RDMA Read/Write**: To access remote memory directly. | |
| * **Atomic Operations**: Used for updating remote memory with atomicity, ensuring no race conditions in distributed systems. | |
| | |
| Common InfiniBand verbs include: | |
| * ''ibv_post_send'': This verb is used to post a send request to a Queue Pair (QP). It initiates the process of sending data from the local queue to a remote queue. | |
| * ''ibv_post_recv'': This verb posts a receive request to a Queue Pair (QP). It prepares the local queue to receive incoming data from a remote queue. | |
| * ''ibv_reg_mr'': This verb registers a memory region (MR) for RDMA access. It allows the application to specify a memory buffer that can be accessed directly by the InfiniBand hardware for data transfer operations. | |
| * ''ibv_modify_qp'': This verb modifies the state of a Queue Pair (QP). It is used to transition the QP through various states, such as initiating a connection or resetting the QP. | |
| | |
| | |
| InfiniBand is often deployed in **AI training clusters** where large-scale model training requires seamless, high-speed communication between GPUs across different servers. This makes it a popular choice in supercomputing environments and AI data centers. | |
| | |
| === ROCEV2 === | |
| [[ https://netdevconf.info/0x19/docs/netdev-0x19-paper18-talk-slides/netdev-0x19-AI-networking-RoCE-and-netdev.pdf ]] | [[ https://netdevconf.info/0x19/docs/netdev-0x19-paper18-talk-slides/netdev-0x19-AI-networking-RoCE-and-netdev.pdf ]] |
| \\ | \\ |
| * **Compatibility with AI Workloads**: Like InfiniBand, **RoCE** supports high-speed, low-latency, and lossless communication, making it ideal for distributed AI workloads such as training deep learning models across multiple GPUs or nodes. | * **Compatibility with AI Workloads**: Like InfiniBand, **RoCE** supports high-speed, low-latency, and lossless communication, making it ideal for distributed AI workloads such as training deep learning models across multiple GPUs or nodes. |
| * __QP (Queue Pair)__: is a fundamental concept representing an RDMA connection. It consists of a send queue and a receive queue. | * __QP (Queue Pair)__: is a fundamental concept representing an RDMA connection. It consists of a send queue and a receive queue. |
| * __BTH Base Transport Header__: is a key component within RoCEv2 packets, carrying essential information like:# | * __InfiniBand Base Transport Header (IB BTH)__: is a key component within RoCEv2 packets, carrying essential information like. It is the same as in Infiniband (but now is inside IP-UDP:# |
| * Packet Sequence Number (PSN), QP Number, and acknowledgment request bits. | * Packet Sequence Number (PSN), QP Number, and acknowledgment request bits. |
| | * **Version:** Indicates the version of the InfiniBand protocol being used. |
| | * **Reserved:** A field reserved for future use or alignment purposes. |
| | * **Packet Length:** Specifies the total length of the packet in bytes. |
| | * **Class Version:** Indicates the class and version of the transport protocol. |
| | * **Operation (OpCode): Defines the type of operation being performed (e.g., WRITE, READ, SEND, RECEIVE).** |
| | * **Transaction ID:** A unique identifier for tracking requests and responses. |
| | * **Destination Queue Pair (QP):** Identifies the destination queue used for receiving the packet. |
| | |
| \\ | \\ |
| Packet structure: | Packet structure: |
| |
| |
| === Ultra Ethernet === is an evolving concept that builds on **RoCE** to create even more robust, low-latency, and lossless Ethernet environments. Companies like **Nvidia** and **Arista** are leading the charge with **Ultra Ethernet** to create an optimized Ethernet fabric for AI workloads, where predictable, lossless communication is key. | ===**Congestion control** in ROCEV2 === |
| | * DCQCN |
| | * PFC |
| | * ECN |
| | |
| | |
| | |
| | ==== Ultra Ethernet ==== is an evolving concept that builds on **RoCE** to create even more robust, low-latency, and lossless Ethernet environments. Companies like **Nvidia** and **Arista** are leading the charge with **Ultra Ethernet** to create an optimized Ethernet fabric for AI workloads, where predictable, lossless communication is key. |
| |
| **Link:** [[https://ultraethernet.org/ultra-ethernet-specification-update/|Ultra Ethernet Specification]] | **Link:** [[https://ultraethernet.org/ultra-ethernet-specification-update/|Ultra Ethernet Specification]] |