This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| network_stuff:machine_learning:networking [2025/07/15 17:53] – jotasandoku | network_stuff:machine_learning:networking [2025/07/15 21:08] (current) – jotasandoku | ||
|---|---|---|---|
| Line 52: | Line 52: | ||
| ==== LAN PROTOCOLS IN AI NETWORKING ==== | ==== LAN PROTOCOLS IN AI NETWORKING ==== | ||
| - | === NVIDIA | + | === InfiniBand === |
| **InfiniBand** is a key technology for AI workloads, widely used in high-performance computing (HPC) and AI clusters for its **ultra-low latency**, **high throughput**, | **InfiniBand** is a key technology for AI workloads, widely used in high-performance computing (HPC) and AI clusters for its **ultra-low latency**, **high throughput**, | ||
| Line 79: | Line 79: | ||
| === ROCEV2 === | === ROCEV2 === | ||
| + | [[ https:// | ||
| + | \\ | ||
| **RoCE (RDMA over Converged Ethernet)** is a technology that brings the benefits of **RDMA** to Ethernet networks, enabling the same low-latency, | **RoCE (RDMA over Converged Ethernet)** is a technology that brings the benefits of **RDMA** to Ethernet networks, enabling the same low-latency, | ||
| Key aspects of **RoCE**: | Key aspects of **RoCE**: | ||
| - | * **RDMA on Ethernet**: **RoCE** allows RDMA to operate over Ethernet, enabling efficient memory-to-memory data transfers between servers **without involving the CPU**, reducing latency and offloading the CPU from handling the bulk of the data movement. | ||
| - | * **RoCE v1 and RoCE v2**: | ||
| - | * **RoCE v1** operates at **Layer 2** and is confined within the same Ethernet broadcast domain, meaning it cannot be routed across subnets. | ||
| * **RoCE v2** operates at **Layer 3 (IP level)**, making it routable across Layer 3 networks, allowing it to scale across larger environments. This flexibility makes **RoCE v2** a better choice for AI and data center networks where communication across subnets is required. | * **RoCE v2** operates at **Layer 3 (IP level)**, making it routable across Layer 3 networks, allowing it to scale across larger environments. This flexibility makes **RoCE v2** a better choice for AI and data center networks where communication across subnets is required. | ||
| * **Lossless Ethernet**: RoCE relies on **lossless Ethernet** technologies (such as **Priority Flow Control (PFC)** and **Enhanced Transmission Selection (ETS)**) to prevent packet loss, ensuring the high reliability that AI workloads need. | * **Lossless Ethernet**: RoCE relies on **lossless Ethernet** technologies (such as **Priority Flow Control (PFC)** and **Enhanced Transmission Selection (ETS)**) to prevent packet loss, ensuring the high reliability that AI workloads need. | ||
| * **Congestion Control**: **Explicit Congestion Notification (ECN)** is used in RoCE to manage traffic congestion. When congestion occurs, ECN signals the sender to slow down the transmission rate, preventing packet loss. This makes RoCE suitable for AI workloads that require predictable performance without the need for retransmissions. | * **Congestion Control**: **Explicit Congestion Notification (ECN)** is used in RoCE to manage traffic congestion. When congestion occurs, ECN signals the sender to slow down the transmission rate, preventing packet loss. This makes RoCE suitable for AI workloads that require predictable performance without the need for retransmissions. | ||
| - | |||
| - | Benefits of **RoCE** in AI networking: | ||
| - | * **Low-Latency Ethernet**: RoCE delivers **latency as low as 1-2 microseconds**, | ||
| - | * **Cost-Effective**: | ||
| * **Interoperability**: | * **Interoperability**: | ||
| * **Compatibility with AI Workloads**: | * **Compatibility with AI Workloads**: | ||
| + | * __QP (Queue Pair)__: is a fundamental concept representing an RDMA connection. It consists of a send queue and a receive queue. | ||
| + | * __BTH Base Transport Header__: is a key component within RoCEv2 packets, carrying essential information like:# | ||
| + | * Packet Sequence Number (PSN), QP Number, and acknowledgment request bits. | ||
| + | \\ | ||
| + | Packet structure: | ||
| + | Ethernet Header → IP Header → UDP Header → RoCE Packet (BTH + Payload) | ||
| + | The Base Transport Header (BTH) is a key component of the InfiniBand transport layer. It contains essential information for delivering messages in InfiniBand or RDMA over Converged Ethernet (RoCE). | ||
| + | |||
| + | \\ | ||
| + | Specifies the operation type (e.g., RDMA read, write, send, atomic). | ||
| + | * Solicited Event Indicator (SE): Indicates if a completion event is required. | ||
| + | * Migration State (M): Manages Queue Pair (QP) state transitions. | ||
| + | * P_Key: Identifies the partition the packet belongs to. | ||
| + | * Destination QP: Specifies the target Queue Pair for the message. | ||
| + | * Packet Sequence Number (PSN): Ensures ordered delivery and detects packet loss. | ||
| + | * Acknowledgment Request (A): Signals if an acknowledgment is needed for reliable transport. | ||
| + | * Resync Request (R): Handles retransmissions in reliable modes. | ||
| + | |||
| - | RoCE is increasingly being adopted in **AI training clusters**, where the flexibility | + | == RDMA VERBS == |
| - | == ROCE VERBS == | + | They are the **same** for both infiniband and rocev2 |
| - | **TODO** | + | * ibv_alloc_pd: |
| + | * ibv_reg_mr: Registers a memory region for RDMA operations. | ||
| + | * ibv_create_cq: | ||
| + | * ibv_create_qp: | ||
| + | * ibv_modify_qp: | ||
| + | * ibv_post_send: | ||
| + | * ibv_post_recv: | ||
| + | * ibv_poll_cq: | ||
| + | | ||
| + | | ||