User Tools

Site Tools


network_stuff:machine_learning:networking

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
network_stuff:machine_learning:networking [2025/07/15 17:53] jotasandokunetwork_stuff:machine_learning:networking [2025/07/15 21:08] (current) jotasandoku
Line 52: Line 52:
 ==== LAN PROTOCOLS IN AI NETWORKING ==== ==== LAN PROTOCOLS IN AI NETWORKING ====
  
-=== NVIDIA InfiniBand ===+=== InfiniBand ===
 **InfiniBand** is a key technology for AI workloads, widely used in high-performance computing (HPC) and AI clusters for its **ultra-low latency**, **high throughput**, and **lossless performance**. It plays a crucial role in enabling efficient communication between nodes, especially for distributed machine learning tasks where large amounts of data need to be shared rapidly across many devices. **InfiniBand** is a key technology for AI workloads, widely used in high-performance computing (HPC) and AI clusters for its **ultra-low latency**, **high throughput**, and **lossless performance**. It plays a crucial role in enabling efficient communication between nodes, especially for distributed machine learning tasks where large amounts of data need to be shared rapidly across many devices.
  
Line 79: Line 79:
  
 === ROCEV2 === === ROCEV2 ===
 +[[ https://netdevconf.info/0x19/docs/netdev-0x19-paper18-talk-slides/netdev-0x19-AI-networking-RoCE-and-netdev.pdf ]]
 +\\
 **RoCE (RDMA over Converged Ethernet)** is a technology that brings the benefits of **RDMA** to Ethernet networks, enabling the same low-latency, high-throughput, and lossless data transfer characteristics as InfiniBand but using standard **Ethernet infrastructure**. **RoCE (RDMA over Converged Ethernet)** is a technology that brings the benefits of **RDMA** to Ethernet networks, enabling the same low-latency, high-throughput, and lossless data transfer characteristics as InfiniBand but using standard **Ethernet infrastructure**.
  
 Key aspects of **RoCE**: Key aspects of **RoCE**:
-  * **RDMA on Ethernet**: **RoCE** allows RDMA to operate over Ethernet, enabling efficient memory-to-memory data transfers between servers **without involving the CPU**, reducing latency and offloading the CPU from handling the bulk of the data movement. 
-  * **RoCE v1 and RoCE v2**:  
-    * **RoCE v1** operates at **Layer 2** and is confined within the same Ethernet broadcast domain, meaning it cannot be routed across subnets. 
     * **RoCE v2** operates at **Layer 3 (IP level)**, making it routable across Layer 3 networks, allowing it to scale across larger environments. This flexibility makes **RoCE v2** a better choice for AI and data center networks where communication across subnets is required.     * **RoCE v2** operates at **Layer 3 (IP level)**, making it routable across Layer 3 networks, allowing it to scale across larger environments. This flexibility makes **RoCE v2** a better choice for AI and data center networks where communication across subnets is required.
   * **Lossless Ethernet**: RoCE relies on **lossless Ethernet** technologies (such as **Priority Flow Control (PFC)** and **Enhanced Transmission Selection (ETS)**) to prevent packet loss, ensuring the high reliability that AI workloads need.   * **Lossless Ethernet**: RoCE relies on **lossless Ethernet** technologies (such as **Priority Flow Control (PFC)** and **Enhanced Transmission Selection (ETS)**) to prevent packet loss, ensuring the high reliability that AI workloads need.
   * **Congestion Control**: **Explicit Congestion Notification (ECN)** is used in RoCE to manage traffic congestion. When congestion occurs, ECN signals the sender to slow down the transmission rate, preventing packet loss. This makes RoCE suitable for AI workloads that require predictable performance without the need for retransmissions.   * **Congestion Control**: **Explicit Congestion Notification (ECN)** is used in RoCE to manage traffic congestion. When congestion occurs, ECN signals the sender to slow down the transmission rate, preventing packet loss. This makes RoCE suitable for AI workloads that require predictable performance without the need for retransmissions.
- 
-Benefits of **RoCE** in AI networking: 
-  * **Low-Latency Ethernet**: RoCE delivers **latency as low as 1-2 microseconds**, close to the performance of InfiniBand, but using Ethernet infrastructure. 
-  * **Cost-Effective**: By leveraging **Ethernet**, RoCE can often be more **cost-effective** than building a dedicated InfiniBand fabric, especially when large-scale Ethernet infrastructure already exists. 
   * **Interoperability**: Since RoCE is based on Ethernet, it can be more easily integrated with existing Ethernet-based infrastructure while still providing RDMA-like performance.   * **Interoperability**: Since RoCE is based on Ethernet, it can be more easily integrated with existing Ethernet-based infrastructure while still providing RDMA-like performance.
   * **Compatibility with AI Workloads**: Like InfiniBand, **RoCE** supports high-speed, low-latency, and lossless communication, making it ideal for distributed AI workloads such as training deep learning models across multiple GPUs or nodes.   * **Compatibility with AI Workloads**: Like InfiniBand, **RoCE** supports high-speed, low-latency, and lossless communication, making it ideal for distributed AI workloads such as training deep learning models across multiple GPUs or nodes.
 +  * __QP (Queue Pair)__: is a fundamental concept representing an RDMA connection. It consists of a send queue and a receive queue.
 +  * __BTH Base Transport Header__: is a key component within RoCEv2 packets, carrying essential information like:#
 +    * Packet Sequence Number (PSN), QP Number, and acknowledgment request bits.
 +\\
 +Packet structure:
 +  Ethernet Header → IP Header → UDP Header → RoCE Packet (BTH + Payload)
 +The Base Transport Header (BTH) is a key component of the InfiniBand transport layer. It contains essential information for delivering messages in InfiniBand or RDMA over Converged Ethernet (RoCE).
 +
 +\\
 +Specifies the operation type (e.g., RDMA read, write, send, atomic).
 +  * Solicited Event Indicator (SE): Indicates if a completion event is required.
 +  * Migration State (M): Manages Queue Pair (QP) state transitions.
 +  * P_Key: Identifies the partition the packet belongs to.
 +  * Destination QP: Specifies the target Queue Pair for the message.
 +  * Packet Sequence Number (PSN): Ensures ordered delivery and detects packet loss.
 +  * Acknowledgment Request (A): Signals if an acknowledgment is needed for reliable transport.
 +  * Resync Request (R): Handles retransmissions in reliable modes.
 +
  
-RoCE is increasingly being adopted in **AI training clusters**, where the flexibility of Ethernet is needed, but the high performance of RDMA is still crucial+== RDMA VERBS == 
-== ROCE VERBS == +They are the **same** for both infiniband and rocev2 
-**TODO**+  * ibv_alloc_pd: Allocates a Protection Domain for resources. 
 +  * ibv_reg_mr: Registers a memory region for RDMA operations. 
 +  * ibv_create_cq: Creates a Completion Queue to track work completions. 
 +  * ibv_create_qp: Creates a Queue Pair for sending and receiving data. 
 +  * ibv_modify_qp: Changes the state or properties of a Queue Pair. 
 +  * ibv_post_send: Posts a send work request to the send queue
 +  * ibv_post_recv: Posts a receive work request to the receive queue. 
 +  ibv_poll_cq: Polls a Completion Queue for completed work requests. 
 +  ibv_query_device: Retrieves attributes of an RDMA device. 
 +  ibv_get_device_list: Lists available RDMA devices.
  
  
network_stuff/machine_learning/networking.1752601988.txt.gz · Last modified: by jotasandoku