User Tools

Site Tools


network_stuff:machine_learning:networking

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
network_stuff:machine_learning:networking [2025/06/19 17:53] jotasandokunetwork_stuff:machine_learning:networking [2025/07/15 21:08] (current) jotasandoku
Line 32: Line 32:
 **Link:** [[https://github.com/NVIDIA/nccl|NCCL GitHub Repository]] **Link:** [[https://github.com/NVIDIA/nccl|NCCL GitHub Repository]]
  
-  * **AllReduce**: Each GPU computes local gradients, aggregates them by summing element-wise across GPUs, then redistributes the averaged result back to every GPU for synchronized gradient updates. +  * **AllReduce (syncs. gradients**: Each GPU computes local gradients, aggregates them by summing element-wise across GPUs, then redistributes the averaged result back to every GPU for synchronized gradient updates. 
-  * **Reduce**: Multiple GPUs send locally computed values (e.g., loss, metrics) to one GPU, which aggregates them (usually summing or averaging), keeping the final combined result locally for centralized analysis or logging.+  * Reduce: Multiple GPUs send locally computed values (e.g., loss, metrics) to one GPU, which aggregates them (usually summing or averaging), keeping the final combined result locally for centralized analysis or logging.
   * **Broadcast**: A single GPU distributes identical data (such as updated model parameters or initial hyperparameters) to all GPUs, ensuring consistent information before computation begins.   * **Broadcast**: A single GPU distributes identical data (such as updated model parameters or initial hyperparameters) to all GPUs, ensuring consistent information before computation begins.
-  * **Gather**: Each GPU produces distinct data (e.g., inference predictions), which are sent to and collected by a single GPU, assembling them into a single dataset or tensor. +  * Gather: Each GPU produces distinct data (e.g., inference predictions), which are sent to and collected by a single GPU, assembling them into a single dataset or tensor. 
-  * **AllGather**: Every GPU shares its unique local data (embeddings, partial outputs) with all other GPUs; all GPUs concatenate this data identically, resulting in each GPU holding the complete combined data set. +  * **AllGather (distributes tensors (eg: datasets))**: Every GPU shares its unique local data (embeddings, partial outputs) with all other GPUs; all GPUs concatenate this data identically, resulting in each GPU holding the complete combined data set. 
-  * **ReduceScatter**: GPUs collaboratively aggregate their local data via summation, then partition and distribute distinct segments of this combined result to individual GPUs for subsequent parallel computations. +  * ReduceScatter: GPUs collaboratively aggregate their local data via summation, then partition and distribute distinct segments of this combined result to individual GPUs for subsequent parallel computations. 
-  * **Scatter**: One GPU partitions a large tensor or dataset into distinct segments, distributing each segment to different GPUs to enable parallel processing of unique subsets.+  * Scatter: One GPU partitions a large tensor or dataset into distinct segments, distributing each segment to different GPUs to enable parallel processing of unique subsets.
  
  
Line 52: Line 52:
 ==== LAN PROTOCOLS IN AI NETWORKING ==== ==== LAN PROTOCOLS IN AI NETWORKING ====
  
-=== NVIDIA InfiniBand ===+=== InfiniBand ===
 **InfiniBand** is a key technology for AI workloads, widely used in high-performance computing (HPC) and AI clusters for its **ultra-low latency**, **high throughput**, and **lossless performance**. It plays a crucial role in enabling efficient communication between nodes, especially for distributed machine learning tasks where large amounts of data need to be shared rapidly across many devices. **InfiniBand** is a key technology for AI workloads, widely used in high-performance computing (HPC) and AI clusters for its **ultra-low latency**, **high throughput**, and **lossless performance**. It plays a crucial role in enabling efficient communication between nodes, especially for distributed machine learning tasks where large amounts of data need to be shared rapidly across many devices.
  
Line 78: Line 78:
 InfiniBand is often deployed in **AI training clusters** where large-scale model training requires seamless, high-speed communication between GPUs across different servers. This makes it a popular choice in supercomputing environments and AI data centers. InfiniBand is often deployed in **AI training clusters** where large-scale model training requires seamless, high-speed communication between GPUs across different servers. This makes it a popular choice in supercomputing environments and AI data centers.
  
-=== RDMA over Converged Ethernet (RoCE) ===+=== ROCEV2 === 
 +[[ https://netdevconf.info/0x19/docs/netdev-0x19-paper18-talk-slides/netdev-0x19-AI-networking-RoCE-and-netdev.pdf ]] 
 +\\
 **RoCE (RDMA over Converged Ethernet)** is a technology that brings the benefits of **RDMA** to Ethernet networks, enabling the same low-latency, high-throughput, and lossless data transfer characteristics as InfiniBand but using standard **Ethernet infrastructure**. **RoCE (RDMA over Converged Ethernet)** is a technology that brings the benefits of **RDMA** to Ethernet networks, enabling the same low-latency, high-throughput, and lossless data transfer characteristics as InfiniBand but using standard **Ethernet infrastructure**.
  
 Key aspects of **RoCE**: Key aspects of **RoCE**:
-  * **RDMA on Ethernet**: **RoCE** allows RDMA to operate over Ethernet, enabling efficient memory-to-memory data transfers between servers **without involving the CPU**, reducing latency and offloading the CPU from handling the bulk of the data movement. 
-  * **RoCE v1 and RoCE v2**:  
-    * **RoCE v1** operates at **Layer 2** and is confined within the same Ethernet broadcast domain, meaning it cannot be routed across subnets. 
     * **RoCE v2** operates at **Layer 3 (IP level)**, making it routable across Layer 3 networks, allowing it to scale across larger environments. This flexibility makes **RoCE v2** a better choice for AI and data center networks where communication across subnets is required.     * **RoCE v2** operates at **Layer 3 (IP level)**, making it routable across Layer 3 networks, allowing it to scale across larger environments. This flexibility makes **RoCE v2** a better choice for AI and data center networks where communication across subnets is required.
   * **Lossless Ethernet**: RoCE relies on **lossless Ethernet** technologies (such as **Priority Flow Control (PFC)** and **Enhanced Transmission Selection (ETS)**) to prevent packet loss, ensuring the high reliability that AI workloads need.   * **Lossless Ethernet**: RoCE relies on **lossless Ethernet** technologies (such as **Priority Flow Control (PFC)** and **Enhanced Transmission Selection (ETS)**) to prevent packet loss, ensuring the high reliability that AI workloads need.
   * **Congestion Control**: **Explicit Congestion Notification (ECN)** is used in RoCE to manage traffic congestion. When congestion occurs, ECN signals the sender to slow down the transmission rate, preventing packet loss. This makes RoCE suitable for AI workloads that require predictable performance without the need for retransmissions.   * **Congestion Control**: **Explicit Congestion Notification (ECN)** is used in RoCE to manage traffic congestion. When congestion occurs, ECN signals the sender to slow down the transmission rate, preventing packet loss. This makes RoCE suitable for AI workloads that require predictable performance without the need for retransmissions.
- 
-Benefits of **RoCE** in AI networking: 
-  * **Low-Latency Ethernet**: RoCE delivers **latency as low as 1-2 microseconds**, close to the performance of InfiniBand, but using Ethernet infrastructure. 
-  * **Cost-Effective**: By leveraging **Ethernet**, RoCE can often be more **cost-effective** than building a dedicated InfiniBand fabric, especially when large-scale Ethernet infrastructure already exists. 
   * **Interoperability**: Since RoCE is based on Ethernet, it can be more easily integrated with existing Ethernet-based infrastructure while still providing RDMA-like performance.   * **Interoperability**: Since RoCE is based on Ethernet, it can be more easily integrated with existing Ethernet-based infrastructure while still providing RDMA-like performance.
   * **Compatibility with AI Workloads**: Like InfiniBand, **RoCE** supports high-speed, low-latency, and lossless communication, making it ideal for distributed AI workloads such as training deep learning models across multiple GPUs or nodes.   * **Compatibility with AI Workloads**: Like InfiniBand, **RoCE** supports high-speed, low-latency, and lossless communication, making it ideal for distributed AI workloads such as training deep learning models across multiple GPUs or nodes.
 +  * __QP (Queue Pair)__: is a fundamental concept representing an RDMA connection. It consists of a send queue and a receive queue.
 +  * __BTH Base Transport Header__: is a key component within RoCEv2 packets, carrying essential information like:#
 +    * Packet Sequence Number (PSN), QP Number, and acknowledgment request bits.
 +\\
 +Packet structure:
 +  Ethernet Header → IP Header → UDP Header → RoCE Packet (BTH + Payload)
 +The Base Transport Header (BTH) is a key component of the InfiniBand transport layer. It contains essential information for delivering messages in InfiniBand or RDMA over Converged Ethernet (RoCE).
 +
 +\\
 +Specifies the operation type (e.g., RDMA read, write, send, atomic).
 +  * Solicited Event Indicator (SE): Indicates if a completion event is required.
 +  * Migration State (M): Manages Queue Pair (QP) state transitions.
 +  * P_Key: Identifies the partition the packet belongs to.
 +  * Destination QP: Specifies the target Queue Pair for the message.
 +  * Packet Sequence Number (PSN): Ensures ordered delivery and detects packet loss.
 +  * Acknowledgment Request (A): Signals if an acknowledgment is needed for reliable transport.
 +  * Resync Request (R): Handles retransmissions in reliable modes.
 +
 +
 +== RDMA VERBS ==
 +They are the **same** for both infiniband and rocev2
 +  * ibv_alloc_pd: Allocates a Protection Domain for resources.
 +  * ibv_reg_mr: Registers a memory region for RDMA operations.
 +  * ibv_create_cq: Creates a Completion Queue to track work completions.
 +  * ibv_create_qp: Creates a Queue Pair for sending and receiving data.
 +  * ibv_modify_qp: Changes the state or properties of a Queue Pair.
 +  * ibv_post_send: Posts a send work request to the send queue.
 +  * ibv_post_recv: Posts a receive work request to the receive queue.
 +  * ibv_poll_cq: Polls a Completion Queue for completed work requests.
 +  * ibv_query_device: Retrieves attributes of an RDMA device.
 +  * ibv_get_device_list: Lists available RDMA devices.
  
-RoCE is increasingly being adopted in **AI training clusters**, where the flexibility of Ethernet is needed, but the high performance of RDMA is still crucial. 
  
-**Ultra Ethernet** is an evolving concept that builds on **RoCE** to create even more robust, low-latency, and lossless Ethernet environments. Companies like **Nvidia** and **Arista** are leading the charge with **Ultra Ethernet** to create an optimized Ethernet fabric for AI workloads, where predictable, lossless communication is key.+=== Ultra Ethernet === is an evolving concept that builds on **RoCE** to create even more robust, low-latency, and lossless Ethernet environments. Companies like **Nvidia** and **Arista** are leading the charge with **Ultra Ethernet** to create an optimized Ethernet fabric for AI workloads, where predictable, lossless communication is key.
  
 **Link:** [[https://ultraethernet.org/ultra-ethernet-specification-update/|Ultra Ethernet Specification]] **Link:** [[https://ultraethernet.org/ultra-ethernet-specification-update/|Ultra Ethernet Specification]]
network_stuff/machine_learning/networking.1750355635.txt.gz · Last modified: by jotasandoku