This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| network_stuff:machine_learning:networking [2025/07/15 17:53] – jotasandoku | network_stuff:machine_learning:networking [2026/02/01 15:23] (current) – removed jotasandoku | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | [[https:// | ||
| - | ===== NETWORKING FOR AI (AI NETWORKING | ||
| - | |||
| - | ==== Key Requirements for AI Networking ==== | ||
| - | * **HPC** workloads require **ultra-low latency** with minimal jitter, even at low throughput. | ||
| - | * **AI workloads** demand **predictable low latency** at **high throughput** (often near 100% utilization). | ||
| - | * **RPCs (Remote Procedure Calls)** and **memory-to-memory access** are crucial, with zero tolerance for packet losses or retransmissions. | ||
| - | * Example: A single **XPU** (GPU, TPU, etc.) could generate flows of ~400 Gbps. | ||
| - | |||
| - | ==== AI Backend Networking Layers ==== | ||
| - | AI backend networks consist of several layers: | ||
| - | * **RAIL** (Scale-up network – NVLink): Provides high-speed, low-latency memory interconnects between GPUs within the same node. See [[https:// | ||
| - | * **RACK** (Scale-up network – InfiniBand): | ||
| - | * **LEAF**: Connects servers within the rack. | ||
| - | * **SPINE**: Connects multiple leaf switches. | ||
| - | * **SUPERSPINE**: | ||
| - | |||
| - | ==== NCCL Library (NVIDIA Collective Communications Library) ==== | ||
| - | Intro [[https:// | ||
| - | |||
| - | ====== NCCL Collective Operations (ML-Centric) ====== | ||
| - | |||
| - | These operations enable multiple GPUs to communicate efficiently during distributed training, such as gradient synchronization, | ||
| - | |||
| - | The **NVIDIA Collective Communications Library (NCCL)** is essential in scaling AI workloads across multiple GPUs and nodes. | ||
| - | * It provides communication routines such as: | ||
| - | * **All-Reduce**, | ||
| - | * NCCL abstracts away the complexity of multi-GPU and multi-node communication, | ||
| - | * NCCL helps frameworks like **TensorFlow** and **PyTorch** efficiently distribute tasks across GPUs, ensuring **high bandwidth** and **low latency**. | ||
| - | |||
| - | **Link:** [[https:// | ||
| - | |||
| - | * **AllReduce (syncs. gradients**: | ||
| - | * Reduce: Multiple GPUs send locally computed values (e.g., loss, metrics) to one GPU, which aggregates them (usually summing or averaging), keeping the final combined result locally for centralized analysis or logging. | ||
| - | * **Broadcast**: | ||
| - | * Gather: Each GPU produces distinct data (e.g., inference predictions), | ||
| - | * **AllGather (distributes tensors (eg: datasets))**: | ||
| - | * ReduceScatter: | ||
| - | * Scatter: One GPU partitions a large tensor or dataset into distinct segments, distributing each segment to different GPUs to enable parallel processing of unique subsets. | ||
| - | |||
| - | |||
| - | |||
| - | ==== CUDA (Compute Unified Device Architecture) ==== | ||
| - | Proprietary[2] parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs. CUDA was created by Nvidia in 2006.[3] When it was first introduced, the name was an acronym for Compute Unified Device Architecture, | ||
| - | |||
| - | ==== NVLink ==== | ||
| - | Is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central hub. The protocol was first announced in March 2014 and uses a proprietary high-speed signaling interconnect (NVHS).[1] | ||
| - | |||
| - | |||
| - | |||
| - | ==== LAN PROTOCOLS IN AI NETWORKING ==== | ||
| - | |||
| - | === NVIDIA InfiniBand === | ||
| - | **InfiniBand** is a key technology for AI workloads, widely used in high-performance computing (HPC) and AI clusters for its **ultra-low latency**, **high throughput**, | ||
| - | |||
| - | Key characteristics of **InfiniBand** in AI networking: | ||
| - | |||
| - | * **Low Latency**: InfiniBand offers **latencies as low as 1-2 microseconds**, | ||
| - | * **High Bandwidth**: | ||
| - | * **Lossless Transmission**: | ||
| - | * **Remote Direct Memory Access (RDMA)**: One of the most important features of InfiniBand, RDMA allows **direct memory-to-memory transfers** between nodes **without involving the CPU**. This is crucial in reducing CPU overhead and accelerating data transfers, making it ideal for AI training where rapid data sharing is required between nodes. | ||
| - | * **Self-Healing**: | ||
| - | * **Queue Pair Communication**: | ||
| - | | ||
| - | Key operations managed by **InfiniBand Verbs** (the API for data transfer operations): | ||
| - | * **Send/ | ||
| - | * **RDMA Read/ | ||
| - | * **Atomic Operations**: | ||
| - | | ||
| - | Common InfiniBand verbs include: | ||
| - | * '' | ||
| - | * '' | ||
| - | * '' | ||
| - | * '' | ||
| - | |||
| - | |||
| - | InfiniBand is often deployed in **AI training clusters** where large-scale model training requires seamless, high-speed communication between GPUs across different servers. This makes it a popular choice in supercomputing environments and AI data centers. | ||
| - | |||
| - | === ROCEV2 === | ||
| - | **RoCE (RDMA over Converged Ethernet)** is a technology that brings the benefits of **RDMA** to Ethernet networks, enabling the same low-latency, | ||
| - | |||
| - | Key aspects of **RoCE**: | ||
| - | * **RDMA on Ethernet**: **RoCE** allows RDMA to operate over Ethernet, enabling efficient memory-to-memory data transfers between servers **without involving the CPU**, reducing latency and offloading the CPU from handling the bulk of the data movement. | ||
| - | * **RoCE v1 and RoCE v2**: | ||
| - | * **RoCE v1** operates at **Layer 2** and is confined within the same Ethernet broadcast domain, meaning it cannot be routed across subnets. | ||
| - | * **RoCE v2** operates at **Layer 3 (IP level)**, making it routable across Layer 3 networks, allowing it to scale across larger environments. This flexibility makes **RoCE v2** a better choice for AI and data center networks where communication across subnets is required. | ||
| - | * **Lossless Ethernet**: RoCE relies on **lossless Ethernet** technologies (such as **Priority Flow Control (PFC)** and **Enhanced Transmission Selection (ETS)**) to prevent packet loss, ensuring the high reliability that AI workloads need. | ||
| - | * **Congestion Control**: **Explicit Congestion Notification (ECN)** is used in RoCE to manage traffic congestion. When congestion occurs, ECN signals the sender to slow down the transmission rate, preventing packet loss. This makes RoCE suitable for AI workloads that require predictable performance without the need for retransmissions. | ||
| - | |||
| - | Benefits of **RoCE** in AI networking: | ||
| - | * **Low-Latency Ethernet**: RoCE delivers **latency as low as 1-2 microseconds**, | ||
| - | * **Cost-Effective**: | ||
| - | * **Interoperability**: | ||
| - | * **Compatibility with AI Workloads**: | ||
| - | |||
| - | RoCE is increasingly being adopted in **AI training clusters**, where the flexibility of Ethernet is needed, but the high performance of RDMA is still crucial. | ||
| - | == ROCE VERBS == | ||
| - | **TODO** | ||
| - | |||
| - | |||
| - | === Ultra Ethernet === is an evolving concept that builds on **RoCE** to create even more robust, low-latency, | ||
| - | |||
| - | **Link:** [[https:// | ||
| - | |||
| - | --- | ||
| - | |||
| - | Together, **InfiniBand** and **RoCE** provide the backbone for high-performance AI networking, offering low-latency, | ||
| - | |||
| - | |||
| - | ==== Google Aquila ==== | ||
| - | * **Google Aquila** is an AI infrastructure platform optimized for large-scale AI model training, integrating high-speed networking, compute, and storage to handle demanding AI workloads efficiently. | ||
| - | |||
| - | **Link:** [[https:// | ||
| - | |||
| - | ===== Additional Thoughts ===== | ||
| - | The shift to AI networking introduces new challenges but builds on the fundamental principles of **latency**, | ||
| - | |||
| - | AI networking requires you to: | ||
| - | * Understand **verbs (RDMA)** for efficient memory transfers. | ||
| - | * Focus on **predictable performance** in latency and bandwidth for AI workloads. | ||
| - | |||
| - | By expanding your knowledge in RDMA technologies like **InfiniBand** and **RoCE**, you'll bridge the gap between traditional networking and the demands of modern AI infrastructures. | ||
| - | |||
| - | ====== Networking for AI Training and HPC ====== | ||
| - | |||
| - | This document elaborates on key networking concepts, technologies, | ||
| - | |||
| - | ===== Training Operations ===== | ||
| - | |||
| - | * **All-Reduce Operation**: | ||
| - | - In distributed AI training, models are trained across multiple nodes (e.g., GPUs, CPUs) to leverage parallel computation. | ||
| - | - During this process, intermediate results from each node must be aggregated and shared with all other nodes. | ||
| - | - The ' | ||
| - | - This operation is fundamental for synchronizing model updates, particularly in gradient descent optimization algorithms used in deep learning. | ||
| - | - Inefficient implementation of all-reduce can lead to bottlenecks, | ||
| - | |||
| - | ===== Queue Management ===== | ||
| - | |||
| - | * **Impact of Packet Drops**: | ||
| - | - In a distributed training environment, | ||
| - | - When a packet is dropped, all nodes must slow down to maintain synchronization. This involves: | ||
| - | - Retransmitting the lost packet. | ||
| - | - Re-establishing consistency across nodes. | ||
| - | - This process introduces latency and delays the completion of training epochs. | ||
| - | - **Why Queue Management Matters**: | ||
| - | - Effective queue management helps minimize packet drops by ensuring optimal data flow through network devices. | ||
| - | - Techniques such as congestion control, flow prioritization, | ||
| - | |||
| - | |||
| - | |||
| - | * **MPI (Message Passing Interface)**: | ||
| - | - A standardized library for message-passing used in distributed computing environments. | ||
| - | - Provides functions for sending, receiving, and synchronizing data between multiple nodes. | ||
| - | - Commonly used for coordinating processes in HPC and AI training. | ||
| - | - Strengths: | ||
| - | - Scalability: | ||
| - | - Flexibility: | ||
| - | |||
| - | * **RDMA (Remote Direct Memory Access)**: | ||
| - | - Enables direct memory access between GPUs (or CPUs) across a network without involving the CPU. | ||
| - | - Significantly reduces latency and CPU overhead during data transfers. | ||
| - | - Critical for high-speed, low-latency communication in distributed training setups. | ||
| - | - **Applications**: | ||
| - | - GPU-to-GPU communication in AI training. | ||
| - | - Real-time data transfers in HPC simulations. | ||
| - | |||
| - | |||
| - | ===== Flow Control ===== | ||
| - | * credit-based flow control mechanism to ensure that data is transmitted efficiently and without packet loss [TODO] | ||
| - | * Queue Pairs (QPs) and Flow Control: InfiniBand uses queue pairs (QPs) to manage communication between devices. Each QP consists of a send queue and a receive queue, which are used to post send and receive operations [TODO] | ||
| - | |||
| - | ===== Advanced Networking Technologies ===== | ||
| - | |||
| - | * **Direct GPU-to-GPU Communication**: | ||
| - | - Traditional communication methods rely on NICs (Network Interface Cards), which introduce latency and reduce efficiency. | ||
| - | - Modern systems aim to bypass NICs by enabling direct communication between GPUs. | ||
| - | - **Advantages**: | ||
| - | - Reduced data transfer latency. | ||
| - | - Higher throughput due to fewer intermediate processing steps. | ||
| - | |||
| - | * **PCIBus to PCIBus Communication**: | ||
| - | - Eliminates the need for traditional NICs by directly connecting PCI buses between systems. | ||
| - | - Allows for faster data transfer rates and lower latency compared to network-based communication. | ||
| - | - Commonly used in high-performance systems for AI and HPC workloads. | ||
| - | |||
| - | * **NvLink (NVIDIA Bus)**: | ||
| - | - A proprietary high-speed interconnect technology developed by NVIDIA. | ||
| - | - Provides significantly faster communication between GPUs compared to PCIe. | ||
| - | - Designed to optimize AI workloads, especially for models requiring frequent data synchronization. | ||
| - | - **Key Features**: | ||
| - | - High bandwidth: Up to 300 GB/s for certain configurations. | ||
| - | - Reduced latency: Enables GPUs to share memory directly, creating a larger effective memory pool. | ||
| - | - Scalability: | ||
| - | |||
| - | ===== Practical Implications ===== | ||
| - | |||
| - | - **Synchronization Across Nodes**: | ||
| - | - Ensuring synchronized operations across nodes is critical to avoid inconsistencies and redundant computations. | ||
| - | - Technologies like all-reduce and MPI play a central role in maintaining synchronization. | ||
| - | |||
| - | - **Minimizing Latency**: | ||
| - | - Advanced interconnects like RDMA and NvLink significantly reduce the latency associated with data transfers, improving overall system efficiency. | ||
| - | |||
| - | - **Efficient Resource Utilization**: | ||
| - | - By bypassing traditional bottlenecks like NICs, systems can achieve higher performance with the same hardware. | ||
| - | - Queue management ensures that resources are used effectively, | ||
| - | |||
| - | ===== Summary ===== | ||
| - | |||
| - | Efficient networking in AI training and HPC requires the integration of robust operations like all-reduce, advanced libraries such as MPI and RDMA, and modern technologies like NvLink. By addressing challenges like packet drops and latency, these solutions enable scalable, high-performance distributed systems that drive advancements in AI and computational science. | ||