This is an old revision of the document!
NETWORKING FOR AI (AI NETWORKING - AI/ML DATACENTER)
Key Requirements for AI Networking
HPC workloads require ultra-low latency with minimal jitter, even at low throughput.
AI workloads demand predictable low latency at high throughput (often near 100% utilization).
RPCs (Remote Procedure Calls) and memory-to-memory access are crucial, with zero tolerance for packet losses or retransmissions.
AI Backend Networking Layers
AI backend networks consist of several layers:
RAIL (Scale-up network – NVLink): Provides high-speed, low-latency memory interconnects between GPUs within the same node. See
rail_networks
RACK (Scale-up network – InfiniBand): Enables ultra-low-latency communication between nodes within a rack, with RDMA support.
LEAF: Connects servers within the rack.
SPINE: Connects multiple leaf switches.
SUPERSPINE: Interconnects multiple spine switches for large-scale AI clusters.
NCCL Library (NVIDIA Collective Communications Library)
NCCL Collective Operations (ML-Centric)
These operations enable multiple GPUs to communicate efficiently during distributed training, such as gradient synchronization, weight updates, or sharded computations.
AllReduce
All GPUs compute partial results (often gradients) locally, then collectively combine (sum or average) them. Each GPU ends up with the complete combined result. Similar to all workers jointly solving a puzzle and each keeping the final complete picture.
Reduce
Multiple GPUs send their data to a single GPU, which aggregates (sums or averages) the inputs. This central GPU alone holds the final result—like multiple researchers contributing data to one person compiling the final report.
Broadcast
A single GPU sends identical data (e.g., model parameters or initial settings) to all other GPUs simultaneously. This ensures every GPU starts from the same information—similar to an instructor sending identical instructions to each student.
Gather
A single GPU collects different, distinct data from every GPU. Each GPU contributes unique data, which the central GPU assembles into one piece. Like a coordinator collecting different chapters from multiple authors into a single book.
The NVIDIA Collective Communications Library (NCCL) is essential in scaling AI workloads across multiple GPUs and nodes.
It provides communication routines such as:
NCCL abstracts away the complexity of multi-GPU and multi-node communication, supporting both intra-node (via PCIe/NVLink) and inter-node (via InfiniBand/RoCE) communication.
NCCL helps frameworks like TensorFlow and PyTorch efficiently distribute tasks across GPUs, ensuring high bandwidth and low latency.
Link: NCCL GitHub Repository
CUDA (Compute Unified Device Architecture)
Proprietary[2] parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs. CUDA was created by Nvidia in 2006.[3] When it was first introduced, the name was an acronym for Compute Unified Device Architecture,[4] but Nvidia later dropped the common use of the acronym and now rarely expands it.[5]
NVLink
Is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central hub. The protocol was first announced in March 2014 and uses a proprietary high-speed signaling interconnect (NVHS).[1]
LAN PROTOCOLS IN AI NETWORKING
NVIDIA InfiniBand
InfiniBand is a key technology for AI workloads, widely used in high-performance computing (HPC) and AI clusters for its ultra-low latency, high throughput, and lossless performance. It plays a crucial role in enabling efficient communication between nodes, especially for distributed machine learning tasks where large amounts of data need to be shared rapidly across many devices.
Key characteristics of InfiniBand in AI networking:
Low Latency: InfiniBand offers latencies as low as 1-2 microseconds, which is critical for AI workloads that require predictable performance across nodes.
High Bandwidth: Supports bandwidths of up to 400 Gbps per link (with InfiniBand HDR), allowing the transfer of massive datasets needed in AI model training.
Lossless Transmission: InfiniBand is inherently lossless, ensuring that there is no packet loss during communication, which is essential for AI workloads that cannot tolerate retransmissions (e.g., when training deep learning models).
Remote Direct Memory Access (RDMA): One of the most important features of InfiniBand, RDMA allows direct memory-to-memory transfers between nodes without involving the CPU. This is crucial in reducing CPU overhead and accelerating data transfers, making it ideal for AI training where rapid data sharing is required between nodes.
Self-Healing: InfiniBand has built-in self-healing capabilities, which means that in the event of a failure or congestion in a link, it can reroute traffic dynamically to ensure continuous operation.
Queue Pair Communication: InfiniBand uses Queue Pairs (QP), consisting of a send queue and a receive queue, for managing communication between nodes.
Key operations managed by InfiniBand Verbs (the API for data transfer operations):
Send/Receive: For transmitting and receiving data.
RDMA Read/Write: To access remote memory directly.
Atomic Operations: Used for updating remote memory with atomicity, ensuring no race conditions in distributed systems.
Common InfiniBand verbs include:
ibv_post_send: This verb is used to post a send request to a Queue Pair (QP). It initiates the process of sending data from the local queue to a remote queue.
ibv_post_recv: This verb posts a receive request to a Queue Pair (QP). It prepares the local queue to receive incoming data from a remote queue.
ibv_reg_mr: This verb registers a memory region (MR) for RDMA access. It allows the application to specify a memory buffer that can be accessed directly by the InfiniBand hardware for data transfer operations.
ibv_modify_qp: This verb modifies the state of a Queue Pair (QP). It is used to transition the QP through various states, such as initiating a connection or resetting the QP.
InfiniBand is often deployed in AI training clusters where large-scale model training requires seamless, high-speed communication between GPUs across different servers. This makes it a popular choice in supercomputing environments and AI data centers.
RDMA over Converged Ethernet (RoCE)
RoCE (RDMA over Converged Ethernet) is a technology that brings the benefits of RDMA to Ethernet networks, enabling the same low-latency, high-throughput, and lossless data transfer characteristics as InfiniBand but using standard Ethernet infrastructure.
Key aspects of RoCE:
RDMA on Ethernet: RoCE allows RDMA to operate over Ethernet, enabling efficient memory-to-memory data transfers between servers without involving the CPU, reducing latency and offloading the CPU from handling the bulk of the data movement.
RoCE v1 and RoCE v2:
RoCE v1 operates at Layer 2 and is confined within the same Ethernet broadcast domain, meaning it cannot be routed across subnets.
RoCE v2 operates at Layer 3 (IP level), making it routable across Layer 3 networks, allowing it to scale across larger environments. This flexibility makes RoCE v2 a better choice for AI and data center networks where communication across subnets is required.
Lossless Ethernet: RoCE relies on lossless Ethernet technologies (such as Priority Flow Control (PFC) and Enhanced Transmission Selection (ETS)) to prevent packet loss, ensuring the high reliability that AI workloads need.
Congestion Control: Explicit Congestion Notification (ECN) is used in RoCE to manage traffic congestion. When congestion occurs, ECN signals the sender to slow down the transmission rate, preventing packet loss. This makes RoCE suitable for AI workloads that require predictable performance without the need for retransmissions.
Benefits of RoCE in AI networking:
Low-Latency Ethernet: RoCE delivers latency as low as 1-2 microseconds, close to the performance of InfiniBand, but using Ethernet infrastructure.
Cost-Effective: By leveraging Ethernet, RoCE can often be more cost-effective than building a dedicated InfiniBand fabric, especially when large-scale Ethernet infrastructure already exists.
Interoperability: Since RoCE is based on Ethernet, it can be more easily integrated with existing Ethernet-based infrastructure while still providing RDMA-like performance.
Compatibility with AI Workloads: Like InfiniBand, RoCE supports high-speed, low-latency, and lossless communication, making it ideal for distributed AI workloads such as training deep learning models across multiple GPUs or nodes.
RoCE is increasingly being adopted in AI training clusters, where the flexibility of Ethernet is needed, but the high performance of RDMA is still crucial.
Ultra Ethernet is an evolving concept that builds on RoCE to create even more robust, low-latency, and lossless Ethernet environments. Companies like Nvidia and Arista are leading the charge with Ultra Ethernet to create an optimized Ethernet fabric for AI workloads, where predictable, lossless communication is key.
Link: Ultra Ethernet Specification
—
Together, InfiniBand and RoCE provide the backbone for high-performance AI networking, offering low-latency, high-throughput, and lossless data movement—essential for scaling deep learning models across multi-node architectures.
Google Aquila
Google Aquila is an AI infrastructure platform optimized for large-scale AI model training, integrating high-speed networking, compute, and storage to handle demanding AI workloads efficiently.
Link: Google Aquila Platform
Additional Thoughts
The shift to AI networking introduces new challenges but builds on the fundamental principles of latency, throughput, and lossless data transmission.
AI networking requires you to:
By expanding your knowledge in RDMA technologies like InfiniBand and RoCE, you'll bridge the gap between traditional networking and the demands of modern AI infrastructures.
Networking for AI Training and HPC
This document elaborates on key networking concepts, technologies, and strategies critical for AI training and high-performance computing (HPC). These insights are based on a podcast focused on cutting-edge networking practices in these domains.
Training Operations
All-Reduce Operation:
In distributed AI training, models are trained across multiple nodes (e.g., GPUs, CPUs) to leverage parallel computation.
During this process, intermediate results from each node must be aggregated and shared with all other nodes.
The 'all-reduce' operation combines all distributed results into one, ensuring that each node operates with the same data.
This operation is fundamental for synchronizing model updates, particularly in gradient descent optimization algorithms used in deep learning.
Inefficient implementation of all-reduce can lead to bottlenecks, significantly impacting training speed.
Queue Management
Impact of Packet Drops:
In a distributed training environment, the loss of a single packet can disrupt the entire operation.
When a packet is dropped, all nodes must slow down to maintain synchronization. This involves:
Retransmitting the lost packet.
Re-establishing consistency across nodes.
This process introduces latency and delays the completion of training epochs.
Why Queue Management Matters:
Effective queue management helps minimize packet drops by ensuring optimal data flow through network devices.
Techniques such as congestion control, flow prioritization, and buffer optimization play a crucial role in maintaining smooth communication between nodes.
Flow Control
credit-based flow control mechanism to ensure that data is transmitted efficiently and without packet loss [TODO]
Queue Pairs (QPs) and Flow Control: InfiniBand uses queue pairs (QPs) to manage communication between devices. Each QP consists of a send queue and a receive queue, which are used to post send and receive operations [TODO]
Advanced Networking Technologies
NvLink (NVIDIA Bus):
A proprietary high-speed interconnect technology developed by NVIDIA.
Provides significantly faster communication between GPUs compared to PCIe.
Designed to optimize AI workloads, especially for models requiring frequent data synchronization.
Key Features:
High bandwidth: Up to 300
GB/s for certain configurations.
Reduced latency: Enables GPUs to share memory directly, creating a larger effective memory pool.
Scalability: Supports multiple GPUs in a single system.
Practical Implications
Synchronization Across Nodes:
Ensuring synchronized operations across nodes is critical to avoid inconsistencies and redundant computations.
Technologies like all-reduce and MPI play a central role in maintaining synchronization.
Minimizing Latency:
Advanced interconnects like RDMA and NvLink significantly reduce the latency associated with data transfers, improving overall system efficiency.
Efficient Resource Utilization:
By bypassing traditional bottlenecks like NICs, systems can achieve higher performance with the same hardware.
Queue management ensures that resources are used effectively, avoiding delays due to congestion.
Summary
Efficient networking in AI training and HPC requires the integration of robust operations like all-reduce, advanced libraries such as MPI and RDMA, and modern technologies like NvLink. By addressing challenges like packet drops and latency, these solutions enable scalable, high-performance distributed systems that drive advancements in AI and computational science.