User Tools

Site Tools


network_stuff:machine_learning:networking

This is an old revision of the document!


ML ; network-for-ML-workload

NETWORKING FOR AI (AI NETWORKING - AI/ML DATACENTER)

Key Requirements for AI Networking

  • HPC workloads require ultra-low latency with minimal jitter, even at low throughput.
  • AI workloads demand predictable low latency at high throughput (often near 100% utilization).
  • RPCs (Remote Procedure Calls) and memory-to-memory access are crucial, with zero tolerance for packet losses or retransmissions.
    • Example: A single XPU (GPU, TPU, etc.) could generate flows of ~400 Gbps.

AI Backend Networking Layers

AI backend networks consist of several layers:

  • RAIL (Scale-up network – NVLink): Provides high-speed, low-latency memory interconnects between GPUs within the same node. See rail_networks
  • RACK (Scale-up network – InfiniBand): Enables ultra-low-latency communication between nodes within a rack, with RDMA support.
  • LEAF: Connects servers within the rack.
  • SPINE: Connects multiple leaf switches.
  • SUPERSPINE: Interconnects multiple spine switches for large-scale AI clusters.

NCCL Library (NVIDIA Collective Communications Library)

Intro HERE

NCCL Collective Operations (ML-Centric)

These operations enable multiple GPUs to communicate efficiently during distributed training, such as gradient synchronization, weight updates, or sharded computations.

The NVIDIA Collective Communications Library (NCCL) is essential in scaling AI workloads across multiple GPUs and nodes.

  • It provides communication routines such as:
    • All-Reduce, Broadcast, All-Gather, and Reduce.
  • NCCL abstracts away the complexity of multi-GPU and multi-node communication, supporting both intra-node (via PCIe/NVLink) and inter-node (via InfiniBand/RoCE) communication.
  • NCCL helps frameworks like TensorFlow and PyTorch efficiently distribute tasks across GPUs, ensuring high bandwidth and low latency.

Link: NCCL GitHub Repository

  • AllReduce: Each GPU computes local gradients, aggregates them by summing element-wise across GPUs, then redistributes the averaged result back to every GPU for synchronized gradient updates.
  • Reduce: Multiple GPUs send locally computed values (e.g., loss, metrics) to one GPU, which aggregates them (usually summing or averaging), keeping the final combined result locally for centralized analysis or logging.
  • Broadcast: A single GPU distributes identical data (such as updated model parameters or initial hyperparameters) to all GPUs, ensuring consistent information before computation begins.
  • Gather: Each GPU produces distinct data (e.g., inference predictions), which are sent to and collected by a single GPU, assembling them into a single dataset or tensor.
  • AllGather: Every GPU shares its unique local data (embeddings, partial outputs) with all other GPUs; all GPUs concatenate this data identically, resulting in each GPU holding the complete combined data set.
  • ReduceScatter: GPUs collaboratively aggregate their local data via summation, then partition and distribute distinct segments of this combined result to individual GPUs for subsequent parallel computations.
  • Scatter: One GPU partitions a large tensor or dataset into distinct segments, distributing each segment to different GPUs to enable parallel processing of unique subsets.

CUDA (Compute Unified Device Architecture)

Proprietary[2] parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs. CUDA was created by Nvidia in 2006.[3] When it was first introduced, the name was an acronym for Compute Unified Device Architecture,[4] but Nvidia later dropped the common use of the acronym and now rarely expands it.[5]

Is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central hub. The protocol was first announced in March 2014 and uses a proprietary high-speed signaling interconnect (NVHS).[1]

LAN PROTOCOLS IN AI NETWORKING

NVIDIA InfiniBand

InfiniBand is a key technology for AI workloads, widely used in high-performance computing (HPC) and AI clusters for its ultra-low latency, high throughput, and lossless performance. It plays a crucial role in enabling efficient communication between nodes, especially for distributed machine learning tasks where large amounts of data need to be shared rapidly across many devices.

Key characteristics of InfiniBand in AI networking:

  • Low Latency: InfiniBand offers latencies as low as 1-2 microseconds, which is critical for AI workloads that require predictable performance across nodes.
  • High Bandwidth: Supports bandwidths of up to 400 Gbps per link (with InfiniBand HDR), allowing the transfer of massive datasets needed in AI model training.
  • Lossless Transmission: InfiniBand is inherently lossless, ensuring that there is no packet loss during communication, which is essential for AI workloads that cannot tolerate retransmissions (e.g., when training deep learning models).
  • Remote Direct Memory Access (RDMA): One of the most important features of InfiniBand, RDMA allows direct memory-to-memory transfers between nodes without involving the CPU. This is crucial in reducing CPU overhead and accelerating data transfers, making it ideal for AI training where rapid data sharing is required between nodes.
  • Self-Healing: InfiniBand has built-in self-healing capabilities, which means that in the event of a failure or congestion in a link, it can reroute traffic dynamically to ensure continuous operation.
  • Queue Pair Communication: InfiniBand uses Queue Pairs (QP), consisting of a send queue and a receive queue, for managing communication between nodes.

Key operations managed by InfiniBand Verbs (the API for data transfer operations):

  • Send/Receive: For transmitting and receiving data.
  • RDMA Read/Write: To access remote memory directly.
  • Atomic Operations: Used for updating remote memory with atomicity, ensuring no race conditions in distributed systems.

Common InfiniBand verbs include:

  • ibv_post_send: This verb is used to post a send request to a Queue Pair (QP). It initiates the process of sending data from the local queue to a remote queue.
  • ibv_post_recv: This verb posts a receive request to a Queue Pair (QP). It prepares the local queue to receive incoming data from a remote queue.
  • ibv_reg_mr: This verb registers a memory region (MR) for RDMA access. It allows the application to specify a memory buffer that can be accessed directly by the InfiniBand hardware for data transfer operations.
  • ibv_modify_qp: This verb modifies the state of a Queue Pair (QP). It is used to transition the QP through various states, such as initiating a connection or resetting the QP.

InfiniBand is often deployed in AI training clusters where large-scale model training requires seamless, high-speed communication between GPUs across different servers. This makes it a popular choice in supercomputing environments and AI data centers.

RDMA over Converged Ethernet (RoCE)

RoCE (RDMA over Converged Ethernet) is a technology that brings the benefits of RDMA to Ethernet networks, enabling the same low-latency, high-throughput, and lossless data transfer characteristics as InfiniBand but using standard Ethernet infrastructure.

Key aspects of RoCE:

  • RDMA on Ethernet: RoCE allows RDMA to operate over Ethernet, enabling efficient memory-to-memory data transfers between servers without involving the CPU, reducing latency and offloading the CPU from handling the bulk of the data movement.
  • RoCE v1 and RoCE v2:
    • RoCE v1 operates at Layer 2 and is confined within the same Ethernet broadcast domain, meaning it cannot be routed across subnets.
    • RoCE v2 operates at Layer 3 (IP level), making it routable across Layer 3 networks, allowing it to scale across larger environments. This flexibility makes RoCE v2 a better choice for AI and data center networks where communication across subnets is required.
  • Lossless Ethernet: RoCE relies on lossless Ethernet technologies (such as Priority Flow Control (PFC) and Enhanced Transmission Selection (ETS)) to prevent packet loss, ensuring the high reliability that AI workloads need.
  • Congestion Control: Explicit Congestion Notification (ECN) is used in RoCE to manage traffic congestion. When congestion occurs, ECN signals the sender to slow down the transmission rate, preventing packet loss. This makes RoCE suitable for AI workloads that require predictable performance without the need for retransmissions.

Benefits of RoCE in AI networking:

  • Low-Latency Ethernet: RoCE delivers latency as low as 1-2 microseconds, close to the performance of InfiniBand, but using Ethernet infrastructure.
  • Cost-Effective: By leveraging Ethernet, RoCE can often be more cost-effective than building a dedicated InfiniBand fabric, especially when large-scale Ethernet infrastructure already exists.
  • Interoperability: Since RoCE is based on Ethernet, it can be more easily integrated with existing Ethernet-based infrastructure while still providing RDMA-like performance.
  • Compatibility with AI Workloads: Like InfiniBand, RoCE supports high-speed, low-latency, and lossless communication, making it ideal for distributed AI workloads such as training deep learning models across multiple GPUs or nodes.

RoCE is increasingly being adopted in AI training clusters, where the flexibility of Ethernet is needed, but the high performance of RDMA is still crucial.

Ultra Ethernet is an evolving concept that builds on RoCE to create even more robust, low-latency, and lossless Ethernet environments. Companies like Nvidia and Arista are leading the charge with Ultra Ethernet to create an optimized Ethernet fabric for AI workloads, where predictable, lossless communication is key.

Link: Ultra Ethernet Specification

Together, InfiniBand and RoCE provide the backbone for high-performance AI networking, offering low-latency, high-throughput, and lossless data movement—essential for scaling deep learning models across multi-node architectures.

Google Aquila

  • Google Aquila is an AI infrastructure platform optimized for large-scale AI model training, integrating high-speed networking, compute, and storage to handle demanding AI workloads efficiently.

Link: Google Aquila Platform

Additional Thoughts

The shift to AI networking introduces new challenges but builds on the fundamental principles of latency, throughput, and lossless data transmission.

AI networking requires you to:

  • Understand verbs (RDMA) for efficient memory transfers.
  • Focus on predictable performance in latency and bandwidth for AI workloads.

By expanding your knowledge in RDMA technologies like InfiniBand and RoCE, you'll bridge the gap between traditional networking and the demands of modern AI infrastructures.

Networking for AI Training and HPC

This document elaborates on key networking concepts, technologies, and strategies critical for AI training and high-performance computing (HPC). These insights are based on a podcast focused on cutting-edge networking practices in these domains.

Training Operations

  • All-Reduce Operation:
    1. In distributed AI training, models are trained across multiple nodes (e.g., GPUs, CPUs) to leverage parallel computation.
    2. During this process, intermediate results from each node must be aggregated and shared with all other nodes.
    3. The 'all-reduce' operation combines all distributed results into one, ensuring that each node operates with the same data.
    4. This operation is fundamental for synchronizing model updates, particularly in gradient descent optimization algorithms used in deep learning.
    5. Inefficient implementation of all-reduce can lead to bottlenecks, significantly impacting training speed.

Queue Management

  • Impact of Packet Drops:
    1. In a distributed training environment, the loss of a single packet can disrupt the entire operation.
    2. When a packet is dropped, all nodes must slow down to maintain synchronization. This involves:
      1. Retransmitting the lost packet.
      2. Re-establishing consistency across nodes.
    3. This process introduces latency and delays the completion of training epochs.
    4. Why Queue Management Matters:
      1. Effective queue management helps minimize packet drops by ensuring optimal data flow through network devices.
      2. Techniques such as congestion control, flow prioritization, and buffer optimization play a crucial role in maintaining smooth communication between nodes.
  • MPI (Message Passing Interface):
    1. A standardized library for message-passing used in distributed computing environments.
    2. Provides functions for sending, receiving, and synchronizing data between multiple nodes.
    3. Commonly used for coordinating processes in HPC and AI training.
    4. Strengths:
      1. Scalability: Handles thousands of nodes efficiently.
      2. Flexibility: Supports various communication patterns (e.g., point-to-point, broadcast).
  • RDMA (Remote Direct Memory Access):
    1. Enables direct memory access between GPUs (or CPUs) across a network without involving the CPU.
    2. Significantly reduces latency and CPU overhead during data transfers.
    3. Critical for high-speed, low-latency communication in distributed training setups.
    4. Applications:
      1. GPU-to-GPU communication in AI training.
      2. Real-time data transfers in HPC simulations.

Flow Control

  • credit-based flow control mechanism to ensure that data is transmitted efficiently and without packet loss [TODO]
  • Queue Pairs (QPs) and Flow Control: InfiniBand uses queue pairs (QPs) to manage communication between devices. Each QP consists of a send queue and a receive queue, which are used to post send and receive operations [TODO]

Advanced Networking Technologies

  • Direct GPU-to-GPU Communication:
    1. Traditional communication methods rely on NICs (Network Interface Cards), which introduce latency and reduce efficiency.
    2. Modern systems aim to bypass NICs by enabling direct communication between GPUs.
    3. Advantages:
      1. Reduced data transfer latency.
      2. Higher throughput due to fewer intermediate processing steps.
  • PCIBus to PCIBus Communication:
    1. Eliminates the need for traditional NICs by directly connecting PCI buses between systems.
    2. Allows for faster data transfer rates and lower latency compared to network-based communication.
    3. Commonly used in high-performance systems for AI and HPC workloads.
  • NvLink (NVIDIA Bus):
    1. A proprietary high-speed interconnect technology developed by NVIDIA.
    2. Provides significantly faster communication between GPUs compared to PCIe.
    3. Designed to optimize AI workloads, especially for models requiring frequent data synchronization.
    4. Key Features:
      1. High bandwidth: Up to 300 GB/s for certain configurations.
      2. Reduced latency: Enables GPUs to share memory directly, creating a larger effective memory pool.
      3. Scalability: Supports multiple GPUs in a single system.

Practical Implications

  1. Synchronization Across Nodes:
    1. Ensuring synchronized operations across nodes is critical to avoid inconsistencies and redundant computations.
    2. Technologies like all-reduce and MPI play a central role in maintaining synchronization.
  1. Minimizing Latency:
    1. Advanced interconnects like RDMA and NvLink significantly reduce the latency associated with data transfers, improving overall system efficiency.
  1. Efficient Resource Utilization:
    1. By bypassing traditional bottlenecks like NICs, systems can achieve higher performance with the same hardware.
    2. Queue management ensures that resources are used effectively, avoiding delays due to congestion.

Summary

Efficient networking in AI training and HPC requires the integration of robust operations like all-reduce, advanced libraries such as MPI and RDMA, and modern technologies like NvLink. By addressing challenges like packet drops and latency, these solutions enable scalable, high-performance distributed systems that drive advancements in AI and computational science.

network_stuff/machine_learning/networking.1747075978.txt.gz · Last modified: by jotasandoku