User Tools

Site Tools


network_stuff:machine_learning:networking:nvlink

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
network_stuff:machine_learning:networking:nvlink [2025/06/24 20:01] – created jotasandokunetwork_stuff:machine_learning:networking:nvlink [2025/07/07 16:38] (current) jotasandoku
Line 1: Line 1:
 +====== notes ======
 +NVLink is the protocol itself – a low-level, hardware-accelerated, point-to-point interconnect designed specifically for GPU-to-GPU/CPU communication. Protocol Stack No TCP/IP, Ethernet, or InfiniBand. Uses proprietary NVLink Packet Protocol with minimal overhead. No TCP/IP, Ethernet, or InfiniBand. Uses proprietary NVLink Packet Protocol with minimal overhead.
 +\\
 +Addressing: Physical memory addresses (not IP/MAC). Supports GPU memory coherence via GPU Direct RDMA.
 +\\
 +How NVSwitch Works:Path Setup: GPUs request paths via control packets. NVSwitch establishes dedicated circuits (circuit-switched) for sustained transfers...
 +
 +
 ====== NVIDIA NVLink: History and Technical Architecture Report ====== ====== NVIDIA NVLink: History and Technical Architecture Report ======
  
Line 227: Line 235:
  
 The technology's seamless integration into NVIDIA's software ecosystem, combined with its hardware advantages, establishes NVLink as a key differentiator in the competitive landscape of AI acceleration and high-performance computing. Future developments in optical interconnects, chip-to-chip integration, and emerging application domains will likely drive continued evolution of this critical interconnect technology. The technology's seamless integration into NVIDIA's software ecosystem, combined with its hardware advantages, establishes NVLink as a key differentiator in the competitive landscape of AI acceleration and high-performance computing. Future developments in optical interconnects, chip-to-chip integration, and emerging application domains will likely drive continued evolution of this critical interconnect technology.
 +
 +
 +----
 +
 +### NVIDIA NVLink 4: Architecture, Topologies & Use Cases (H100 Focus)
 +**Last Updated:** 2025/07/07  
 +**Author:** DeepSeek-R1  
 +
 +===== 1. NVLink 4 Core Architecture (H100) =====  
 +**Physical Layer**  
 +  * **Links per GPU**: 18 × 50 Gb/s lanes → **900 GB/s** bidirectional  
 +  * **Signaling**: PAM4 modulation @ 25 GT/s per lane  
 +  * **Power Efficiency**: 1.3 pJ/bit (40% better than NVLink 3)  
 +
 +**Key Innovations**  
 +  * **Transformer Engine Integration**:  
 +    - Dynamic FP8/FP16 precision switching during transfers  
 +    - Reduces LLM training bandwidth by 50%  
 +  * **DPX Instructions**:  
 +    - Optimized data exchange for genomics/algorithm acceleration  
 +
 +===== 2. Intra-Node Topology (NVLink + NVSwitch) =====  
 +**8-GPU DGX H100 Configuration**  
 +<code>  
 +┌──────┐   ┌──────┐  
 +│ GPU1 ├─┬─┤ GPU2 │  
 +└──┬───┘ │ └──┬───┘  
 +   │NVLink│   │  
 +┌──┴───┐ │ ┌──┴───┐  
 +│ GPU3 ├─┼─┤ GPU4 │  
 +└──┬───┘ │ └──┬───┘  
 +   ├─── NVSwitch ───┤  
 +</code>  
 +
 +**NVSwitch 4 Specs**  
 +| Parameter       | Value              |  
 +| :-------------- | :----------------- |  
 +| Ports           | 64 NVLink ports    |  
 +| Bandwidth       | 7.2 Tb/s aggregate |  
 +| Latency         | < 500 ns hop-to-hop |  
 +
 +===== 3. Inter-Node Topology (NVLink Switch System) =====  
 +**Multi-Node Scaling**  
 +  * **NVLink Switch 5**:  
 +    - 144 ports, 14.4 Tb/s non-blocking  
 +    - Supports **576 GPUs** (72 nodes)  
 +  * **Topology**: Fat-tree with SHARP in-network reduction  
 +
 +**Hybrid Fabric (DGX SuperPOD)**  
 +<code>  
 +[Node1-GPUs]─[NVLink Switch]─[Node2-GPUs]  
 +      │                  │  
 +[NDR InfiniBand Spine (400Gb/s)]  
 +      │  
 +[Storage/CPU Traffic]  
 +</code>  
 +
 +===== 4. Protocol Stack =====  
 +**Communication Layers**  
 +^ Layer          ^ Functionality                     ^  
 +| Physical       | 50 Gb/s PAM4 signaling          |  
 +| Link           | Flow control, error correction  |  
 +| Transport      | SHARP in-network reduction      |  
 +| Software       | Magnum IO, CUDA 12+ APIs        |  
 +
 +**Key Protocols**  
 +  * **TMA (Tensor Memory Accelerator)**: Zero-copy SM-to-SM transfers  
 +  * **Address Space Isolation**: Hardware partitioning for multi-tenant security  
 +
 +===== 5. Use Cases & Performance =====  
 +**LLM Training**  
 +  * GPT-3 175B: 4× faster training vs A100  
 +  * Megatron 530B: 30× lower latency  
 +
 +**HPC Workloads**  
 +  * 3D-FFT: 7× throughput in multi-node clusters  
 +  * Smith-Waterman: 7× acceleration via DPX  
 +
 +**Generative AI**  
 +  * RAG Pipelines: 5× faster vector DB retrieval  
 +  * H100 NVL (Dual-GPU): 188GB HBM3 for 70B-parameter inference  
 +
 +===== 6. Limitations =====  
 +  * **Thermal**: 10.2 kW/node requires liquid cooling  
 +  * **Cost**: DGX H100 system ~$238K  
 +  * **Scalability**: Max 576 GPUs per NVLink domain  
 +
 +===== 7. Future Roadmap =====  
 +  * **Grace Hopper Superchip**: CPU-GPU coherence via 900 GB/s link  
 +  * **NVLink 5**: 1.8 TB/s in Blackwell GPUs  
 +  * **Optical I/O**: Mitigating electrical constraints  
 +
 +===== Appendix: Comparison Table =====  
 +^ Generation ^ Bandwidth/GPU ^ Max GPUs ^ Key Feature        ^  
 +| NVLink 3   | 600 GB/s      | 8        | Ampere support      |  
 +| **NVLink 4** | **900 GB/s**  | **576**  | **Transformer Engine** |  
 +| NVLink 5   | 1.8 TB/s      | 576+     | Blackwell integration |  
 +
 +===== References =====  
 +  * NVIDIA H100 Architecture Whitepaper (2023)  
 +  * DGX SuperPOD Reference Design (2024)  
 +  * Hopper TMA Programming Guide (2025)  
 +
 +{{tag>nvidia nvlink h100 gpu}}
 +----
 +
network_stuff/machine_learning/networking/nvlink.1750795264.txt.gz · Last modified: by jotasandoku