This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| network_stuff:machine_learning:networking:nvlink [2025/06/24 20:01] – created jotasandoku | network_stuff:machine_learning:networking:nvlink [2025/07/07 16:38] (current) – jotasandoku | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== notes ====== | ||
| + | NVLink is the protocol itself – a low-level, hardware-accelerated, | ||
| + | \\ | ||
| + | Addressing: Physical memory addresses (not IP/MAC). Supports GPU memory coherence via GPU Direct RDMA. | ||
| + | \\ | ||
| + | How NVSwitch Works:Path Setup: GPUs request paths via control packets. NVSwitch establishes dedicated circuits (circuit-switched) for sustained transfers... | ||
| + | |||
| + | |||
| ====== NVIDIA NVLink: History and Technical Architecture Report ====== | ====== NVIDIA NVLink: History and Technical Architecture Report ====== | ||
| Line 227: | Line 235: | ||
| The technology' | The technology' | ||
| + | |||
| + | |||
| + | ---- | ||
| + | |||
| + | ### NVIDIA NVLink 4: Architecture, | ||
| + | **Last Updated:** 2025/ | ||
| + | **Author:** DeepSeek-R1 | ||
| + | |||
| + | ===== 1. NVLink 4 Core Architecture (H100) ===== | ||
| + | **Physical Layer** | ||
| + | * **Links per GPU**: 18 × 50 Gb/s lanes → **900 GB/s** bidirectional | ||
| + | * **Signaling**: | ||
| + | * **Power Efficiency**: | ||
| + | |||
| + | **Key Innovations** | ||
| + | * **Transformer Engine Integration**: | ||
| + | - Dynamic FP8/FP16 precision switching during transfers | ||
| + | - Reduces LLM training bandwidth by 50% | ||
| + | * **DPX Instructions**: | ||
| + | - Optimized data exchange for genomics/ | ||
| + | |||
| + | ===== 2. Intra-Node Topology (NVLink + NVSwitch) ===== | ||
| + | **8-GPU DGX H100 Configuration** | ||
| + | < | ||
| + | ┌──────┐ | ||
| + | │ GPU1 ├─┬─┤ GPU2 │ | ||
| + | └──┬───┘ │ └──┬───┘ | ||
| + | | ||
| + | ┌──┴───┐ │ ┌──┴───┐ | ||
| + | │ GPU3 ├─┼─┤ GPU4 │ | ||
| + | └──┬───┘ │ └──┬───┘ | ||
| + | | ||
| + | </ | ||
| + | |||
| + | **NVSwitch 4 Specs** | ||
| + | | Parameter | ||
| + | | : | ||
| + | | Ports | 64 NVLink ports | | ||
| + | | Bandwidth | ||
| + | | Latency | ||
| + | |||
| + | ===== 3. Inter-Node Topology (NVLink Switch System) ===== | ||
| + | **Multi-Node Scaling** | ||
| + | * **NVLink Switch 5**: | ||
| + | - 144 ports, 14.4 Tb/s non-blocking | ||
| + | - Supports **576 GPUs** (72 nodes) | ||
| + | * **Topology**: | ||
| + | |||
| + | **Hybrid Fabric (DGX SuperPOD)** | ||
| + | < | ||
| + | [Node1-GPUs]─[NVLink Switch]─[Node2-GPUs] | ||
| + | │ │ | ||
| + | [NDR InfiniBand Spine (400Gb/ | ||
| + | │ | ||
| + | [Storage/ | ||
| + | </ | ||
| + | |||
| + | ===== 4. Protocol Stack ===== | ||
| + | **Communication Layers** | ||
| + | ^ Layer ^ Functionality | ||
| + | | Physical | ||
| + | | Link | Flow control, error correction | ||
| + | | Transport | ||
| + | | Software | ||
| + | |||
| + | **Key Protocols** | ||
| + | * **TMA (Tensor Memory Accelerator)**: | ||
| + | * **Address Space Isolation**: | ||
| + | |||
| + | ===== 5. Use Cases & Performance ===== | ||
| + | **LLM Training** | ||
| + | * GPT-3 175B: 4× faster training vs A100 | ||
| + | * Megatron 530B: 30× lower latency | ||
| + | |||
| + | **HPC Workloads** | ||
| + | * 3D-FFT: 7× throughput in multi-node clusters | ||
| + | * Smith-Waterman: | ||
| + | |||
| + | **Generative AI** | ||
| + | * RAG Pipelines: 5× faster vector DB retrieval | ||
| + | * H100 NVL (Dual-GPU): 188GB HBM3 for 70B-parameter inference | ||
| + | |||
| + | ===== 6. Limitations ===== | ||
| + | * **Thermal**: | ||
| + | * **Cost**: DGX H100 system ~$238K | ||
| + | * **Scalability**: | ||
| + | |||
| + | ===== 7. Future Roadmap ===== | ||
| + | * **Grace Hopper Superchip**: | ||
| + | * **NVLink 5**: 1.8 TB/s in Blackwell GPUs | ||
| + | * **Optical I/O**: Mitigating electrical constraints | ||
| + | |||
| + | ===== Appendix: Comparison Table ===== | ||
| + | ^ Generation ^ Bandwidth/ | ||
| + | | NVLink 3 | 600 GB/s | 8 | Ampere support | ||
| + | | **NVLink 4** | **900 GB/ | ||
| + | | NVLink 5 | 1.8 TB/s | 576+ | Blackwell integration | | ||
| + | |||
| + | ===== References ===== | ||
| + | * NVIDIA H100 Architecture Whitepaper (2023) | ||
| + | * DGX SuperPOD Reference Design (2024) | ||
| + | * Hopper TMA Programming Guide (2025) | ||
| + | |||
| + | {{tag> | ||
| + | ---- | ||
| + | |||