6.2. GPU-NVMe Latency Comparison#
Note: These figures represent typical ranges from various benchmarks and can vary significantly based on hardware generation, workload patterns, and specific implementations.
Configuration |
Typical One-Way Latency |
Round-Trip Latency |
Key Characteristics |
|---|---|---|---|
Local NVMe (PCIe) - Standard |
30–100 μs |
60–200 μs |
CPU-mediated, kernel stack involved |
Local NVMe (PCIe) - GPU Direct Storage |
20–60 μs |
40–120 μs |
GPU↔SSD direct, CPU bypass |
InfiniBand - NVMe-oF/RDMA |
60–170 μs |
120–340 μs |
CPU bypass, lossless fabric, sub-10 μs network |
InfiniBand - TCP/IP (no RDMA) |
200–500 μs |
400–1000 μs |
CPU overhead, interrupt handling, context switches |
Ethernet - RoCE v2 (RDMA) |
70–200 μs |
140–400 μs |
Similar to IB RDMA, requires lossless Ethernet |
Ethernet - TCP/IP (no RDMA) |
300–800 μs |
600–1600 μs |
Highest overhead, best-effort network |