GPU-NVMe Latency Comparison

6.2. GPU-NVMe Latency Comparison#

Note: These figures represent typical ranges from various benchmarks and can vary significantly based on hardware generation, workload patterns, and specific implementations.

Configuration

Typical One-Way Latency

Round-Trip Latency

Key Characteristics

Local NVMe (PCIe) - Standard

30–100 μs

60–200 μs

CPU-mediated, kernel stack involved

Local NVMe (PCIe) - GPU Direct Storage

20–60 μs

40–120 μs

GPU↔SSD direct, CPU bypass

InfiniBand - NVMe-oF/RDMA

60–170 μs

120–340 μs

CPU bypass, lossless fabric, sub-10 μs network

InfiniBand - TCP/IP (no RDMA)

200–500 μs

400–1000 μs

CPU overhead, interrupt handling, context switches

Ethernet - RoCE v2 (RDMA)

70–200 μs

140–400 μs

Similar to IB RDMA, requires lossless Ethernet

Ethernet - TCP/IP (no RDMA)

300–800 μs

600–1600 μs

Highest overhead, best-effort network