PCIe, USB, and NVLink: Why Your GPU Cares About the Speed of Its Commute

Scott Morrison • January 24, 2026 • 0 views

PCIe USB NVLink RDMA GPU interconnects bandwidth data transfer hardware architecture high performance computing AI infrastructure

Every byte traveling through your computer takes a bus, and not all buses are created equal. While USB and PCIe have been dutifully ferrying data for decades, NVLink showed up like a private jet for GPUs that refuse to wait in traffic, and for certain RDMA workloads, that makes all the difference.

Let's talk about the roads inside your computer. Not the metaphorical information superhighway, but the actual physical buses that move data between components. Because it turns out that in the world of high-performance computing, the speed limit matters just as much as the speed of your processors.

USB: The Universal Serial Bottleneck

USB started as a noble idea: one connector to rule them all. No more PS/2 ports, parallel ports, and serial ports living in chaos. Just plug it in and it works. The "Universal" in USB wasn't kidding around.

USB 2.0 gave us 480 Mbps (megabits per second), which seemed fast in 2000 when we were connecting keyboards and mice. That's about 60 MB/s in practical terms, enough for your external hard drive to feel sluggish but functional.

USB 3.0 (later rebranded as USB 3.1 Gen 1, because naming things is hard) jumped to 5 Gbps, or roughly 500 MB/s. Suddenly external SSDs made sense.

USB 3.1 Gen 2 doubled that to 10 Gbps. Then came USB 3.2, which can hit 20 Gbps if you're using both lanes. The naming got so confusing that we collectively agreed to pretend it makes sense and move on.

USB4 brings us to 40 Gbps, matching Thunderbolt 3 and finally putting USB in the same conversation as professional interconnects. That's about 5 GB/s, which is respectable.

But here's the thing: USB is designed for external peripherals. It's a protocol built for flexibility, plug-and-play compatibility, and not electrocuting users who plug things in backwards. Those are admirable goals, but they come with overhead. For internal, high-bandwidth communication between major system components, we need something faster.

PCIe: The Highway System Inside Your Computer

PCI Express is the backbone of modern computer architecture. It's how your CPU talks to your GPU, your NVMe drives, your network cards, and basically everything else that needs serious bandwidth.

PCIe works with lanes, typically x1, x4, x8, or x16 configurations. Think of lanes like highway lanes: more lanes means more total throughput, assuming each lane maintains its speed.

Let's look at the evolution per lane:

PCIe 1.0 (2003): 250 MB/s per lane. A x16 slot could move 4 GB/s. Revolutionary for its time, quaint by today's standards.

PCIe 2.0 (2007): 500 MB/s per lane, 8 GB/s for x16. Your GPU could finally stretch its legs a bit.

PCIe 3.0 (2010): 985 MB/s per lane, nearly 16 GB/s for x16. This is where things got interesting for graphics and started looking attractive for storage.

PCIe 4.0 (2017): 1.97 GB/s per lane, about 32 GB/s for x16. High-end GPUs stopped being quite so bandwidth-starved.

PCIe 5.0 (2019): 3.94 GB/s per lane, roughly 63 GB/s for x16. Now we're talking speeds that matter for AI workloads.

PCIe 6.0 (2022): 7.88 GB/s per lane, approximately 126 GB/s for x16. Still rolling out, but the specs are impressive.

For most applications, PCIe 3.0 or 4.0 is plenty. Your gaming GPU rarely saturates PCIe 3.0 x16. Your NVMe SSD on PCIe 4.0 x4 can hit 7 GB/s, which is fast enough that you'll blame something else when your application is slow.

Why These Numbers Actually Matter

Here's where we separate the "that's a nice spec" from the "this fundamentally changes what's possible."

For consumer applications, USB and PCIe speeds have outpaced most needs. Your external drive on USB 3.2? Fast enough. Your GPU on PCIe 4.0? Not the bottleneck in your gaming performance.

But in three domains, interconnect bandwidth becomes critical:

Data-intensive AI training: Modern AI models have billions of parameters. Training them requires moving massive amounts of data between GPUs, between GPUs and memory, and between GPUs and storage. When you're training GPT-scale models, every GB/s matters.

High-performance storage: NVMe SSDs can saturate even PCIe 4.0 x4 connections. Data center applications with dozens of drives need the bandwidth that PCIe 5.0 provides.

GPU-to-GPU communication: This is where things get really interesting. When multiple GPUs need to work together on a problem, they need to exchange data constantly. PCIe works, but it's not ideal for this use case.

Enter NVLink: NVIDIA's Private Highway

NVIDIA looked at PCIe and said, "What if we just... didn't?" NVLink is NVIDIA's proprietary high-speed interconnect specifically designed for GPU-to-GPU and GPU-to-CPU communication.

NVLink 1.0 (2016, with Pascal GPUs): 20 GB/s per link, with multiple links possible. A fully-connected configuration with 4 links gave 80 GB/s bidirectional bandwidth between two GPUs.

NVLink 2.0 (2017, with Volta GPUs): 25 GB/s per link, up to 300 GB/s total with 6 links.

NVLink 3.0 (2020, with Ampere): 50 GB/s per link, with 12 links possible for 600 GB/s total bandwidth.

NVLink 4.0 (2022, with Hopper): 72 GB/s per link. The H100 GPU can achieve 900 GB/s of aggregate bandwidth.

Compare that to PCIe 5.0's 63 GB/s for x16. NVLink isn't just faster, it's an order of magnitude faster for GPU interconnects.

The RDMA Connection

Here's where NVLink becomes critical for a specific class of applications: RDMA (Remote Direct Memory Access). In traditional networking, moving data between machines involves the CPU, the operating system, memory copies, and various layers of protocol processing. RDMA bypasses all of that, allowing direct memory-to-memory transfers between machines.

For high-performance computing and AI training clusters, RDMA over InfiniBand or RoCE (RDMA over Converged Ethernet) is standard. But there's a catch: even with RDMA's efficiency, you still need to get data from your GPU's memory to your network interface card. If your GPUs can communicate at 900 GB/s via NVLink but then bottleneck at PCIe speeds getting to the network card, you've got a problem.

NVIDIA's solution is GPUDirect RDMA, which allows network adapters to directly access GPU memory, bypassing the CPU entirely. Combined with NVLink, this enables scenarios where GPUs in multi-node clusters can communicate with minimal latency and maximum throughput.

In practice, this means:

Distributed AI training can keep dozens or hundreds of GPUs synchronized with minimal communication overhead. When you're training models with trillions of parameters across multiple servers, the interconnect becomes as important as the compute.

High-performance simulations that distribute workload across multiple GPUs can exchange boundary data without waiting for the CPU to orchestrate transfers.

Real-time data processing pipelines can stream data through multiple GPU processing stages at speeds that would choke traditional PCIe connections.

The Architecture of Speed

The real genius of NVLink isn't just raw bandwidth, it's the architecture. NVLink creates a cache-coherent, high-bandwidth mesh between GPUs. Multiple GPUs can share a unified memory address space, allowing them to directly access each other's memory without explicit copying.

This is transformative for certain workloads. Instead of carefully partitioning your problem to fit in each GPU's memory and then orchestrating data transfers, you can treat multiple GPUs as a single large GPU with shared memory. The programming model simplifies, performance improves, and problems that didn't fit before suddenly become tractable.

For RDMA applications, NVLink enables GPU-centric architectures where the GPUs drive the computation and communication, with the CPU coordinating rather than mediating. In traditional architectures, the CPU copies data from GPU to system memory to network card. With NVLink and GPUDirect RDMA, the GPU sends data directly to the network, cutting latency and increasing throughput.

The Proprietary Problem

Now for the uncomfortable part: NVLink is NVIDIA-only, and mostly high-end NVIDIA-only. You're not getting NVLink in your gaming GPU. It appears in data center cards like the A100, H100, B100, B200, B300 and certain professional GPUs. And it only connects NVIDIA GPUs to other NVIDIA GPUs or specific IBM POWER processors.

This creates lock-in. Once you've built an architecture around NVLink's bandwidth and programming model, switching to AMD or Intel GPUs means reengineering your entire interconnect strategy. AMD has its own answer, Infinity Fabric, but it's not compatible. Intel is working on alternatives.

The open-standards alternative is CXL (Compute Express Link), which promises cache-coherent device connectivity over PCIe. CXL 3.0 can theoretically hit impressive bandwidth numbers, but as of 2025, it's still early days. The ecosystem hasn't converged around it the way high-performance computing has converged around NVLink for NVIDIA-based systems.

When Do You Actually Need This?

Let's be practical. If you're running inference on a single GPU, PCIe 4.0 is fine. If you're training small models on a couple of GPUs, PCIe is still fine. If you're building a web service, USB speeds for your backup drives are fine.

You need NVLink-class interconnects when:

You're training large models across multiple GPUs where gradient synchronization becomes a bottleneck. The largest language models wouldn't be practical without high-speed GPU interconnects.

You're running simulations that partition across GPUs with significant boundary data exchange. Computational fluid dynamics, molecular dynamics, climate modeling, these applications benefit enormously from fast GPU-to-GPU communication.

You're building inference clusters where models are too large for a single GPU and need to be sharded across multiple GPUs with low-latency communication.

You're doing GPU-accelerated analytics on datasets that exceed single-GPU memory and require data shuffling between GPUs.

For everyone else, PCIe is the mature, vendor-neutral, perfectly adequate solution that's been reliably moving data around computers for two decades.

The Future of Fast

The interconnect wars continue. PCIe 6.0 is arriving, CXL is maturing, and proprietary solutions like NVLink keep pushing boundaries. USB keeps getting faster and more confusing to name. The fundamental tension remains: open standards provide compatibility and longevity, proprietary solutions push performance boundaries.

For RDMA applications in particular, the combination of high-speed interconnects like NVLink, GPUDirect capabilities, and RDMA-capable networks creates a perfect storm of performance. Data can flow from GPU memory to remote GPU memory with minimal CPU involvement and maximal bandwidth. It's elegant, it's fast, and it's exactly what distributed AI training needs.

But it's also complex, expensive, and vendor-specific. As usual in high-performance computing, you pay for performance with money, complexity, and flexibility.

The Speed You Need vs The Speed You Want

Here's the thing about interconnect speeds: like RAM, disk space, and CPU cores, faster is always nice but not always necessary. A well-designed application on PCIe 4.0 will outperform a poorly designed application on NVLink every time.

Before you invest in the fastest interconnects, profile your application. Is GPU communication actually your bottleneck? Or are you compute-bound, memory-bound, or bottlenecked by algorithmic complexity? The answer might be humbling, but it'll save you from buying hardware you don't need.

That said, when you do need the bandwidth, truly need it because you've optimized everything else and the interconnect is genuinely limiting your performance, technologies like NVLink deliver. They're not marketing hype, they're legitimate solutions to real problems faced by applications pushing the boundaries of what's computationally possible.

The trick, as always, is knowing which problem you're actually trying to solve.

Article Not Found