RDMA: When CPUs Decided They're Too Important to Move Your Data
Remote Direct Memory Access is the networking technology that says "the CPU has better things to do than babysit data transfers." Instead of your processor spending cycles copying data from NIC to memory to application buffers through multiple layers of kernel stack, RDMA lets the network adapter directly read and write memory while your CPU does actual work. It's the difference between making your CEO personally deliver every package versus hiring a competent shipping department. In high-performance computing, AI training, and low-latency trading, RDMA isn't optional, it's the only way to keep your expensive hardware from spending all its time moving data instead of processing it.
Let's trace how we got here, from Cray's revolutionary shared-memory supercomputers to modern GPU clusters training trillion-parameter models, and why making RDMA work across today's heterogeneous memory hierarchies is harder than anyone expected.
The Cray Origins: When Supercomputers Learned to Share
The story of RDMA starts with Seymour Cray and the fundamental problem of building supercomputers: how do you make multiple processors work together efficiently? In the 1970s and 1980s, this was an existential question for high-performance computing.
The Cray-1 and Early Vector Processing
The Cray-1 (1976) was a single-processor vector machine with incredibly fast memory for its time. Let's talk actual numbers:
Memory capacity: 1 to 8 megabytes (yes, megabytes). This sounds laughable today, but in 1976 this was enormous. Memory bandwidth: 320 megabytes per second. The Cray-1's memory system could sustain reads at this rate, which was extraordinary for the era. Memory technology: Core memory transitioning to semiconductor DRAM, with extremely wide data paths (64 bits per cycle, no waiting). Clock speed: 80 MHz, incredibly fast for 1976 when most computers ran at single-digit MHz.
The Cray-1 didn't need RDMA because there was only one processor. All memory was local, and the vector units could sustain operations at memory bandwidth. But as computational demands grew, the industry needed multiprocessor systems, and that's where things got interesting.
Cray X-MP: Shared Memory Across Processors
The Cray X-MP (1982) introduced multi-processor shared memory architecture. Up to four processors could access the same memory space directly. This wasn't RDMA in the modern sense (it was a shared bus architecture), but it established the principle: let hardware move data between processors without software intervention.
Memory capacity: 8 to 128 megabytes, shared across all processors. Memory bandwidth: 1.3 gigabytes per second aggregate across all processors, an incredible 4x improvement. Processors: 2 to 4 processors, each running at 105 to 117 MHz. Interconnect: Custom crossbar switch allowing any processor to access any memory bank, average memory access time around 60 nanoseconds.
The key insight: synchronization and data movement are expensive. If processors must constantly interrupt each other to coordinate memory access, you lose performance. Cray's solution was hardware-managed shared memory with minimal synchronization overhead. The processors could read and write shared memory locations, and hardware cache coherence protocols kept everything consistent.
Cray T3D: Distributed Shared Memory
The Cray T3D (1993) took this concept further with distributed shared memory across physically separate nodes. Each node had its own memory, but the system created the illusion of a single shared address space. A processor on node 1 could access memory on node 32 by simply reading or writing to an address, the hardware handled the network transfer transparently.
Memory per node: 64 to 256 megabytes of local DRAM per node. Total system memory: Up to 32 gigabytes across the full system (128 nodes maximum). Node count: 32 to 2,048 processing elements (two per node). Interconnect: 3D torus network with 300 megabytes per second bidirectional bandwidth per link. Memory access latency: Local memory in 60-100 nanoseconds, remote memory access in 1-3 microseconds depending on distance in the torus. Processors: DEC Alpha 21064 running at 150 MHz per processing element.
This is the conceptual ancestor of modern RDMA. The T3D used a custom interconnect (3D torus topology) with hardware support for remote memory operations. Software requested data from a remote address, hardware moved it across the network without OS involvement, minimal latency and zero CPU overhead for the transfer itself.
Here's what made it revolutionary: the T3D's network interface could directly DMA from remote memory into local memory. A processor would issue a load instruction for a remote address, the hardware would recognize this as a remote memory operation, send a request packet across the network, the remote node's hardware would read from its local memory and send a response packet, and the requesting processor would receive the data, all without any software (including the OS) intervening.
The Lessons That Shaped RDMA
Cray's architectures taught the industry several critical lessons:
- CPU overhead kills scaling: If every data transfer requires CPU intervention, you can't scale to thousands of processors.
- Memory latency matters more than bandwidth: A high-bandwidth connection with high latency performs worse than moderate bandwidth with low latency for many workloads.
- Hardware should handle the mechanics: Complex protocol processing, flow control, and error handling belong in hardware, not software.
- Zero-copy is essential: Copying data multiple times (kernel to user space, user space to NIC) wastes memory bandwidth and adds latency.
These principles directly influenced InfiniBand's design in the late 1990s and remain the foundation of modern RDMA.
RDMA Fundamentals: How It Actually Works
Modern RDMA comes in several flavors (InfiniBand, RoCE, iWARP), but the core concept is consistent: let the network adapter directly access application memory without kernel or CPU involvement.
Traditional Networking: Death By A Thousand Copies
Before we dive into RDMA, let's understand exactly what makes traditional networking so expensive. Here's what happens when you send data over TCP/IP:
- Application writes data to a buffer in userspace memory
- Application calls send() or write(), triggering a system call
- Context switch to kernel mode: The CPU switches from running your application code to running kernel code. This involves saving all your application's register state (instruction pointer, stack pointer, general-purpose registers), loading the kernel's register state, and switching memory address space mappings. Cost: 50-100 nanoseconds on modern CPUs, but this stalls all forward progress of your application.
- Kernel copies data from user buffer to kernel socket buffer (first copy)
- TCP stack processes the data: adds headers, calculates checksums, handles segmentation
- Kernel copies data from socket buffer to NIC DMA buffer (second copy)
- NIC receives DMA request and transmits the data
- NIC generates interrupt when transmission completes: The NIC signals the CPU by asserting an interrupt line. The CPU immediately stops whatever it's doing (even if it's running critical code), saves its state, jumps to the interrupt handler in the kernel, processes the interrupt (marking packets as sent, freeing buffers), and returns to what it was doing. Cost: 500-2000 nanoseconds including the handler execution.
- Context switch back to userspace: Eventually control returns to your application, requiring another 50-100 nanosecond context switch.
On the receiving side, reverse all of that with more copies and context switches.
What's a context switch: Modern CPUs have different privilege levels. Userspace applications run at a lower privilege (ring 3 on x86) where they can't access hardware directly or other applications' memory. The kernel runs at higher privilege (ring 0) where it can do anything. Switching between these requires saving the current CPU state (all registers, instruction pointer, flags), changing privilege level, switching to a different address space (so the kernel can access kernel memory), and loading the new state. This takes dozens to hundreds of CPU cycles, during which no useful work happens.
What's an interrupt: Hardware devices (like network cards) need to notify the CPU when something happens (packet arrived, transmission completed, error occurred). They do this by asserting an interrupt line. The CPU checks for interrupts between every instruction (or at specific checkpoints). When an interrupt fires, the CPU immediately stops executing the current code, saves minimal state, and jumps to an interrupt handler (kernel code that deals with the interrupt). After handling, it returns to what it was doing. This happens thousands to millions of times per second on a busy system, and each interrupt costs 500-2000 nanoseconds of pure overhead.
The costs are brutal:
- CPU cycles: 20-30% of CPU time spent on networking for high-throughput applications. That's one-quarter of your expensive CPU doing data movement instead of actual work.
- Latency: Multiple microseconds added by context switches and kernel processing. In the time it takes to context switch twice, light travels about 100 meters. Your data is waiting for software bureaucracy.
- Memory bandwidth: Multiple copies consume precious memory bandwidth. If you're sending 10 GB/s over the network, you're actually moving 20-30 GB/s through memory due to all the copies.
- Cache pollution: Data copied through CPU cache evicts useful application data, which we'll explain in detail next.
Cache Pollution: Why Copying Through the CPU Is Worse Than You Think
Modern CPUs have a memory hierarchy: L1 cache (32-64 KB, ~4 cycles to access), L2 cache (256 KB-1 MB, ~12 cycles), L3 cache (8-64 MB, ~40 cycles, shared across cores), and main memory (gigabytes, ~200 cycles). This hierarchy exists because fast memory is expensive and slow memory is cheap.
When your application is running, the CPU tries to keep frequently accessed data in cache. Let's say you're doing scientific computation:
// Your hot loop processes a 1 MB array repeatedly
for (int iteration = 0; iteration < 1000000; iteration++) {
for (int i = 0; i < array_size; i++) {
result[i] = complex_calculation(data[i]);
}
}
After the first iteration, your data[] and result[] arrays are in L3 cache (and the working set you're actively processing is in L1/L2). Each subsequent iteration runs fast because cache hits are ~40 cycles instead of ~200 cycles for main memory access. Your 1 million iterations run 5x faster due to cache.
Now, while your computation is running, the network receives a packet. The kernel copies this 1500-byte packet from the NIC buffer through L1/L2/L3 cache to kernel memory, then from kernel memory through cache again to user space. That 1500 bytes of network data just evicted 1500 bytes of your carefully cached computation data.
Do this for 10,000 packets per second (a modest network load), and you're evicting 15 MB per second of useful cache data to make room for transient network buffers that you'll never reuse. Your computation loop, which was cache-hot and running at 5x speed, now suffers cache misses and runs at maybe 3x speed. You just lost 40% of your compute performance to network-induced cache pollution.
Real-world example: A database server with a working set that fits in L3 cache (queries hit cache, not disk). Network traffic for queries and results constantly evicts cached database pages. Query latency increases by 2-3x during high network activity, not because the CPU is busy with networking, but because cache pollution forces more main memory accesses. The database is spending CPU cycles waiting for memory instead of processing queries.
Another example: High-frequency trading where microseconds matter. Your trading algorithm's hot path (price updates, signal generation, order submission logic) fits in 32 KB L1 cache. But market data packets arriving at 100,000 per second evict L1 cache constantly. Your algorithm's execution time becomes variable, sometimes fast (cache hit), sometimes slow (cache miss), making latency unpredictable. In trading, unpredictable latency is worse than consistently slow latency.
RDMA solves this by never touching the CPU cache. The RDMA adapter DMAs data directly from network to memory (or memory to network), bypassing the CPU entirely. Your application's cache footprint remains undisturbed. The computation keeps running at full speed even under heavy network load.
RDMA: The Kernel-Free Path
RDMA eliminates all of this:
- Application registers memory region with RDMA adapter (one-time setup)
- Application posts send/receive work requests to adapter's queue
- RDMA adapter reads data directly from registered memory
- Adapter handles protocol processing in hardware
- Adapter transmits data
- Remote adapter writes data directly to remote registered memory
- Remote application gets notification (completion queue event)
Zero copies. Zero context switches. Zero CPU involvement during transfer. The RDMA adapter is a fully autonomous DMA engine that speaks networking protocols.
The Verbs API: RDMA's Programming Interface
RDMA is programmed using "verbs," operations that post work requests to the adapter:
RDMA Write: Write data to remote memory without remote CPU notification. The remote side doesn't even know the write happened (one-sided operation).
RDMA Read: Read data from remote memory without remote CPU involvement. Again, one-sided.
Send/Receive: Traditional message passing. The remote side must post a receive buffer before the send arrives (two-sided operation).
Atomic operations: Compare-and-swap, fetch-and-add, etc., executed atomically on remote memory. Critical for lock-free distributed algorithms.
The beauty of one-sided operations (RDMA Read/Write) is that the remote CPU is completely uninvolved. You can read from or write to remote memory as if it were local, just slower due to network latency. This is the Cray T3D's distributed shared memory model, implemented on commodity hardware.
RDMA's Impact on Linux Networking
RDMA required fundamental changes to Linux's networking architecture, which was built around sockets and the kernel network stack.
The Verbs Interface in Linux
The Linux kernel gained RDMA support through the InfiniBand Verbs API, first added in kernel 2.6 (2003). This created a parallel networking stack:
Traditional path: Sockets → TCP/IP stack → NIC driver RDMA path: Verbs → RDMA core → HCA/RNIC driver
Applications using RDMA bypass the socket API entirely, using verbs to post work requests directly to hardware queues. The kernel's role is minimal: memory registration, queue pair management, but not data path processing.
Memory Registration: The Permission System
For security and correctness, RDMA requires explicit memory registration. You can't just tell the NIC to read arbitrary memory addresses, that would allow any application to read other applications' memory (or kernel memory).
But before we explain memory registration, we need to understand how modern operating systems manage memory.
Understanding Memory Pages
Modern operating systems don't manage memory byte-by-byte. Instead, memory is divided into fixed-size chunks called pages. On x86-64 systems, the standard page size is 4 kilobytes (4096 bytes). Some systems support huge pages (2 MB or 1 GB) for special cases.
When your application allocates memory (malloc, new, etc.), the operating system gives you virtual addresses. These aren't real physical addresses, they're made up. Your application might think it has memory at address 0x7ffff7a00000, but that's not where the data actually lives in physical RAM.
The CPU's Memory Management Unit (MMU) translates virtual addresses to physical addresses using page tables. These page tables map virtual page to physical page. For example:
Virtual address 0x7ffff7a00000 (page 0x7ffff7a00) → Physical address 0x0012f000 (page 0x0012f) Virtual address 0x7ffff7a01000 (page 0x7ffff7a01) → Physical address 0x000be000 (page 0x000be)
Notice that consecutive virtual pages (0x7ffff7a00, 0x7ffff7a01) map to non-consecutive physical pages (0x0012f, 0x000be). Your memory, from your application's perspective, is one contiguous block. In physical RAM, it's scattered fragments.
Why does this matter for RDMA? Because network hardware (the RDMA adapter) can't use virtual addresses. It doesn't have access to your application's page tables. It needs physical addresses. And here's the problem: those physical addresses can change.
Why Physical Pages Must Be Pinned
Operating systems have several reasons to move pages around in physical memory:
Swapping: If physical RAM is full, the OS moves some pages to disk (swap space) to free up RAM for other processes. Your virtual address stays the same, but the physical page might now be on disk instead of in RAM.
Memory compaction: To create large contiguous physical memory regions, the OS sometimes moves pages around to reduce fragmentation.
NUMA optimization: On multi-socket systems, the OS might move pages closer to the CPU that's accessing them most frequently.
Imagine the OS swaps your page to disk, and the RDMA adapter (which was told "read from physical address 0x0012f000") tries to read it. That physical address now contains someone else's data, or nothing at all. The RDMA adapter would read garbage, or cause a hardware error, or worse, read another application's secrets.
Pinning solves this. When you register memory for RDMA, the OS marks those pages as "pinned" or "locked." This prevents the OS from:
- Swapping the pages to disk
- Moving the pages to different physical addresses
- Reusing the physical pages for other purposes
The pages stay at fixed physical addresses for as long as they're registered. The RDMA adapter can safely use those physical addresses, knowing they won't change.
The Cost of Pinning
Memory registration does two things:
- Pins physical pages: Prevents the kernel from swapping or moving pages, as explained above.
- Creates protection domain: Establishes access permissions. Only authorized queue pairs can access registered memory regions. The RDMA adapter maintains a table of "this physical address range is registered to this application with these permissions."
Memory registration is expensive because it requires:
- Page table walks: The kernel must traverse page tables to find physical addresses for all pages in the region. For a 1 GB registration, that's 262,144 pages to look up (1 GB / 4 KB per page).
- TLB shootdowns: The Translation Lookaside Buffer (TLB) is the CPU's cache of virtual-to-physical mappings. Pinning pages might require invalidating TLB entries across all CPUs, which requires inter-processor interrupts and stalls.
- Reference count updates: The kernel must update reference counts for every page to mark them as pinned.
- RDMA adapter registration: The RDMA adapter's hardware must be told about these physical addresses and permissions, which might involve PCI transactions and hardware table updates.
For a 1 GB region, registration might take 10-100 milliseconds. This is why RDMA applications register large regions upfront and reuse them, rather than registering and deregistering frequently.
The Dynamic Allocation Problem
Normal applications use dynamic memory allocation freely. You malloc() when you need memory, free() when you're done. The language runtime or standard library handles the details. This works because virtual memory makes allocation cheap, the OS can give you virtual addresses instantly and allocate physical pages lazily (only when you actually access the memory).
RDMA breaks this model. Consider this code:
// Normal application: works fine void* buffer = malloc(1024 * 1024); // 1 MB buffer send_over_network(socket, buffer, 1024 * 1024); free(buffer); // RDMA application: doesn't work void* buffer = malloc(1024 * 1024); // Problem: this buffer isn't registered for RDMA // Can't post RDMA send operation with this address rdma_post_send(qp, buffer, 1024 * 1024); // ERROR free(buffer);
To use RDMA, you need:
// Pre-register a large memory pool void* memory_pool = malloc(1024 * 1024 * 1024); // 1 GB pool struct ibv_mr* mr = ibv_reg_mr(pd, memory_pool, 1024 * 1024 * 1024, flags); // This takes 10-100ms // Now allocate from the pool void* buffer = allocate_from_pool(memory_pool, 1024 * 1024); rdma_post_send(qp, buffer, 1024 * 1024); // Works, buffer is in registered region return_to_pool(memory_pool, buffer); // Much later, when shutting down: ibv_dereg_mr(mr); free(memory_pool);
This creates several problems:
Pre-allocation pressure: You must decide upfront how much memory to register. Register too little and you run out. Register too much and you waste RAM (pinned pages can't be swapped, so you're reducing the system's effective memory).
Custom allocators: Your application needs a custom memory allocator that manages the registered pool. This allocator must be thread-safe, handle fragmentation, and deal with allocation failures. You're reimplementing malloc, which is harder than it sounds.
Can't use standard libraries: Many libraries allocate memory internally. If you pass an RDMA buffer to a library that does realloc() or returns a newly allocated buffer, that memory isn't registered. You need RDMA-aware libraries or wrapper code that copies between registered and unregistered memory.
Memory pressure conflicts: The OS wants to use all available memory (caching files, buffers, allocations). Your RDMA application wants to pin gigabytes of memory. On a system with 128 GB RAM and 100 GB of pinned RDMA memory, only 28 GB is available for everything else. This can cause other applications to swap or fail allocations.
Some modern solutions help:
On-demand Paging (ODP): Some RDMA adapters support ODP, which allows RDMA operations on non-pinned memory. The adapter takes page faults and asks the kernel to pin pages on-demand. This reduces pre-allocation pressure but adds latency to the first access of each page.
Registration cache: Middleware layers (MPI implementations, NCCL) maintain caches of registered memory regions. When you allocate memory, they check if it's in a previously-registered region. This hides some registration cost but doesn't solve the fundamental problem.
Huge pages: Using 2 MB or 1 GB huge pages instead of 4 KB pages reduces registration overhead (fewer pages to process) and TLB pressure. But huge pages have their own allocation complexities.
The fundamental tension remains: RDMA wants stable, pinned physical memory, while modern OSes want flexible, virtual, pageable memory. RDMA performance requires giving up some OS flexibility.
libibverbs: Userspace RDMA
The libibverbs library provides userspace access to RDMA functionality. Applications link against libibverbs and can post verbs operations from userspace with zero kernel involvement (after initial setup).
This is kernel bypass done right: the kernel establishes permissions and sets up hardware queues, then gets out of the way. Userspace code directly manipulates hardware queues, which the RDMA adapter continuously polls.
RDMA CM: Connection Management
While data transfer is kernel-free, connection establishment still needs coordination. The RDMA CM (Connection Manager) provides a socket-like API for establishing RDMA connections.
It handles:
- Service ID resolution (which host provides this service?)
- Connection negotiation (parameters, queue sizes)
- Path selection (in multi-path fabrics)
- Disconnection and error handling
Once connections are established, RDMA CM stays out of the data path.
The Memory Hierarchy Problem: HBM, DRAM, NVMe, and Heterogeneity
Modern systems have complex memory hierarchies, and RDMA must work across all of them. This is harder than it sounds.
The Three-Tier Memory System
HBM (High Bandwidth Memory): On-package memory directly attached to GPUs. Bandwidth: 2-3 TB/s. Capacity: 80-120 GB per GPU. Latency: ~100 nanoseconds. This is where active computation happens.
DRAM (System Memory): Traditional DDR memory attached to CPUs. Bandwidth: 200-400 GB/s per socket. Capacity: 512 GB to several TB. Latency: ~60 nanoseconds. This is staging and coordination memory.
NVMe/Persistent Memory: SSD storage, optionally with persistent memory features. Bandwidth: 5-15 GB/s per device. Capacity: TB to PB scale. Latency: microseconds. This is the data source and training checkpoints.
Training a large AI model requires moving data through this hierarchy constantly:
- Model parameters in HBM
- Gradients aggregated across GPUs (HBM → network → HBM)
- Optimizer state sometimes spilled to DRAM
- Training data streamed from NVMe
- Checkpoints written to NVMe
The GPUDirect Problem
Traditional networking requires data to flow through CPU memory:
GPU HBM → PCIe → CPU DRAM → PCIe → NIC → Network
This is terrible for performance:
- Two PCIe transfers per network operation
- CPU memory bandwidth consumed
- CPU cycles wasted orchestrating transfers
- Latency added by each hop
GPUDirect RDMA solves this by allowing the NIC to directly access GPU memory:
GPU HBM → PCIe → NIC → Network (CPU uninvolved)
The RDMA adapter can read from or write to GPU memory directly. The CPU sets up the transfer (posts RDMA work request with GPU memory address), then the hardware handles everything.
This is revolutionary for AI training. All-reduce operations (aggregating gradients across GPUs) used to require:
- GPU copies gradients to CPU memory
- CPU network stack sends to other nodes
- Remote CPU receives
- Remote CPU copies to GPU
With GPUDirect RDMA:
- GPU tells NIC "send this HBM address to that host"
- Done
The Coherency Nightmare
Here's where it gets messy. These memory types aren't cache coherent with each other. Let's define what that means.
What Is Cache Coherence?
Cache coherence is the guarantee that all observers see a consistent view of memory. In a system with multiple CPUs, each CPU has its own caches (L1, L2, sometimes L3). If CPU 0 writes to address 0x1000 and CPU 1 reads from 0x1000, CPU 1 should see the value CPU 0 wrote, not a stale cached value.
Modern CPUs implement cache coherence protocols (like MESI or MOESI) in hardware. When CPU 0 writes to a cached location:
- CPU 0's cache marks the line as "Modified"
- CPU 0 broadcasts an invalidation message to all other CPUs
- Other CPUs invalidate their cached copies of that line
- When CPU 1 reads that address, it gets a cache miss, requests the line from CPU 0
- CPU 0 sends its modified line, and both CPUs now have consistent copies
This happens automatically in hardware, invisibly to software. It's why you can write multithreaded programs without thinking about cache consistency (though you still need locks/atomics for synchronization).
But this coherence only works within the CPU's coherence domain. It doesn't extend to:
- GPU HBM: GPUs have their own memory and cache hierarchy, completely separate from the CPU's
- Device memory: Network adapters, NVMe controllers, other PCIe devices can't participate in CPU cache coherence
- Remote memory: Memory on another machine, accessed via RDMA
This creates visibility problems:
- CPU caches aren't coherent with GPU HBM
- GPU HBM isn't coherent with NVMe device buffers
- RDMA operations bypass CPU caches entirely
- Remote writes via RDMA are invisible to the local CPU's caches
This means explicit synchronization is required:
CPU to GPU: CPU must flush writes to memory before GPU can see them. GPU must flush writes before CPU can see them. CUDA streams and events handle this with explicit synchronization points.
Example:
// CPU prepares data
for (int i = 0; i < size; i++) {
host_buffer[i] = compute_something(i);
}
// WRONG: GPU might see stale data
cudaMemcpy(device_buffer, host_buffer, size, cudaMemcpyHostToDevice);
kernel<<>>(device_buffer);
// CORRECT: Explicit synchronization ensures visibility
__sync_synchronize(); // Memory barrier
cudaMemcpy(device_buffer, host_buffer, size, cudaMemcpyHostToDevice);
cudaDeviceSynchronize(); // Wait for copy to complete
kernel<<>>(device_buffer);
GPU to NIC (via RDMA): GPU must complete all writes to HBM before RDMA operation can safely read. This requires GPU-side synchronization primitives (CUDA events, streams).
Example:
// GPU kernel writes results to HBM kernel<<>>(gpu_buffer); // WRONG: RDMA might read before GPU writes complete rdma_post_send(qp, gpu_buffer, size); // CORRECT: Wait for GPU to finish cudaStreamSynchronize(stream); // Ensure kernel completed rdma_post_send(qp, gpu_buffer, size);
NIC to GPU (RDMA write to GPU memory): RDMA write to GPU memory is invisible to GPU caches. GPU must be notified and invalidate relevant cache lines before reading.
Example:
// RDMA writes data into GPU memory rdma_wait_completion(qp); // Wait for RDMA write to complete // WRONG: GPU reads might hit stale cache kernel<<>>(gpu_buffer); // CORRECT: Invalidate GPU caches or use volatile loads cudaDeviceSynchronize(); kernel<<>>(gpu_buffer); // GPU now sees RDMA data
Getting this synchronization wrong causes silent data corruption, the most fun kind of debugging. "It worked yesterday" probably means you got lucky with timing, not that your code is correct. The bug manifests as occasional wrong results, which are nearly impossible to reproduce in a debugger (because debugging changes timing).
How Coherency Problems Are Solved in Practice
The good news: you don't usually write this synchronization code yourself. Frameworks and libraries handle most of it.
NCCL (NVIDIA Collective Communications Library): Handles all GPU-to-GPU RDMA synchronization internally. When you call ncclAllReduce(), NCCL:
- Ensures GPU kernel completion before starting RDMA operations
- Uses CUDA events to synchronize between GPU work and network operations
- Handles GPUDirect RDMA setup and teardown
- Manages memory registration caching
Application code just calls the high-level NCCL API, all synchronization is automatic.
CUDA-aware MPI: MPI implementations like OpenMPI, MPICH, and MVAPICH2 support passing GPU pointers directly to MPI operations. The MPI library handles:
- Detecting whether a pointer is host or device memory
- Synchronizing GPU streams before RDMA operations
- Using GPUDirect RDMA when available
- Falling back to CPU staging when GPUDirect isn't available
PyTorch and TensorFlow Distributed Training: High-level frameworks abstract RDMA completely. You write:
# PyTorch distributed training model = torch.nn.parallel.DistributedDataParallel(model) output = model(input) loss.backward() # Gradients automatically all-reduced via NCCL+RDMA optimizer.step()
Behind the scenes, PyTorch uses NCCL, which uses RDMA, but you never see it. The framework handles all synchronization.
UCX (Unified Communication X): An open-source communication framework that provides a unified API across different transports (InfiniBand, RoCE, TCP, shared memory). UCX handles:
- Protocol selection (choosing the fastest available method)
- Memory registration caching
- GPUDirect RDMA support
- Automatic synchronization between host and device memory
Libraries like Horovod (distributed deep learning) and Dask (distributed computing) use UCX internally.
GDRCopy: A low-level library for copying between GPU and CPU memory with minimal overhead. It uses GPUDirect RDMA techniques but for local host-GPU transfers. Applications needing fine-grained control use GDRCopy, but most use higher-level libraries.
Best Practices for Application Developers:
- Use framework-provided APIs: Let PyTorch, TensorFlow, or MPI handle RDMA. Don't write raw RDMA code unless absolutely necessary.
- Assume nothing about timing: Always use explicit synchronization (cudaStreamSynchronize, CUDA events) before accessing data that might be in flight.
- Test on target hardware: Coherency bugs often only appear at scale or under specific timing conditions. Test with real RDMA hardware, not simulators.
- Use memory checkers: Tools like cuda-memcheck can detect some (not all) coherency violations.
- Monitor for data corruption: Include checksums or validation in your training/computation code to detect silent corruption.
Open Source Ecosystem:
The RDMA and GPU communication ecosystem is largely open source:
- NCCL: Open source (NVIDIA, BSD license)
- OpenMPI/MPICH: Open source HPC communication standards
- UCX: Open source (OpenUCX community)
- RDMA-Core: Linux kernel RDMA subsystem and userspace libraries
- PyTorch/TensorFlow: Open source ML frameworks with RDMA support
This means you can read the source to understand how synchronization works, contribute improvements, or debug issues. The community around these tools is active, and real-world bugs get fixed relatively quickly.
NVMe over Fabrics: Storage Joins the Party
NVMe over Fabrics (NVMe-oF) extends RDMA to storage access. Instead of accessing local SSDs, you can RDMA directly to remote NVMe devices.
The memory hierarchy now includes:
- Local GPU HBM
- Local CPU DRAM
- Remote GPU HBM (via RDMA)
- Remote CPU DRAM (via RDMA)
- Remote NVMe (via NVMe-oF RDMA)
An AI training pipeline might:
- Stream training data from remote NVMe via RDMA
- Decompress in CPU memory
- Transfer to GPU HBM for preprocessing
- Run training on GPU
- All-reduce gradients to other GPUs via RDMA
- Write checkpoints to remote NVMe via RDMA
Every transition point is a potential coherency bug, a performance bottleneck, or a configuration nightmare. Welcome to modern systems programming.
NVIDIA NCCL: Collective Communications for GPUs
NCCL (NVIDIA Collective Communications Library) is the glue that makes multi-GPU training work efficiently. It implements collective operations (all-reduce, broadcast, reduce-scatter, all-gather) optimized for NVIDIA GPUs.
What NCCL Does
Training neural networks requires aggregating gradients across all GPUs after each training step. Naive approaches are terrible:
Gather to one GPU: GPU 0 collects all gradients, sums them, sends results back. This creates a bottleneck at GPU 0 and doesn't scale beyond a few GPUs.
All-to-all communication: Every GPU sends to every other GPU. This is O(N²) messages, overwhelming the network.
NCCL implements efficient collective algorithms:
Ring all-reduce: GPUs arranged in a logical ring. Gradients are divided into chunks, which travel around the ring in a pipelined fashion. Each GPU reduces (sums) chunks as they pass through. Total data transferred: 2(N-1)/N × data size, which approaches 2× as N grows. This is bandwidth-optimal.
Here's how ring all-reduce works with 4 GPUs:
Initial state (each GPU has different data, shown as A0-A3, B0-B3, etc.):
GPU0: [A0, B0, C0, D0]
GPU1: [A1, B1, C1, D1]
GPU2: [A2, B2, C2, D2]
GPU3: [A3, B3, C3, D3]
Ring topology:
GPU0 → GPU1 → GPU2 → GPU3 → GPU0 (circular)
Phase 1: Reduce-scatter (send chunks around ring, reducing as we go)
Step 1: GPU0 sends D0 to GPU1, GPU1 sends C1 to GPU2, etc.
GPU1 now has D0+D1, GPU2 has C1+C2, etc.
Step 2: GPU1 sends D0+D1 to GPU2, GPU2 sends C1+C2 to GPU3, etc.
GPU2 now has D0+D1+D2, GPU3 has C1+C2+C3, etc.
Step 3: Each chunk reaches its final reduction destination
GPU3 has final sum of D: [D0+D1+D2+D3]
GPU0 has final sum of C: [C0+C1+C2+C3]
GPU1 has final sum of B: [B0+B1+B2+B3]
GPU2 has final sum of A: [A0+A1+A2+A3]
Phase 2: All-gather (propagate reduced values around ring)
Step 4-6: Send the fully-reduced chunks around until everyone has everything
Final state (all GPUs have complete reduced data):
GPU0: [A0+A1+A2+A3, B0+B1+B2+B3, C0+C1+C2+C3, D0+D1+D2+D3]
GPU1: [A0+A1+A2+A3, B0+B1+B2+B3, C0+C1+C2+C3, D0+D1+D2+D3]
GPU2: [A0+A1+A2+A3, B0+B1+B2+B3, C0+C1+C2+C3, D0+D1+D2+D3]
GPU3: [A0+A1+A2+A3, B0+B1+B2+B3, C0+C1+C2+C3, D0+D1+D2+D3]
Total communication: 2(N-1) steps, each GPU sends/receives one chunk
Bandwidth: All links used simultaneously (optimal)
Tree all-reduce: GPUs arranged in a tree topology. Reduction flows up the tree, broadcast flows down. Lower latency than ring for small data, bandwidth-optimal for large data.
Tree topology for 8 GPUs (binary tree):
GPU0 (root)
/ \
GPU1 GPU2
/ \ / \
GPU3 GPU4 GPU5 GPU6
\
GPU7
Reduce phase (data flows up to root):
Step 1: Leaf GPUs send to parents
GPU3 → GPU1, GPU4 → GPU1, GPU5 → GPU2, GPU6 → GPU2, GPU7 → GPU6
Step 2: GPU1 reduces (GPU1 + GPU3 + GPU4), sends to GPU0
GPU2 reduces (GPU2 + GPU5 + GPU6 + GPU7), sends to GPU0
Step 3: GPU0 computes final sum (GPU0 + GPU1_sum + GPU2_sum)
Broadcast phase (data flows down from root):
Step 4: GPU0 sends result to GPU1 and GPU2
Step 5: GPU1 sends to GPU3 and GPU4
GPU2 sends to GPU5 and GPU6
Step 6: GPU6 sends to GPU7
Latency: O(log N) steps (much faster than ring's O(N) for small messages)
Bandwidth: Not all links used simultaneously (root can become bottleneck)
Double-binary tree: Two trees for bidirectional bandwidth utilization. NCCL dynamically chooses the best algorithm based on message size and topology.
Two tree roots allow bidirectional flow, doubling effective bandwidth:
Tree 1 (root GPU0): Tree 2 (root GPU4):
GPU0 GPU4
/ \ / \
GPU1 GPU2 GPU5 GPU6
| | | |
GPU3 GPU7 GPU3 GPU7
Data flows up Tree 1 to GPU0, down Tree 2 from GPU4 simultaneously
This avoids the single-root bottleneck of simple tree
NCCL's algorithm selection heuristic:
- Small messages (< 1 MB): Use tree (low latency matters more than bandwidth)
- Large messages (> 100 MB): Use ring (bandwidth efficiency matters)
- Medium messages: Hybrid approaches or based on topology
- Multi-node: Hierarchy matching physical topology (NVLink within node, InfiniBand between nodes)
NCCL and RDMA Integration
NCCL is RDMA-aware. When running on InfiniBand or RoCE networks with GPUDirect, NCCL:
- Uses RDMA verbs directly for inter-node communication
- Leverages GPUDirect RDMA to avoid CPU bouncing
- Pipelines transfers to overlap computation and communication
- Implements smart topology detection to use NVLink within nodes, RDMA between nodes
For a cluster with NVLink within nodes and InfiniBand between nodes, NCCL automatically creates a hybrid communication plan:
- Intra-node: Use NVLink at 900 GB/s (see my previous article on NVLink and PCIe)
- Inter-node: Use InfiniBand RDMA at 200-400 GB/s (see my article on InfiniBand vs Ethernet)
The result is near-linear scaling to thousands of GPUs, which is why trillion-parameter models are trainable.
NCCL and Memory Hierarchy
NCCL deals with the memory hierarchy complexity through:
Registration caching: NCCL pre-registers GPU memory regions for RDMA to avoid registration overhead on every operation.
Pipelining: Break large transfers into chunks that fit in GPU HBM, overlap communication of one chunk with computation on another.
CUDA graphs integration: Capture NCCL operations as CUDA graphs for minimal launch overhead, critical when communication operations are frequent.
Topology awareness: NCCL queries the system topology (which GPUs connect via NVLink, which nodes connect via InfiniBand) and optimizes communication patterns accordingly.
NVIDIA IMEX: Import/Export for Multi-Tenant GPU Clusters
NVIDIA's IMEX (Import/Export) service is a newer development addressing a specific problem: how do you share GPUs across Kubernetes pods or containers while maintaining high-performance RDMA communication?
The Problem IMEX Solves
Traditional GPU sharing in Kubernetes isolates workloads. Each pod gets assigned GPUs, but pods can't directly communicate through RDMA because:
- RDMA requires registered memory with physical addresses
- Containers have isolated memory spaces
- GPUDirect RDMA typically requires processes to be in the same memory context
For multi-node training, you'd fall back to TCP/IP or sacrifice isolation by running everything in one pod (which breaks Kubernetes' resource management).
How IMEX Works
IMEX provides a mechanism to safely share RDMA resources across containers:
Export: A process exports a GPU memory region, creating a handle that can be safely shared across process boundaries.
Import: Another process (potentially in a different container) imports the handle, gaining RDMA access to the exported memory region.
Fabric registration: IMEX handles the complexity of registering imported memory with the RDMA fabric, maintaining security boundaries while enabling high-performance communication.
This allows:
- Each training worker runs in its own Kubernetes pod
- Workers communicate via RDMA/GPUDirect for maximum performance
- Kubernetes maintains resource isolation and orchestration
- NCCL works across pod boundaries as if GPUs were in the same process
IMEX and the Memory Hierarchy
IMEX essentially extends the memory hierarchy export/import abstraction across security domains. Instead of just dealing with HBM vs DRAM coherency, you're now dealing with:
- Container A's GPU HBM
- Container B's GPU HBM (different address space)
- RDMA adapter (shared physical resource)
- Kubernetes' security model (namespace isolation)
IMEX maintains security (Container A can't access Container B's memory unless explicitly exported) while enabling performance (exported regions use zero-copy RDMA).
AWS Neuron: Custom Silicon Meets Custom Networking
AWS went a different direction with their Trainium and Inferentia chips: custom AI accelerators with custom networking. This is interesting because it shows an alternative approach to the NVIDIA ecosystem.
AWS Neuron SDK
Neuron is AWS's SDK for their AI chips. It provides:
- Compiler (converts PyTorch/TensorFlow to Neuron-optimized code)
- Runtime (manages chip execution)
- Collective communications library (like NCCL but for Neuron)
- Integration with AWS infrastructure
The key difference from NVIDIA: AWS controls the entire stack (silicon, networking, software), enabling optimizations impossible with commodity hardware.
EFA: Elastic Fabric Adapter
EFA is AWS's custom network interface for HPC and ML workloads. It's essentially RoCE (RDMA over Converged Ethernet) with AWS-specific extensions.
EFA provides:
- RDMA functionality: Zero-copy, kernel-bypass communication
- Scalability: Works in multi-tenant cloud environment (unlike traditional RDMA which assumes dedicated fabrics)
- Integration with VPC: RDMA over AWS's virtual networking, maintaining cloud security model
- No configuration: AWS manages the underlying fabric complexity
The engineering challenge AWS solved: traditional RDMA assumes dedicated networks with manual configuration. EFA works in a shared, dynamically provisioned cloud environment while maintaining RDMA's performance benefits.
Neuron Collective Communications (NCC)
NCC is AWS's version of NCCL, optimized for Neuron chips and EFA networking. It implements the same collective operations (all-reduce, etc.) but tuned for AWS's specific hardware topology.
Key differences from NCCL:
- Topology-aware out of the box: AWS knows exactly how instances are connected (they built the datacenter), so NCC can optimize based on actual physical topology
- EFA integration: Direct support for EFA's RDMA capabilities without generic InfiniBand fallbacks
- Multi-tenancy aware: NCC handles the case where other customers' training jobs share the same physical switches
EFA Drivers and the Memory Hierarchy
EFA drivers deal with a similar memory hierarchy problem as GPUDirect, but in AWS's custom environment:
Neuron chip HBM: Where computation happens System DRAM: Staging area EBS/EFS storage: Persistent data via AWS block/file storage EFA fabric: RDMA communication between instances
EFA drivers enable direct memory access between Neuron chips across instances, analogous to GPUDirect but for AWS's custom silicon. The complexity is similar: maintaining coherency across different memory types while avoiding CPU overhead.
Why AWS Built This
AWS could have just used NVIDIA GPUs with InfiniBand like everyone else. They built custom silicon and networking because:
- Cost: Custom ASICs optimized for inference and training can be cheaper per operation than general-purpose GPUs
- Integration: Controlling the full stack enables optimizations impossible with commodity hardware
- Differentiation: Unique capabilities that competitors can't easily replicate
- Economics: AWS runs at scale where custom silicon development costs amortize across millions of instances
Whether this strategy succeeds long-term depends on whether AWS's optimizations can keep pace with NVIDIA's rapid innovation. NVIDIA has the ecosystem, AWS has the cloud integration. Time will tell which matters more.
Beyond AI: Other RDMA Use Cases
High-Frequency Trading
Financial markets live and die by microseconds. RDMA's sub-microsecond latency is critical for:
- Market data distribution (tick data to thousands of trading algorithms)
- Order execution (sending orders to exchanges with minimal latency)
- Risk calculations (aggregating positions across servers in real-time)
Trading firms run InfiniBand fabrics with RDMA because the latency advantage directly translates to profit. A microsecond saved is money earned (or competitor beaten).
Distributed Databases
Modern distributed databases (Cassandra, ScyllaDB, FoundationDB with RDMA extensions) use RDMA to reduce replication latency and increase throughput.
Traditional replication:
- Primary writes to local disk
- Primary sends data over TCP to replicas
- Replicas write to their disks
- Latency: network RTT + multiple kernel overheads
RDMA replication:
- Primary writes to local disk
- Primary RDMAs data directly to replica memory
- Replicas write to disk
- Latency: network RTT only, no kernel overhead
The latency reduction improves write throughput and consistency guarantees. When you're replicating millions of writes per second, RDMA's efficiency matters.
Distributed Filesystems
Parallel filesystems (Lustre, BeeGFS, WekaFS) use RDMA for client-to-server and server-to-server communication. This enables:
- High aggregate bandwidth (hundreds of GB/s from thousands of clients)
- Low latency for metadata operations (creating files, opening files)
- Efficient lock management (distributed locks using RDMA atomics)
HPC centers depend on these filesystems to feed data to compute nodes. RDMA makes it possible to saturate thousands of compute nodes without the storage system becoming the bottleneck.
In-Memory Databases and Caches
Systems like Redis, Memcached (with RDMA extensions), and specialized in-memory databases use RDMA to achieve multi-million operations per second with single-digit microsecond latency.
RDMA Read operations are particularly powerful here: clients can read cached data without server CPU involvement. The server's CPUs handle writes and complex queries, reads are offloaded to RDMA hardware.
This dramatically increases cache hit throughput, which is critical when you're serving a million requests per second per server.
Video Production and Rendering
Real-time video production (live sports broadcasts, movie production) uses RDMA for uncompressed video streaming between cameras, mixers, and editing stations.
Uncompressed 4K video at 60fps is about 12 Gbps per stream. With multiple cameras, effects processing, and editing stations all exchanging video in real-time, you need RDMA's bandwidth and latency guarantees. Traditional networking's variable latency would cause frame drops and synchronization issues.
The Dark Side: RDMA's Challenges
RDMA is powerful but comes with costs and complexity that limit adoption:
Security Implications
RDMA bypasses the kernel's security checks. Once memory is registered for RDMA, the remote side can read/write it without per-operation permission checks. This creates attack surfaces:
- Memory disclosure: Bugs in RDMA applications can leak data to remote hosts
- Memory corruption: Remote writes to wrong addresses corrupt application state
- Denial of service: RDMA operations can exhaust resources (memory, bandwidth)
Cloud providers struggle with RDMA multi-tenancy because traditional isolation mechanisms (kernel firewalls, memory protection) don't apply. AWS EFA's engineering effort was largely about making RDMA safe in multi-tenant environments.
Programming Complexity
RDMA programming is harder than sockets:
- Explicit memory registration and management
- Asynchronous completion model (work requests post now, complete later)
- Manual retry logic for reliable delivery
- Careful attention to memory barriers and synchronization
- Debugging is painful (hardware state, race conditions, timing-dependent bugs)
Most developers don't want to write RDMA code directly. They use libraries (NCCL, MPI) or frameworks (TensorFlow, PyTorch) that hide the complexity. This is why RDMA succeeds in specialized domains (HPC, AI) where performance justifies the investment, but hasn't replaced TCP/IP for general applications.
Deployment and Configuration
RDMA requires infrastructure that "just works" for TCP doesn't:
- Specialized NICs (not standard with servers)
- Fabric configuration (InfiniBand subnet managers, RoCE PFC/ECN tuning)
- Driver installation and updates across clusters
- Monitoring tools that understand RDMA metrics
- Expertise to debug problems
For organizations running 10-100 servers, this overhead often isn't worth it. For organizations running 1,000+ servers in HPC or AI training, it's essential.
The ROI Calculation
RDMA makes sense when:
- Latency below 10 microseconds matters
- CPU overhead of networking is a bottleneck
- Bandwidth requirements exceed 10 Gbps per node
- Workload does frequent small messages (HPC, trading)
- Scale justifies infrastructure investment
RDMA doesn't make sense when:
- TCP latency is acceptable (milliseconds OK)
- CPU cycles are abundant
- Bandwidth needs are moderate
- Simplicity and standard tooling matter more than performance
- Budget constrains specialized hardware
Most of the internet runs fine without RDMA. But for the workloads where it matters, nothing else comes close.
The Future: CXL and Memory Disaggregation
The memory hierarchy is getting even more complex with CXL (Compute Express Link), which promises cache-coherent memory sharing over PCIe.
What CXL Enables
CXL.mem allows CPUs to access remote memory (on other servers) as if it were local DRAM, with cache coherence. This is RDMA's concept taken further: not just explicit remote memory operations, but transparent remote memory that the CPU and OS use without knowing it's remote.
Potential use cases:
- Memory pooling: Disaggregate memory from compute, allocate dynamically based on workload
- Memory expansion: Add memory to servers without upgrading RAM
- Persistent memory: Shared persistent memory pools across servers
- AI training: Share optimizer state across nodes without explicit RDMA operations
CXL's RDMA Challenge
CXL promises transparent remote memory, but the physics of networks means latency is orders of magnitude higher than local DRAM. A load instruction that takes 60ns for local DRAM might take 1-5 microseconds for CXL memory over a network.
This creates a new tier in the memory hierarchy:
- L1/L2/L3 cache: nanoseconds
- Local DRAM: 60-100ns
- CXL-attached local memory: 100-300ns
- CXL-attached remote memory: 1-5 microseconds
- NVMe storage: 10-100 microseconds
Software must be aware of this hierarchy. Transparently accessing remote CXL memory will cause performance cliffs if not managed carefully. RDMA's explicit operations at least force developers to think about remote memory access patterns. CXL's transparency might hide performance problems until they become critical.
RDMA + CXL: Best of Both Worlds?
The future likely combines RDMA's explicit, high-performance operations with CXL's transparent, cache-coherent sharing:
- Use CXL for occasional access to shared memory pools (management data, coordination)
- Use RDMA for bulk transfers and latency-critical operations (AI training, HPC)
- Let the system automatically tier memory (frequently accessed in local DRAM, rarely accessed in CXL pools)
This maximizes flexibility while maintaining performance where it matters. But it also adds another layer of complexity to an already complex memory hierarchy.
Living With RDMA in 2025
RDMA has moved from niche supercomputing technology to mainstream infrastructure for AI and cloud computing. Every major AI training cluster uses RDMA (InfiniBand or custom equivalents). Cloud providers offer RDMA-capable instances. The technology is proven and essential for high-performance workloads.
But RDMA remains complex and specialized. Most applications don't need it and won't benefit from it. The traditional kernel network stack is good enough for the vast majority of networking needs, and it's getting better (see my article on DPDK and kernel bypass).
The memory hierarchy complexity isn't going away. HBM, DRAM, CXL memory, NVMe, we're adding layers not simplifying. RDMA helps manage this complexity by providing explicit, efficient operations. But the burden is on developers to use it correctly, and the barriers to entry remain high.
For the organizations building AI infrastructure, RDMA isn't optional. The performance advantages are too significant, and the ecosystem (NCCL, IMEX, EFA) has matured to the point where deployment is practical. The ROI is clear when you're training models that cost millions of dollars in compute.
For everyone else, RDMA is something to be aware of but probably not something to deploy unless you've proven that traditional networking is actually your bottleneck. Profile first, optimize second, and be honest about whether you need microsecond latency or if you just want the cool technology.
The Cray Legacy Lives On
From Cray's distributed shared memory supercomputers to today's AI training clusters with GPUDirect RDMA, the core insight remains: let hardware move data while software focuses on computation. RDMA isn't a new idea, it's the full realization of principles established in supercomputing decades ago.
What's changed is scale and accessibility. Cray systems cost tens of millions of dollars and required specialized expertise. Today, you can spin up RDMA-capable instances on AWS for dollars per hour, and frameworks like PyTorch handle the RDMA complexity behind the scenes.
The memory hierarchy has become vastly more complex, with HBM, multiple DRAM tiers, persistent memory, and now CXL adding new layers. RDMA provides the plumbing to make this complexity manageable, enabling applications to explicitly control data movement with minimal overhead.
But the fundamental trade-off remains: performance versus simplicity. RDMA delivers unmatched performance for the right workloads, but it demands careful engineering, specialized hardware, and expertise that most organizations lack. It's a tool for when you've exhausted simpler alternatives and performance is worth the complexity cost.
The future of networking isn't everyone using RDMA, it's RDMA powering the specialized infrastructure (AI training, HPC, high-frequency trading) while traditional networking serves everything else. Seymour Cray would probably approve, specialized tools for specialized problems, engineered for maximum performance when nothing else will do.
Now excuse me while I go debug why my RDMA application works on Mondays but fails on Tuesdays. It's probably a race condition involving cache coherency across GPU HBM, CPU DRAM, and the network fabric. Or maybe I just forgot to flush writes before posting the RDMA operation. Again. RDMA is powerful, but it never lets you forget you're playing with fire.