The NIC Has a Computer Inside It: SmartNICs, DPUs, and Offloading the Kernel

Scott Morrison • May 19, 2026 • 0 views

smartnic dpu network-offload aws-nitro bluefield kernel-bypass rdma data-center ai-infrastructure virtualization

For thirty years the networking industry has been trying to decide how much work belongs on the NIC and how much belongs on the CPU. The answer, after three failed attempts and one correct one, turns out to be: most of it. This is the story of how a card that used to move packets became a full computer that the CPU reports to.

Your server has two computers in it. One is the CPU you bought, the one with the cores and the cache hierarchy and the operating system. The other one is inside the NIC, and it has been getting significantly more powerful every three years without anyone making a particularly big announcement about it.

This is not an accident. It is the result of a slow, occasionally ugly reckoning with a simple problem: at a certain network speed, using the CPU to process the network is the wrong answer. The CPU is expensive, general-purpose silicon. It has caches, a branch predictor, a memory subsystem, and a kernel that would prefer not to be interrupted several million times a second to move packets. When you are pushing 100 gigabits of traffic through a server, roughly 12.5 gigabytes of data per second, the overhead of the software networking stack on the host can eat 30 to 50 percent of your CPU cycles before your application gets to run. That is not a performance problem. That is a design failure.

The industry has been trying to fix it since the late 1990s. It got the answer wrong twice, for interesting reasons, before getting it right the third time in a way that quietly restructured how cloud computing works.

Let's talk about how a network card became a data center.

The First Attempt: TCP Offload Engines, and Why the Kernel Said No

The original problem statement was straightforward. TCP in the early 2000s was expensive. A 2.4 GHz Pentium 4, which at the time was a respectable server CPU, could spend more than 80 percent of its cycles on full-duplex gigabit TCP processing. The math was not complicated: you needed roughly one hertz of CPU capacity per bit per second of TCP throughput, which meant that 1 Gbps of TCP traffic required about 1 GHz of CPU just for the network stack. On a system with more than one process, that left very little for the applications you were actually running.

The proposed solution was the TCP Offload Engine, universally abbreviated TOE. The idea was to put the entire TCP/IP stack onto dedicated silicon on the NIC. Instead of the CPU handling connection state, acknowledgments, retransmissions, checksum calculations, and sliding window math, a specialized processor on the network card would do all of it. The host CPU would see a byte stream, neatly assembled, without having been involved in any of the protocol work.

TOE NICs shipped. They even worked, in limited circumstances. The problem was that the Linux kernel networking community refused to support them, and their rejection was principled rather than obstructionist.

The core issue was protocol ossification. If TCP processing lives in the kernel, and a bug is found in congestion control, you can patch the kernel. The patch gets distributed. The bug gets fixed everywhere. If TCP processing lives on a proprietary NIC with firmware written by a hardware vendor, patching the bug requires a firmware update from that vendor, coordinated across potentially millions of deployed cards, with all the driver compatibility landmines that come with it. Linus Torvalds and the kernel developers were not willing to cede ownership of protocol behavior to closed-source NIC firmware. The decision was made in 2005 to reject full TCP offload support from the mainline kernel. That rejection effectively killed TOE as a mass-market technology for general-purpose workloads.

What survived was partial offload: checksum calculation, TCP segmentation offload (TSO), large receive offload (LRO), and scatter-gather DMA. These are the features that push specific, stable, hardware-amenable operations onto the NIC without giving the NIC ownership of protocol state. Modern NICs still do all of this, and it is genuinely useful. But it is not offloading the networking stack. It is offloading arithmetic.

The first attempt taught the industry something important. The problem is not just performance. The problem is who owns the software running on the offload device, and whether you can update it when something goes wrong.

The Second Attempt: RDMA NICs and the Kernel Bypass School

The second wave of NIC intelligence came from HPC rather than cloud computing, and it took a different philosophical approach. Instead of offloading the kernel stack onto the NIC, RDMA NICs bypassed the kernel entirely.

We covered RDMA in depth in an earlier paper. The short version is that Remote Direct Memory Access allows one machine to read or write the memory of another machine without involving either CPU. The NIC handles the memory transfer, the authentication, and the flow control. The kernel is not in the loop for data movement, only for control operations. This is not offloading the stack: it is declaring that the stack does not exist for this application, and replacing it with a hardware primitive.

RDMA NICs, particularly Mellanox (now NVIDIA) ConnectX adapters, became standard equipment in HPC clusters throughout the 2000s and 2010s. They solved the CPU consumption problem for specific workloads, principally MPI-based parallel computing and, later, distributed storage. The attack surface was much smaller because the NIC was not running a general-purpose protocol stack, it was running a fixed RDMA protocol with well-defined semantics.

The limitation of RDMA was that it required applications to be written against RDMA APIs, or to use middleware that translated conventional socket calls into RDMA operations. General-purpose workloads that did not have this integration got none of the benefits. A web server handling HTTP requests cannot trivially be converted to RDMA. A database replication protocol could, but only with significant engineering investment.

RDMA was the right answer for a specific vertical: high-performance computing and large-scale distributed storage. It was not a general solution for the problem of CPU overhead in networked servers. The broader question, how do you offload all the infrastructure work without requiring applications to change, remained unanswered.

The Third Attempt: The DPU, and Getting It Right

The third wave had the right insight: if you want to offload infrastructure work without requiring applications to be rewritten, you need a complete computer on the NIC, not a fixed-function accelerator and not a faster CPU alternative. You need a device that runs an independent operating system, has its own memory, handles its own network functions, and presents a clean interface to the host CPU. You need something that the host CPU cannot easily meddle with.

This concept has accumulated more names than it deserves. Depending on the vendor and the year, you might encounter SmartNIC, DPU (Data Processing Unit), IPU (Infrastructure Processing Unit), or the generic "intelligent NIC." The distinctions are real but narrower than the marketing suggests. A SmartNIC, in the traditional sense, offloads specific data-plane functions while leaving control-plane operations on the host CPU. A DPU goes further: it runs a full operating system on embedded ARM cores, can execute arbitrary software, and is logically independent of the host. The host CPU cannot see what is running on the DPU without going through a defined interface. A DPU is a tenant of the same PCIe slot, but it is not a subordinate.

The practical difference: a SmartNIC can accelerate VXLAN encapsulation in hardware. A DPU can run the entire Open vSwitch control plane on its ARM cores, handle all packet processing in hardware, and refuse to tell the host CPU anything about what it is doing unless the DPU's software decides to share that information.

That security boundary, the ability of the DPU to be trusted by the cloud provider without necessarily trusting the host CPU it is plugged into, turns out to be at least as important as the performance benefits.

AWS Nitro: The First DPU Deployed at Hyperscale

The most consequential early deployment of what we now call DPU technology was not sold as a product. It was built internally by AWS and deployed to power EC2.

The Nitro System has an origin story that reflects how the hyperscalers actually develop infrastructure. In 2013, AWS launched its C3 instance type. The C3 was the first instance to offload network processing to dedicated hardware, a custom card they called the Nitro Card for VPC. The next year, C4 offloaded EBS storage. These first-generation Nitro cards used commercial off-the-shelf silicon from an Israeli semiconductor company called Annapurna Labs. AWS was so impressed by Annapurna's work that they acquired the company in January 2015 for a reported $350 million. The person making a $16 billion bet on FPGA programmability for Intel's IPU strategy would later look at that number and have complicated feelings about it.

By 2017, with the C5 instance type, AWS had offloaded everything: network, storage, management, security, and control plane. The Nitro Hypervisor, a KVM derivative stripped down to its minimal functional core, runs on the host CPU but hands off all I/O virtualization to Nitro Cards connected via PCIe. The Nitro Security Chip, soldered to the server motherboard, establishes a hardware root of trust and mediates all firmware updates to non-volatile storage on the system. AWS personnel have no mechanism to access customer data on Nitro instances, not because of a policy constraint but because no such mechanism exists in the hardware.

Here is what the Nitro architecture buys you on top of the security argument. When you shift all I/O virtualization to dedicated hardware, the host CPU's only job is to run the hypervisor scheduler and the customer's virtual machine. AWS claims the performance improvement from the C4 to C5 generation was not primarily from CPU improvements. It was from the elimination of hypervisor overhead for I/O. The CPU cores that used to spend cycles processing EBS writes and VPC packet encapsulation now spend those cycles running customer workloads. AWS has claimed their network-optimized instances achieved up to four times better storage latency after moving to the full Nitro architecture. The Nitro Card for VPC transparently encrypts all traffic between EC2 instances using AES-256-GCM, with no measurable performance penalty, because the encryption happens in dedicated hardware engines on the card rather than in software on the host.

The Nitro architecture also enabled bare-metal EC2 instances. When the hypervisor is no longer responsible for I/O virtualization, you can offer instances where the customer's code runs directly on the host CPU with no hypervisor layer at all. The Nitro Cards still handle network and storage, but the customer gets access to the full instruction set, including hardware performance counters, without the overhead or opacity of a hypervisor between them and the silicon. This is not a niche use case: financial services firms running latency-sensitive workloads, HPC applications doing MPI over fabric, and database workloads that want to own their own memory management all have strong reasons to want bare metal.

The same Annapurna Labs team that built Nitro also built the Graviton processors, AWS's ARM-based CPUs, and the Trainium and Inferentia AI accelerators. The $350 million acquisition turned into the silicon strategy for an entire cloud.

Azure AccelNet: When the FPGA People Were Right

Microsoft took a different path to the same destination. Their SmartNIC program, published in detail in a 2018 NSDI paper by Daniel Firestone and colleagues, chose FPGAs over custom ASICs, and the reasoning is worth understanding.

Microsoft's networking team had to make a hardware choice: dedicated ASICs, which offered maximum efficiency but locked functionality at tape-out time; general-purpose embedded CPU cores, which were programmable but did not perform well enough on single network flows; or FPGAs, which offered a programmability/efficiency tradeoff that sat between the other two. They chose FPGAs, deploying them on every new Azure server from late 2015 onward as part of the AccelNet program. By the time the NSDI paper was published, they had over a million hosts running FPGA-based SmartNICs.

The FPGA argument turned out to be correct for Azure's specific situation. Microsoft's Virtual Filtering Platform (VFP), the host SDN software stack that implements VXLAN tunneling, ACLs, stateful NAT, QoS, and tenant isolation, was evolving rapidly. Writing those functions in FPGA logic and being able to reprogram deployed hardware as features changed was more valuable than the absolute efficiency advantage of a custom ASIC. The ability to update functionality in the field without replacing cards, combined with performance and efficiency far beyond what software running on the host CPU could achieve, justified the complexity.

The performance results from AccelNet were notable: sub-15 microsecond VM-to-VM TCP latencies and 32 Gbps throughput, which was faster than any competing public cloud at the time. The FPGA was doing all the packet processing for VFP, handling flow table lookups, applying network policy, and performing VXLAN encapsulation and decapsulation, entirely in hardware. The host CPU was not involved in the data path for established flows.

Microsoft's latest generation, branded Azure Boost, uses the Microsoft Azure Network Adapter (MANA), a 200G card built around an Altera Agilex FPGA and an ARM SoC. The card runs Microsoft Linux. MANA drivers are in the mainline Linux kernel. Azure Boost extends the offload pattern from networking to storage as well, presenting NVMe interfaces directly to VMs via hardware acceleration and achieving 12.5 GB/s data throughput and 650,000 IOPS from storage without involving the host CPU in the data path.

Azure and AWS got to roughly the same architectural place through different silicon choices. AWS bet on custom ASICs from the start and won on long-term efficiency and cost. Azure bet on FPGAs for programmability and won on agility during a period when their SDN stack was changing quickly. Both bets were defensible. Both resulted in hyperscalers where the NIC is doing substantially more work than the CPU for infrastructure functions.

NVIDIA BlueField: When the GPU Company Built a Data Center on a Card

NVIDIA's entry into DPUs came through its 2020 acquisition of Mellanox Technologies for $6.9 billion. Mellanox had been building high-performance network adapters for HPC clusters for two decades. Their ConnectX line was already the dominant RDMA NIC in AI training infrastructure. When NVIDIA bought Mellanox, they inherited not just the NIC business but the team that built it, and they had very specific plans for that team.

The BlueField DPU line, which predated the NVIDIA acquisition but was dramatically accelerated by it, combines a high-speed Mellanox ConnectX network interface with a cluster of ARM processing cores and dedicated hardware accelerators for cryptography, compression, and packet processing. BlueField-2 shipped with 8 ARM Cortex-A72 cores and 200 Gbps connectivity. BlueField-3 doubled the core count to 16 ARM Cortex-A78 cores, moved to 400 Gbps Ethernet or InfiniBand, and significantly expanded the hardware accelerator surface for cryptographic operations. NVIDIA's marketing claims that a single BlueField-3 DPU provides the equivalent of 300 CPU cores' worth of infrastructure services offloaded, which is a number that requires interpretation but is directionally correct for workloads that map well to what the DPU's fixed-function hardware accelerates.

BlueField-4, announced at NVIDIA GTC 2025, pushes bandwidth to 800 Gbps and delivers roughly six times the compute of BlueField-3 according to NVIDIA, with a unified control plane through the DOCA SDK that spans networking, storage, and security services. The explicit marketing target is AI factories, meaning the massive GPU clusters where LLMs and other large models are trained. This is not accidental.

The DOCA SDK (Data Center Infrastructure on a Chip Architecture) is the programming model for BlueField DPUs. It is an SDK that lets developers write software to run on the DPU's ARM cores, using the DPU's hardware accelerators for network processing, crypto, and compression. This is the answer to the TOE failure mode: instead of closed-source firmware implementing a fixed protocol, you have a programmable device running open-source-compatible software that you can audit, modify, and update. The programming model is different from the host CPU but not alien to it. You are writing software for ARM cores running Linux. The kernel developers' objection to black-box NIC firmware does not apply to an ARM CPU running Linux with an SDK and open drivers.

Intel IPU and AMD Pensando: The Rest of the Market

Intel arrived late and loudly. Their Infrastructure Processing Unit (IPU) program, originally focused on custom chips for specific hyperscalers, produced the Intel IPU E2100, which launched publicly in 2024. The E2100 is the first IPU Intel made available outside its initial hyperscaler customer (Google Cloud, which co-developed the "Mt Evans" design that became the E2100). It offers 200 Gbps connectivity, 16 ARM Neoverse N1 cores, and dedicated accelerators for IPSec, TLS, QUIC, NVMe, and compression. The IPU E2200, announced at Hot Chips 2025, brings 400 Gbps parity with BlueField-3.

Intel's IPU architecture is explicitly modeled on the Nitro-style separation of infrastructure and tenant workloads. The design goal is that the host CPU runs tenant applications, and the IPU runs all infrastructure services: virtual switching, network policy enforcement, storage virtualization, and security functions. The IPU is a separately administered compute domain that the tenant cannot see into.

AMD acquired Pensando Systems in 2022 for $1.9 billion and absorbed their DPU technology into the Pensando line. The Pensando Elba, built around 16 ARM Cortex-A72 cores and supporting dual 200 GbE with P4 programmability, became AMD's primary DPU offering for enterprise and cloud customers. Elba has stronger support in the VMware ecosystem than some alternatives, which matters considerably in enterprise deployments where VMware vSphere is the operational reality. The Pensando Pollara 400, announced in late 2024 and targeting AI networking specifically, is an Ultra Ethernet Consortium-compliant AI NIC rather than a full DPU, aimed at the market segment where you need intelligent congestion control and GPU-aware scheduling but not the full independent compute domain of a DPU.

Marvell plays in this space with their Octeon line, which are network processors that run data plane applications in a DPU-like configuration, particularly in telco and edge contexts. Marvell has been in the NIC processor business longer than most of the newer entrants and has been particularly successful in 5G base station and edge applications where real-time packet processing requirements are strict.

What Actually Gets Offloaded

The practical value of all these devices depends on what workloads you are actually moving off the host CPU. The answer varies by deployment context, but there are consistent patterns.

VXLAN and overlay encapsulation. Virtual networks in cloud environments wrap tenant packets in UDP/VXLAN headers, adding the destination physical endpoint information before sending packets over the physical underlay network. At 100 Gbps, doing this in software on the host CPU is expensive. Doing it in fixed-function hardware on the DPU is essentially free. Every major DPU can act as a VXLAN Tunnel Endpoint (VTEP) in hardware, and this is typically the first thing cloud providers offload because the performance improvement is immediate and the complexity is low.

Open vSwitch offload. OVS is the software virtual switch that implements SDN policy in most Linux-based cloud hypervisors. It is flexible, powerful, and, under load, a significant CPU consumer. DPUs can offload OVS flow tables to hardware, handling the common case (established flow, known policy) in the DPU's packet processing engines and only falling back to the software OVS implementation for new flows or policy exceptions. The first-packet exception path remains on the CPU, but the steady-state data path moves to hardware. Azure's entire AccelNet program is essentially this idea, scaled to a million servers.

IPSec and TLS termination. Hardware crypto acceleration is the oldest form of NIC offload that actually worked without controversy, and modern DPUs take it much further. BlueField-3 includes dedicated AES-GCM accelerators that can encrypt and decrypt at line rate without using ARM core cycles. AWS Nitro Cards transparently encrypt all inter-instance VPC traffic using AES-256-GCM in dedicated hardware engines with keys stored in the SoC. The host CPU and the tenant's operating system have no visibility into the key material and no involvement in the packet-by-packet encryption. From the tenant's perspective, the network is encrypted; from the host CPU's perspective, the encryption does not exist as a workload.

Firewall and access control enforcement. Stateless ACLs and stateful firewall rules can be pushed into DPU packet processing engines and enforced at line rate before packets reach host memory. This is operationally important in multi-tenant environments because it means security policy is enforced by a domain that the tenant cannot influence, even with root access to their own VM. A compromised tenant VM cannot subvert network policy enforced on the DPU.

NVMe-over-Fabrics storage. Disaggregated storage architectures, where NVMe drives are accessed over a network fabric rather than being directly attached to the compute host, require the host to run an NVMe-oF client stack. Running this in software adds CPU overhead and latency. DPUs can run the NVMe-oF initiator in hardware, presenting a standard NVMe device interface to the host and handling all the fabric transport, flow control, and error recovery in the DPU. Azure Boost's storage offload is exactly this: the VM sees a NVMe device, and the DPU handles the underlying NVMe-oF protocol to disaggregated storage over the network.

RDMA and congestion control. For AI training clusters, RDMA processing is the core workload. DPUs handle RoCEv2 RDMA operations in hardware, with dedicated hardware queue pairs and memory regions. The congestion control algorithms, specifically DCQCN (Data Center Quantized Congestion Notification) and ECN processing, run in hardware on the DPU rather than in software on the host, which means the response to congestion signals happens in microseconds rather than the millisecond timescales of software-based congestion control. At the scale of a GPU cluster doing gradient synchronization across thousands of nodes, this response time difference matters enormously for training throughput.

The AI Infrastructure Connection

The most urgent current driver of DPU adoption is not general cloud virtualization. It is GPU clusters.

A modern AI training cluster runs distributed training jobs across hundreds or thousands of NVIDIA H100 or B200 GPUs. Each GPU communicates with its peers over a fabric, either InfiniBand or high-speed Ethernet, continuously exchanging gradient updates during training. The communication volume is extreme: each GPU can generate hundreds of gigabits of traffic per second. The CPU in the server hosting those GPUs is not particularly important to the training computation. The GPU does the math. The CPU's main job is to not be in the way.

Here is the problem. If the CPU has to handle network stack processing for the GPU's fabric traffic, it consumes CPU cycles that generate latency that causes the GPU to sit idle waiting for data. GPUDirect RDMA, the NVIDIA technology that lets the GPU's NVLink memory communicate directly with the NIC over PCIe without going through CPU-managed buffers, addresses the data path. But the control plane for the RDMA connection, the queue pair management, memory registration, and congestion control, still needs to happen somewhere. On a DPU, it happens on the DPU's ARM cores and hardware accelerators, not on the host CPU. The GPU trains. The CPU stays out of the way. The DPU handles the fabric.

BlueField-4's positioning as an AI factory DPU is not marketing opportunism. GPU clusters genuinely need the kind of infrastructure offload that DPUs provide, and the scale of AI infrastructure buildout has created enormous demand. The DPU SmartNIC market was valued at $1.11 billion in 2024 and is projected to grow at roughly 15 percent annually to $4.44 billion by 2034. Around half of cloud providers are now using DPUs in at least part of their infrastructure. Approximately 35 percent of AI training workloads now offload network functions to DPUs, a number that is growing fast.

The interesting technical wrinkle in AI infrastructure is that the NIC and the DPU are starting to serve different roles in the same cluster. The SmartNIC or DPU handles the infrastructure functions: management, policy, isolation, storage, encryption. A separate high-speed AI NIC, like NVIDIA's ConnectX-7 running SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) for in-network all-reduce operations, handles the actual GPU-to-GPU training traffic at maximum speed without being burdened by infrastructure work. NVIDIA's ConnectX-7 supports in-network compute for gradient reduction, meaning the NIC itself can aggregate gradients from multiple nodes as packets pass through the network, reducing the total amount of data that has to be transmitted. That is moving compute onto the network fabric, which is a different thing from moving infrastructure work onto the DPU. Both are happening simultaneously.

The Honest Accounting of What DPUs Cannot Do

DPUs solve real problems and they create real ones. The engineering costs are non-trivial.

Programming complexity. Writing software to run on a DPU is not the same as writing software for a server CPU. You are targeting ARM cores with access to different memory, different hardware accelerators, and a different bus architecture. NVIDIA's DOCA SDK is the best-developed programming environment in the space, but "best-developed" means "is now a fully-featured SDK after several years of effort," not "is as easy as writing software for the host." Organizations without dedicated DPU software engineers face a steep climb to move infrastructure work off the host.

The wimpy core problem. A BlueField-3's 16 ARM Cortex-A78 cores are capable of running real software, but they are not competition for a modern server's Xeon or EPYC CPUs. For workloads that map well to the DPU's hardware accelerators, this does not matter: the accelerators do the heavy lifting and the ARM cores handle exceptions. For workloads that require general-purpose CPU performance, the DPU's ARM cores are a bottleneck. Published research on BlueField-2 noted that it is easy to overwhelm the device: the flexibility comes with limits on how much general-purpose compute you can push onto it before performance degrades. RDMA performance over the DPU matched host CPU performance in some benchmarks, but TCP performance over the DPU was significantly worse than what an x86 host could achieve directly, because TCP processing does not map as cleanly to the DPU's hardware.

The off-path latency tax. DPUs come in two architectural flavors: on-path (also called bump-in-the-wire) and off-path. On-path DPUs sit between the network cable and the host, which means every packet passes through them. Off-path DPUs connect to the host via PCIe and have a separate network connection, which means moving data to and from the DPU's network interface involves a PCIe DMA transfer. That DMA transfer adds latency. For workloads where the DPU's hardware accelerators save more time than the DMA transfer costs, the net result is still a win. For latency-sensitive workloads where the DMA overhead is significant relative to the per-packet processing time, the math may not work in favor of DPU offload.

Cost. A server-grade DPU like the BlueField-3 is not a $50 NIC. It is a complex SoC with ARM cores, hardware accelerators, DRAM, and a high-speed network interface, all on a half-height PCIe card. The power consumption is up to 150 watts, which is another 150 watts added to the rack budget multiplied by however many servers you are running. At hyperscaler scale, the amortization math works because the performance and efficiency gains from CPU offload return more value than the DPU's cost. At enterprise scale, the math is less obvious, and many enterprises are still running standard NICs.

Software ecosystem maturity, or the lack thereof. The DPU software ecosystem is not mature. The tooling for debugging software running on a DPU, monitoring what the DPU is doing, tracing packets through the DPU's pipeline, and correlating DPU behavior with host application performance is significantly less developed than the equivalent tooling for host software. DOCA is improving, but when something goes wrong in a DPU offload and you need to figure out why, the diagnostic tools are not yet at the same level as what you have for debugging host networking issues.

The Security Architecture That Changes Everything

The most underappreciated property of full DPUs as opposed to simple SmartNICs is the trust model they enable.

In a conventional cloud server, the hypervisor runs on the host CPU alongside everything else. The hypervisor trusts itself, which means it trusts the entire host environment, which means that any vulnerability in the hypervisor or the host OS represents a potential path to compromising tenant isolation. The hypervisor is software. Software has bugs.

In a Nitro-style architecture, the hypervisor is stripped to its minimal essential functions: partitioning CPU time and memory between VMs. Everything else, network packet processing, storage I/O, key management, firmware update validation, runs on the Nitro Cards, which are separately administered hardware. The Nitro Security Chip enforces a hardware root of trust. Updates to firmware on any component of the system must be cryptographically signed and are mediated by the Nitro Controller. AWS employees have no mechanism to access customer data on a Nitro instance because the hardware does not provide one. This is a fundamentally different security claim than "we have a policy that employees cannot access customer data." A policy can be violated. A hardware constraint cannot be violated through software.

The same principle applies to the BlueField's security model. Because the DPU runs a separate operating system on its own ARM cores, and because network policy is enforced in the DPU's hardware before packets reach host memory, a compromised host CPU cannot bypass network security policy. A tenant VM that obtains root access to its host cannot reach across to another tenant's traffic by manipulating the host networking stack, because the host networking stack is not processing the traffic. The DPU is processing it, and the DPU's software is not under the control of the tenant.

This is the argument that is changing enterprise architecture decisions more than performance. Zero trust network architectures, which assume that any node in the network might be compromised and require cryptographic proof of identity and policy enforcement at every hop, are much more practically implementable when the enforcement point is a hardware device that the workload cannot influence. You can enforce microsegmentation policies in the DPU's packet processing pipeline, at line rate, with cryptographic guarantees, without requiring the host operating system to participate in their enforcement.

The Uncomfortable Truth About the NIC Arms Race

Here is the part the vendor brochures skip over. The hyperscalers built their own DPUs and deployed them at scale specifically because the available commercial options were either not performant enough, not programmable enough, or not trustworthy enough to run as a security boundary in a multi-tenant environment. AWS built Nitro because they needed something that did not exist. Microsoft built AccelNet on FPGAs because no ASIC was programmable enough for their evolving SDN stack. Google collaborated on the Intel Mt Evans / E2000 IPU. Meta has its own NIC infrastructure.

The commercial DPU market, BlueField, Pensando, IPU E2100, is primarily serving enterprises and smaller cloud operators who cannot justify the investment in completely custom silicon. The hyperscalers have, for the most part, made their own decisions and are not waiting for the commercial market to catch up. This means the commercial DPU ecosystem is developing under different requirements than what actually drives the technology, which is an awkward dynamic for anyone trying to understand where the technology is heading by watching the product announcements.

The other awkward reality is that "offloading to the DPU" is not magic. If you move a badly designed workload from the host CPU to the DPU's ARM cores, you get the same badly designed workload running more slowly on a less powerful CPU. The DPU's value comes from hardware accelerators for specific functions, not from adding general-purpose CPU capacity to your server. Organizations that treat DPUs as "extra CPUs inside the NIC" are going to be disappointed. Organizations that identify specific workloads that map to fixed-function hardware acceleration, VXLAN encapsulation, AES-GCM crypto, NVMe-oF initiator, RoCEv2 queue pair processing, OVS flow table lookup, and build their offload strategy around those workloads will find genuine value.

The pattern that works in practice is what cloud providers have been doing: build a catalog of infrastructure functions that have stable, well-understood semantics, implement them in hardware accelerators, and handle the rest in software on the ARM cores. The accelerators do the high-throughput work. The ARM cores do exception handling, control plane updates, and anything that requires actual programmability. The host CPU does nothing infrastructure-related at all.

The Third Computer in Your Server

We are approaching the point where the conventional model of what a server is will need to be updated.

Historically: a server is a CPU with memory, some storage, and a NIC that moves packets to and from the network. The CPU makes all decisions.

Currently, in the hyperscaler model: a server is a CPU with memory and some storage, a DPU that runs the network and storage virtualization stack, manages encryption keys, enforces security policy, and presents clean interfaces to the CPU, and a network chip that handles the physical data transmission. The CPU makes application decisions. The DPU makes infrastructure decisions. The network chip moves bits.

In AI training clusters, there is a fourth device: an AI accelerator NIC that runs in-network compute, performing gradient aggregation or reduction as data passes through the fabric. The NIC is doing math. The DPU is doing policy. The CPU is doing coordination. The GPU is doing everything that actually matters.

Thirty years of arguing about how much to put on the NIC has produced the answer: most of the infrastructure stack. The kernel was right to reject the TOE NICs in 2005, because those NICs were fixed-function black boxes. The DPU is not a black box. It runs Linux. It has an SDK. You can write your own software for it. The kernel developers' concern about accountability and updateability is addressed by the fact that you can update the DPU's operating system the same way you update any other Linux system, with a signed package through an authenticated management channel. The answer to the protocol ossification problem turned out to be: put a general-purpose computer in the NIC, run open software on it, and use hardware accelerators for the parts where the performance requirements exceed what software can deliver.

This was always the correct answer. It just took thirty years, two failed attempts, and the economics of hyperscale cloud computing to make it happen.

The NIC has a computer inside it. The computer is the point.

Welcome to the data center infrastructure stack that has been quietly building itself into your PCIe slot for the last decade.

Article Not Found