Three Types of Silicon Walk Into a Router: Why Your Packets Need a CPU, an ASIC, and Maybe an FPGA to Get Anywhere
When you crack open a modern enterprise router (warranty be damned), you'll find something that looks suspiciously like overkill: a general-purpose CPU running at several gigahertz, one or more massive specialized ASICs (Application-Specific Integrated Circuits) that dwarf the CPU in size, possibly an FPGA (Field-Programmable Gate Array) lurking in there somewhere, and a bewildering array of memory chips with wildly different purposes and price tags. This isn't engineering excess or some vendor's scheme to empty your budget (well, not entirely). It's the inevitable result of physics, economics, and the brutal reality that moving packets at 400 gigabits per second while also running BGP is physically impossible with any single type of chip.
Understanding why requires starting at the bottom, with transistors and the fundamental physics of how fast electricity can actually move through silicon. There's an iron triangle in chip design: performance, flexibility, and cost. You can optimize for any two, but the universe won't let you have all three. CPUs choose flexibility and reasonable economics but sacrifice speed. ASICs go all-in on speed and efficiency but give up any ability to change their minds. FPGAs split the difference, offering reconfigurability at the cost of being slower than ASICs and more expensive per unit than mass-produced chips.
Let's start with a single transistor and work our way up to understanding why your switch needs three different kinds of silicon just to forward your cat videos at wire speed.
The Fundamental Building Block: Transistors and the Gates They Build
At the absolute foundation of every processor, regardless of whether it's a flexible CPU or a specialized ASIC, sits the humble transistor. Modern chips use MOSFETs (Metal-Oxide-Semiconductor Field-Effect Transistors), and the industry has now reached 3nm process nodes for cutting-edge switching ASICs (Metal-Oxide-Semiconductor Field-Effect Transistors), and specifically CMOS (Complementary Metal-Oxide-Semiconductor) technology. CMOS pairs two types of transistors, NMOS and PMOS, to create logic gates that sip power elegantly instead of guzzling it like earlier technologies.
NMOS and PMOS: The Yin and Yang of CMOS
Think of transistors as voltage-controlled switches. Apply the right voltage to the gate terminal, and you create an electric field that either allows or blocks current flow between the source and drain terminals.
NMOS (N-channel MOSFET): This transistor turns ON when you apply a high voltage (typically called logic 1 or VDD, around 0.7-1.0V in modern processes) to its gate. When the gate is high, current flows from drain to source. When the gate is low (0V, or ground), the transistor is OFF and blocks current. NMOS transistors are good at passing logic 0 (ground) and pulling outputs LOW.
PMOS (P-channel MOSFET): This is the opposite. PMOS turns ON when you apply a LOW voltage (ground) to its gate. When the gate is low, current flows. When the gate is high, the transistor is OFF. PMOS transistors are good at passing logic 1 (VDD) and pulling outputs HIGH.
The brilliance of CMOS is using both types together. By pairing PMOS and NMOS transistors, you create logic gates where one path is always OFF. This means current only flows briefly during switching (when both transistors might be partially on), not continuously. The result is dramatically lower power consumption compared to older technologies where current flowed constantly.
From Transistors to Logic Gates
Individual transistors don't compute anything useful. You need to combine them into logic gates, which are the basic building blocks that perform Boolean operations (AND, OR, NOT, etc.).
The NAND Gate: Your New Best Friend
The NAND gate is the fundamental building block of modern digital logic. Why NAND? Because it's "universal," you can build any other logic function from NAND gates alone. It's also one of the simplest gates to build efficiently in CMOS.
A NAND gate implements the function "NOT AND," which means its output is LOW (0) only when ALL inputs are HIGH (1). For any other input combination, the output is HIGH (1). Here's the truth table for a 2-input NAND:
Input AInput BOutput001011101110 (only case where output is 0)
In CMOS, a 2-input NAND gate uses exactly 4 transistors: 2 PMOS in parallel (connecting output to VDD) and 2 NMOS in series (connecting output to ground). When both inputs are HIGH, both NMOS transistors turn ON, pulling the output LOW. For any other combination, at least one PMOS is ON and pulls the output HIGH.
Why does this matter? Because every digital function you can imagine (adding numbers, comparing values, storing data) is built from combinations of NAND gates and their cousins. An AND gate is just a NAND followed by an inverter (NOT gate). An OR gate can be built from NANDs using De Morgan's laws. Everything builds up from here.
The Inverter: Simplest but Essential
The inverter (NOT gate) is even simpler: just 2 transistors (1 PMOS, 1 NMOS). When the input is HIGH, NMOS turns ON and PMOS turns OFF, pulling output LOW. When input is LOW, PMOS turns ON and NMOS turns OFF, pulling output HIGH. Input high makes output low, input low makes output high. Despite its simplicity, inverters appear everywhere in chips because you constantly need to flip signals.
Capacitance: The Hidden Enemy of Speed
Here's where we hit our first physics problem. Every transistor, every wire, every connection on a chip has capacitance. Understanding capacitance is critical to understanding why some chips are faster than others.
What is Capacitance?
Capacitance is the ability of a structure to store electrical charge. Any two conductors separated by an insulator (like a transistor gate and the channel below it, separated by gate oxide) form a capacitor. The capacitance (measured in farads, though we're dealing with picofarads or femtofarads at chip scale) tells you how much charge must be moved to change the voltage.
The equation is simple but profound: Q = C × V, where Q is charge, C is capacitance, and V is voltage.
Why Capacitance Kills Speed
To switch a signal from 0V to 1V (say, 0.8V in a modern process), you must move charge Q = C × 0.8V into the capacitance. This charge has to come from somewhere, through transistors that can only supply a limited current. The time it takes is roughly t = C × V / I, where I is the current the driving transistor can supply.
Here's the brutal reality: capacitance is everywhere. The gate of a transistor has capacitance (between the gate metal and the silicon channel). Wires connecting gates have capacitance (to the substrate and to other wires). Even the drain and source terminals have capacitance. Every switching event requires moving charge into or out of these capacitances.
Moving Electrons Takes Time
When you apply voltage to a transistor gate, electrons don't instantly appear in the channel. They have to physically move through the silicon. The speed at which electrons move (drift velocity) is limited by their interactions with the silicon crystal lattice. They scatter off atoms, defects, and thermal vibrations.
In modern silicon, electron drift velocity saturates at around 10^7 cm/s (100,000 meters per second, or about 100 km/s) under high electric fields. This sounds fast until you realize it's only about 0.033% of the speed of light. Light in a vacuum travels at 3 × 10^8 m/s (300,000 km/s). So electrons in silicon move about 300 times slower than light.
At 5nm and 3nm process nodes (which refer to the gate length and feature sizes, not actual 5 or 3 nanometer dimensions), distances are incredibly small. A 5nm gate might be crossed in 5 × 10^-9 m / 10^5 m/s = 50 picoseconds by an electron. But signals often need to traverse dozens of gates and millimeters of wiring. A 1mm wire at 10^5 m/s takes 10 nanoseconds to traverse (though signals propagate faster through wires than electrons drift, we're still talking multiple gate delays).
This is why chips are clocked in gigahertz (billions of cycles per second) not terahertz (trillions). Physics sets hard limits.
The Critical Path: Your Clock Speed Bottleneck
When a signal propagates through logic, it passes through multiple gate delays. A complex arithmetic operation might chain 20, 50, or 100 gates together. Each gate adds delay, typically 10-50 picoseconds in modern processes.
The critical path is the longest chain of gates any signal must traverse in one clock cycle. If your critical path takes 1 nanosecond, you can't clock faster than 1 GHz (1 / 1ns = 1 GHz). Add pipelining (breaking work into stages separated by registers), and you can increase throughput, but each individual operation still takes multiple cycles.
This is why processor architects obsess over minimizing logic depth. Every gate in the critical path is an enemy of clock speed. And this is where the CPU vs ASIC difference starts to emerge.
CPUs: The Flexible Generalists (and Why They're Slow at Packet Forwarding)
Central Processing Units are designed to do anything. They fetch instructions from memory, decode what those instructions mean, execute them using general-purpose arithmetic units, and store results. This flexibility is incredible, you can run Python, compile C++, render video, or forward packets, all with the same hardware. But flexibility comes at a steep performance cost for specialized tasks.
The CPU Architecture: Fetch, Decode, Execute, Writeback
Modern CPUs follow the Von Neumann architecture (or Harvard architecture with separate instruction and data paths, but conceptually similar). The cycle is:
- Fetch: Grab the next instruction from memory (typically L1 instruction cache)
- Decode: Figure out what the instruction means (is it an ADD? A LOAD? A conditional branch?)
- Execute: Perform the operation (ALU for arithmetic, load/store units for memory, FPU for floating point)
- Writeback: Store the result in a register or memory
This fetch-decode-execute cycle is the fundamental bottleneck. Every operation, no matter how simple, requires these steps.
Pipelining: Doing Multiple Things at Once (Sort Of)
To improve throughput, CPUs pipeline instruction execution. While one instruction is executing, the next is decoding, and the one after that is being fetched. Modern high-performance CPUs have pipelines 15-20 stages deep (Intel, AMD) or even deeper in some specialized designs.
Imagine an assembly line: one station fetches instructions, another decodes, another executes, another writes back. Each clock cycle, every station completes its work and passes results to the next stage. This increases throughput (instructions completed per cycle) even though latency for any single instruction doesn't improve.
But pipelines have a fatal flaw: branches.
Branch Prediction: Guessing the Future
Code is full of conditional branches (if statements, loops). When the CPU encounters a branch, it doesn't know which path to take until the branch condition is evaluated, which might be several stages into the pipeline. If the CPU waits, the pipeline stalls and throughput collapses.
Solution: guess. Modern CPUs have incredibly sophisticated branch predictors that analyze instruction history, detect patterns, and predict which way branches will go. They're right 95-98% of the time in typical code. But when they're wrong, the entire pipeline must be flushed, wasting 15-20 cycles. At 3 GHz, that's 5-7 nanoseconds per misprediction, and network code is full of branches (protocol checks, ACL matching, queue management).
Superscalar Execution: Multiple Arithmetic Units
High-performance CPUs don't just have one execution unit. They have multiple ALUs (Arithmetic Logic Units), FPUs (Floating Point Units), and load/store units working in parallel.
ALUs (Arithmetic Logic Units): These handle integer arithmetic (add, subtract, AND, OR, XOR, shifts). A modern CPU might have 4-6 ALUs capable of executing simple integer operations in a single cycle. Complex operations (like division) take multiple cycles.
FPUs (Floating Point Units): These handle floating-point arithmetic (add, multiply, divide for numbers with decimal points). Floating point is much more complex than integer math, FP addition requires aligning exponents, adding mantissas, and normalizing results. Modern FPUs can still complete one operation per cycle through deep pipelining, but latency is higher. Network code rarely uses floating point (we're dealing with integers), so FPUs mostly sit idle in routers.
Load/Store Units: These handle memory operations, reading from or writing to cache/memory. Modern CPUs have 2-3 load/store units that can execute memory operations in parallel (assuming no dependencies or address conflicts).
The CPU analyzes instruction streams and, when it finds independent instructions (operations that don't depend on each other's results), issues them to multiple execution units simultaneously. This is Instruction-Level Parallelism (ILP).
Out-of-Order Execution: The Complexity Nobody Wants to Think About
Here's where CPUs get truly complicated. To maximize ILP and keep execution units busy, high-performance CPUs reorder instructions dynamically, executing later independent instructions while earlier dependent instructions wait for data.
The Problem: Consider this instruction sequence:
1. LOAD R1, [memory address] // Slow, might miss cache 2. ADD R2, R1, R3 // Depends on R1 from instruction 1 3. MUL R4, R5, R6 // Independent, doesn't need R1
In a simple in-order CPU, instruction 3 must wait for instruction 2 to complete, which waits for instruction 1's slow memory load. The multiply unit sits idle even though it could work.
Out-of-Order Solution: The CPU tracks dependencies and executes instruction 3 while instructions 1-2 are still in flight. When instruction 1 completes, instruction 2 executes immediately. The result is higher throughput, all execution units stay busy.
The Complexity:
Register Renaming: The CPU has a limited number of architectural registers (x86 has 16 general-purpose registers, ARM has 31). But instructions might have false dependencies (two instructions using the same register for unrelated data). Register renaming maps architectural registers to a larger pool of physical registers (100-200 in modern designs), eliminating false dependencies.
Reservation Stations: When instructions are decoded, they're placed in reservation stations (queues waiting for operands). Each reservation station monitors when its operands become available and issues the instruction to an execution unit as soon as all operands are ready.
Reorder Buffer (ROB): To maintain the illusion of in-order execution (important for exceptions and precise interrupts), completed instructions are placed in a reorder buffer in program order. Results are only committed (made permanent) when all earlier instructions have completed. If an exception occurs (divide by zero, page fault), the CPU can discard all instructions after the exception and restart from there.
Load/Store Queues: Memory operations are particularly tricky. Loads and stores to the same address must execute in program order, but loads from different addresses can execute in parallel. The load/store queue tracks memory dependencies and enforces ordering.
All this complexity requires massive amounts of silicon. The execution core of a modern CPU might use millions of transistors just for dependency tracking, renaming, and reordering logic. These transistors don't compute anything, they're pure overhead for flexibility.
Cache Hierarchy: Hiding Memory Latency (Mostly)
Memory access is devastatingly slow compared to CPU speed. A load from DRAM might take 200-400 CPU cycles. At 3 GHz, that's 67-133 nanoseconds, during which the CPU could have executed thousands of instructions.
CPUs use multiple cache levels to hide this latency:
L1 Cache: Tiny (32-64 KB per core) but blazingly fast (4-5 cycle access, ~1.5ns). Split into instruction and data caches.
L2 Cache: Medium (256 KB - 1 MB per core) and still fast (12-15 cycles, ~4-5ns).
L3 Cache: Large (8-64 MB shared across cores) but slower (40-70 cycles, ~15-25ns).
DRAM: Huge (gigabytes) but agonizingly slow (200-400 cycles, ~70-130ns).
Cache hits are great. Cache misses are disasters. And routing table lookups don't fit in L1 cache, meaning every lookup risks an expensive miss.
Why CPUs Are Terrible at Packet Processing
Consider forwarding a single packet through a router:
- Instruction Overhead: Even highly optimized code requires 50-100 instructions per packet (parse Ethernet and IP headers, lookup routing table, check ACLs, modify TTL and checksums, determine egress queue). At 3 GHz, even if every instruction completes in 1 cycle (they don't), that's 16-33 nanoseconds per packet.
- Cache Misses: Routing tables don't fit in L1 cache (32-64 KB). A lookup likely misses to L2 or L3, costing 20-70 cycles (7-25ns). If it misses to DRAM (400 cycles, 130ns), you're down to 7.7 million packets per second per core.
- Branch Mispredictions: Network code is branchy (protocol type checks, TTL checks, ACL matching). Even with 98% prediction accuracy, 2% mispredictions at 50 packets per misprediction × 20 cycles per misprediction = 20 cycles wasted per packet on average.
- Serial Processing: A CPU core processes one (or a few) packets at a time. There's massive repetition (every packet runs the same code), but CPUs can't exploit it efficiently.
Let's calculate packet forwarding capacity:
Best case scenario: 3 GHz CPU, perfect IPC (instructions per cycle) of 1, zero cache misses, zero branch mispredictions, 50 instructions per packet.
50 instructions / 3 GHz = 16.7 nanoseconds per packet = 60 million packets per second (60 Mpps)
Realistic scenario: Cache misses add 30ns, branch mispredictions add 5ns, IPC is 0.7 (only 70% of cycles do useful work):
(16.7ns / 0.7 + 30ns + 5ns) = 58ns per packet = 17.2 Mpps
At 64-byte packets (minimum Ethernet frame size), 17.2 Mpps × 64 bytes × 8 bits/byte = 8.8 Gbps.
That's decent, but nowhere near 100 Gbps, let alone 400 Gbps or 800 Gbps.
Now let's calculate what wire speed means for a 102.4 Tbps switch:
102.4 Tbps = 102,400 Gbps (for latest generation, 52,000 Gbps for previous gen). At minimum packet size (64 bytes including Ethernet frame overhead, 20 bytes):
64 bytes × 8 bits/byte = 512 bits per packet
102,400 Gbps (for latest generation, 52,000 Gbps for previous gen) / 512 bits per packet = 203.1 billion packets per second (for 102.4 Tbps) (101,600 Mpps)
If each CPU core can handle 17.2 Mpps, you'd need 101,600 / 17.2 = 5,907 CPU cores running flat out, perfectly load-balanced, with zero overhead.
That's not happening. Even with 100 cores (wildly unrealistic in a router), you'd only hit 1.72 billion packets per second, about 1.7% of what's needed. CPUs fundamentally can't scale to wire-speed packet processing at modern link speeds.
What CPUs Are Actually Good For
CPUs excel at:
- Control Plane Operations: Running routing protocols (BGP, OSPF, IS-IS), managing configuration, handling exceptions. These don't need wire speed, they need flexibility.
- Complex Policy: Deep packet inspection, application-aware routing, complex ACLs with thousands of rules and regex pattern matching.
- Flexibility: Software updates add new features instantly. New protocol support is a code change, not a hardware redesign.
- Exception Handling: Packets destined to the router itself (SSH, SNMP), TTL expired (ICMP time exceeded), or requiring special processing.
This is why every router still has a CPU. It's not for forwarding packets, it's for managing the equipment that does.
ASICs: Purpose-Built Silicon Where Every Transistor Has One Job
If you know exactly what computation you need to perform, and it won't change, you can design silicon specifically for that task. No instruction fetch, no decode, no general-purpose overhead. Just gates arranged in precisely the pattern needed for your function. The result is an ASIC, and for networking, these are the switching chips or merchant silicon that actually forward your packets.
How ASICs Are Fundamentally Different
An ASIC is a custom chip where logic is permanently etched into silicon during manufacturing. The transistors are arranged in exactly the pattern needed for packet forwarding. There's no instruction fetch, no decode, no branch prediction, because there are no instructions. The logic is directly implemented in gates.
For packet forwarding, a switching ASIC might contain:
Packet Parser: Dedicated combinational logic that identifies protocol headers (Ethernet, VLAN tags, IP, TCP/UDP) and extracts fields in parallel. This isn't software reading bytes one at a time, it's hardware examining many bytes simultaneously using gates specifically arranged to recognize header patterns.
Lookup Engine: Custom hardware for table lookups. Instead of software sequentially searching a routing table, ASICs use specialized search structures: hash tables with multiple parallel lookup paths, algorithmic tries (tree structures walking prefixes), or TCAMs (more on this later) that search thousands of entries in parallel.
Packet Modification: Dedicated logic for common modifications. Need to decrement TTL? There's a gate network specifically for subtracting one and updating the header checksum. Rewrite MAC addresses? Dedicated hardware. Add VLAN tags? Dedicated hardware. Each operation has its own custom gate arrangement.
Queueing and Scheduling: Hardware queues with sophisticated scheduling algorithms (weighted fair queueing, deficit round robin) implemented directly in gates, not software loops.
Forwarding Pipeline: All these stages operate as a pipeline. While one packet is being parsed, the previous packet is being looked up, and the one before that is being modified. Packets flow through continuously.
The Physics of Why ASICs Are Fast
ASICs achieve absurd performance because:
No Instruction Overhead: There's no fetch-decode-execute cycle. Gates directly implement the function. What would take 50 CPU instructions happens in a handful of gate delays.
Massive Parallelism: ASICs can have hundreds of parallel datapaths processing packets simultaneously. While a CPU has 4-6 ALUs, an ASIC might have 64 independent packet parsers operating in parallel, each handling one port.
Deep Pipelining: A 20-stage pipeline processing packets continuously can achieve throughput equal to the slowest stage. If each stage takes 1ns and stages are balanced, throughput is 1 billion packets per second per pipeline. Add 64 parallel pipelines (for 64 ports), and you get 64 billion packets per second.
Optimized Datapaths: Every transistor is placed for the specific task. There are no general-purpose registers sitting idle, no instruction decoders consuming power, no branch predictors. Critical paths are minimized through careful timing analysis and physical design.
Lower Power Per Operation: Because there's no instruction fetch, cache, or speculation, power per operation is dramatically lower. Modern switching ASICs deliver terabits of throughput at a few hundred watts. Achieving the same with CPUs would require kilowatts or megawatts.
RTL Design: Describing Hardware at the Register Level
How do you actually design an ASIC? You don't draw individual gates (there are billions of them). Instead, you write RTL (Register Transfer Level) code describing how data moves between registers and what operations occur.
What is RTL?
RTL is a hardware description language (HDL) representation of digital logic. The most common languages are Verilog and VHDL. RTL describes digital circuits at an abstraction level where you specify:
Registers: Storage elements (flip-flops) that hold state between clock cycles Combinational Logic: Gates that compute outputs from inputs (without memory) Transfers: How data moves from registers through logic to other registers on each clock cycle
For example, RTL describing a simple adder might look like:
always @(posedge clk) begin
if (reset)
sum <= 0;
else
sum <= a + b; // Register sum gets a + b on each clock
end
This doesn't specify how to build an adder (carry-lookahead vs ripple carry), just that you want to add two numbers and store the result in a register every clock cycle.
From RTL to Silicon:
- RTL Design: Engineers write RTL code describing packet parsing, lookup engines, queueing, etc. This is where the ASIC's functionality is defined.
- Synthesis: Tools convert RTL into a gate-level netlist (a description of which gates connect to which). The tool chooses implementations (adders, multiplexers, memory structures) from a library of standard cells.
- Place and Route: Tools physically place millions of gates on the silicon die and route wires between them, minimizing wire length and meeting timing constraints (ensuring signals propagate fast enough for the target clock frequency).
- Verification: Simulation and formal verification ensure the design actually does what the RTL describes. This takes months, ASIC bugs are catastrophically expensive.
- Tape-out: The final design is sent to a fab (like TSMC) for manufacturing. Masks are created, and wafers are processed through dozens of lithography, deposition, and etching steps.
- Testing and Packaging: Dies are tested, packaged, and shipped.
From concept to shipping silicon takes 2-3 years and tens of millions of dollars.
The ASIC Trade-off: Fast but Frozen
This performance comes with severe limitations:
Development Cost: A modern switching ASIC costs $50-100 million to develop (NRE, Non-Recurring Engineering cost, the one-time expense before you manufacture any chips). This includes design, verification, mask sets, and first silicon. You need to sell millions of units to amortize this.
NRE (Non-Recurring Engineering) cost is the upfront investment required to design and manufacture a product before you can sell a single unit. For ASICs, this includes paying engineers for 2-3 years, buying EDA (Electronic Design Automation) tool licenses, mask costs ($5-10 million for advanced nodes), and prototype fabrication. Once you've taped out, manufacturing additional chips is relatively cheap (a few hundred dollars per chip in volume), but you need massive volumes to justify the NRE.
Inflexibility: Once manufactured, the logic is permanent. You can't add new protocol support or change algorithms. If IPv7 is invented tomorrow, your IPv4/IPv6 ASIC can't handle it. Though given how long it's been since IPv6 was ratified in 1998 and most people are still clinging to IPv4 like it's a security blanket, I think we're safe from IPv7 for at least another quarter century.
Long Lead Times: From initial concept to shipping products is 3+ years. By the time your ASIC ships, requirements may have evolved, and you're stuck with decisions made years earlier.
This is why ASICs dominate only where requirements are stable and volumes are enormous: Ethernet switching (standards don't change quickly), IP routing (IPv4/IPv6 have been stable for decades), and specialized applications like cryptocurrency mining (Bitcoin's SHA-256 hasn't changed since 2009).
GPUs: ASICs for Parallel Computation
Graphics Processing Units deserve mention as a specific type of ASIC that found unexpected success beyond graphics. GPUs are massively parallel processors optimized for the same operation on many data elements simultaneously (SIMD, Single Instruction Multiple Data).
A modern GPU has thousands of simple cores (CUDA cores in NVIDIA terms, or stream processors for AMD). Each core is much simpler than a CPU core: no complex branch prediction, small cache, simple pipeline, but you get thousands of them running in lockstep.
For graphics, this is perfect. Every pixel can be processed independently using the same shader program. Transform a million vertices? Run the same transformation on 1000 cores simultaneously.
For networking, GPUs are less directly applicable because packets have dependencies and state (flow tables, connection tracking), but they've found specific uses:
Deep Packet Inspection: Pattern matching across packet payloads parallelizes well across thousands of cores Cryptography: Encryption/decryption operations for thousands of concurrent flows Machine Learning: Traffic classification, anomaly detection, and DDoS mitigation increasingly use ML models that run beautifully on GPUs
NVIDIA's BlueField DPUs (Data Processing Units) combine ARM CPUs, network ASICs, and GPU-like acceleration engines specifically for offloading network and storage processing from server CPUs.
GPUs highlight a key point: ASICs exist on a spectrum of specialization. A GPU is more flexible than a switching ASIC (you can program shaders) but less flexible than a CPU. It's all about choosing your point on the performance-flexibility curve.
FPGAs: Reconfigurable Hardware That's Neither Fish Nor Fowl
Field-Programmable Gate Arrays occupy the awkward middle ground between CPUs and ASICs. They're chips where the logic can be reconfigured after manufacturing, even in the field (hence "field-programmable"). Instead of hardwired gates, FPGAs contain configurable logic blocks that you can program to implement arbitrary logic functions.
FPGA Architecture: The Reconfigurable Fabric
An FPGA is essentially a massive array of configurable building blocks connected by programmable routing. The key components:
Logic Blocks (CLBs, Configurable Logic Blocks): Each block contains several key elements:
LUTs (Lookup Tables): The heart of FPGA logic. A LUT is a small memory (typically 64 bits for a 6-input LUT) that implements any logic function of its inputs. Think of it as a truth table stored in SRAM. For a 4-input LUT, you have 2^4 = 16 possible input combinations, and the LUT stores a 16-bit table defining the output for each combination. To make a 4-input AND gate, you'd program the LUT to output 1 only when all 4 inputs are 1 (entry 15 = 1, all others = 0).
By loading different values into the LUT memory, you can reconfigure it to act as any gate combination: AND, OR, XOR, NAND, multiplexer, or even simple arithmetic. Six-input LUTs (64-bit tables) can implement functions like (A & B) | (C & D & E & F) in a single LUT.
Flip-Flops: These are the memory elements in digital logic, storing one bit of state. A flip-flop captures the input signal on a clock edge (rising or falling) and holds that value until the next clock edge. Every register in an ASIC or CPU is built from flip-flops. In FPGAs, each CLB contains multiple flip-flops (typically 8-16) that can be used to store intermediate results, create registers, or build sequential logic.
Flip-flops are built from gates in a feedback configuration (usually using NAND or NOR gates), but in FPGAs they're provided as pre-built elements because they're so fundamental.
Multiplexers: A multiplexer (mux) is a digital switch that selects one of several inputs and passes it to the output based on a select signal. A 2:1 mux has two data inputs and one select input; if select=0, output=input0, if select=1, output=input1.
Multiplexers are critical in FPGAs because they're how you configure data paths. The programmable routing fabric is essentially a massive hierarchy of multiplexers that connect logic blocks together. By programming which mux inputs are selected, you create the connections your design needs.
Interconnect: A vast programmable routing network connects logic blocks. This is where FPGA magic and limitations both emerge. The interconnect consists of:
- Wire segments of various lengths (local, intermediate, global)
- Switch boxes with programmable multiplexers at intersections
- Connection blocks linking logic blocks to routing channels
Configuration data stored in SRAM determines which switches are closed, creating paths between logic blocks. This reconfigurability is the FPGA's superpower and its biggest performance bottleneck.
I/O Blocks: Configurable interfaces to the outside world supporting various voltage levels, signaling standards (LVDS, DDR, etc.), and protocols.
Specialized Hard Blocks: Modern FPGAs aren't purely configurable logic. They include hardened blocks for common functions:
- DSP Blocks: Fixed-function multipliers and accumulators for signal processing
- Block RAM: Embedded SRAM memory blocks (thousands to megabytes total)
- SerDes: High-speed serializer/deserializers for multi-gigabit links
- Embedded Processors: Some FPGAs include full ARM CPUs (Xilinx Zynq, Intel Agilex) on the same die
These hard blocks are much faster and more power-efficient than implementing the same functions in programmable logic.
Why FPGAs Are Slower Than ASICs (But Faster Than CPUs)
The configurability costs performance in several ways:
Routing Overhead: In an ASIC, wires connect gates directly via optimized metal layers. Signals travel minimal distances through carefully sized and buffered interconnect. In an FPGA, signals pass through programmable routing switches (multiplexers implemented with transistors and configuration SRAM). Each switch adds capacitance, resistance, and delay.
A signal crossing 10 routing switches might take 5-10x longer than a direct wire in an ASIC. This routing delay often dominates FPGA timing, not the LUT logic itself.
LUT Delays: A LUT is a small memory lookup, which is slower than dedicated combinational logic. A 6-input LUT (64-bit memory) takes longer to access than a 2-input NAND gate (4 transistors of hardwired logic). Additionally, building complex functions often requires cascading multiple LUTs, adding more delay.
Lower Clock Speeds: Due to routing delays and LUT overhead, FPGA designs typically run at 200-500 MHz, compared to 1-2 GHz for ASICs and 3-5 GHz for CPUs. Clock speed isn't everything (parallelism matters hugely), but it directly impacts latency. An operation that takes 5 clock cycles at 2 GHz (2.5ns) takes 10ns at 500 MHz.
Power Consumption: The programmable routing consumes significant power. Configuration SRAM leaks current. Routing switches have higher capacitance than direct wires. FPGAs use 2-5x more power than equivalent ASIC implementations of the same function.
Why FPGAs Still Crush CPUs for Specialized Tasks
Despite being slower than ASICs, FPGAs demolish CPUs for hardware-appropriate tasks:
Parallelism: Like ASICs, you can implement many parallel datapaths. Need 32 independent packet parsers? Configure 32 sets of logic blocks. Want 16 concurrent AES encryption engines? Build them. This parallelism is impossible in software without massive multi-core systems.
Custom Datapaths: You directly implement the logic you need. Parsing a custom protocol is a few hundred LUTs arranged to recognize your header format, running at hardware speeds. In software, it's a loop reading bytes, checking conditions, making decisions, all at instruction-fetch speeds.
Low Latency: No instruction fetch means sub-microsecond latencies are achievable for simple operations. CPUs struggle to get below 10-100 microseconds for complex tasks due to instruction overhead, cache misses, and scheduling.
Determinism: FPGAs don't have cache misses, branch mispredictions, interrupts, or operating system scheduling. Latency is deterministic: the same operation takes the same time every time, which is critical for real-time systems (industrial control, high-frequency trading, telecommunications).
The FPGA Economic Equation
FPGAs occupy a strange middle ground economically:
- Slower than ASICs: 2-10x lower clock speeds, 2-5x higher power consumption
- Faster than CPUs: 10-100x speedup for parallel operations, 10-1000x lower latency for streaming tasks
- More flexible than ASICs: Reconfigurable in seconds/minutes, not the years required for ASIC redesign
- Less flexible than CPUs: Requires hardware design skills (Verilog/VHDL, timing analysis, not just C/Python)
- Expensive per unit: Larger die sizes due to configuration overhead, lower volumes than ASICs
- Lower NRE than ASICs: No mask costs ($5-10M saved), but expensive EDA tool licenses ($100K+/year)
FPGAs make economic sense for:
- Low-to-medium volumes (1,000-100,000 units) where ASIC NRE isn't justified
- Evolving standards where hardware updates are needed (5G/6G infrastructure, rapidly changing protocols)
- Specialized applications where commercial ASICs don't exist (custom cryptography, proprietary protocols)
- ASIC prototyping: Before taping out a $50M ASIC, validate the design on FPGA
In networking, FPGAs find specific niches:
- High-frequency trading (sub-microsecond latency requirements, custom protocols)
- Network security appliances (custom DPI engines, pattern matching hardware)
- Telecom equipment (specialized protocol handling, custom framing)
- Research and development (experimenting with new protocols, SDN hardware accelerators)
The Physics of the Trade-off: Why You Really Can't Have It All
The fundamental trade-off between performance, flexibility, and cost isn't arbitrary engineering pessimism. It's rooted in immutable physics and brutal economics.
The Transistor Density Problem
Modern chips pack absurd numbers of transistors. The Apple M4 Max integrates approximately 95 billion transistors (estimated, not officially confirmed by Apple) on TSMC's N3E (3nm) process on a single die. NVIDIA's B300 GPU contains 208 billion transistors (verified across dual-die GB100 architecture) on TSMC 4NP process (which is 5nm-class, not true 3nm). These are staggering numbers, but you can't use all of them simultaneously.
Power Density: Modern process nodes (5nm, 3nm) have such high transistor density that powering all transistors at once would literally melt the chip. Power density (watts per square millimeter) has reached levels where heat removal is the limiting factor, not transistor count. This is the "dark silicon" problem: large portions of the chip must be powered down at any given time.
Processors manage this by:
- Clock gating (turning off clock signals to idle blocks)
- Power gating (completely shutting down voltage to unused sections)
- Dynamic frequency scaling (slowing clock speed when load is light)
Interconnect Limits: At advanced nodes, wire resistance and capacitance dominate delay more than gate delay. Global interconnects (wires spanning the chip, tens of millimeters) can take 10-20 gate delays to traverse. This limits how tightly you can couple distant logic and forces designers to partition designs into localized blocks.
Heat Removal: Chips are fundamentally limited by thermal dissipation. High-performance CPUs hit 250W thermal design power (TDP). Higher power requires exotic cooling (vapor chambers, liquid cooling), and there are hard limits. ASICs spread power across larger die areas and lower clock speeds to stay within thermal budgets.
The Flexibility Overhead Tax
General-purpose CPUs pay massive overheads for flexibility. Consider a simple 32-bit addition:
In an ASIC, a carry-lookahead adder might use ~100 transistors to compute A + B = C in 2-3 gate delays.
In a CPU, the same addition requires:
- Instruction fetch: Thousands of transistors for I-cache, program counter, fetch logic (5-10K transistors)
- Instruction decode: Hundreds of transistors to identify instruction type, source/destination registers (1-2K transistors)
- Register file read: Thousands of transistors for register file and bypass networks (5-10K transistors)
- Execute: The actual adder (100 transistors)
- Writeback: More register file logic (2-3K transistors)
- Out-of-order infrastructure: Reservation stations, reorder buffer, register renaming (50-100K transistors)
Total: 60,000-125,000 transistors supporting 100 transistors of actual arithmetic. That's 99.8% overhead for flexibility.
ASICs eliminate this overhead but become locked to their function. FPGAs keep much of the overhead (LUTs are memories, routing is programmable muxes requiring SRAM configuration) but gain reconfigurability.
The Design Complexity Wall
As chips grow more complex, design and verification costs explode super-linearly:
CPU Cores: A modern high-performance core (Intel P-core, AMD Zen 5, ARM Neoverse V) requires 200-500 person-years to design and verify. Intel, AMD, and ARM collectively spend billions annually on CPU core development.
ASICs: A switching ASIC requires 100-200 person-years and $50-100M total investment. Only companies with massive markets can justify this (Broadcom ships millions of switching chips annually, amortizing NRE over huge volumes).
FPGAs: The FPGA itself (the reconfigurable fabric) requires billion-dollar investments by Xilinx/AMD and Intel/Altera. But using an FPGA for a specific application requires 5-20 person-years, dramatically lower than custom ASICs.
This cost structure determines market realities:
- CPUs: Designed by a handful of companies (Intel, AMD, ARM, Apple, some Chinese vendors)
- ASICs: Designed for high-volume markets (switching, mining, AI training, smartphone SoCs)
- FPGAs: Serve mid-volume and specialized applications where ASICs aren't economical
Why Network Hardware Uses All Three Types of Silicon
Now we can finally understand why opening a modern enterprise router reveals what looks like a small data center's worth of specialized chips. Each silicon type serves a distinct, irreplaceable purpose dictated by physics and economics.
The ASIC: Fast Path Data Plane (Where Your Packets Actually Go)
The ASIC, often called the "switching chip," "forwarding ASIC," or "merchant silicon," handles the wire-speed data plane. Every packet that can be forwarded using standard processing flows through the ASIC without ever bothering the CPU:
- L2 switching: MAC address lookup (hash tables with 32K-256K entries), VLAN processing (tag insertion/removal)
- L3 routing: Longest-prefix match in routing tables (algorithmic TCAM or hash-based structures), TTL decrement, IP checksum recalculation
- ACLs: Five-tuple rule matching (source/dest IP, source/dest port, protocol) using TCAMs
- QoS: Traffic shaping (token bucket algorithms), policing (rate limiting), queue management (weighted fair queueing, WRED)
- Encapsulation: VXLAN overlay tunneling, GRE tunneling, MPLS label push/pop operations
The ASIC's entire existence is justified by one brutal requirement: wire speed. For a 48-port 25 Gbps switch, that's 1.2 Tbps total throughput. At minimum packet size (64 bytes), that's 1.86 billion packets per second that must be parsed, looked up, modified, queued, and forwarded without dropping a single one (assuming no congestion).
Modern switching ASICs achieve this through:
Deep Pipelines: 12-20 stages processing packets continuously. While packet 20 is being parsed, packet 19 is being looked up, packet 18 is being modified, packet 17 is being queued, etc. The pipeline never empties as long as traffic flows.
Parallel Datapaths: A 48-port switch doesn't have one pipeline processing all packets sequentially. It has 48 ingress pipelines (one per port) and 48 egress pipelines, all operating in parallel. Each pipeline processes its port's traffic independently.
Packet Buffering: On-chip SRAM provides 60-120 MB of buffering to absorb traffic bursts. When traffic arrives faster than egress can transmit (a 48-to-1 incast pattern, for example), packets queue in buffers rather than being dropped immediately.
Integrated SerDes: Serializer/deserializer logic for high-speed links. Modern switching ASICs now support 112G SerDes (112 Gbps per lane) as production standard, with 224G SerDes entering sampling/early production in 2025. PAM4 (4-level pulse amplitude modulation) is the dominant modulation scheme for all links ≥50 Gbps. Current configurations: 100G PAM4 (56 GBaud) for 800G links (8×100G), 200G PAM4 for 1.6T ports (8×200G). The industry has moved to 224G SerDes (approximately 112 GBaud PAM4) with specifications finalized in October 2024 and silicon samples from Synopsys, Credo, and Alphawave. Forward Error Correction (FEC), specifically KP4 (RS-544,514), is mandatory for all PAM4 links.
The FPGA: Specialized Fast Path Extensions (When Standard Hardware Won't Cut It)
Some routers and switches include FPGAs for functions that need hardware speeds but aren't common enough for ASIC vendors to integrate, or where vendors want competitive differentiation:
Custom Encapsulation: Non-standard protocols or proprietary encapsulation formats. If you're building a service provider network with a custom encapsulation format (maybe for legacy equipment interoperability), the ASIC doesn't support it, but an FPGA can.
Inline Encryption: IPSec or MACsec at line rate beyond what the switching ASIC's built-in crypto engine can handle. A 100 Gbps link fully encrypted with AES-256-GCM requires processing 100 Gbps of crypto operations, which might exceed the ASIC's capabilities, but an FPGA with dedicated AES cores can handle it.
Traffic Analysis: Custom monitoring, telemetry collection, or security functions. Maybe you want to extract specific header fields from every packet and stream them to an analytics system. The ASIC doesn't have this feature, but an FPGA can inspect packets in parallel and extract whatever you need.
Protocol Translation: Converting between incompatible standards or bridging legacy equipment to modern networks. Got some old SONET/SDH equipment that needs to talk to modern Ethernet? FPGA to the rescue.
FPGAs also enable vendor differentiation. While everyone can buy the same Broadcom Tomahawk ASIC, adding FPGA-based features creates competitive advantages. Cisco, Arista, and Juniper all use similar merchant silicon but differentiate through software and, occasionally, FPGA-based hardware acceleration.
The CPU: Control Plane and Exception Handler (The Brain That Doesn't Touch Most Packets)
The CPU handles everything the ASIC can't or shouldn't handle. It's not forwarding packets (that would be disastrous at wire speed), it's managing the equipment:
Control Plane Protocols:
- BGP: Maintaining TCP sessions with peers, processing route updates, running best-path selection algorithms, applying policy (AS-path filtering, communities, local preference), managing state for 900K+ Internet routes
- OSPF/IS-IS: Maintaining link-state databases, running Dijkstra's algorithm for shortest-path computations, handling multi-area designs, flooding LSAs
- LLDP/LACP: Link-layer discovery protocol for topology mapping, link aggregation control protocol for managing port channels
- SNMP/NetFlow/sFlow: Management protocols, flow telemetry collection and export
- SSH/HTTPS: Management interfaces, CLI, web GUI
Exception Processing:
- Packets destined to the router itself: Any packet with the router's IP as destination gets "punted" to the CPU via a special internal port
- ARP/ND: Address resolution for IPv4 and IPv6 neighbor discovery, learning MAC addresses for next-hop IPs
- ICMP: Generating time exceeded messages (traceroute), destination unreachable, echo reply (ping responses)
- Routing protocol packets: BGP, OSPF, EIGRP traffic destined to the router
- Deep inspection beyond ASIC capabilities: Complex pattern matching, application-layer processing
Configuration and Management:
- Operating System: Running Linux (Arista EOS, Cumulus, SONiC) or proprietary OS (Cisco IOS-XE, Juniper Junos)
- Configuration database: Candidate vs running configuration, commit/rollback functionality
- Logging and monitoring: Syslog, local logs, alerting on events
- Software updates: Downloading and installing new OS versions, patching
The CPU doesn't need wire-speed performance because it only sees exception traffic, typically <1% of total packets. But it needs sufficient compute power to run complex protocols and maintain state for millions of flows. Modern network equipment uses multi-core CPUs: 4-16 cores is typical, running at 2-3 GHz. These are often ARM-based (power efficiency, lower cost) or x86 (compatibility with existing software stacks).
Inside the Merchant Silicon: Broadcom, Marvell, and NVIDIA
Let's examine what's actually inside these switching ASICs from the major vendors and why your network runs on their silicon whether you like it or not.
Broadcom: The 800-Pound Gorilla of Switching
Broadcom's Trident, Tomahawk, and Jericho families absolutely dominate the data center switching market. According to various estimates, Broadcom has 70-80% market share (verified across multiple analyst sources) in data center switching silicon. Their success comes from aggressive pricing, feature integration, and ruthless execution.
Tomahawk Series: The Speed Demon
The Tomahawk family targets high-density data center switching where throughput is king. Tomahawk 5 (2023) delivers 102.4 Tbps of switching capacity (with Tomahawk 6 shipping in 2025, while Tomahawk 5 delivers 51.2 Tbps), supporting configurations like 64 ports of 800 Gbps or 128 ports of 400 Gbps.
As of December 2025, the Tomahawk family has evolved significantly:
Tomahawk 6 (BCM78910): Shipping since June 2025, Tomahawk 6 represents a major architectural shift to chiplet design. It delivers 102.4 Tbps—double the 51.2 Tbps of Tomahawk 5—on TSMC's 3nm process node. This is the first chiplet-based Ethernet switch ASIC from Broadcom. Key specifications:
• Throughput: 102.4 Tbps
• Process: 3nm TSMC (chiplet architecture)
• SerDes: 200G PAM4 technology
• Port configurations: 64×1.6TbE, 128×800GbE, or 512×200GbE
• Ultra Ethernet Consortium (UEC) compliant
• Co-packaged optics variant (TH6-Davisson) shipping October 2025
Tomahawk Ultra (BCM78920): Launched in July 2025, this variant targets HPC and AI scale-up applications with ultra-low latency of just 250 nanoseconds. It maintains 51.2 Tbps throughput on a 5nm process and achieves 77 billion packets per second at 64-byte line rate. It's pin-compatible with Tomahawk 5 for easier upgrades.
Tomahawk 5 (BCM78900): The previous generation (2023) delivers 51.2 Tbps on a monolithic 5nm die—the last monolithic design before Broadcom's transition to chiplets. It supports 64 ports of 800 Gbps or 128 ports of 400 Gbps using 512 lanes of 100G PAM4 SerDes. The chip includes 6 ARM processor cores for packet processing assistance.
Inside a Tomahawk ASIC, the packet processing pipeline has roughly 12-14 stages:
Stage 1-2: Ingress Parsing Dedicated parsing logic examines the packet header byte-by-byte, identifying:
- Ethernet frame (destination MAC, source MAC, Ethertype)
- VLAN tags (802.1Q single tag, QinQ double tag)
- IP header (version, protocol, TTL, source/dest addresses)
- L4 headers (TCP/UDP ports, ICMP type/code)
- Optional headers (MPLS labels, VXLAN headers)
The parser is state-machine based and can handle various protocol combinations. It extracts metadata into a packet descriptor (a 128-256 byte structure containing all relevant header fields), while the actual packet data goes to buffers.
Stage 3: Ingress ACL/Policy Lookup The descriptor is matched against ingress ACLs implemented in TCAM. Rules might match on:
- Source/destination IP addresses (with prefix masking)
- L4 ports (with range matching)
- Protocol type
- VLAN IDs
- TCP flags
ACL results (permit/deny, QoS marking, mirror to monitoring port) are attached to the descriptor. Multiple TCAM lookups can happen in parallel (one for L2 ACLs, another for L3 ACLs).
Stage 4: L2 Lookup MAC address table lookup using a hash table with 32K-256K entries (configurable). For known unicast destinations, this returns the egress port number. For unknown unicast or broadcast/multicast, it returns a port bitmap (flood to multiple ports). Lookup takes 1-2 clock cycles (1-2 nanoseconds at 1 GHz ASIC clock).
Stage 5-6: L3 Lookup (if routing) If the packet is routed (destination MAC matches the router's MAC), perform longest-prefix match in the routing table:
- Hash-based lookup for exact matches (host routes like /32)
- Algorithmic TCAM for prefixes (using a hybrid approach that reduces TCAM requirements)
- Supports 128K-1M+ routes depending on configuration
Returns next-hop IP address and egress interface. Lookup takes 2-4 clock cycles (2-4 nanoseconds).
Stage 7: Next-Hop Resolution Resolve next-hop IP to destination MAC address via ARP/ND table (another hash table). This determines the L2 header to write on the egress packet.
Stage 8: Egress Port Determination Based on L2/L3 lookups, determine egress port(s). For multicast, this might be multiple ports (handled via replication logic that duplicates the packet descriptor).
Stage 9: Queueing The packet descriptor (and pointer to buffered packet data in SRAM) is enqueued to the appropriate egress queue. Modern switches have 8-16 queues per port with different priorities and scheduling weights.
This is where congestion control happens:
- Queue depth monitoring: ECN (Explicit Congestion Notification) marking occurs when queues reach thresholds
- WRED (Weighted Random Early Detection): Probabilistically drops packets before queues are completely full
- PFC (Priority Flow Control): For lossless Ethernet (RoCE), sends pause frames to upstream switches when queues fill
Stage 10: Scheduling A hierarchical scheduler selects packets from queues based on policy:
- Strict priority: Higher-priority queues drain first (voice/video traffic)
- Weighted Fair Queueing: Share bandwidth proportionally based on weights
- Deficit Round Robin: Ensures fairness across queues by tracking "deficits"
Stage 11: Egress ACLs Another TCAM lookup on egress, potentially dropping packets, modifying QoS markings, or applying rate limiting.
Stage 12-13: Packet Modification Dedicated hardware rewrites packet headers:
- Decrement TTL (for routed packets)
- Rewrite source/destination MAC addresses (for routed packets)
- Add/remove VLAN tags (access vs trunk ports)
- Add MPLS labels or VXLAN headers (overlay networking)
- Recompute IP header checksum
- Update L4 checksums if port translation occurred
Each modification type has specialized hardware. A TTL decrement unit subtracts one and updates the checksum using an incremental checksum update formula (avoiding full recalculation).
Stage 14: Serialization The modified packet is serialized by the egress SerDes and transmitted on the physical port at 25/50/100/400/800 Gbps.
Total latency through this pipeline: 300-800 nanoseconds depending on configuration, features enabled, packet size, and whether cut-through or store-and-forward mode is active.
Memory Architecture in Tomahawk:
HBM in Carrier/Routing ASICs:
While pure switching ASICs like Tomahawk use only on-chip SRAM, carrier and routing ASICs increasingly incorporate HBM (High Bandwidth Memory) for deep buffering:
• Broadcom Jericho 2: 8 GB HBM2 with 2.5D packaging
• Broadcom Jericho3-AI: 4× HBM3 stacks for deep packet buffering
• Broadcom Jericho 4 (August 2025, 51.2 Tbps on 3nm): HBM for enabling lossless RoCE over 100+ km distances
• Cisco Silicon One: Hybrid buffer architecture with HBM for congested flows
HBM enables carrier-class features like deep buffering for video streams, long-haul latency variation absorption, and lossless Ethernet over extended distances—features not required in pure data center switching applications.
- On-chip SRAM: 64-128 MB for packet buffers, MAC tables, routing tables, ACL tables, metadata
- External TCAM (optional): For massive ACL deployments (256K+ rules), external TCAM chips provide additional capacity at the cost of power consumption and board space
- Hash tables: For exact-match lookups (MAC addresses, exact-match routes, flow tables)
- Algorithmic TCAM: Hybrid approach using hash tables and small TCAMs to implement LPM (longest prefix match) for routing tables efficiently
Features Integrated:
- VXLAN/NVGRE: Overlay network encap/decap at line rate
- MPLS: Label switching, up to 8 label push/pop operations in the pipeline
- 802.1Q/QinQ: VLAN tagging, double tagging for service providers
- Traffic management: Per-port and per-queue shaping, policing (token bucket)
- ECN and WRED: Congestion signaling for TCP and RED-based dropping
- MACsec: Layer 2 encryption with 128/256-bit AES-GCM
- Telemetry: sFlow, NetFlow sampling, INT (In-band Network Telemetry) for latency and path tracking
Broadcom's dominance comes from offering 90% of features most customers need at aggressive prices enabled by massive production volumes. They sell millions of switching ASICs annually, amortizing the $50-100M NRE across enormous unit counts.
Trident Series: The Enterprise Workhorse
Trident targets campus and enterprise switching, trading some raw throughput for richer feature sets:
- More ACL capacity (larger TCAMs)
- IGMP snooping for efficient multicast (important in enterprise networks with video conferencing)
- Enhanced QoS (more queue levels, more granular policers)
- Lower power consumption (campus switches don't have data center power budgets)
Jericho Series: The Carrier Class Beast
Jericho is Broadcom's carrier/service provider line (acquired from Dune Networks). It focuses on requirements telcos care about:
- Deep buffering: Up to 1 GB on-chip for absorbing bursty traffic (carriers deal with video streams, long-haul latency variations)
- Hierarchical QoS: Multiple levels of scheduling (per-customer, per-service, per-flow)
- MPLS everything: Deep label stacks, sophisticated traffic engineering, VPLS/VPWS for L2/L3 VPNs
- Carrier Ethernet: MEF (Metro Ethernet Forum) compliance, 8021.ag/y.1731 OAM (Operations, Administration, and Maintenance)
Marvell: The Challenger With a Niche
Marvell's switching silicon (Prestera, Teralynx families) competes with Broadcom but emphasizes different features and targets slightly different markets.
Prestera Family: Enterprise and Mid-Range
Marvell's Prestera chips target mid-range switching with strong enterprise features:
- Deep packet inspection: Hardware support for application-layer protocol recognition (useful for firewalls, QoS based on application)
- Inline security: Integrated MACsec and basic firewall functions without external security processors
- Power efficiency: Lower power per gigabit than Broadcom equivalents (critical for edge deployments where power budgets are tight)
- Precision timing: IEEE 1588 PTP (Precision Time Protocol) for sub-microsecond time synchronization (financial trading, telecom base stations, industrial automation)
Marvell's strategy is integration. Where Broadcom might require external TCAMs, security processors, or timing chips, Marvell integrates more functionality on-die, reducing BOM (Bill of Materials) cost and board complexity. This appeals to mid-tier vendors who want simpler designs.
Teralynx: Data Center Competition
Teralynx 10, shipping since July 2024, delivers 51.2 Tbps on a 5nm monolithic design, directly competing with Broadcom's Tomahawk 5.
Key differentiators:
• Industry-lowest latency: 500 nanoseconds (verified by Keysight testing)
• Radix of 512 (enables 40% fewer switches vs 256-radix designs)
• Power efficiency: 1 watt per 100 Gbps
• Supports 32×1.6T, 64×800G, 128×400G, or 512×100G ports
• Enhanced RoCE features for AI/ML workloads
• Adaptive routing with flow-level granularity
Marvell is targeting approximately 20% of the custom ASIC market by 2028, positioning itself as the strategic second source for hyperscalers seeking diversification from Broadcom's 70-80% dominance. Key advantages include lowest latency, clean-sheet programmable architecture, and end-to-end portfolio integration with optical DSPs (~80% market share in 800G optical DSPs).
Marvell's Teralynx series (2022+) targets high-end data center switching, directly competing with Broadcom Tomahawk. Teralynx 10 delivers 51.2 Tbps, matching Tomahawk 5.
Differentiation comes from:
- Lower latency: Optimized pipeline achieves 400-500ns port-to-port latency (vs 700-800ns for Tomahawk)
- AI/ML optimizations: Enhanced features for RoCE (RDMA over Converged Ethernet) critical for GPU clusters doing distributed AI training
- Adaptive routing: Load balancing across multiple equal-cost paths with flow-level granularity
- Enhanced telemetry: Deeper visibility into queue depths, latency histograms, drop reasons
NVIDIA (Mellanox Spectrum): The AI Networking Specialist
NVIDIA's Spectrum switching ASICs, inherited from the 2020 Mellanox acquisition, focus on high-performance computing and AI infrastructure. This isn't general-purpose networking, it's specialized for the unique demands of GPU-to-GPU communication.
Spectrum-6: Built for AI
The latest generation delivers 409.6 Tbps across 512 ports of 800 Gbps or 1024 ports of 400 Gbps. But raw throughput isn't the story, the specialized features are:
RDMA Optimization:
RoCE (RDMA over Converged Ethernet) allows GPUs to read/write each other's memory directly without involving CPUs. This requires lossless Ethernet, which means no packet drops due to congestion. Spectrum achieves this through:
- PFC (Priority Flow Control): When queues fill, send PAUSE frames upstream to stop traffic flow for specific priorities (leaving other traffic unaffected)
- ECN (Explicit Congestion Notification): Mark packets experiencing congestion, allowing endpoints to slow down before packet loss occurs
- Ultra-low latency buffering: Microsecond-scale buffer management to catch transient bursts without introducing latency
AI/ML-Specific Features:
- SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) - NOTE: SHARP is InfiniBand-only, not available on Spectrum Ethernet switches: In-network computation for collective operations used in distributed ML training. When 1000 GPUs need to perform an AllReduce (summing gradients across all GPUs), SHARP does partial aggregation in the network switches rather than forcing all traffic to a single GPU. This dramatically reduces communication time.
- Adaptive routing: Load balances traffic across multiple equal-cost paths at per-flow granularity, avoiding hotspots that occur when all GPUs send to the same destination simultaneously.
- Telemetry for AI workloads: Visibility into GPU-to-GPU communication patterns, identifying stragglers (slow GPUs holding up collective operations), monitoring RoCE flow control events.
Low Latency Obsession:
Previous generation, Spectrum-4 achieves 600ns port-to-port latency. Why does this matter? In AI training, every collective operation (AllReduce, AllGather) synchronizes all GPUs. If one packet is delayed by 10 microseconds, all 1000 GPUs wait. Minimizing switch latency directly improves training speed.
Vertical Integration Advantage:
NVIDIA builds:
- GPUs (A100, H100, B100/B200/B300)
- DPUs (BlueField, combining ARM CPUs, network ASICs, and acceleration engines)
- Switches (Spectrum)
- Interconnects (NVLink for GPU-to-GPU, InfiniBand for cluster-scale)
- Software (CUDA, NCCL for collective operations, DOCA for DPU programming)
This vertical stack enables optimizations impossible for pure networking vendors. The switch can coordinate with GPUs and DPUs to minimize collective operation latency in ways that require intimate knowledge of the entire system.
What All These ASICs Share (Despite Marketing Claims)
Despite vendor differentiation efforts, all modern switching ASICs share core architectural principles because physics and economics force convergence:
Pipeline Processing: 10-20 stages operating continuously, maximizing throughput Integrated SerDes: On-chip serializer/deserializers for 25G/50G/100G links eliminate external PHY chips Flexible Memory Allocation: Trade MAC table size for routing table size or ACL capacity based on deployment needs Hardware Telemetry: Flow monitoring, latency measurement, queue depth visibility On-Chip Buffering: 60-120 MB of SRAM for packet buffering, plus optional external DRAM for deep buffering
The differences are in details: latency optimization (600ns vs 800ns), specific feature sets (SHARP vs standard ECMP), power efficiency (watts per terabit), and pricing. But fundamentally, they all solve the same problem: move packets at wire speed through a pipeline optimized for the most common operations.
The Packet Path: Following Your Cat Video Through Silicon
Let's trace a packet's journey through a modern switch to see how CPU, ASIC, and memory work together. We'll follow a single IP packet arriving on port 1, destined for a host reachable via port 48.
Stage 1: Photons Become Bits
The packet arrives as light pulses on a fiber optic cable connected to an SFP28 transceiver (25 Gbps). The transceiver's photodetector converts light into electrical signals.
SerDes (Serializer/Deserializer) Magic:
The SerDes inside the ASIC receives the serial bit stream at 25 Gbps. A 64-byte minimum packet (512 bits) arrives in 512 bits / 25 Gbps = 20.48 nanoseconds. There's no buffering at this stage, if downstream processing can't keep up, packets are dropped.
The SerDes performs several critical functions:
- Clock recovery: Extract timing information from the signal (there's no separate clock wire, timing is embedded in the data using 64b/66b encoding)
- Equalization: Compensate for signal degradation over the fiber (high-frequency components attenuate more than low-frequency)
- FEC (Forward Error Correction): Use RS-FEC or similar to correct bit errors without retransmission (critical at 25 Gbps and above where BER approaches 10^-12)
The SerDes converts serial data into parallel data (typically 256 or 512 bits wide internally) for processing by the ASIC pipeline.
Stage 2: Ingress Processing Begins
The packet enters the ASIC's ingress pipeline, clocked at 1-2 GHz (much slower than the 25 Gbps line rate, but the pipeline is hundreds of bits wide, maintaining throughput).
Parsing (2-3 clock cycles):
Dedicated parsing logic examines the packet:
Byte 0-5: Destination MAC = 00:1A:2B:3C:4D:5E Byte 6-11: Source MAC = 00:AA:BB:CC:DD:EE Byte 12-13: Ethertype = 0x0800 (IPv4) Byte 14: IP Version = 4, Header Length = 5 (20 bytes) Byte 15: DSCP/ECN = 0x00 Byte 16-17: Total Length = 1500 bytes Byte 18-19: Identification = 0x1234 Byte 20-21: Flags/Fragment Offset = 0x4000 (Don't Fragment) Byte 22: TTL = 64 Byte 23: Protocol = 6 (TCP) Byte 24-25: Header Checksum = 0xABCD Byte 26-29: Source IP = 192.0.2.10 Byte 30-33: Destination IP = 203.0.113.50
The parser extracts these fields into a packet descriptor structure and stores it in SRAM. The full packet data is also written to ingress buffers (on-chip SRAM).
Ingress ACL Lookup (1-2 clock cycles):
The descriptor is matched against ingress ACLs stored in TCAM. Let's say there's a rule:
Rule 100: permit tcp any 203.0.113.0/24 eq 443 (allow HTTPS to web servers)
The TCAM performs parallel matching: all entries compare simultaneously against the packet's 5-tuple (src IP, dst IP, protocol, src port, dst port). Multiple matches are possible, the highest-priority match wins. Result: permit, set DSCP to 46 (expedited forwarding).
Stage 3: L2 Lookup (Optional, Skipped if Routing)
If this were L2 switching, the ASIC would look up the destination MAC in the MAC table (hash table in SRAM). The hash function computes an index, looks up the entry, and returns the egress port.
But our destination MAC (00:1A:2B:3C:4D:5E) matches the router's MAC, so this is a routed packet. L2 lookup is skipped.
Stage 4: L3 Routing Lookup (2-4 clock cycles)
Now the critical operation: longest-prefix match for 203.0.113.50.
The ASIC's routing table uses a hybrid approach:
Hash Table for Host Routes: Exact-match /32 routes (and /64 for IPv6) use a hash table for O(1) lookup.
Algorithmic TCAM for Prefixes: Prefix routes use a clever structure:
- Most routing table entries are /24 and longer (more specific)
- Hash the prefix, look up in a table
- If collision (multiple prefixes hash to the same bucket), use a small TCAM for disambiguation
For 203.0.113.50, let's say the routing table has:
203.0.113.0/24 -> next-hop 10.0.1.1, egress port 48
The lookup returns: next-hop IP 10.0.1.1, egress interface port 48.
Stage 5: Next-Hop Resolution (1 clock cycle)
The ASIC looks up next-hop IP 10.0.1.1 in the ARP table (another hash table) and retrieves the destination MAC address: AA:BB:CC:DD:EE:FF.
If the ARP entry doesn't exist, the packet descriptor is marked for "punt to CPU." The ASIC forwards the packet to the CPU via an internal port, the CPU generates an ARP request, waits for a reply, updates the ARP table, and re-injects the original packet. This adds milliseconds of latency, but it only happens once per destination (until ARP times out).
Stage 6: Queueing (Variable Time)
The packet descriptor (containing egress port, rewrite information, QoS priority) is enqueued to the appropriate egress queue for port 48.
Modern switches have 8-16 queues per port:
- Queue 7: Network control (highest priority, for routing protocols)
- Queue 6: Voice (strict priority)
- Queue 5: Video (strict priority)
- Queue 4-0: Data traffic (weighted fair queueing)
Our packet (HTTPS traffic, DSCP 46) maps to queue 5. If queue 5 is empty, the packet proceeds immediately. If queue 5 is full (congestion), ECN marking occurs or the packet is dropped (depending on configuration).
This is where congestion happens. If port 48 is receiving 100 Gbps aggregate from multiple ports but can only transmit 25 Gbps, queues fill and packets wait (or drop).
Stage 7: Scheduling (Continuous)
The egress scheduler for port 48 runs continuously, selecting packets from queues based on policy:
while (port_48_transmitting) {
if (!queue_7.empty()) send(queue_7.dequeue()); // Strict priority
else if (!queue_6.empty()) send(queue_6.dequeue());
else if (!queue_5.empty()) send(queue_5.dequeue());
else {
// WFQ for queues 0-4 based on weights
send_from_queue_based_on_deficit_round_robin();
}
}
Our packet in queue 5 waits for any queue 7 or 6 traffic to clear, then gets scheduled for transmission.
Stage 8: Packet Modification (2-3 clock cycles)
Dedicated hardware rewrites the packet:
TTL Decrement:
- TTL: 64 -> 63
- IP Header Checksum: Recalculate using incremental update (RFC 1624): new_checksum = old_checksum + ~old_TTL + new_TTL
MAC Address Rewrite:
- Source MAC: 00:1A:2B:3C:4D:5E (router's MAC on port 48)
- Destination MAC: AA:BB:CC:DD:EE:FF (next-hop MAC from ARP)
If VXLAN encapsulation were required, another 50 bytes of headers would be added (outer Ethernet, IP, UDP, VXLAN). But for our simple routed packet, we're done.
Stage 9: Serialization and Transmission
The modified packet is read from buffers, serialized by the egress SerDes, and transmitted on port 48 at 25 Gbps. The 1500-byte packet takes 1500 × 8 / 25 Gbps = 480 nanoseconds to transmit.
Total Latency:
Ingress processing (10ns) + Queueing (0-10ms depending on congestion) + Egress processing (10ns) + Serialization (480ns) = ~500ns with empty queues, or milliseconds with congestion.
Modern switches achieve <1 microsecond latency under ideal conditions, but queuing delay dominates under load.
What About the CPU?
The CPU never saw this packet. It was entirely handled by the ASIC pipeline. The CPU's involvement was limited to:
- Earlier (seconds/minutes ago): Running BGP, learning the route to 203.0.113.0/24, programming it into the ASIC's routing table
- Earlier (seconds ago): Learning next-hop 10.0.1.1's MAC address via ARP and programming it into the ASIC's ARP table
The CPU is running in parallel, maintaining BGP sessions, responding to SNMP queries, writing logs, but it's not involved in per-packet forwarding. This is why a slow CPU doesn't impact forwarding performance (until control plane overload affects routing convergence).
Memory Architecture: SRAM, DRAM, and TCAM (Or: Why Some Memory Costs More Than Gold)
Network hardware uses three very different memory technologies, each with distinct physics, performance characteristics, and cost profiles that vary by roughly 1000x.
SRAM: Fast, Expensive, and Hungry for Space
Static RAM is the fastest memory technology, used for on-chip buffers, small tables, and anything accessed every packet.
How SRAM Works:
Each SRAM cell uses 6 transistors arranged as cross-coupled inverters (two inverters connected in a feedback loop) with access transistors. The cell holds state indefinitely as long as power is applied. No refresh cycles are needed (unlike DRAM), so access is simple: assert the wordline (row selector), and data appears on bitlines immediately.
Physical Operation:
When you write to SRAM:
- Drive bitlines to the desired value (0V for logic 0, VDD for logic 1)
- Assert the wordline (turns on access transistors)
- Bitlines overpower the cross-coupled inverters, forcing them to the new state
- De-assert the wordline (access transistors close), and the inverters hold the state
When you read:
- Pre-charge bitlines to VDD/2 (half voltage)
- Assert the wordline
- The cell's contents slightly discharge one bitline (pulling it toward 0V or keeping it at VDD/2)
- Sense amplifiers detect the small voltage difference and amplify it to a full logic level
This can happen in <1ns in modern processes.
SRAM Characteristics:
- Speed: Single-cycle access at process frequencies (1ns at 1 GHz, sub-nanosecond at higher clocks)
- Density: Terrible (6 transistors per bit, plus routing overhead and sense amplifiers). A 1MB SRAM might occupy 10-20 square millimeters of die area at 7nm.
- Power: Leakage power (static power) is significant at advanced nodes. At 5nm/3nm, leakage through transistor gates (quantum tunneling) becomes non-trivial. Dynamic power (during switching) is moderate.
- Cost: Extremely expensive per bit in terms of die area. Die area costs $5-10 per square millimeter at advanced nodes.
SRAM in Network Hardware:
A modern switching ASIC has 64-128 MB of SRAM on-die for:
- Packet buffers: Ingress and egress queues, storing packets temporarily
- MAC address tables: Hash tables with 32K-256K entries (6 bytes per MAC + metadata)
- Routing tables: Small routing tables or algorithmic TCAM structures
- Packet descriptors: Metadata for packets in flight through the pipeline
At 5nm, 128 MB of SRAM requires billions of transistors (128 MB × 8 bits/byte × 6 transistors/bit = 6.4 billion transistors just for SRAM). This is a massive silicon investment, but essential for wire-speed access.
DRAM: Capacity Over Speed (The Workhorse of Computing)
Dynamic RAM trades speed for density, achieving roughly 4-8x higher density than SRAM at the cost of slower access and refresh overhead.
How DRAM Works:
Each DRAM cell is one transistor and one capacitor. The capacitor stores charge (logic 1) or not (logic 0). The transistor acts as an access switch.
Why "Dynamic"?
Capacitors leak charge. Within milliseconds, a charged capacitor (logic 1) will discharge enough to become ambiguous. DRAM controllers must periodically refresh every cell: read the value, sense it, and write it back. DDR4 requires refresh every 64ms, which means the controller is constantly reading and rewriting memory even when the CPU isn't accessing it.
Physical Operation:
When you write to DRAM:
- Assert the row address (opens access transistors for an entire row, thousands of cells)
- The row's capacitors share charge with bitlines (destructive read, since capacitors discharge into bitlines)
- Sense amplifiers detect bitline voltages and amplify them
- Drive bitlines to the desired new value for the column being written
- Close the row (access transistors turn off), capacitors recharge to new values
When you read:
- Same as write steps 1-3 (reading is destructive)
- Restore the row by writing amplified values back (required since reading discharged capacitors)
- Close the row
This takes 20-40ns for the first access (row activation, column selection, data transfer), then subsequent accesses to the same row are faster (10-15ns) since the row is already open.
DRAM Characteristics:
- Speed: Much slower than SRAM (20-40ns first access latency, though burst modes achieve high bandwidth)
- Density: Excellent (1 transistor + 1 capacitor per bit, and capacitors are 3D structures stacked vertically, not planar)
- Power: Moderate (refresh cycles consume power continuously, plus dynamic power during access)
- Cost: Much cheaper per bit than SRAM ($5-10 per gigabyte vs thousands per gigabyte for SRAM)
DRAM in Network Hardware:
External DRAM (DDR4, DDR5) provides gigabytes of capacity for:
- Deep packet buffers: Carrier-class equipment with 1+ GB buffers for absorbing massive bursts (video streams, long-haul latency variations)
- CPU main memory: 4-32 GB for the control plane CPU
- Routing tables in software: The CPU maintains full routing tables in DRAM, then programs the most active routes into the ASIC's hardware tables
DRAM doesn't appear in the wire-speed data plane because access latency is too high. A 40ns DRAM access at 25 Gbps means 25 Gbps × 40ns = 1000 bits = 125 bytes have arrived during that access. You'd fall behind immediately.
TCAM: Parallel Search at Absurd Cost (The Technology That Makes ACLs Possible)
Ternary Content-Addressable Memory is the most exotic and expensive memory in network hardware, but it enables operations that are physically impossible with standard memory.
How Standard Memory Works:
Standard memory is addressed: "Give me the data at address 0x1234." You provide an address, and memory returns the data stored there. This is "location-based" access.
How CAM Works:
CAM (Content-Addressable Memory) reverses this: "Find me the address(es) containing this data." You provide data, and CAM returns the address(es) where it's stored. This is "content-based" access.
How TCAM Extends CAM:
TCAM adds a third state: "don't care" (denoted X). Each bit can be 0, 1, or X. This allows pattern matching with wildcards.
Example TCAM entries:
Entry 0: 192.0.2.0/24 -> 11000000.00000000.00000010.XXXXXXXX Entry 1: 192.0.2.128/25 -> 11000000.00000000.00000010.1XXXXXXX Entry 2: 192.0.0.0/8 -> 11000000.XXXXXXXX.XXXXXXXX.XXXXXXXX
When you search TCAM with an IP address like 192.0.2.150:
Search: 192.0.2.150 = 11000000.00000000.00000010.10010110
All three entries match (entries 0, 1, and 2), but entry 1 is the most specific (longest prefix). TCAM returns entry 1's address (and associated data, like next-hop information).
Physical Implementation:
Each TCAM cell requires 16 transistors (compared to 6 for SRAM):
- 2 SRAM cells (12 transistors) to store the three states (00=0, 01=1, 11=X, 10=invalid)
- Comparison logic (4 transistors) to compare the search data against the stored value and X bits
When you search TCAM:
- The search data is broadcast to all entries simultaneously
- Every entry's comparison logic evaluates match/no-match in parallel
- A priority encoder selects the highest-priority match (typically longest prefix or lowest index)
- The result is the address of the matching entry
This parallel search is TCAM's defining feature: searching 1 million entries takes the same time as searching 10 entries (~1-2ns). The cost is enormous transistor count and power consumption.
TCAM Characteristics:
- Speed: Extremely fast (1-2ns lookup regardless of table size, since search is fully parallel)
- Density: Abysmal (16 transistors per bit, plus priority encoding logic)
- Power: Staggering (all entries compare simultaneously, charging/discharging massive capacitance)
- Cost: Astronomical per bit (10-100x SRAM cost, 10,000x DRAM cost)
A 1-Mbit TCAM (128 KB) might consume 40-80 watts continuously. This is why large ACL tables (thousands of rules) require careful power and cooling design.
TCAM in Network Hardware:
Typical switching ASICs have 4-8K TCAM entries on-chip, enough for 2-4K ACL rules (each rule requires 2 TCAM entries: one for match, one for mask). Larger tables require external TCAM chips from IDT, Broadcom, or Marvell.
External TCAMs provide 512K-2M entries but require:
- High-speed interface to the ASIC (10-20 Gbps)
- Significant board space
- Dedicated power delivery (40-80W)
- Active cooling
This is why network architects preach ACL hygiene: every rule costs power, money, and chip area.
Algorithmic TCAM (Reducing TCAM Requirements):
Modern ASICs use hybrid approaches to reduce TCAM dependence:
For routing tables:
- Hash tables handle exact-match routes (/32 for IPv4, /64 for IPv6), the most common entries
- Small TCAMs handle less-specific prefixes that cause hash collisions
- This reduces TCAM requirements by 10-20x
For ACLs:
- Common 5-tuple rules use hash-based matching where possible
- TCAMs handle wildcard-heavy rules (e.g., "any source, destination port 443")
Memory Hierarchy in Network Hardware (Revisited with Physics)
Network hardware uses memory in a carefully orchestrated hierarchy, balancing speed, capacity, and cost:
SRAM (on-chip):
- Size: 10-128 MB
- Speed: <1ns access
- Cost: $10-50 per megabyte (die area cost)
- Use: Packet buffers, frequently accessed tables (MAC, small routing tables)
TCAM (on-chip or external):
- Size: 4K-2M entries (512 KB - 256 MB)
- Speed: 1-2ns search (regardless of size)
- Cost: $100-1000 per megabyte
- Use: ACLs, routing tables (LPM)
DRAM (external):
- Size: 1-32 GB
- Speed: 20-40ns first access
- Cost: $0.005-0.01 per megabyte
- Use: Deep buffers, CPU memory
Flash/SSD:
- Size: 16-256 GB
- Speed: 10-100 microseconds
- Cost: $0.0001-0.001 per megabyte
- Use: OS images, configuration storage, logs
The art of network hardware design is placing the right data in the right memory tier, balancing wire-speed requirements against silicon budget constraints.
The Future: Programmable Data Planes and Disaggregation
Traditional ASICs have fixed pipelines. You can configure tables and policies, but you can't change what the pipeline fundamentally does. This is changing, slowly.
P4 (Programming Protocol-independent Packet Processors):
P4 is a domain-specific language for programming packet processing pipelines. Instead of hardcoded parsers and match-action stages, P4 lets you define:
Custom Packet Parsers: Handle new protocols by specifying header formats and extraction logic Custom Match-Action Tables: Define new forwarding behaviors beyond standard L2/L3/ACL Custom Packet Modifications: Specify arbitrary header rewrites and insertions
Modern ASICs supporting P4 (Intel Tofino 1/2 - now discontinued as of 2023, Broadcom Trident 4, Cisco Silicon One, Marvell Teralynx) implement this through:
Programmable Parser Engines: State machines with programmable transitions and extraction rules Flexible Match-Action Pipeline: Configurable table types (exact match, LPM, range), action primitives (add, subtract, rewrite), and table interconnections Microcode for Packet Modifications: Small processors executing sequences of primitive operations
P4 doesn't eliminate ASIC limitations (the hardware is still fixed, you're programming microcode and state machines within limits), but it dramatically increases flexibility. You can deploy new protocols without new silicon.
Critical P4 Update - Intel Tofino Discontinuation and PINS:
The P4 landscape has shifted significantly. Intel discontinued Tofino ASIC development in 2023 after Tofino 1 and Tofino 2, cancelling Tofino 3. In 2025, Intel open-sourced the P4 Studio software but has exited the switch ASIC market, remaining committed to P4 only for IPUs and SmartNICs.
Current P4-capable ASICs:
• Intel Tofino 1/2: Native P4 support (end-of-life for new development)
• Cisco Silicon One G200: Native P4 programmable
• Broadcom Trident 4/Jericho 2: Use NPL (Network Programming Language), Broadcom's proprietary alternative
• Broadcom (all ASICs): P4 via PINS/SAI mapping—not compiled to bitstream but provides P4Runtime interface
• NVIDIA Spectrum: Not P4-native, uses FlexFlow architecture
PINS (P4 Integrated Network Stack) merged into SONiC 202111 release. It enables P4Runtime interface on SAI-compatible ASICs, working on fixed-function ASICs and not just P4-native hardware. This is a collaboration between ONF, Google, Microsoft, and Intel. While not true silicon-level programmability, it provides a standard interface for programming fixed-function pipelines.
Example P4 Use Cases:
- In-band Network Telemetry (INT): Add custom headers to packets with switch queue depth, latency, path information
- Custom load balancing: Implement novel ECMP hashing algorithms or flowlet-based load balancing
- Network function offload: Move simple firewall or NAT logic into switches
Disaggregation and Open Networking:
The rise of merchant silicon (Broadcom, Marvell, NVIDIA) has enabled disaggregation: separating hardware from software.
Traditional model:
- Cisco, Juniper, Arista sell integrated hardware + software
- Proprietary ASICs (Cisco, Juniper) or tightly coupled software (all vendors)
- Vendor lock-in
Disaggregated model:
- Buy "white box" switches (Edgecore, Dell, Celestica) with merchant silicon
- Run open-source network operating systems (SONiC, Cumulus, DENT) or commercial alternatives
- Mix and match hardware and software vendors
This works because:
- Merchant silicon (Broadcom) is good enough for 90% of use cases
- Standard APIs (SAI, Switch Abstraction Interface) decouple software from hardware
- Cloud providers (Microsoft, Facebook, Google) pioneered this for cost savings
The result is increased competition, lower costs, and faster innovation. But also complexity: you're now integrating components yourself.
The Uncomfortable Reality: Physics Always Wins
No amount of clever software, algorithm optimization, or wishful thinking can overcome fundamental physics. If your application needs:
- Wire-speed forwarding at 100+ Gbps: You need an ASIC. CPUs can't touch this (10-100x too slow), and FPGAs are marginal (2-5x too slow, 2-5x too power-hungry).
- Submicrosecond latency: You need custom hardware (ASIC or FPGA). CPUs take microseconds just for interrupt handling and context switching.
- Parallel processing of millions of packets per second: You need hardware pipelines. CPUs process serially (or with limited parallelism across cores).
Conversely, if you need:
- Complex protocol state machines: BGP, OSPF with thousands of neighbors and routes
- Frequent updates: New features, bug fixes, protocol extensions
- Integration with standard ecosystems: Linux, containers, cloud orchestration
You need a CPU. ASICs can't provide this flexibility, and FPGAs are too expensive to program (FPGA development requires hardware design expertise, not just software skills).
This is why network equipment has evolved into heterogeneous systems: ASICs for the fast path (packet forwarding), CPUs for the control plane (protocols, management), FPGAs for specialized middle ground (custom acceleration), and multiple memory types (SRAM for speed, DRAM for capacity, TCAM for parallel search).
It's not redundancy, poor design, or vendor profit maximization (okay, maybe a little of that). It's the inevitable result of physics (electron mobility, capacitance, power density), economics (NRE amortization, silicon cost, development expertise), and the relentless demands of moving packets at wire speed while also running BGP.
The next time you complain about network equipment costs (and you will, because a 100G router costs more than most cars), remember: inside that chassis is an orchestra of specialized silicon, each type doing what only it can do, because the universe gave us an iron triangle of performance, flexibility, and cost. You get to choose two, and network hardware chooses all three by using three different types of chips.
And that's why your packets need a CPU, an ASIC, and maybe an FPGA to get anywhere. Physics is a harsh mistress, but at least she's consistent.