Container Networking: Giving Every Process Its Own Network Stack

Scott Morrison • January 24, 2026 • 0 views

container networking Kubernetes CNI plugins overlay networks network namespaces VXLAN InfiniBand IPv4 exhaustion nftables Docker

Containers are virtualization stacked on virtualization, giving every process its own network stack complete with namespaces, overlay networks, and CNI plugins that create enough encapsulation overhead to make networking engineers weep. Kubernetes decided "namespace" wasn't confusing enough so they created their own completely different concept with the same name, while AI workloads try to cram InfiniBand's sub-microsecond latency through multiple layers of software virtualization, and everyone's burning through IPv4 addresses faster than they can deploy microservices but refuses to migrate to IPv6.

Containers are virtualization, despite a decade of developers insisting they're "not VMs." They virtualize the OS instead of hardware, then we run them inside actual VMs (because security), creating virtualization stacked on virtualization. The result is network namespaces, veth pairs, overlay networks, CNI plugins, and iptables/nftables chains, giving every container its own complete network stack. Because apparently giving processes their own filesystem wasn't enough complexity.

And then Kubernetes decided the word "namespace" wasn't confusing enough, so they created Kubernetes namespaces (logical resource groupings) completely unrelated to Linux network namespaces (kernel isolation). Thanks for that. Nothing says intuitive design like overloading fundamental terminology, forcing you to ask "which kind?" before troubleshooting.

This complexity enables CI/CD velocity, deploying microservices with 15% VXLAN overhead and 10% iptables CPU cost. The operations team can sort out the performance later. Let's explore how container networking actually works, why it's complex, how different orchestrators handle it, and why combining virtualization layers creates fascinating problems when you try running high-performance InfiniBand through multiple software abstraction layers.

Linux Network Namespaces: The Foundation of Container Networking

Container networking is built on Linux network namespaces, a kernel feature that creates isolated network stacks. When you create a network namespace, you get a completely independent set of network interfaces, routing tables, firewall rules, and sockets. Processes in different namespaces can't see each other's network configuration. This is how containers achieve network isolation without full virtualization.

The basic mechanics are straightforward. The Linux kernel maintains a single global network stack by default. Network namespaces partition this into multiple independent stacks. Each namespace has:

Its own loopback interface: Every namespace gets a separate lo interface. 127.0.0.1 in one namespace is completely distinct from 127.0.0.1 in another.

Independent routing tables: Routes configured in one namespace don't affect others. Each namespace makes its own routing decisions.

Separate netfilter chains: Iptables/nftables rules are per-namespace. Firewall configurations don't leak between namespaces.

Isolated socket state: TCP connections, UDP sockets, and listening ports in one namespace are invisible to others.

Own network interfaces: Physical or virtual interfaces can be assigned to specific namespaces.

Creating a network namespace is simple:

ip netns add container1

Now you have a new namespace called container1 with nothing in it. No interfaces, no routes, completely isolated. To make it useful, you need to connect it to something, which brings us to virtual ethernet pairs.

Virtual Ethernet Pairs: The Pipe Between Worlds

A veth (virtual ethernet) pair is exactly what it sounds like: two virtual network interfaces connected by a virtual wire. Packets sent to one end come out the other end. It's a pipe, but for network frames instead of bytes.

This is how you connect network namespaces to the host or to each other. Create a veth pair, put one end in the container's namespace, keep the other end in the host namespace (or another container's namespace), configure IP addresses, and suddenly your isolated container can communicate.

ip link add veth0 type veth peer name veth1
ip link set veth1 netns container1
ip addr add 10.0.0.1/24 dev veth0
ip netns exec container1 ip addr add 10.0.0.2/24 dev veth1
ip link set veth0 up
ip netns exec container1 ip link set veth1 up

Congratulations, you've manually created what Docker does automatically for every container. veth0 in the host namespace can now communicate with veth1 in container1's namespace. The kernel bridges the two interfaces, moving frames between namespaces.

The performance cost is real but not terrible. Each packet traverses the veth pair, which involves context switching between namespaces and some kernel overhead. For low packet rates this is fine. For high packet rates (10+ Gbps, millions of packets per second), the veth overhead becomes measurable. You're adding microseconds of latency and CPU cycles for the privilege of network isolation. Usually worth it, but the cost exists.

Linux Bridges: Connecting Multiple Containers

Veth pairs connect two namespaces. What if you have 10 containers that need to communicate? You could create veth pairs between every pair of containers (45 pairs for 10 containers in a full mesh), or you could use a Linux bridge.

A Linux bridge is a software switch. It learns MAC addresses, forwards frames, and behaves like a physical Ethernet switch. Docker creates a bridge called docker0 by default, and each container gets a veth pair with one end attached to this bridge. All containers on the same bridge can communicate at Layer 2.

The bridge typically gets an IP address and acts as the default gateway for containers. The host's kernel routes traffic from containers to the outside world, performing NAT if needed. This is the simplest container networking model and what you get by default with Docker.

It works fine for single-host deployments. When you scale to multiple hosts, things get interesting (read: complicated).

Overlay Networks: Tunneling Your Way to Multi-Host Networking

Containers on different physical hosts need to communicate, but they're on separate Layer 2 networks. The solution that emerged is overlay networking: create a virtual Layer 2 network on top of the existing Layer 3 network by encapsulating container traffic in tunnels.

The most common overlay technology is VXLAN (Virtual Extensible LAN). VXLAN wraps Ethernet frames in UDP packets and sends them across the underlying IP network. From the containers' perspective, they're on the same Layer 2 network. From the physical network's perspective, it's just UDP traffic between hosts.

How VXLAN Works

VXLAN adds a 50-byte header to every packet: 8 bytes for VXLAN header, 8 bytes for UDP header, 20 bytes for IP header, 14 bytes for outer Ethernet header. Your 1500-byte MTU on the container network becomes 1550 bytes on the physical network (or you reduce container MTU to 1450 to fit within 1500-byte physical MTU).

This overhead is permanent. Every container-to-container packet on different hosts pays the encapsulation tax. A chatty microservices application doing thousands of requests per second across hosts is wrapping and unwrapping thousands of packets per second, burning CPU cycles for the privilege of pretending containers are on the same Layer 2 network.

The VXLAN Network Identifier (VNI) is 24 bits, providing 16 million possible overlay networks. This seems like plenty until you consider that some organizations run hundreds of Kubernetes clusters with dozens of namespaces each (Kubernetes namespaces, the resource grouping kind, not network namespaces). VNI exhaustion is unlikely but theoretically possible if you're really committed to microservices madness.

The Performance Cost of Overlays

Let's be honest about overlay network performance. You're adding:

Encapsulation/decapsulation CPU overhead: Every packet must be wrapped and unwrapped. This burns CPU cycles on both sender and receiver.

Increased packet size: 50 bytes of overhead per packet means 3.3% overhead for 1500-byte packets, but 50% overhead for 100-byte packets. Small packets suffer most.

MTU fragmentation issues: If your physical network has 1500-byte MTU and you don't reduce container MTU, packets get fragmented. Fragmentation destroys performance.

Additional latency: Encapsulation adds microseconds per packet. Not huge, but measurable at scale.

Hardware offload limitations: Not all NICs can offload VXLAN encapsulation. Without hardware offload, the CPU does all the work.

Typical overlay network performance: 70-90% of host-to-host performance, depending on packet size, hardware offload support, and CPU capabilities. You're trading 10-30% of network performance for the abstraction of a flat network across hosts. Whether that's a good trade depends on your priorities.

Alternatives to Overlays

Some CNI plugins avoid overlays entirely by using routing:

Calico: Uses BGP to advertise container IP routes to the physical network. Each container gets a real routable IP, no encapsulation. Requires network infrastructure that supports this (you can't do it on AWS without VPC CNI).

Cilium: Can use routing or overlays. In routing mode, it programs the network to route container IPs natively. In overlay mode, it uses VXLAN or Geneve (similar to VXLAN).

Host routing: Some setups simply route container traffic through the host's network stack using the host's IP and port mapping. This is what Docker does by default (NAT from container IPs to host IPs).

Routing is faster than overlays but requires more coordination with the physical network. Overlays are slower but work on any IP network. Classic performance versus flexibility trade-off.

CNI Plugins: Because Standard Container Networking Would Be Too Easy

The Container Network Interface (CNI) is a specification for how container runtimes should configure container networking. Instead of every orchestrator implementing its own networking, CNI defines a plugin interface. The orchestrator calls CNI plugins to set up networking for each container.

This sounds reasonable until you realize there are dozens of CNI plugins, each with different features, performance characteristics, and configuration complexity. Choosing a CNI plugin for Kubernetes is like choosing a web framework for JavaScript: there are too many options, they all have trade-offs, and by the time you've evaluated them all, three new ones have launched.

How CNI Works

When a container starts, the container runtime (containerd, CRI-O) calls a CNI plugin, passing it configuration and the container's network namespace. The CNI plugin:

1. Creates network interfaces (typically veth pairs)

2. Assigns IP addresses to the container

3. Sets up routing inside the container namespace

4. Configures any necessary tunneling or encapsulation

5. Updates external systems (IPAM databases, route servers) as needed

CNI plugins are executables that follow a simple JSON-based protocol. This makes them language-agnostic and theoretically simple to implement. In practice, production CNI plugins are thousands of lines of code dealing with edge cases, performance optimization, and integration with various networking stacks.

Popular CNI Plugins and Their Trade-offs

Flannel: The simplest overlay network. Uses VXLAN by default. Easy to set up, limited features, decent performance. Good for getting started, outgrown quickly.

Calico: Routing-based networking using BGP. No overlay overhead when in pure routing mode. Can also do overlays when routing isn't viable. Includes network policy enforcement. More complex to operate but better performance.

Cilium: eBPF-based networking and security. Can do routing or overlays. Uses eBPF to bypass much of the Linux network stack for better performance. Includes advanced features like transparent encryption and API-aware network policies. Requires newer kernels (4.9+, realistically 5.x+) and is complex to debug when things break.

Weave: Another overlay network with built-in encryption. Simple to use but performance isn't great. Has declined in popularity as Cilium and Calico improved.

AWS VPC CNI: Gives each pod a real VPC IP address. No overlays, no encapsulation, full AWS network integration. Only works on AWS, and you can exhaust VPC IPs quickly with large clusters.

Azure CNI / AKS: Similar to AWS VPC CNI but for Azure. Integrates with Azure Virtual Networks, assigns Azure IPs to pods.

Multus: Not really a CNI plugin but a CNI multiplexer. Lets you attach multiple network interfaces to a single pod. Used when you need containers connected to multiple networks (management network, data network, storage network).

The choice depends on your environment (bare metal, cloud, hybrid), performance requirements (overlay overhead acceptable or not), and operational complexity tolerance (simple is easy to operate, complex gives more features).

Kubernetes Pod Networking: Why Is This So Complicated?

Kubernetes has strong opinions about networking. The Kubernetes network model requires:

1. Every pod gets its own IP address

2. Pods can communicate with all other pods without NAT

3. Pods can communicate with nodes without NAT

4. The IP that a pod sees itself as is the same IP others see it as

This model is clean from an application perspective, every pod is directly addressable, but it creates operational complexity. You need thousands of IP addresses (one per pod), routing or overlay networks to make pod-to-pod communication work across hosts, and careful IP allocation to avoid conflicts.

Pod-to-Pod Communication: The Surprisingly Complex Flow

Let's trace a packet from Pod A on Node 1 to Pod B on Node 2 using a typical overlay network (VXLAN):

1. Application in Pod A sends packet to Pod B's IP address

2. Packet exits Pod A's network namespace via veth pair into Node 1's namespace

3. Node 1's CNI bridge receives the packet

4. Node 1's routing table knows Pod B is on Node 2 (learned via CNI plugin)

5. Packet is encapsulated in VXLAN (UDP packet to Node 2's IP)

6. VXLAN packet traverses physical network from Node 1 to Node 2

7. Node 2 receives VXLAN packet, decapsulates it

8. Node 2's CNI bridge receives the inner packet (original frame from Pod A)

9. Packet enters Pod B's namespace via veth pair

10. Application in Pod B receives the packet

Count the hops: pod namespace → host namespace → VXLAN encapsulation → physical network → VXLAN decapsulation → host namespace → pod namespace. Seven network stack traversals for what could have been one physical hop if the pods were on the same host.

And we haven't even talked about Kubernetes Services yet.

Services and Kube-proxy: More Layers, More Complexity

Kubernetes Services provide stable IPs and load balancing for pods. Applications talk to Service IPs, and kube-proxy (or eBPF replacements) transparently load balances to backend pods.

This is implemented via iptables (or nftables, or eBPF) NAT rules. Every Service gets a cluster IP, and kube-proxy programs rules that rewrite destination IPs from the Service IP to individual pod IPs.

In iptables mode (still the default in many clusters), every Service creates dozens of iptables rules. A cluster with 1000 Services might have 50,000+ iptables rules. Every packet traversing the host must evaluate these rules. Rule evaluation is O(n) where n is the number of rules. This scales terribly.

IPVS mode (IP Virtual Server) uses the kernel's IPVS load balancer instead of iptables. IPVS uses hash tables instead of linear rule lists, so lookup is O(1) regardless of number of Services. Much better scaling, but IPVS has its own quirks and debugging IPVS is different from debugging iptables.

eBPF-based CNI plugins like Cilium bypass both iptables and IPVS, implementing Service load balancing directly in eBPF programs attached to network interfaces. This is fastest but requires newer kernels and expertise with eBPF tooling.

So now our packet flow includes: pod namespace → veth → iptables/IPVS/eBPF NAT rewriting → VXLAN encapsulation → physical network → VXLAN decapsulation → iptables/IPVS/eBPF reverse NAT → veth → pod namespace. Even more fun.

From IPTables to NFTables: Because We Needed Another Transition

Just as everyone became comfortable (or at least resigned to) debugging iptables, Linux moved to nftables. NFTables is the modern replacement for iptables, iptables6, arptables, and ebtables, consolidating them into a single framework with better performance and more flexible rule syntax.

The transition is, predictably, messy. Many CNI plugins still use iptables. Some support nftables. Some use eBPF and bypass both. Kubernetes itself defaults to iptables for kube-proxy but has support for nftables mode. The result is a mixture of iptables and nftables rules in many clusters, each with different semantics and performance characteristics.

What NFTables Actually Improves

NFTables addresses several iptables limitations:

Better performance: NFTables uses more efficient data structures and evaluates rules faster. For large rulesets (thousands of rules), nftables can be 10x faster than iptables.

Single transaction model: IPTables applies rule changes one at a time. Nftables uses atomic transactions, so complex rule updates happen instantaneously without intermediate states where some rules are updated and others aren't.

More flexible matching: NFTables supports sets, maps, and verdict maps, enabling complex matching logic that required dozens of iptables rules to implement.

Unified syntax: IPTables had different commands (iptables, ip6tables, arptables, ebtables) for different protocols. Nftables has one command (nft) with consistent syntax.

Better scripting: NFTables rules can be dumped and reloaded as text files more easily than iptables. Backup and restore are simpler.

The Migration Reality

Despite nftables being better, migration is slow:

Tooling inertia: Decades of scripts, automation, and documentation assume iptables. Rewriting all of this takes time.

Knowledge gap: Engineers know iptables. Learning nftables syntax and debugging takes effort.

Compatibility layer confusion: Linux provides iptables-nft, a compatibility layer that translates iptables commands to nftables rules. This works but creates a franken-system where you think you're using iptables but you're actually using nftables under the hood. Debugging this is delightful…

CNI plugin support: Not all CNI plugins support nftables yet. Running Kubernetes with nftables mode enabled might break networking depending on your plugin.

The industry is slowly moving to nftables, but the transition will take years. In the meantime, you get to debug both iptables and nftables simultaneously. Fun times…

Container Networking Beyond Kubernetes

Kubernetes dominates container orchestration mindshare, but it's not the only game. Different orchestrators make different networking trade-offs.

Docker Swarm: The Simpler Time

Docker Swarm is Docker's native orchestration. It's simpler than Kubernetes, which means less flexibility but also less complexity.

Swarm uses overlay networks via libnetwork, Docker's built-in networking library. Each overlay network is a separate VXLAN. Containers on the same overlay network can communicate regardless of which host they're on. The implementation is straightforward: VXLAN tunnel, container bridge, done.

Swarm also has ingress routing mesh for load balancing. Every node in the Swarm can accept traffic for any Service, and Swarm routes it to appropriate containers using IPVS. This is similar to Kubernetes Services but built into Swarm rather than bolted on via kube-proxy.

The downside is limited extensibility. You're stuck with Docker's networking decisions. No CNI plugin system, no eBPF acceleration, no advanced network policies. It works, it's simple, but it doesn't scale to Kubernetes-size deployments or complexity.

Slurm: HPC Doesn't Do Containers (Much)

Slurm is the dominant job scheduler in long-running high-performance computing jobs. HPC workloads traditionally ran on bare metal with MPI for communication. Containers were seen as unnecessary overhead, another layer of abstraction that just slows things down.

But containers have snuck into HPC via Singularity (now Apptainer), a container runtime designed for HPC. Unlike Docker, Singularity containers run as the user's own UID, don't require root privileges, and integrate better with HPC environments (shared filesystems, InfiniBand, GPUs).

Singularity containers typically use the host network namespace directly. No network isolation, no overlay networks, no NAT. The container sees the host's network interfaces and can use InfiniBand, high-performance Ethernet, or whatever networking the host has. This eliminates virtualization overhead but also eliminates isolation. For HPC, where jobs run on dedicated compute nodes for hours or days, isolation matters less than performance.

Slurm itself doesn't orchestrate container networking. It schedules jobs, those jobs might run in containers, but the container networking is the container runtime's problem (typically just using host networking).

Hypervisors: When Containers Need Even More Isolation

Sometimes container isolation isn't enough. For security-sensitive workloads or multi-tenant environments, you want full VM-level isolation. Enter nested virtualization: containers running inside VMs.

KVM: The standard Linux hypervisor. Each VM gets virtual network interfaces (virtio-net), which connect to bridges or Open vSwitch on the host. Inside the VM, you run containers with their own networking stack. Now you have VM networking (virtual interfaces, bridges, VXLAN between hosts) and container networking (network namespaces, veth pairs, CNI plugins) stacked on top of each other. Two layers of virtualization, two layers of networking overhead.

The packet path for container-to-container communication across VMs on different hosts: container namespace → veth → CNI bridge → container VXLAN → VM's network namespace → virtio-net → host bridge → physical network → host bridge → virtio-net → VM's network namespace → container VXLAN decap → CNI bridge → veth → container namespace. Thirteen (see, that lucky number says a lot about this architecture…) network stack traversals. The performance is about what you'd expect.

Cloud providers do this extensively. Your Kubernetes (EKS) cluster on AWS runs on EC2 instances (VMs), which run containers. Two layers of virtualization is the price of multi-tenancy and security isolation.

MicroVMs: Fast Boot, Less Overhead

MicroVMs like Firecracker (AWS Lambda, Fargate) and Cloud Hypervisor aim for VM-level isolation with near-container overhead. They boot in milliseconds, use minimal memory, and optimize the VM-container gap.

Firecracker uses a minimal device model. One virtual network interface (virtio-net), no legacy devices, no BIOS. Inside the microVM, you typically run a single container or minimal container runtime. The goal is to make the VM so lightweight that you don't notice it's there.

Networking is simplified: each microVM gets a tap device on the host, connected to a bridge or routing table. Traffic flows microVM → tap → host routing → physical network. No overlay networks inside the microVM (usually), just simple routing or NAT.

MicroVMs reduce but don't eliminate virtualization overhead. You still have virtio-net, still have a VMM (virtual machine monitor) managing the guest, still have context switches between guest and host. It's better than full VMs but not as fast as bare containers.

The AI Training Paradox: InfiniBand Meets Kubernetes

Here's a fun contradiction: AI training and inference increasingly run in Kubernetes, but high-performance AI workloads require InfiniBand or other RDMA-capable networks for GPU-to-GPU communication. Kubernetes adds virtualization overhead (network namespaces, overlay networks, CNI plugins). InfiniBand is all about eliminating overhead (kernel bypass, zero-copy, sub-microsecond latency). Combining them creates architectural tension.

Training GPT-4-scale models requires 10,000+ GPUs communicating constantly via collective operations (all-reduce, all-gather). These operations are latency-sensitive. Every microsecond of network latency multiplies into hours of additional training time. InfiniBand provides sub-microsecond latency. Kubernetes networking adds microseconds of overhead. This is a problem.

The InfiniBand-in-Containers Challenge

To use InfiniBand from containers, you need:

Host InfiniBand drivers and RDMA stack: The kernel modules and userspace libraries must be present on the host.

Device passthrough to containers: The InfiniBand device (/dev/infiniband/*) must be accessible inside the container. This means mounting host devices into containers, breaking some isolation.

Shared RDMA resources: InfiniBand uses kernel resources (protection domains, queue pairs, memory regions) that must be shared between containers. Proper isolation requires SR-IOV or careful resource management.

CNI plugin support: Not all CNI plugins understand InfiniBand. You need specialized CNI plugins (Multus + Whereabouts + SR-IOV CNI, or Mellanox's (now Nvidia’s) UFM integration) to assign InfiniBand interfaces to pods.

The typical solution is Multus, which lets you attach multiple network interfaces to pods. One interface uses the standard CNI plugin for Kubernetes networking (Calico, Flannel, whatever). Another interface directly attaches the InfiniBand device via SR-IOV CNI, bypassing most of the Kubernetes networking stack.

Applications inside the pod use the InfiniBand interface for high-performance communication (GPU-to-GPU, storage access) and the standard interface for Kubernetes control plane communication (health checks, metrics, logs). Two separate networks, two separate IP addressing schemes, completely different performance characteristics.

SR-IOV: Hardware Virtualization for Network Devices

SR-IOV (Single Root I/O Virtualization) lets a physical network device present itself as multiple virtual devices. Each virtual device (VF, virtual function) can be assigned to a different container or VM with near-native performance.

This bypasses most of the container networking stack. Instead of veth pairs and bridges and overlays, the container gets direct access to a slice of the physical NIC. Packets go directly from application to hardware with minimal kernel involvement.

The downside is complexity. SR-IOV requires compatible hardware, kernel drivers, and careful configuration. You're limited by the number of VFs the hardware supports (typically 64-256 per device). Debugging SR-IOV issues requires understanding PCI passthrough, IOMMU, and device firmware.

For AI training clusters running thousands of GPUs, SR-IOV InfiniBand is increasingly standard. The complexity is worth it to avoid container networking overhead on latency-critical paths.

The Performance Gap

Let's quantify the overhead:

Bare metal InfiniBand: 0.5-0.7 microseconds latency, 200 Gbps bandwidth (HDR), near-zero CPU overhead with RDMA.

InfiniBand via SR-IOV in containers: 0.6-0.8 microseconds latency (slightly higher due to IOMMU and virtualization), 190+ Gbps bandwidth, still near-zero CPU overhead.

InfiniBand through standard container networking: Don't even try. RDMA doesn't work through veth pairs and VXLAN tunnels. You'd fall back to IP over InfiniBand with kernel stack overhead, losing all the benefits of RDMA.

The paradox is real but solvable: use Kubernetes for orchestration and standard services, bypass Kubernetes networking for performance-critical data paths. It's awkward, complex to configure, and philosophically questionable (why use Kubernetes if you're working around its networking?), but it works. Welcome to production AI infrastructure.

IPv4 Exhaustion in Container Networks: The Self-Inflicted Wound

Remember when we ran out of IPv4 addresses on the public Internet? Now we're running out of IPv4 addresses in private networks too, and containers are to blame.

Traditional infrastructure uses one IP per server, maybe a few dozen per rack, hundreds or thousands per datacenter. With containers, you need one IP per container. A modest Kubernetes cluster with 100 nodes running 30 pods per node needs 3,000 IPs just for pods, plus IPs for nodes, Services, load balancers, and infrastructure. A large cluster easily consumes a /16 (65,536 IPs) or multiple /16 blocks.

The RFC 1918 private address space (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) seemed infinite when it was defined in 1996. Now organizations with large Kubernetes deployments are running out. A /8 provides 16 million addresses. Sounds like plenty until you realize you're burning through millions per cluster.

The IP Sprawl Problem

Container churn makes it worse. Pods are ephemeral. They start, run for minutes or hours, then terminate. Each pod consumes an IP address during its lifetime. If you're deploying hundreds of times per day (hello CI/CD velocity!), you're allocating and releasing thousands of IPs daily.

IP Address Management (IPAM) systems must track this churn. Kubernetes assigns pod IPs from configured CIDR blocks. When blocks fill up, you add more. Eventually you run out of RFC 1918 space in your entire organization. Now what?

Some organizations resort to using public IP space for private networks (please don't, yeah DoD space is public space too…), others implement complex NAT schemes, and a few actually migrate to IPv6 (the correct solution that nobody wants to do).

IPv6: The Solution We're Too Stubborn to Use

IPv6 has 340 undecillion addresses. For comparison, IPv4 has 4.3 billion. IPv6 has enough addresses to give every atom on Earth's surface a /48 network. Container IP sprawl is not a problem in IPv6.

Yet adoption is slow. As of 2025, most Kubernetes clusters run IPv4-only or dual-stack with IPv4 primary. Why?

Tooling gaps: Not all CNI plugins fully support IPv6. Some have bugs. Dual-stack (IPv4 + IPv6) is complex to configure.

Application compatibility: Older applications assume IPv4. They parse IP addresses incorrectly, use IPv4-only libraries, or hardcode IPv4 addresses.

Network infrastructure: Not all switches, routers, and load balancers fully support IPv6, especially in older datacenters.

Knowledge gap: Network engineers learned IPv4. IPv6 requires learning new concepts (no broadcast, neighbor discovery instead of ARP, SLAAC, etc.).

Inertia: IPv4 works (mostly), migrating is effort, nobody wants to be first.

The irony is thick: we're running out of private IP addresses because we're deploying too many containers, but we refuse to adopt the technology (IPv6) that solves this problem, instead creating increasingly complex workarounds (NAT, CIDR reclamation, IP address overlapping) that make networking harder to operate and debug.

Workarounds for IPv4 Exhaustion

Since we're not moving to IPv6, we get creative:

Carrier-Grade NAT (CGN): NAT multiple private networks behind shared private IPs, then NAT again to public IPs. NAT on NAT. Debugging this is a nightmare.

Overlapping IP space: Use the same RFC 1918 ranges in different clusters, rely on VPNs or overlay networks to isolate them. Works until you need to connect the clusters.

Aggressive CIDR reclamation: Shrink pod CIDR blocks to minimum necessary, reclaim unused space aggressively. Operationally intensive.

Host networking: Run pods in the host network namespace, avoiding pod IP allocation entirely. Destroys isolation, creates port conflicts, but saves IPs.

Service meshes with overlay routing: Tools like Istio can route traffic without unique pod IPs, using labels and service discovery instead. Complex but avoids IP exhaustion.

Or, you know, just use IPv6. But that would be too easy.

The Debugging Nightmare: When Container Networking Breaks

Container networking failures are spectacular because there are so many layers where things can break:

Network namespace misconfiguration: Wrong interfaces in wrong namespaces, or namespaces that weren't cleaned up from previous containers.

Veth pair failures: One end of the pair is up, the other is down. Or both ends are in the wrong namespaces. Or the pair was deleted but state remains.

Bridge/routing misconfigurations: CNI bridge has wrong IP, routing tables missing entries, or conflicting routes from multiple CNI plugins.

Overlay network problems: VXLAN tunnel endpoints misconfigured, VNI conflicts, MTU mismatches causing silent packet drops.

Iptables/nftables rule conflicts: Rules from CNI plugins, kube-proxy, and manual configuration interfering with each other. Rule evaluation order matters and is non-obvious.

IPAM exhaustion: CNI plugin runs out of IPs to allocate, pods fail to start with cryptic errors.

MTU mismatches: Overlay adds 50 bytes, but nobody reduced MTU in containers, packets get fragmented or dropped.

DNS resolution failures: Kube-dns or CoreDNS misconfigured, pods can't resolve service names (it’s always DNS).

Service routing broken: Kube-proxy rules not updating, or IPVS load balancing to wrong backends.

Physical network issues blamed on containers: Sometimes it really is a bad cable or switch, not the container networking.

Debugging requires understanding all these layers and having tools to inspect each one. You need to know ip netns, tcpdump, iptables/nftables, ethtool, route tables, bridge utilities, and CNI-specific debugging tools. Good luck.

Essential Debugging Tools

When container networking breaks, reach for:

ip netns list: Show network namespaces

ip netns exec <namespace> <command>: Run commands in a specific namespace

nsenter: Enter a container's namespace for debugging

tcpdump: Capture traffic at various points (host interface, container interface, bridge)

iptables-save / nft list ruleset: Dump all firewall rules

conntrack -L: Show connection tracking state

bridge fdb show: Show bridge forwarding database

ss -tunap: Show all sockets and their namespaces

kubectl get pods -o wide: Basic Kubernetes pod info including IPs

CNI plugin logs: Usually in /var/log or journalctl, essential for understanding CNI failures

The Uncomfortable Truth About Container Networking

Container networking is complex because we're solving a hard problem: make thousands of isolated processes with independent network stacks communicate efficiently across multiple hosts while maintaining security, observability, and operational simplicity. Those requirements are fundamentally in tension.

We chose complexity. Every container gets its own network stack because isolation is valuable. We use overlay networks because they provide location transparency. We layer CNI plugins on top of CNI plugins because we want extensibility. We run containers inside VMs because we need security. We integrate InfiniBand via SR-IOV because we need performance. We stick with IPv4 because we're too lazy to migrate.

Each decision makes sense in isolation. The combination creates a system with so many layers, abstractions, and moving parts that it's nearly impossible for any one person to understand completely. Debugging requires expertise in kernel namespaces, virtual networking, overlay protocols, iptables/nftables, CNI specifications, Kubernetes internals, and the specific CNI plugin you're using.

And we did this to ourselves. Containers were supposed to be lightweight. We made them heavy by giving each one a full network stack. Kubernetes was supposed to be simple. We made it complex by trying to support every networking model. High-performance networking was supposed to be fast. We made it slow by inserting virtualization layers.

But it works. Barely, sometimes, with enough expertise and operational overhead. Container networking enables microservices architectures, rapid deployment, and massive scale. The cost is complexity. Whether that's a good trade depends on your priorities, but we're too far in to turn back now.

Living with Container Networking

If you're deploying containers in production, here's what you need to accept:

Virtualization overhead is real: Every abstraction layer costs performance. Network namespaces, veth pairs, overlay networks, and CNI processing add latency and reduce throughput. Measure the impact, optimize where critical, accept it elsewhere.

Complexity scales with features: Simple CNI plugins are easy to operate but limited. Advanced CNI plugins offer more features but require more expertise. Choose based on your operational capacity, not just feature lists.

IPv6 is the answer: Container IP sprawl is a self-inflicted wound. IPv6 solves it. Bite the bullet and migrate, or continue band-aiding IPv4 exhaustion with increasingly complex workarounds.

Bypass when necessary: For performance-critical workloads (AI training, HPC, high-throughput data processing), bypass container networking overhead using SR-IOV, host networking, or dedicated physical interfaces. Container networking is good enough for most things, not optimal for everything.

Invest in observability: Container networking fails in complex ways. Monitoring, logging, and distributed tracing are essential. You can't fix what you can't see, and you can't see inside network namespaces without proper tooling.

Test failure modes: Don't just test happy path. Test what happens when CNI plugins fail, when overlay networks have packet loss, when IP allocation exhausts. Failures will happen in production; understanding them in testing is cheaper.

Document your networking: Future you (and your team) will thank present you. Document which CNI plugins you use, why, what IP ranges are allocated, how overlays are configured, and where physical network integrates with container networking.

Container networking is a mess of abstractions stacked on abstractions. It's slower than it should be, more complex than it needs to be, and harder to debug than anyone wants. But it enables deploying and scaling applications at a speed that wasn't possible before. Whether that's worth the operational overhead depends on your use case, but for most of us, we're stuck with it.

Welcome to modern infrastructure, where every process gets its own network stack, and nobody can remember why we thought that was a good idea. But it makes CI/CD faster, so we ship it anyway.

Article Not Found