MTU: The Simple Number That Breaks Everything
Maximum Transmission Unit seems like it should be simple. It's just a number representing the largest packet your network can handle. Ethernet is 1500 bytes, done, right? Except it's not done, because we have tunnels that add overhead, PPPoE that subtracts bytes, jumbo frames that some equipment supports and some doesn't, VPNs that encapsulate everything, Path MTU Discovery that's supposed to fix this automatically but breaks when firewalls block ICMP, TCP that tries to negotiate around the problem but gets sabotaged by middleboxes, and fragmentation that works in theory but creates performance problems in practice. MTU is a number, but that number propagates through every layer of the network stack, interacts with every protocol, and breaks in creative ways when any component doesn't handle it correctly. This is the story of how a simple size limit became one of networking's most persistent troubleshooting problems.
Let's explore what MTU actually is, how various mechanisms try to deal with it, why they all break in production networks, and the workarounds we've built to paper over the fundamental problem.
MTU Basics: Size Matters
The Maximum Transmission Unit is the largest packet (in bytes) that a network interface can transmit without fragmentation. Different link types have different MTUs:
Ethernet: 1500 bytes (the standard since forever) Wi-Fi: Also 1500 bytes (inherited from Ethernet) PPPoE: 1492 bytes (1500 minus 8-byte PPPoE header) VPN/Tunnel: Variable, typically 1400-1450 bytes (depends on tunnel overhead) Jumbo Frames: 9000 bytes (or sometimes 9216, depending on vendor) Token Ring: 4464 bytes (ancient history) FDDI: 4352 bytes (also ancient) ATM: 9180 bytes (mostly dead) Loopback: 65536 bytes (local only, no physical constraints)
The MTU includes the IP header and everything above it, but not the layer 2 framing. So for standard Ethernet:
1500 bytes MTU = 20 bytes IP header (minimum, can be up to 60 with options) + 20 bytes TCP header (minimum, can be up to 60 with options) + Up to 1460 bytes of actual data (with minimum headers)
This seems straightforward until you realize that every path through the Internet might have different MTUs along the way, and discovering the smallest MTU (the bottleneck) is surprisingly difficult.
MSS: TCP's MTU Abstraction
TCP introduced the Maximum Segment Size (MSS) to abstract away MTU concerns from applications. During the TCP three-way handshake, each side advertises its MSS in the SYN packet.
MSS = MTU - IP header - TCP header
For standard Ethernet:
MSS = 1500 - 20 (IP) - 20 (TCP) = 1460 bytes
The idea is that TCP will never send more than MSS bytes of data in a segment, preventing fragmentation. Each side independently advertises its MSS based on its local interface MTU. The sender uses the minimum of the two advertised MSS values.
This works perfectly when:
- Both ends have the same MTU
- Every hop in between has equal or larger MTU
- No tunnels or encapsulation add overhead
In real networks, none of these assumptions hold reliably.
The Fragmentation Problem
When a router receives a packet larger than the outgoing interface's MTU, it has two options:
Option 1: Fragment It
Split the packet into multiple smaller fragments:
Original packet (1500 bytes): [IP Header][Data bytes 1-1480] After fragmentation for 1400 byte MTU: Fragment 1: [IP Header + fragment info][Data bytes 1-1380] Fragment 2: [IP Header + fragment info][Data bytes 1381-1480]
Each fragment gets its own IP header with:
- Fragment Offset: Where this fragment belongs in the original packet
- More Fragments (MF) flag: Set on all fragments except the last
- Identification field: All fragments from the same packet share an ID
The destination reassembles fragments back into the original packet.
Why fragmentation seems like a solution: The network can handle any packet size by splitting and reassembling.
Why fragmentation is terrible:
- Performance: Routers must spend CPU cycles fragmenting. Destinations must buffer fragments and reassemble. This is slow.
- Loss amplification: If any single fragment is lost, the entire original packet must be retransmitted. One lost 100-byte fragment forces retransmission of the whole 1500-byte packet.
- Firewall problems: Firewalls inspecting the first fragment see TCP/UDP headers, but subsequent fragments have no L4 headers. Many firewalls drop fragments or handle them poorly.
- Security issues: Fragment attacks (tiny fragments, overlapping fragments) have been used for firewall evasion and DoS attacks.
- NAT breaks: NAT devices must reassemble fragments to see port numbers for translation, adding processing overhead and state.
- MTU discovery breaks: If the response fragments are blocked by firewalls, connections fail mysteriously.
Modern best practice: don't fragment. But how?
Option 2: Drop It and Send ICMP
The router can drop the packet and send an ICMP error message back to the sender.
For IPv4: ICMP Type 3, Code 4 (Fragmentation Needed and DF Set) includes the MTU of the next hop.
For IPv6: ICMPv6 Type 2 (Packet Too Big, PTB) includes the MTU. IPv6 eliminated router fragmentation entirely, only endpoints can fragment.
The sender receives this ICMP message and should:
- Reduce its packet size to the indicated MTU
- Retransmit with smaller packets
- Remember this MTU for this destination (PMTU cache)
This is called Path MTU Discovery (PMTUD), and it's a beautiful idea that breaks constantly.
The DF Bit: Don't Fragment, Please
The IP header has a Don't Fragment (DF) bit. When set, routers must not fragment this packet. If the packet is too large, the router drops it and sends ICMP Fragmentation Needed.
Why DF exists:
- Applications can signal they don't want fragmentation
- Enables Path MTU Discovery
- Avoids fragmentation's performance problems
Modern TCP behavior: Almost all TCP implementations set DF by default. TCP wants to do its own size management at Layer 4 rather than rely on IP fragmentation.
The problem: This makes TCP entirely dependent on PMTUD working correctly. When PMTUD breaks (and it does), TCP connections blackhole.
Path MTU Discovery: The Mechanism That Should Work
PMTUD (RFC 1191 for IPv4, RFC 8201 for IPv6) is elegant in theory:
- Start optimistically: Send packets at your local MTU with DF set
- Receive ICMP: If a router sends Fragmentation Needed, reduce MTU
- Converge: Eventually discover the path's true MTU
- Cache it: Remember this MTU for future connections
- Timeout: Periodically probe with larger packets in case path changed
This should just work. It doesn't.
Why PMTUD Breaks
Firewall ICMP Filtering: The number one reason PMTUD fails. Network security teams, having read somewhere that ICMP can be used for attacks, block all ICMP. This breaks PMTUD completely. The sender never receives Fragmentation Needed messages, continues sending too-large packets, and everything times out.
Asymmetric Routing: ICMP messages might take a different path back and get filtered somewhere else.
Broken ICMP Generation: Some routers don't properly generate ICMP Fragmentation Needed, either due to bugs, misconfiguration, or CPU protection (rate limiting ICMP).
NAT/Stateful Firewalls: Many stateful devices don't associate ICMP errors with the original connection, dropping them as unsolicited ICMP.
Wrong MTU in ICMP: Some broken routers send ICMP Fragmentation Needed but report the wrong MTU (usually 0 or their inbound MTU instead of outbound).
ICMP Rate Limiting: Routers rate-limit ICMP generation to prevent DoS. During a burst of large packets, only the first few generate ICMP messages.
IPv6 PTB Validation: RFC 8201 requires validating that PTB messages actually correspond to a flow. Many implementations are overly strict, rejecting legitimate PTB messages.
The result: PMTUD works in lab networks with no ICMP filtering. It fails in production networks where security teams block ICMP. This creates the dreaded "MTU black hole."
The MTU Black Hole
Here's a common scenario:
- User connects to VPN (MTU now 1400 bytes due to tunnel overhead)
- User's computer tries to load a website
- TCP handshake succeeds (small SYN/SYN-ACK packets)
- HTTP request sent (small packet)
- Server responds with large packet (1500 bytes, the server doesn't know about the VPN)
- VPN gateway can't forward 1500-byte packet through 1400-byte tunnel
- VPN gateway drops packet and sends ICMP Fragmentation Needed
- Corporate firewall blocks ICMP (because "security")
- Server never learns it needs smaller packets
- Server retransmits same 1500-byte packet repeatedly
- Every retransmission is dropped
- Connection times out after 2+ minutes
- User reports "Internet is broken"
This is the MTU black hole. Small transfers work (initial handshake, small requests), but large transfers fail mysteriously. It's one of the most frustrating networking problems because:
- Basic connectivity works (ping succeeds)
- Some sites work, others don't (depends on their packet sizes)
- It's intermittent (depends on transfer size)
- Users report "slow" instead of "broken" (because retransmissions eventually time out)
Diagnosing MTU Black Holes
The classic test:
bash
# Ping with various sizes ping -M do -s 1472 example.com # 1472 + 28 (IP+ICMP) = 1500, should work on Ethernet ping -M do -s 1473 example.com # Should fragment, but DF prevents it
If 1472 works but 1473 times out silently, you have PMTUD failure.
For TCP specifically:
bash
# Try fetching with curl curl -v https://example.com/largefile # If it hangs after initial connection, MTU black hole likely
Packet captures show:
- TCP handshake completes
- Small packets in both directions work
- Large packet sent, never acknowledged
- Retransmissions of same large packet
- No ICMP Fragmentation Needed received
MSS Clamping: The Ugly Workaround
Since PMTUD is unreliable, the networking community invented MSS clamping (MSS adjustment, MSS rewriting).
The idea: intercept TCP handshakes and rewrite the MSS option to a safe value.
Original SYN packet MSS: 1460 Gateway rewrites MSS: 1360 (accounting for VPN overhead)
Both sides now use 1360 as their MSS, preventing packets larger than will fit in the tunnel.
Where MSS clamping happens:
- VPN gateways (almost universal)
- PPPoE routers (common)
- Cellular gateways (common)
- Firewalls (sometimes)
- Load balancers (occasionally)
Why MSS clamping works:
- Prevents MTU problems proactively
- No dependency on ICMP
- Transparent to endpoints
- Handles the most common case (TCP, which is most traffic)
Why MSS clamping is ugly:
- Layer violation (Layer 3/4 device manipulating Layer 4 option)
- Only works for TCP (UDP is unprotected)
- Middlebox must understand TCP options
- Must recalculate TCP checksums
- Doesn't help existing connections (only new ones)
- Slightly reduces throughput (smaller packets = more overhead)
The reality: MSS clamping is everywhere. Every VPN gateway does it. It's the workaround that makes modern networks functional despite PMTUD being broken. Network engineers grudgingly accept it because the alternative (users calling about broken websites) is worse.
TCP MTU Probing: Trying Again
Since PMTUD fails in the wild, TCP implementations added their own mechanisms.
RFC 4821: Packetization Layer Path MTU Discovery (PLPMTUD)
The idea: Don't trust ICMP, probe with actual data packets.
How it works:
- Start with safe small MTU (usually 1280 for IPv6 compatibility)
- Connection works but uses small packets
- Periodically send probe packets slightly larger
- If probe is acknowledged, increase packet size
- If probe is lost (after retransmission timeout), stick with current size
- Gradually converge to maximum working size
Advantages:
- No ICMP dependency
- Works even with ICMP filtering
- Eventually finds optimal MTU
- Self-correcting if path changes
Disadvantages:
- Starts with suboptimal small packets (reduced throughput initially)
- Takes time to converge (multiple RTTs)
- Lost probe packets might be congestion, not MTU (false signal)
- Not universally implemented
- Adds complexity to TCP stack
Modern Linux and Windows implement variants of this. It helps but doesn't completely solve the problem.
TCP Black Hole Detection
Some TCP stacks detect potential MTU black holes:
- Multiple retransmissions of the same segment without any progress
- Newer segments acknowledged but older large segment not
- Pattern suggests MTU issue not congestion
When detected, the stack:
- Reduces MSS temporarily
- Retransmits with smaller segments
- Gradually probes larger sizes if smaller works
This is a heuristic, not a solution, but it helps connections recover from MTU black holes instead of timing out.
IPv6 and MTU: Better But Still Broken
IPv6 tried to fix fragmentation:
No router fragmentation: Only endpoints can fragment. If a packet is too large, routers must send ICMPv6 Packet Too Big (PTB).
Minimum MTU: IPv6 requires 1280 byte minimum MTU on all links. This ensures a baseline that should always work.
PTB validation: RFC 8201 requires validating PTB messages correspond to actual traffic, preventing spoofing attacks.
These improvements help but don't solve the core issue: firewalls still block ICMPv6. The IETF literally published RFC 4890 titled "Recommendations for Filtering ICMPv6 Messages in Firewalls" explaining which ICMP types must not be blocked. Many security teams ignored it.
IPv6 PMTUD breaks for the same reasons as IPv4. The 1280-byte minimum helps (endpoints can fall back to that), but it's not always sufficient for efficient throughput.
The Special Horror of Tunnels
Every tunnel reduces MTU:
GRE: Adds 24 bytes (20 IP + 4 GRE) IPSec ESP: Adds 50-60+ bytes (IP + ESP + padding + auth) IPSec with NAT-T: Adds 60-70+ bytes (IP + UDP + ESP + padding + auth) GRE over IPSec: Adds 74-84+ bytes (original packet really squeezed) VXLAN: Adds 50 bytes (outer IP + UDP + VXLAN header) GTP (cellular): Adds 36-52 bytes MPLS: Adds 4 bytes per label (usually 4-8 bytes total) PPPoE: Subtracts 8 bytes from Ethernet (encapsulation overhead) 6in4/6to4: Adds 20 bytes (second IP header) IPv6 over IPv4: Adds 20 bytes
Multiple tunnels stack: VPN over PPPoE over cellular means all overhead adds up.
Consider this nightmare scenario:
1500 (Ethernet) - 8 (PPPoE) = 1492 effective Then VPN adds: - 20 (outer IP) - 8 (UDP for NAT-T) - 50 (ESP) = 78 bytes overhead 1492 - 78 = 1414 bytes available for inner packet
Your computer thinks it has 1500 MTU. The actual path MTU is 1414. That 86-byte gap is where packets go to die.
The DF Bit in Tunnels
When tunneling packets, what do you do with the DF bit?
Option 1: Copy DF to outer header: If inner packet has DF, outer packet has DF. Preserves PMTUD but risks dropping packets if tunnel MTU is smaller.
Option 2: Clear DF on outer header: Allow fragmentation of outer packet. Works but creates fragmentation problems. Also, some networks filter fragments.
Option 3: Drop packet and send ICMP: Honor inner DF by not tunneling the packet if it's too large. Correct behavior but risks ICMP black holes.
There's no perfect answer. Most tunnels copy DF and rely on MSS clamping to prevent problems. It's ugly but practical.
Jumbo Frames: The Temptation
Jumbo frames (MTU > 1500, typically 9000) reduce overhead for high-throughput applications. Fewer packets for the same data means:
- Less CPU per byte (fewer interrupts, less header processing)
- Less bandwidth wasted on headers
- Better throughput for large transfers
Why jumbo frames are tempting: In controlled environments (data centers, HPC clusters), they provide measurable performance improvements.
Why jumbo frames are a trap:
- Must be configured everywhere: Every switch, router, NIC, and operating system in the path must support the same jumbo MTU. Miss one device and packets get dropped.
- Silently breaks: Misconfigured devices might accept jumbo frames but drop them randomly or corrupt them. Diagnosis is painful.
- Different standards: 9000, 9216, or other? Vendors differ. Some hardware has 9018 byte limitation. Mismatch = pain.
- Internet doesn't support them: Jumbo frames only work in your local network. Any traffic to the Internet reverts to 1500.
- Complicates troubleshooting: Now you have two MTU configurations (jumbo internal, 1500 external) with translation points that can break.
- Not as valuable as you think: With modern NICs, TSO (TCP Segmentation Offload) and GRO (Generic Receive Offload) reduce the CPU advantage of jumbo frames significantly.
Recommendation: Unless you're in a controlled data center with homogeneous equipment and proven need (storage networks, big data), stick with 1500. The operational complexity of jumbo frames exceeds the benefit for most networks.
UDP and MTU: The Wild West
TCP has MSS negotiation, probing, and black hole detection. UDP has none of this. UDP applications must:
- Know the path MTU somehow (how?)
- Keep packets small enough to avoid fragmentation
- Handle fragmentation if it happens
- Deal with PMTUD failures
Common UDP strategies:
Conservative sizing: DNS uses 512 bytes by default (tiny, always works). EDNS0 allows larger but starts at 1232 bytes (safe for most paths including tunnels).
Application-level fragmentation: Some UDP applications (like QUIC) do their own fragmentation and reassembly at the application layer, avoiding IP fragmentation entirely.
Just hope: Many UDP applications send 1500-byte packets and hope for the best. Gaming protocols, VoIP (sometimes), streaming video all often do this. It mostly works because most paths are 1500, and when it doesn't work, users complain about "lag" without realizing it's MTU.
Probe with ICMP: Some applications send large UDP packets and check if ICMP comes back, adjusting size. This fails with ICMP filtering.
UDP's MTU handling is essentially "every application for itself." There's no standard, no protocol support, just hope and occasional failure.
ICMP and PMTU: The Catch-22
ICMP Type 3 Code 4 (Fragmentation Needed) is critical for PMTUD. Yet it's often blocked because:
Security concerns: ICMP can be used for reconnaissance, OS fingerprinting, and some attacks.
Blanket filtering: "Block all ICMP" is easier than understanding which types are safe.
Legacy advice: 1990s security guides recommended blocking ICMP. Many networks never revisited this.
Ignorance: Network security and network operations are often separate teams. Security blocks ICMP without understanding it breaks PMTUD. Operations troubleshoots weird failures without knowing security blocked ICMP.
The recommendations (RFC 4890, BCP 38, etc.) are clear: ICMP types 2 (PTB), 3.4 (Fragmentation Needed), and a few others must not be filtered. Yet filtering happens anyway.
This creates a Catch-22:
- PMTUD requires ICMP
- Security blocks ICMP
- PMTUD breaks
- Workarounds (MSS clamping, conservative sizing) are deployed
- Networks "work" (sort of)
- No pressure to fix ICMP filtering
- Cycle continues
The correct solution is to allow necessary ICMP types. The actual solution is MSS clamping and hoping for the best.
Modern Trends and Remaining Problems
TSO/GSO/GRO: Modern NICs support TCP Segmentation Offload (TSO), Generic Segmentation Offload (GSO), and Generic Receive Offload (GRO). These allow the OS to generate large "packets" (64KB+) and have the NIC split them into MTU-sized packets. This reduces CPU overhead without requiring jumbo frames.
QUIC: Google's QUIC protocol (now HTTP/3) does its own packetization at the application layer, avoiding IP fragmentation entirely. QUIC probes MTU independently and handles it better than traditional TCP.
eBPF and XDP: Modern Linux allows packet manipulation in kernel or even in the NIC driver. This enables smarter MTU handling and probing.
IPv6 adoption: Eventually, IPv6's cleaner MTU handling (no router fragmentation, 1280 minimum) will help. But IPv6 still has ICMP filtering problems.
VPN proliferation: WireGuard, OpenVPN, IPSec, SSL VPNs, and corporate remote access mean more users behind tunnels with reduced MTU. MSS clamping is critical.
Cloud networking: Cloud providers use various tunnel protocols (VXLAN, Geneve, etc.) internally, reducing effective MTU. They mostly handle this transparently, but edge cases remain.
Living With MTU Problems
MTU should be simple. It's a number. Yet it touches every layer of the network stack, depends on fragile PMTUD mechanisms that break constantly, and causes mysterious failures that users describe as "slow" or "sometimes works."
The practical approach for network engineers:
- Expect 1500 as maximum: Design for standard Ethernet MTU
- Deploy MSS clamping: On every VPN, tunnel, and encapsulation point
- Document actual MTU: Know what your tunnels reduce it to
- Allow necessary ICMP: Fight the security team, allow Types 2 and 3.4
- Monitor for black holes: Watch for timeout patterns suggesting MTU issues
- Test with large transfers: "ping works" isn't sufficient, test with actual data
- Avoid jumbo frames: Unless you're running a data center and know what you're doing
- Keep it simple: Fewer layers of encapsulation = fewer MTU problems
MTU problems will never completely go away. We've built too many layers, too many tunnels, too many workarounds. PMTUD breaks because security teams filter ICMP. MSS clamping works around PMTUD failures. TCP probing works around MSS clamping limitations. Each layer of workaround adds complexity.
But the Internet keeps working, mostly, because enough workarounds exist that packets usually find their way through. When they don't, network engineers troubleshoot MTU black holes at 3 AM, curse the complexity, deploy more MSS clamping, and move on.
Welcome to MTU, where a simple number becomes the source of endless troubleshooting, where elegant mechanisms like PMTUD break in production, and where ugly workarounds like MSS clamping are what actually keep the Internet running. It's not pretty, but it's what we have.