TCP Tuning: The Knobs You Can Turn and the Ones You Shouldn't Touch
TCP is the most successful protocol in the history of computing. It was designed in 1974, ratified in 1981, and has carried the overwhelming majority of human internet traffic ever since. It runs on your phone, your web server, your database, your streaming service, and the undersea cables that connect continents. It has survived the dial-up era, the broadband transition, the mobile revolution, and the hyperscaler age without a fundamental redesign.
It has also been mistuned, abused, and willfully broken by systems administrators more times than anyone cares to count.
Here's the uncomfortable truth: TCP is not a fixed protocol sitting passively in your kernel, quietly doing its job. It is a collection of mechanisms, each with default values chosen for hardware that existed decades ago, running in environments that were fundamentally different from your 100Gbps spine or your globally distributed microservices deployment. Every Linux system ships with TCP defaults that were probably chosen for a server with 128MB of RAM on a 100Mbps link. Your server has 256GB of RAM and a 100Gbps NIC. Those defaults do not fit.
But TCP is also not a slot machine where pulling levers produces better throughput. Most of the tuning advice floating around the internet, particularly the kind packaged as "paste this sysctl.conf and your performance will improve," ranges from cargo-cult superstition to actively harmful. Some of the knobs you shouldn't touch at all. Some you should understand deeply before you touch them. And a handful, tuned correctly, can meaningfully transform your network's behavior.
Before touching a single parameter, capture your baseline:
uname -r sysctl net.ipv4.tcp_congestion_control net.ipv4.tcp_available_congestion_control sysctl net.core.default_qdisc tc -s qdisc show ss -tmiH | head -n 20
That last command is the most useful diagnostic you have. Look for rwnd_limited, sndbuf_limited, cwnd, rtt, and pacing_rate. If you're rwnd_limited, your congestion control tweak will do nothing. If you're sndbuf_limited, the kernel isn't the problem: your application is. If you're CPU-limited, network knobs are mostly decorative. Measure first. Then tune.
Let's pull back the stack, understand what TCP is actually doing under the hood, and figure out which levers are worth pulling.
The Foundation: What TCP Is Actually Managing
Before touching a single sysctl, you need to internalize what TCP is trying to accomplish. It is solving four distinct problems simultaneously, and the knobs you'll tune are all in service of one of these four.
Reliability, meaning every byte sent arrives exactly once in the correct order. TCP sequences every byte, acknowledges receipt, and retransmits anything that doesn't get acknowledged. This requires memory on both sides and a mechanism to detect loss.
Flow control, meaning the sender cannot overwhelm the receiver. If your sender is doing 10Gbps and your receiver can only process 1Gbps, you need the receiver to communicate its capacity. TCP does this through the receive window: a field in every segment telling the sender how much data the receiver is ready to accept.
Congestion control, meaning the sender cannot overwhelm the network between sender and receiver. This is the hard part. The network doesn't tell you directly that it's congested; you have to infer it from signals like packet loss and increasing round-trip time. TCP's congestion control algorithms are elaborate inference engines sitting inside your kernel, trying to probe available bandwidth without causing collapse.
Connection management, meaning establishing and tearing down sessions in a way both sides can agree on. The three-way handshake for establishment, the four-way exchange for teardown, the TIME_WAIT state that lingers after connections close, all of these are load-bearing and frequently misunderstood.
Every knob you can turn is connected to one of these four concerns. Understanding which concern you're addressing is what separates tuning from thrashing.
The Buffer Architecture: Where Most People Start and Often Stop
The most commonly tuned TCP parameters are buffer sizes, and for good reason: the defaults are almost certainly wrong for any modern server. But "make the buffers bigger" is not a tuning strategy. It's a heuristic with real consequences.
The Bandwidth-Delay Product, and Why It Matters More Than Your Buffer Size
The fundamental insight of TCP buffer sizing is the bandwidth-delay product. This is simply: bandwidth (bytes per second) multiplied by round-trip time (seconds) = bytes in flight. If your network path supports 10Gbps and your RTT to a distant client is 100ms, then to fully utilize that path, you need 10,000,000,000 bits/second * 0.1 seconds = 1,000,000,000 bits = 125MB of data in flight simultaneously.
If your TCP socket buffer is smaller than 125MB, your sender will run out of window space and sit idle, waiting for acknowledgments before it can send more. Your 10Gbps link will deliver a fraction of that. Not because the network is congested, not because your CPUs are saturated, but because your kernel is waiting.
This is why buffer sizing is the first real lever most people need to pull.
The Kernel Buffer Sysctl Parameters
Linux exposes these through the net.core and net.ipv4.tcp namespaces. The critical ones:
net.core.rmem_max net.core.wmem_max net.core.rmem_default net.core.wmem_default net.ipv4.tcp_rmem net.ipv4.tcp_wmem
The net.core parameters are the system-wide ceiling. The tcp_rmem and tcp_wmem are three-value tuples: minimum, default, and maximum. The kernel uses the autotuning system to dynamically size each socket's buffer between the minimum and maximum based on observed RTT and throughput.
A starting point for a high-bandwidth, moderate-latency server. Note the documented kernel defaults for the middle value: tcp_rmem default is 131072 bytes, tcp_wmem default is 16384 bytes. Keep those middle values accurate; the maximums are what you're actually raising:
net.core.rmem_max = 134217728 # 128MB net.core.wmem_max = 134217728 # 128MB net.ipv4.tcp_rmem = 4096 131072 134217728 net.ipv4.tcp_wmem = 4096 16384 134217728
The critical parameter most tuning guides skip: net.ipv4.tcp_mem. This controls how much total memory, across all TCP sockets combined, the kernel will allocate for TCP buffers. It's a three-value tuple in pages (not bytes, a distinction that causes no end of confusion), and its defaults are computed at boot based on available RAM. Do not hardcode this value from a blog post written for a different machine. Check what your kernel calculated:
sysctl net.ipv4.tcp_mem getconf PAGESIZE free -m
If you're raising per-socket maximums to 128MB with 10,000 simultaneous connections, the math is obvious: 10,000 * 128MB is a theoretical 1.28TB of TCP buffer space. Your machine doesn't have that, and the kernel will start throttling sockets before the per-socket limits are reached. Treat tcp_mem changes as capacity planning, not performance tuning.
Autotuning: The Mechanism You Should Not Disable
Linux introduced TCP receive buffer autotuning in 2.6.7, and it is one of the most important improvements to TCP performance in the last two decades. With autotuning enabled (the default), the kernel continuously adjusts each socket's receive buffer based on the observed RTT and the rate at which the application is consuming data. A connection to a nearby client with low RTT doesn't need a massive buffer. A connection crossing continents at 150ms RTT does. Autotuning figures this out per-socket without you having to specify it.
The knob:
net.ipv4.tcp_moderate_rcvbuf = 1
Leave this at 1. Always. If you ever see tuning advice telling you to set this to 0, close the tab and question the rest of the article's credibility.
The reason people sometimes disable autotuning is that they've seen a specific high-throughput use case where autotuning's conservative initial behavior causes slow ramp-up. The correct fix is to increase the maximum ceiling, not to disable autotuning.
Congestion Control: The Algorithm Doing All the Real Work
Congestion control is where TCP gets interesting, contentious, and genuinely complex. The algorithm your kernel uses to pace transmissions is the single biggest determinant of throughput over any non-trivial network path, and you have real choices here.
One important clarification before we go further: TCP congestion control runs on the sender. Enabling BBR on a server controls how that server sends data. It does not affect how data is sent to that server. If you're trying to improve download speeds to your clients, you need BBR running on the server that is doing the sending, not on the clients doing the receiving. This is obvious once stated, but it confuses people regularly.
Start by knowing what your kernel actually supports:
sysctl net.ipv4.tcp_available_congestion_control sysctl net.ipv4.tcp_allowed_congestion_control
Do not assume bbr2 or any specific algorithm is present. The available list is determined at kernel build time and varies by distribution and version.
A Brief History of Why This Problem Is Hard
The original TCP specification had no congestion control at all. In 1986, the internet collapsed. Literally. Van Jacobson at LBL observed the ARPANET experiencing a congestion collapse where throughput dropped by a factor of 1,000. Routers were dropping packets, senders were retransmitting, which caused more packets, which caused more drops, which caused more retransmissions. A feedback loop of destruction.
Van Jacobson's 1988 paper introduced the mechanisms that became the foundation of TCP congestion control: slow start, congestion avoidance, fast retransmit, and fast recovery. These mechanisms form the skeleton of every congestion control algorithm that followed. Understanding them is not optional if you want to understand what you're tuning.
Slow Start and Congestion Avoidance
When a TCP connection first opens, the sender has no idea how much capacity the network has. It starts with a small congestion window (cwnd), typically 10 segments in modern kernels (controlled by the initcwnd route parameter), and doubles it every RTT. This is slow start: exponential growth until either the congestion window reaches the slow start threshold (ssthresh) or loss is detected.
After the threshold is crossed, the algorithm transitions to congestion avoidance: linear growth, increasing cwnd by one segment per RTT. This continues until loss is detected.
On loss, the classic behavior is to cut the congestion window in half and reset ssthresh. This is the additive increase, multiplicative decrease (AIMD) algorithm. It's fair, it's stable, and on long-latency high-bandwidth paths, it is also painfully slow.
Here's the problem: if you're running across a 100ms RTT path and your cwnd gets cut because a single packet was dropped due to a brief queue spike that resolved immediately, you'll spend many RTTs recovering to your previous throughput. Each RTT is 100ms. Your throughput chart looks like a sawtooth wave, not a flat line.
This is the fundamental motivation for the modern generation of congestion control algorithms.
The Algorithm Zoo: Your Real Options
Linux exposes the congestion control algorithm selection through:
net.ipv4.tcp_congestion_control
RENO is the original Van Jacobson algorithm, now used only as a baseline for comparison. Loss-based, sawtooth throughput, slow recovery on high-BDP paths. Do not use Reno in production on modern networks.
CUBIC is the Linux default since kernel 2.6.19, and the default for most of the internet. CUBIC's insight is that the congestion window should grow as a cubic function of time since the last loss event, rather than linearly. This allows faster recovery after loss and better utilization of high-bandwidth paths. CUBIC is also designed to be fair to RENO-based connections on shared paths.
CUBIC is a loss-based algorithm. It infers congestion from packet drops. This creates a structural problem: CUBIC must actually fill the queue and cause drops in order to discover the bandwidth ceiling. The queue delay you observe on CUBIC-tuned paths is not a bug; it is CUBIC doing exactly what it's designed to do.
BBR (Bottleneck Bandwidth and RTT), developed at Google and released in 2016, represents a fundamentally different philosophy. BBR is a model-based algorithm: instead of reacting to loss, it tries to build an explicit model of the available bandwidth and the propagation delay of the path, then paces transmissions to match that model without overfilling queues.
BBR measures two quantities: the maximum bandwidth seen recently (BtlBw, bottleneck bandwidth) and the minimum RTT seen recently (RTprop, round-trip propagation delay). It paces at BtlBw and sizes cwnd to BtlBw * RTprop. The goal is to operate at the "BBR optimal point": sending at exactly the bottleneck bandwidth, with no excess queue buildup.
The results on long-latency, lossy paths are dramatic. Google's published benchmarks from the original netdev paper showed BBR achieving throughput orders of magnitude higher than CUBIC on a 10Gbps link at 100ms RTT with 1% random loss, where CUBIC collapses to a few Mbps and BBR remains near the bottleneck rate. On the open internet, where paths frequently pass through lossy wireless links or cross-continental routes, BBR's loss-independence is a significant advantage.
BBRv1 has known issues. Most importantly: it can be unfair to CUBIC connections on shared bottleneck links. BBRv1 does not back off on loss the way CUBIC does, so on a shared path, a BBR flow can occupy a disproportionate share of bandwidth while CUBIC flows repeatedly cut their window. This is not hypothetical; it is well-documented in measurement studies.
BBRv2 addresses most of these fairness issues. If your kernel has it, it will appear in tcp_available_congestion_control. If it's there:
net.ipv4.tcp_congestion_control = bbr2
Do not assume it exists. Some distributions backport it; most stock kernels do not yet include it by default. Check first.
HTCP is worth mentioning for high-latency, high-bandwidth environments. It's less common than BBR but performs well in specific WAN scenarios. CDG (CAIA Delay Gradient) is a delay-based algorithm similar in philosophy to BBR, available as a loadable kernel module on some distributions.
Setting the Initial Congestion Window
One of the highest-leverage, lowest-risk tuning parameters is the initial congestion window (initcwnd) on your routing table. This controls how many segments TCP will send before it has received any acknowledgment, essentially how fast slow start begins.
The default in modern Linux kernels is 10 segments (about 14KB with typical MSS), standardized by RFC 6928. For a small web page or API response that fits in a handful of segments, this is fine. For a large file transfer or a streaming protocol, you'll sit in slow start for multiple RTTs before you've even found your cruising speed.
You can increase initcwnd per route:
ip route change default via <gateway> initcwnd 50
The right value depends on your use case. For LAN or low-latency paths, aggressive values (50-100) make sense. For paths with significant propagation delay where you genuinely need slow start to probe bandwidth, be conservative. Blindly setting initcwnd to 1000 on all routes is how you congest your network the moment load increases.
The Queue Discipline: AQM and TCP Congestion Interaction
TCP congestion control and your network queue management algorithm are not independent systems. They are in a co-evolutionary relationship, and tuning one without understanding the other produces suboptimal results.
By default, Linux uses pfifo_fast for interface queuing, a simple FIFO with three priority bands. It has no active queue management, meaning it buffers packets until the queue fills, then drops. This is tail-drop, and it causes two problems: buffer bloat (excessive queuing delay because packets sit in a full queue) and TCP synchronization (multiple flows all experiencing drop simultaneously, all cutting their window simultaneously, causing a burst of underutilization followed by a burst of retransmission).
FQ-CoDel (Fair Queuing Controlled Delay) is well-studied (CoDel is RFC 8289; FQ-CoDel is RFC 8290) and a strong default for most scenarios, particularly when your host is the bottleneck or when you're shaping traffic at the edge. CoDel's insight is that a queue's target delay, not its depth, is the right signal for congestion. It sets a target delay (5ms by default) and signals congestion when packets have been sitting in the queue longer than that target for a sustained interval.
One practical caveat: modern NICs are often multiqueue. Blindly replacing the root qdisc can interact unexpectedly with multiple hardware transmit queues. Verify your qdisc state before and after:
tc -s qdisc show tc -s qdisc show dev <IFACE>
Set it conditionally:
# If you control the bottleneck or see bufferbloat symptoms: tc qdisc replace dev <IFACE> root fq_codel
Or as the system default for newly created qdiscs:
net.core.default_qdisc = fq_codel
FQ (Fair Queue), paired with BBR, is what Google uses internally. FQ provides per-flow queuing so that a single bursty flow doesn't delay packets from other flows, and it paces TCP segments according to BBR's rate estimate. The in-kernel BBR implementation explicitly notes that without FQ for pacing, BBR falls back to internal pacing and may use more CPU resources. The combination of BBR + FQ is currently the best-performing pairing for high-throughput, low-latency paths:
net.core.default_qdisc = fq net.ipv4.tcp_congestion_control = bbr
If your server is behind a large fat pipe and the bottleneck is somewhere else on the network, qdisc changes may do little. Test before and after. The qdisc is not a universal performance knob; it's a bottleneck management mechanism, and it only matters at the bottleneck.
The Handshake Overhead: Reducing Connection Establishment Latency
TCP Fast Open
Every new TCP connection begins with a three-way handshake: SYN, SYN-ACK, ACK. Before any data can be sent, you've spent one full RTT on handshake overhead. On a 100ms path, that's 100ms of dead time before the first byte of application data moves.
TCP Fast Open (TFO), specified in RFC 7413, allows data to be sent in the SYN packet itself, eliminating that RTT for repeated connections to the same server. The first connection still requires the handshake; TFO issues a cryptographic cookie to the client. On subsequent connections, the client includes the cookie in the SYN along with data, and the server can process that data before the handshake completes.
Here's where the original guidance on this knob was wrong in a way that matters operationally. net.ipv4.tcp_fastopen is a bitmask, not a simple 0-through-3 enum. The bits compose:
0x1: Enable TFO on the client side (this is the documented default, likely already on)0x2: Enable TFO on the server side for sockets that explicitly callTCP_FASTOPEN0x400: Enable TFO for all listeners without requiring a per-socketTCP_FASTOPENenable
Check what you actually have:
sysctl net.ipv4.tcp_fastopen
If you want client and server TFO enabled globally for all listeners without per-socket opt-in:
sysctl -w net.ipv4.tcp_fastopen=$((0x403))
The second thing the bitmask explanation glosses over: on the client side, TFO requires the application to use sendmsg() or sendto() with the MSG_FASTOPEN flag, not a plain connect() followed by send(). If your application isn't using MSG_FASTOPEN, you can enable every TFO bit in the kernel and nothing changes. "TFO is enabled" and "TFO is being used" are different states.
TFO has two failure modes worth knowing. First, some middleboxes drop SYN packets with data, causing connections to fail or stall. Linux handles this with a blackhole detection mechanism that falls back to standard three-way handshake, but you might not be getting the benefit you expect. Second, TFO introduces a replay attack surface: a SYN with data can be replayed, causing the server to process the same data twice. For idempotent requests this is harmless. For non-idempotent requests, it's a correctness issue.
Enable TFO where it benefits you. Monitor for middlebox failures. Don't use it for connection-level protocols where replaying the first data segment would cause state inconsistency.
TIME_WAIT: The State Everyone Hates and No One Should Disable
After a TCP connection is closed, the socket that sent the last FIN enters TIME_WAIT for a duration of 2 * MSL (Maximum Segment Lifetime), which is 60 seconds in most Linux configurations. During TIME_WAIT, the kernel keeps the four-tuple (source IP, source port, dest IP, dest port) in a table and will refuse to reuse it.
The reason for TIME_WAIT is subtle and important: the last ACK sent during connection teardown might be lost, causing the remote side to retransmit its FIN. Without TIME_WAIT, a new connection on the same four-tuple might receive that stale FIN and be incorrectly terminated.
The problem: a busy server handling tens of thousands of short-lived connections can accumulate massive TIME_WAIT tables. Each entry uses kernel memory. Port exhaustion becomes a real threat when you're initiating outbound connections from a limited source port range.
The wrong fix is tcp_tw_recycle, which was removed entirely in Linux 4.12 because it was so frequently misused. tcp_tw_recycle enabled per-host timestamp tracking to accelerate TIME_WAIT reuse, but it broke horribly when multiple clients were behind NAT, since their timestamps appeared to travel backward. Systems administrators would see apparently random connection failures from NATted clients, spend days debugging packet captures, and eventually discover that tcp_tw_recycle was the culprit. If you find this parameter in an old runbook, congratulations, you found a fossil.
The right approach to TIME_WAIT is more nuanced than "set tcp_tw_reuse = 1." Kernel documentation actually classifies tcp_tw_reuse as a parameter that "should not be changed without advice or request of technical experts." Its documented modes:
0: disabled1: global enable (risky, see below)2: loopback-only (the documented default)
Setting tcp_tw_reuse = 1 globally allows the kernel to reuse TIME_WAIT sockets for new outbound connections when PAWS (Protection Against Wrapped Sequence numbers) can verify safety. On networks with stable, correctly-functioning timestamps, this works. On networks with NAT or timestamp inconsistencies, it can cause subtle connection failures. Treat it as a targeted last-resort tool when you have measured port exhaustion, not a default performance setting. Always verify your current value before changing it:
sysctl net.ipv4.tcp_tw_reuse
Increase the ephemeral port range as the safer first move when facing outbound port exhaustion. The default ip_local_port_range is 32768-60999. A busy load balancer or proxy can exhaust this. When expanding it, resist the urge to go all the way to 1024, which risks colliding with service ports. A safer expansion:
sysctl -w net.ipv4.ip_local_port_range="20000 65535" sysctl -w net.ipv4.ip_local_reserved_ports="22,80,443,5432,6379"
Adjust the reserved list to match the services actually running on the box.
net.ipv4.tcp_max_tw_buckets: The maximum number of TIME_WAIT sockets the kernel will maintain. If this limit is exceeded, the kernel destroys sockets immediately and logs a warning. The kernel docs are explicit: do not lower this value below the default; it exists as a DoS guard. If you're seeing those limit warnings, increase it:
net.ipv4.tcp_max_tw_buckets = 1440000
And verify you're actually hitting the limit before changing it:
dmesg | grep -i timewait sysctl net.ipv4.tcp_max_tw_buckets
Nagle's Algorithm: The Knob That Ruins Latency-Sensitive Applications
In 1984, John Nagle wrote a memo to the ARPANET engineering community describing a problem: hosts were sending streams of tiny packets, one character at a time, creating a 40-byte IP/TCP header for each 1-byte payload. The network was being clogged by overhead.
Nagle's algorithm, which became RFC 896, addressed this by buffering small sends: if there is unacknowledged data in flight, don't send a new segment until either the buffer reaches MSS size or all previous data is acknowledged. This coalesces small writes into full-sized segments, dramatically reducing small-packet overhead.
For throughput-oriented workloads, Nagle's algorithm is unambiguously correct. For latency-sensitive applications, particularly request-response protocols where you send a small request and wait for a small response, it is a silent performance catastrophe.
The classic failure mode: a client sends a small request (less than MSS). Nagle buffers it. Meanwhile, the response hasn't arrived yet, so there's unacknowledged data. The kernel waits. After 200ms (the default delayed ACK timeout, which compounds the problem), the delayed ACK timer fires on the receiver, the ACK is sent, the sender's Nagle condition clears, the small packet finally goes out. You've introduced 200ms of completely artificial latency.
This is why every database client library, every HTTP/2 implementation, every RPC framework disables Nagle at the socket level with TCP_NODELAY:
int flag = 1; setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag));
TCP_NODELAY is a per-socket decision made in application code. There is no system-wide sysctl for it. If you are writing a latency-sensitive application and not setting TCP_NODELAY, you are leaving latency on the floor. If you are writing a throughput-oriented application that sends large streams of data, leave Nagle enabled (the default).
The paired problem is delayed ACK. By default, receivers wait up to 200ms to send an ACK, hoping they can piggyback it on outgoing data. This is fine for bidirectional flows but interacts badly with Nagle on unidirectional ones. Disabling delayed ACK requires TCP_QUICKACK, which is also per-socket:
int flag = 1; setsockopt(sock, IPPROTO_TCP, TCP_QUICKACK, &flag, sizeof(flag));
Note that TCP_QUICKACK is not sticky: you must set it after every ACK, or at least periodically. Some applications set it in their receive loop. Most don't bother, leaving 200ms of latency on the table on every request-response cycle.
Keepalives: The Least Surprising Way to Lose Connections
TCP keepalives are a mechanism to detect dead connections. When enabled, the kernel sends a probe packet after the connection has been idle for tcp_keepalive_time seconds. If no response is received, it sends additional probes tcp_keepalive_intvl seconds apart, up to tcp_keepalive_probes times. If all probes fail, the connection is considered dead and closed.
The documented defaults:
net.ipv4.tcp_keepalive_time = 7200 # 2 hours before first probe net.ipv4.tcp_keepalive_intvl = 75 # 75 seconds between probes net.ipv4.tcp_keepalive_probes = 9 # 9 probes before giving up
With defaults, a dead connection is not detected for 2 hours + (9 * 75 seconds) = roughly 2 hours and 11 minutes. If your application opens a connection, goes idle, and the remote side disappears (crashed, network partition, NAT timeout), your application won't know for over two hours.
Shorter values for applications that need quicker dead-peer detection:
# IF you have long-lived idle connections and need faster dead-peer detection: sysctl -w net.ipv4.tcp_keepalive_time=300 sysctl -w net.ipv4.tcp_keepalive_intvl=30 sysctl -w net.ipv4.tcp_keepalive_probes=6
This detects dead connections within roughly 480 seconds. How aggressive to be depends on your workload.
Important: TCP keepalives require the application to enable them on each socket with SO_KEEPALIVE. Setting the sysctl values without ensuring the application enables keepalives on its sockets does nothing. The sysctls set defaults for sockets that have opted in; they are not a system-wide override.
Also important: NAT devices have their own idle timeout, typically 30-300 seconds depending on the device. A connection that is idle longer than the NAT timeout will have its state silently dropped by the NAT device. Setting tcp_keepalive_time lower than the NAT timeout prevents this, which is the most practical reason to reduce the default 2-hour value.
Segment Offloading: Letting Your NIC Do the Work
Modern NICs support a family of offload capabilities that shift computational work from the CPU to the NIC hardware. These are not strictly TCP tuning parameters, but they interact with TCP performance so directly that they belong in this discussion.
TSO: TCP Segmentation Offload
TCP Segmentation Offload allows the kernel to pass very large chunks of data (up to 64KB) to the NIC as a single segment, letting the NIC split it into MTU-sized packets. Without TSO, the kernel's TCP stack would have to split every write into MSS-sized segments before handing them to the driver.
TSO dramatically reduces CPU overhead for high-throughput TCP flows. On a 10Gbps link without TSO, the kernel might need to process 750,000 segments per second. With TSO, it hands the NIC a smaller number of large buffers and lets hardware handle segmentation.
Verify and control TSO:
ethtool -k eth0 | grep -i seg ethtool -K eth0 tso on
TSO should almost always be enabled. The cases where you'd disable it: debugging packet-level behavior (TSO makes Wireshark captures confusing because you see superpayload frames that don't match what's actually on the wire), or when dealing with NICs that have buggy TSO implementations (rare but it happens). If you need to disable for testing, do it temporarily and measure CPU impact before making it permanent.
GRO: Generic Receive Offload
The receive-side equivalent of TSO. GRO coalesces incoming packets that belong to the same flow into larger buffers before handing them to the kernel, reducing the number of receive-side interrupts and context switches.
ethtool -K eth0 gro on
GRO and TSO together are the single highest-impact NIC tuning you can do for throughput. Leave them on.
GSO: Generic Segmentation Offload
GSO is TSO implemented in software, for cases where the NIC doesn't support TSO or where the protocol isn't TCP (UDP, for instance). It sits between the kernel TCP stack and the NIC driver. For TCP, TSO hardware offload is preferred, but GSO provides the same benefit when hardware TSO is unavailable.
RSS and RPS: Spreading the Load
Receive Side Scaling (RSS) allows a NIC with multiple receive queues to hash incoming packets across queues based on the five-tuple (source IP, source port, dest IP, dest port, protocol), ensuring that packets from the same flow go to the same queue, and different flows spread across queues. This enables multiple CPU cores to process receive traffic in parallel.
If your NIC supports RSS, it's configured via the NIC firmware or driver. The number of receive queues should match your number of CPU cores or physical cores, whichever is smaller:
ethtool -L eth0 rx 8 # Set 8 receive queues
Receive Packet Steering (RPS) is RSS implemented in software, for NICs that don't support RSS in hardware. Configure via:
echo "ff" > /sys/class/net/eth0/queues/rx-0/rps_cpus
The hex value is a CPU bitmask. ff means all of the first 8 CPUs.
Interrupt Coalescing
High-throughput NICs can generate enormous numbers of interrupts, each requiring a CPU context switch. Interrupt coalescing tells the NIC to batch interrupts: fire once every N microseconds or once every N packets, whichever comes first.
ethtool -C eth0 rx-usecs 50 tx-usecs 50
More aggressive coalescing (higher usecs values) reduces CPU overhead at the cost of increased latency. For bulk transfer workloads, coalesce aggressively. For latency-sensitive applications, reduce coalescing or disable it. This is a genuine trade-off with no universally correct answer.
Retransmission Behavior: What Happens When Things Go Wrong
TCP detects loss through two mechanisms: duplicate ACKs (fast retransmit) and the retransmission timeout (RTO). Tuning retransmission behavior affects how aggressively TCP probes for loss and how long it waits before giving up on a connection.
SACK and DSACK
Selective Acknowledgment (SACK) allows the receiver to tell the sender exactly which segments it has received, rather than only the highest contiguous sequence number received. Without SACK, the sender must retransmit everything from the last acknowledged point. With SACK, it retransmits only the specific missing segments.
net.ipv4.tcp_sack = 1
This should be 1. Always. SACK has been recommended since RFC 2018 in 1996. Any argument for disabling it in production is almost certainly wrong.
One historical note: the "SACK Panic" CVEs (specifically CVE-2019-11477 and related) were real kernel vulnerabilities triggered through malformed SACK sequences. The correct mitigation was patching, not disabling SACK. If you find a runbook somewhere that says "disable SACK to fix CVE-2019-11477," that runbook was written during the emergency window and never updated. Patch your kernel.
DSACK (Duplicate SACK) extends SACK to let the receiver report duplicate segments, giving the sender information about whether loss was real or whether the ACK was lost in transit. This helps congestion control avoid unnecessary window reductions.
net.ipv4.tcp_dsack = 1
Also leave this at 1.
The Retransmission Timeout
When TCP doesn't receive an ACK within the RTO window, it retransmits the unacknowledged segment and doubles the RTO (exponential backoff). The correct sysctl to tune the global minimum RTO on current kernels is:
net.ipv4.tcp_rto_min_us
Note the _us suffix: this is in microseconds. The documented default is 200000 (200ms). The kernel docs also define a clear precedence chain: per-route rto_min overrides socket-level options, which override this sysctl. That hierarchy matters.
For a controlled, low-latency data center path where you want tighter RTO, prefer the per-route approach to avoid affecting every connection on the host:
ip route change <DEST_PREFIX> via <GW> dev <IFACE> rto_min 200ms
If you're changing the global minimum, use the correct knob and understand what you're doing:
sysctl net.ipv4.tcp_rto_min_us # check current value first
The 200ms default is an eternity on a 1ms LAN. On internet paths where RTT varies significantly, leave it alone and use per-route scoping instead.
tcp_retries2 controls how many times TCP will retransmit before giving up on the connection entirely. The default of 15 amounts to roughly 15-20 minutes of retransmission attempts with exponential backoff. For most server applications, that patience is appropriate. For applications that need to detect dead connections faster, reduce this and rely on application-level keepalives instead.
ECN: Explicit Congestion Notification
ECN is a mechanism that allows routers to signal congestion to endpoints without dropping packets. Routers that support ECN mark packets with the Congestion Experienced (CE) bit when their queues are filling. TCP endpoints that support ECN respond to CE-marked packets by reducing their congestion window, exactly as they would respond to loss, but without the retransmission overhead.
The documented default for net.ipv4.tcp_ecn is 2, not 0:
0: ECN disabled entirely1: ECN enabled, request it on all outgoing connections2: ECN passive mode (the default), accept it if the remote side initiates but don't request it
Modern kernels also support AccECN (Accurate ECN) negotiation modes via higher values. Check your kernel docs for the full list of supported modes on your version.
ECN has historically had middlebox compatibility problems: some firewalls and routers incorrectly drop or modify ECN-marked packets, causing connection failures. This has improved significantly as network equipment has been updated. Enabling mode 1 (requesting ECN on all connections) is reasonable on controlled internal networks. For public-facing services, test middlebox behavior with packet captures before flipping it globally.
sysctl net.ipv4.tcp_ecn # verify current value before changing
TCP Window Scaling
The original TCP receive window field is 16 bits, limiting the advertised window to 65,535 bytes. In 1981, when TCP was standardized, a 64KB window seemed enormous. On a modern 100Gbps WAN link with 100ms RTT, the bandwidth-delay product is 125MB. A 64KB window limits throughput to 64KB / 0.1s = 5.24 Mbps on that path. Not 5 Gbps. 5 Mbps.
RFC 7323 (updated from RFC 1323) governs TCP window scaling. The Linux sysctl:
net.ipv4.tcp_window_scaling = 1
This must be 1. On any high-bandwidth or high-latency path without window scaling, TCP performance will be limited to a small fraction of available bandwidth. Window scaling is negotiated during the handshake: both sides must support it, and if either side disables it, neither can use it.
Some ancient middleboxes mishandle the Window Scale option. If you're seeing mysteriously limited throughput on specific paths and both endpoints look correctly configured, a middlebox stripping or zeroing the Window Scale option is worth investigating. tcpdump will reveal it: look at the options field in SYN packets.
Multipath TCP: The Protocol That Makes TCP Smarter Than Your Network
Multipath TCP (MPTCP), updated in RFC 8684 (MPTCPv1), extends TCP to allow a single connection to use multiple network paths simultaneously. A smartphone with WiFi and LTE can open an MPTCP connection and use both interfaces simultaneously, aggregating bandwidth and seamlessly failing over between them without dropping the connection.
This is not a minor optimization. It is a fundamentally different model of what a "connection" means.
How MPTCP Works
MPTCP begins as a regular TCP connection. During the handshake, both sides negotiate MPTCP capability using TCP options. If both sides support MPTCP, they can then establish additional subflows: new TCP connections between different address pairs that contribute traffic to the same logical MPTCP connection.
Each subflow is a standard TCP connection from the network's perspective. Middleboxes see normal TCP. MPTCP's multipath behavior is entirely in the endpoints. The MPTCP layer presents a single byte stream to the application, multiplexed across however many subflows are available.
MPTCP in Linux
MPTCP is in-kernel starting with Linux 5.6. One important clarification the original guidance got wrong: net.mptcp.enabled = 1 enables MPTCP capability in the kernel, but it does not transparently convert all existing TCP connections to MPTCP. Upstream Linux MPTCP is per-socket. Applications create MPTCP connections by using IPPROTO_MPTCP instead of IPPROTO_TCP:
int sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);
For legacy applications that you can't modify, mptcpize is the compatibility shim:
mptcpize run <your-application>
Enable MPTCP capability in the kernel:
net.mptcp.enabled = 1
Verify it's working:
sysctl net.mptcp.enabled ss -iaM # shows MPTCP-specific socket info
Path Manager Configuration
The path manager controls which additional subflows get created. One sysctl note: net.mptcp.pm_type is deprecated as of kernel 6.15. Current kernels use net.mptcp.path_manager. Always check what your kernel actually exposes:
sysctl -a | grep -E 'net.mptcp.(pm_type|path_manager)'
Configure endpoints for subflow creation:
ip mptcp endpoint add <address> dev <interface> subflow ip mptcp endpoint add <address> dev <interface> signal # advertises address to remote
MPTCP Scheduler: Which Subflow Gets the Data
When MPTCP has multiple subflows available, it needs a scheduling algorithm to decide which subflow to use for each segment:
Default (minRTT): Send on the subflow with the lowest RTT. Optimizes for latency. Good for interactive workloads.
Redundant: Send the same data on all subflows, using whichever ACK arrives first. Eliminates head-of-line blocking at the cost of doubling bandwidth usage. Useful for video streaming where latency variability is damaging.
Round-robin: Distribute segments across subflows in sequence. Maximizes bandwidth aggregation on equal-quality paths.
Where MPTCP Actually Helps
Mobile devices with multiple interfaces are the canonical use case. Apple has used MPTCP since iOS 7 for Siri and expanded its use significantly since then. A connection seamlessly transitions from WiFi to cellular without the user noticing because MPTCP maintains the connection while switching the active subflow.
Data center applications with multiple paths between servers (common in spine-leaf topologies) can aggregate bandwidth across multiple paths. Traditional ECMP hashing distributes flows across paths but pins any single flow to one path. MPTCP allows one application-level connection to use all available paths simultaneously.
High-availability requirements: MPTCP connections survive the failure of individual subflows, transparently migrating to surviving paths. The application doesn't see a connection failure; it sees a momentary throughput reduction.
Where MPTCP Complicates Things
Congestion control across multiple subflows is harder than single-path TCP. Each subflow runs its own congestion control algorithm, but they compete for bandwidth on shared paths. MPTCP-aware congestion control (LIA, OLIA, BALIA) attempts to treat multiple subflows as a coupled system, preventing any individual subflow from being unfair to single-path TCP on shared links. These algorithms are not in the mainline kernel by default and require out-of-tree patches or specific distributions. For most use cases, running BBR on individual subflows and accepting the potential unfairness is the practical choice.
Reordering is another concern. Packets from different subflows may arrive out of order, since different paths have different latencies. MPTCP handles reordering, but it introduces a reorder buffer at the receive side that can add latency. If subflow RTT differences are large (say, 5ms WiFi vs 80ms LTE), the scheduler should prefer the low-latency path for latency-sensitive data.
The Knobs You Probably Shouldn't Touch
tcp_timestamps: Leave It On
net.ipv4.tcp_timestamps = 1
TCP timestamps, defined in RFC 7323, serve two purposes. They enable more accurate RTT measurement (the kernel can measure the time from sending a segment to receiving its ACK with millisecond precision). And they enable Protection Against Wrapped Sequence Numbers (PAWS), which prevents old segments from being mistaken for new ones on high-speed connections where the 32-bit sequence number can wrap around quickly.
Modern kernels use randomized timestamp offsets per connection, which eliminates the uptime fingerprinting concern while retaining the protocol benefits. Some administrators disable timestamps because they add 10 bytes of overhead to every segment. The RTT accuracy benefit to congestion control is significant, and PAWS is genuinely necessary on any multi-gigabit connection. Leave timestamps on.
tcp_syn_retries and tcp_synack_retries: Don't Reduce Aggressively
net.ipv4.tcp_syn_retries = 6 net.ipv4.tcp_synack_retries = 5
These control how many times TCP retries the initial SYN and SYN-ACK before giving up on connection establishment. Reducing them makes your server faster to declare remote hosts unreachable, which sounds useful until a remote host that was briefly unavailable (rebooting, brief network partition) fails to connect because you gave up too quickly.
Reducing tcp_synack_retries to 2 or 3 is sometimes done on public-facing servers to limit the resource consumption from SYN flood attacks (where attackers send SYN packets without completing the handshake). This is a legitimate use case, but always pair it with SYN cookies.
net.ipv4.tcp_fin_timeout: Understand Before Reducing
The FIN_WAIT_2 timeout, not to be confused with TIME_WAIT. After sending FIN, if the remote side sends ACK but then goes silent without sending its own FIN (the connection is half-closed), the kernel waits tcp_fin_timeout seconds before cleaning up. The documented default is 60 seconds.
Reducing this helps when you have applications that close connections without properly completing the four-way teardown. It's a reasonable thing to reduce if you're seeing large numbers of FIN_WAIT_2 sockets. "Reduce FIN_WAIT_2 timeout" and "reduce TIME_WAIT timeout" are different operations with different implications: conflating them causes confusion.
tcp_mem Maximum: Do the Math, Don't Guess
The maximum value of tcp_mem limits total TCP memory usage. Remember it's in pages. Blindly copying a large value from a tuning guide can cause your system to allocate enormous amounts of memory to TCP buffers, starving other processes. Do the math: expected concurrent connections * expected per-socket buffer size gives you the theoretical maximum. Your actual tcp_mem maximum should be bounded by your available RAM, with room for everything else running on the system. The kernel computes a reasonable default at boot. Change it with intention, not cargo-cult.
The Fastest Ways to Break TCP Performance
Let's talk about the hall of shame: the most reliably effective ways to make TCP perform badly.
Disabling window scaling on one side. A single sysctl on one endpoint caps throughput on every high-BDP path from that host. The beauty of this mistake is that it looks fine on LAN tests, only manifesting on WAN connections. You'll spend days looking at the wrong things.
Setting socket buffer sizes too small in application code. The application sets SO_SNDBUF or SO_RCVBUF explicitly and pins them to a value below the bandwidth-delay product. The kernel respects explicit socket buffer sizes and disables autotuning for that socket. Common in older application code that set socket options when 64KB was a large buffer. The symptom is throughput that plateaus far below link capacity on high-latency paths.
Setting tcp_tw_reuse = 1 globally without understanding PAWS assumptions. If your network has NAT devices with inconsistent timestamp behavior, global TIME_WAIT reuse can cause subtle, intermittent connection failures that are genuinely difficult to diagnose. Use it if you need it. Know why you need it. Check what it's actually doing to your connections.
Mismatched MTU causing constant fragmentation. Setting a jumbo frame MTU (9000 bytes) on your server when the path MTU to clients is 1500 bytes causes every large packet to be fragmented, or worse, silently dropped if the "Don't Fragment" bit is set. PMTUD (Path MTU Discovery) should handle this, but firewalls blocking ICMP "fragmentation needed" messages break PMTUD silently. Symptoms: small packets work fine, large transfers fail or perform terribly.
Aggressive net.ipv4.tcp_retries2 reduction. Dropping tcp_retries2 to 3 or 4 means TCP gives up on connections after a few seconds of packet loss, rather than the default 15-20 minutes. For a server whose upstream path has a 30-second brownout, this means dropping every active TCP connection during the brownout. The default's patience is usually a feature.
Disabling SACK. Every lost packet in a window requires retransmitting all subsequent packets. On any lossy path, this is catastrophic. The only reason to disable SACK is if you've found a buggy endpoint that mishandles SACK (extremely rare, and the correct fix is to update the endpoint, not disable SACK globally). If you disabled SACK as a CVE-2019-11477 mitigation and forgot to re-enable it after patching, you are still running with your hand tied behind your back.
Running CUBIC on a very lossy path. CUBIC performs well on reliable paths. On a path with even 0.5% random loss (typical for some wireless links), CUBIC's window gets cut constantly and throughput collapses. Use BBR on lossy paths.
Mixing MTUs within a bonded or LAG interface. Setting different MTUs on the member interfaces of a bond or LAG causes intermittent and mysterious fragmentation depending on which interface the packet was sent on. Always set MTU on the bond interface, not the member interfaces.
Forgetting to set TCP_NODELAY in a latency-sensitive application and wondering why the p99 is 200ms. This happens more than the industry would like to admit. The Nagle-delayed-ACK interaction is a 200ms floor that sits invisibly under your application latency until someone looks at it with a packet capture.
Using the wrong RTO knob. Setting net.ipv4.tcp_rto_min = 200 instead of net.ipv4.tcp_rto_min_us = 200000 sets a value 1,000x smaller than intended. The kernel won't reject the value. Your connections will behave erratically and you will not immediately understand why.
A Coherent Tuning Profile: Putting It Together
Safe baseline (minimal changes, explicit defaults, low drama):
# Verify what you have before changing anything uname -r sysctl net.ipv4.tcp_available_congestion_control tc -s qdisc show ss -tmiH | head -n 20 # Feature flags: all should already be on, make them explicit sysctl -w net.ipv4.tcp_window_scaling=1 sysctl -w net.ipv4.tcp_timestamps=1 sysctl -w net.ipv4.tcp_sack=1 sysctl -w net.ipv4.tcp_dsack=1 sysctl -w net.ipv4.tcp_moderate_rcvbuf=1 # ECN: documented default is 2 (passive). Move to 1 only after testing middleboxes. sysctl -w net.ipv4.tcp_ecn=2 # Keepalives (only if your application enables SO_KEEPALIVE) sysctl -w net.ipv4.tcp_keepalive_time=300 sysctl -w net.ipv4.tcp_keepalive_intvl=30 sysctl -w net.ipv4.tcp_keepalive_probes=6
High-BDP bulk transfer (after confirming rwnd_limited or cwnd_limited via ss):
# Buffer ceilings: only if measured BDP exceeds current autotuned ceiling sysctl -w net.core.rmem_max=134217728 sysctl -w net.core.wmem_max=134217728 # Keep middle values consistent with documented defaults; raise max only sysctl -w net.ipv4.tcp_rmem="4096 131072 134217728" sysctl -w net.ipv4.tcp_wmem="4096 16384 134217728" # If BBR is available and you control the sender: sysctl net.ipv4.tcp_available_congestion_control # verify first sysctl -w net.ipv4.tcp_congestion_control=bbr sysctl -w net.core.default_qdisc=fq
For MPTCP on a multi-homed server:
sysctl -w net.mptcp.enabled=1 # Configure endpoints with ip mptcp endpoint # Use IPPROTO_MPTCP in application socket calls, or mptcpize for legacy apps ss -iaM # verify subflows
For ultra-low-latency workloads: set TCP_NODELAY and TCP_QUICKACK at the application level, reduce interrupt coalescing to near-zero, disable TSO and GRO to eliminate batching latency (and measure the CPU cost), and consider DPDK to eliminate the kernel entirely. That is a different article.
The Actual Lesson
TCP gives you a remarkable amount of control over its behavior, and that control is genuinely useful. The buffer sizing math is real. The congestion control algorithm selection is real. The Nagle/delayed-ACK interaction is real and it bites people constantly. The tcp_rto_min_us vs tcp_rto_min distinction is a unit error waiting to cause a production incident for someone.
But most TCP performance problems are not kernel tuning problems. They are application problems: connection pooling missing, buffers set explicitly in application code to wrong values, TCP_NODELAY not set, keepalives not enabled. Fix the application first. Then measure. Then tune the kernel parameters that the measurement tells you are the bottleneck.
The kernel defaults are not optimal for modern hardware, but they're not random either. They represent decades of careful calibration by people who understood TCP deeply. When you deviate from them, know why, verify your version's actual defaults, and measure before and after.
The best TCP tuning you can do is to understand what TCP is trying to accomplish well enough to tell whether a given knob is helping or hurting. Everything else is just moving numbers around until something looks better in a benchmark you may not be running correctly.
Have strong opinions about BBR fairness, a war story about tcp_tw_recycle eating your production traffic, or a unit error that cost you a night of sleep? Leave me a comment on https://www.linkedin.com/in/nvscottmorrison/