>

RoCE and the Coming Collapse of Reactive Networking: A Hypothesis About Predictive Traffic Engineering in AI Infrastructure

Scott MorrisonMarch 08, 2026 0 views
rdma roce dcqcn pfc ecn ai-infrastructure traffic-engineering sonic gpu-networking hypothesis
PFC stops drops by pausing the fabric. DCQCN stops congestion by signaling rate reduction. Neither of them knows that the all-reduce operation consuming your entire spine was scheduled by a compiler before the first packet was generated, and that is the entire problem.

The internet was built for conversations. Request, response, acknowledge, repeat. The protocols that carry it, TCP, BGP, OSPF, were designed around an assumption so deeply embedded that we rarely bother to name it: traffic patterns are fundamentally unpredictable, so routing and congestion control must be reactive. Something happens, then the network responds to it.

That assumption is becoming wrong, and the infrastructure underpinning large-scale AI training and inference is the reason why.

When a frontier-class model performs a forward pass across a thousand GPUs, the communication patterns are not random. They are derived from a computational graph that was compiled hours or days before execution. The all-reduce operations across a ring of a thousand GPUs, the activation tensors flowing between pipeline stages, the KV cache transfers that happen during speculative decoding, all of it is mathematically deterministic once the model architecture and batch configuration are fixed. The network has no way to know this. BGP is routing packets as if they came from a human clicking a link. The congestion control stack is reacting to drops it could have predicted milliseconds in advance. The entire reactive architecture of the modern data center network is a structural mismatch for the workload it is being asked to carry.

This paper makes a specific hypothesis: the next fundamental shift in AI infrastructure networking will not be a faster protocol or a bigger buffer. It will be the collapse of the boundary between the compute scheduler and the network scheduler. The execution graph that Groq's compiler uses to pre-schedule every clock cycle of chip-to-chip communication, extended from silicon to fiber, applied not just within a chassis but across the entire fabric, becomes the architecture for a network that does not react to congestion because it predicted and routed around it before the first packet was generated.

Let's start at the bottom, with the protocol that makes lossless AI networking possible today, understand why it is structurally limited, and then build the case for what replaces it.


Part One: RDMA over Converged Ethernet

The Problem With Normal TCP for This Workload

An all-reduce operation across 1,024 GPUs in a transformer training run transfers hundreds of gigabytes of gradient data. The latency budget is measured in microseconds. A TCP connection to accomplish this would require kernel processing for every packet, memory copies from NIC buffers to kernel space to user space, interrupt handling, scheduling jitter, all adding latency that compounds across thousands of nodes. A 10-microsecond kernel overhead per message is irrelevant on one GPU. It is catastrophic when every GPU in your cluster is waiting on a collective that must complete before the next forward pass can begin.

This is the founding problem that Remote Direct Memory Access (RDMA) was designed to solve. RDMA allows one host to directly read from or write to the memory of another host, bypassing the kernel entirely on both sides. No CPU involvement. No memory copy. The NIC accesses system memory directly via DMA, and the remote NIC writes directly to the destination process's virtual memory space. The result is single-digit-microsecond latency and message rates in the tens of millions per second per port, consuming a fraction of the CPU resources that TCP would require.

RDMA was born on InfiniBand, a lossless fabric designed from the ground up for high-performance computing. InfiniBand handles congestion through a credit-based flow control system that prevents senders from sending more data than receivers can accept, maintaining lossless delivery without drops. This is elegant and it works, but it requires dedicated InfiniBand hardware, InfiniBand switches, InfiniBand management infrastructure, and an InfiniBand operations team. For hyperscalers building clusters with tens of thousands of nodes, InfiniBand at scale costs a number that hyperscalers prefer not to write in documents.

Enter RDMA over Converged Ethernet, which is the uncomfortable compromise the industry made when it decided that existing Ethernet infrastructure was too large an investment to throw away.

RoCE v1 and v2: The Architecture

RoCE maps RDMA's communication semantics onto Ethernet. There are two versions with fundamentally different network-layer behavior.

RoCE v1 operates at Layer 2. It encapsulates InfiniBand transport packets directly into Ethernet frames. This means RoCE v1 is non-routable: it cannot cross a Layer 3 boundary, which limits it to a single broadcast domain. For a cluster that fits within a single top-of-rack or end-of-row deployment, this can work. For anything at hyperscaler scale, Layer 2 non-routability is a non-starter.

RoCE v2 encapsulates RDMA packets inside UDP/IP. The outer IP header makes the traffic routable across Layer 3 boundaries, which is what makes RoCE v2 the version that actually gets deployed in large clusters. The UDP destination port 4791 identifies RoCE v2 traffic. The IP layer provides addressability; the UDP layer provides port-based demultiplexing; the InfiniBand transport layer (IB BTH) inside the UDP payload handles reliability, ordering, and the actual RDMA semantics.

The network stack for a RoCE v2 transmission looks like this: your application calls ibv_post_send() with a work request. The RDMA NIC (called an RNIC, or more commonly an HCA, Host Channel Adapter) processes the work request, constructs the InfiniBand transport packet, wraps it in UDP/IP/Ethernet headers, and hands it to the wire. The remote RNIC receives the packet, extracts the RDMA operation, and places the payload directly into the destination memory address specified in the packet. The remote CPU never touches it.

The InfiniBand transport layer provides two primary service types:

Reliable Connected (RC): One queue pair maps to exactly one remote queue pair. Reliability is guaranteed by ACKs and retransmission. Most RDMA workloads use RC. The problem with RC in large clusters is that the number of queue pairs scales as O(N²): every host needs a queue pair to every other host it communicates with. At 1,024 nodes, that's over a million queue pairs per host. This exhausts RNIC memory and causes significant performance degradation.

Unreliable Datagram (UD): One queue pair communicates with any remote queue pair. No guaranteed delivery, no ordering. Significantly more scalable, but reliability becomes the application's problem. MPI implementations and some collective communication libraries use UD with application-level error handling.

Reliable Datagram (RD) and Extended Reliable Connected (XRC) exist as solutions to the QP scalability problem: XRC reduces the O(N²) QP problem by sharing destination-side QPs across multiple initiators. It is supported on Mellanox/NVIDIA ConnectX hardware and is increasingly important in large deployments.

Why RoCE Needs a Lossless Network, and What That Costs You

InfiniBand's credit-based flow control prevents packet drops natively. Ethernet does not. Ethernet, under congestion, drops packets. This is fine for TCP, which expects drops and has an entire congestion control architecture built around recovering from them gracefully.

For RDMA, a dropped packet is not an opportunity for congestion control to do its job. It is a disaster. When an RC queue pair misses a packet, the entire connection stalls waiting for retransmission. If the retransmit timer fires, the sender retransmits from the last acknowledged point, potentially retransmitting many packets the receiver already has. This RNR (Receiver Not Ready) retry sequence can introduce latency measured in hundreds of milliseconds into a communication fabric that was supposed to operate at single-digit microseconds.

More critically: RDMA retransmission storms can cause cascading congestion. A retransmit flood at one congestion point creates more drops elsewhere, which creates more retransmits, which creates more drops. You have reproduced the 1986 ARPANET congestion collapse, in your AI training cluster, at 400Gbps.

This is why RoCE v2 absolutely requires a lossless (or near-lossless) network fabric. Two mechanisms deliver that losslessness on Ethernet: Priority Flow Control and Explicit Congestion Notification.


Part Two: Priority Flow Control

PFC is a brute-force solution to a subtle problem. The problem: Ethernet cannot express backpressure at the granularity of individual flows. When a switch port's buffer fills, it drops packets indiscriminately from all flows, including RDMA flows that cannot tolerate drops.

IEEE 802.1Qbb (PFC, or Priority Flow Control) extends Ethernet PAUSE frames to operate per-priority-class rather than per-port. A switch can now tell an upstream device: "stop sending traffic on priority class 3 while letting priorities 0, 1, 2, 4, 5, 6, and 7 continue." The RNIC maps its RDMA traffic to a specific priority class (typically class 3 in standard deployments), and PFC ensures that class is never dropped due to local congestion.

The mechanism: when a switch egress port's buffer occupancy for priority class 3 exceeds a threshold (the XON/XOFF thresholds), the switch sends a PFC PAUSE frame on the ingress port to the upstream device, instructing it to stop sending priority 3 traffic for a specified quanta. The upstream device pauses transmission, the buffer drains, and the switch sends a resume frame.

This stops drops. It also introduces a set of failure modes that are genuinely spectacular.

The Problems With PFC

Head-of-line blocking. PFC pauses an entire priority class on an entire port. If two RDMA flows share a port and one flow's destination is congested, the PFC pause halts both flows, even if the second flow's path is completely clear. The second flow is innocent. PFC punishes it anyway. In a large fat-tree or Clos topology with many flows competing for port bandwidth, head-of-line blocking can reduce effective throughput to well below the physical link rate.

PFC deadlocks. In a network with multiple switch hops, PFC pause frames can propagate upstream in circular patterns. Switch A pauses Switch B. Switch B pauses Switch C. Switch C pauses Switch A. Nothing moves. This is a livelock, and it is not theoretical: it has occurred in production deployments. The conditions that trigger PFC deadlocks are topology-dependent and can be extremely difficult to reproduce in test environments, which means they often appear for the first time under production load patterns your test lab never generated.

PFC storms. A single congested destination can trigger a cascade of PFC pauses that propagates through the entire fabric, pausing traffic far removed from the actual congestion point. A switch at the edge of a fat-tree topology can receive PFC pauses from downstream and propagate them upstream to the spine, effectively shutting down significant portions of the fabric to protect one congested queue. This is a correctness feature (packets are not dropped) that behaves like a failure mode (the network stops working).

Watchdog mechanisms. Most RNICs implement PFC watchdog timers that detect when a PFC pause has lasted longer than expected and respond by disabling the paused queue pair. This prevents indefinite hangs from PFC deadlocks, but it converts a "network is paused" condition into a "RDMA connection is terminated" condition, which the application then needs to recover from. You traded a deadlock for a disconnection. Depending on your recovery logic, this is either slightly better or significantly worse.

The fundamental problem with PFC is architectural: it is a hop-by-hop backpressure mechanism that was designed for Ethernet, not for the traffic patterns of distributed AI workloads. It operates at the granularity of priority classes on ports, which is far too coarse for a network carrying thousands of concurrent RDMA flows with wildly different destination congestion states.

ECN provides a more surgical alternative.


Part Three: Explicit Congestion Notification in RDMA Fabrics

ECN (RFC 3168) allows switches to mark packets as congestion-experienced without dropping them. The CE (Congestion Experienced) bit in the IP header is set by a switch when its queue depth exceeds a threshold. The receiver copies the CE marking into its acknowledgment, and the sender reduces its transmission rate.

For TCP, we covered ECN in the TCP tuning context. For RDMA, ECN is implemented through a mechanism called DCQCN (Data Center Quantized Congestion Notification), developed jointly by Mellanox and Microsoft and described in a 2015 SIGCOMM paper. DCQCN is now the de facto standard for RoCE congestion management, though you'll also encounter variations like HPCC (High Precision Congestion Control) from Alibaba and SWIFT from Google.

DCQCN: The Mechanism

DCQCN has three components: the switch, the receiver, and the sender.

The switch uses RED (Random Early Detection) with ECN marking. When queue occupancy exceeds a minimum threshold, packets are marked with probability that increases linearly as queue occupancy increases toward a maximum threshold. Above the maximum threshold, all packets are marked. The critical parameters are the minimum and maximum queue thresholds, which must be tuned to your link speed and RTT. Too low, and you mark traffic that would have drained without congestion. Too high, and you allow queue buildup that causes PFC triggers.

The receiver (CNP generation): When the RNIC receives a CE-marked packet, it generates a Congestion Notification Packet (CNP). The CNP is a special RDMA packet sent back to the sender on a separate out-of-band path. CNPs are generated at most once per RTT per flow, rate-limited to prevent CNP floods from adding congestion while trying to signal congestion.

The sender (rate reduction and recovery): When the RNIC receives a CNP, it applies a multiplicative decrease to the transmission rate of the affected flow. The rate reduction factor (typically 0.8, meaning 20% reduction) is applied immediately. The sender then enters a recovery phase where it gradually increases the rate back toward line rate using a combination of additive increase phases (byte counter based) and timer-based recovery. The total rate controller mimics AIMD at the RDMA transport layer.

DCQCN Parameters: What You Actually Need to Tune

DCQCN has enough parameters to keep a graduate student occupied for a semester. The ones that materially affect behavior:

Switch ECN thresholds (Kmin, Kmax, Pmax): The minimum queue depth for ECN marking to begin, the maximum queue depth at which all packets are marked, and the maximum marking probability. These must be calibrated to your link speed. On 100Gbps links, a reasonable starting point is Kmin = 400KB, Kmax = 2MB, Pmax = 1.0. On 400Gbps links, these values need to scale up proportionally. The critical insight is that Kmin must be large enough to accommodate the bandwidth-delay product of your fabric (link speed * RTT), otherwise you'll trigger ECN on flows that are not actually causing congestion.

CNP frequency (CNP_Timer): The rate at which CNPs are generated. The default is one CNP per RTT per flow. Increasing CNP generation frequency improves reaction time but increases CNP traffic load on the fabric. On densely scheduled AI fabrics with thousands of concurrent flows, CNP traffic itself can become a non-trivial overhead.

Rate reduction factor (Rtype, Rai, Hai): The multiplicative decrease factor and the additive increase rates for recovery. The default Rtype of 0.8 (20% reduction per CNP) is aggressive. If your flows are bursty (which all-reduce patterns are), too aggressive a reduction means you spend most of your time in rate recovery, never approaching line rate. If too conservative, queue buildup continues despite the rate reduction.

Initial rate (IRR, Initial Rate on Rate Recovery): After a congestion event, the rate the sender begins the additive increase phase at. Setting this too high means you re-congest immediately. Too low means you take many RTTs to recover.

min_timer for rate recovery: The timer that triggers byte counter-based rate recovery phases. Critical for ensuring that low-bandwidth flows that don't accumulate byte counts quickly still eventually recover their rate.

The uncomfortable fact about DCQCN is that its parameters are highly workload-sensitive. The values that work well for uniform all-to-all traffic (collective operations in training) differ from the values that work well for incast traffic (many-to-one, common in inference serving). Most deployments tune for one pattern and accept suboptimal behavior on the other. Nobody has solved the general case.

HPCC and SWIFT: The Alternatives

HPCC (High Precision Congestion Control), developed at Alibaba, takes a different approach to rate control. Rather than reacting to ECN marking, HPCC uses INT (In-Network Telemetry) metadata inserted by switches to give senders precise visibility into queue occupancy and link utilization at every hop along the path. The sender computes its fair share of each link and sets its rate to avoid exceeding any bottleneck link. The result is faster convergence to fair rates, lower queuing delay, and significantly less sensitivity to parameter tuning than DCQCN.

HPCC requires switch support for INT, which is available on programmable ASICs (Tofino, for example) but not on all merchant silicon. It also requires a software RNIC or a programmable RNIC that can process INT metadata and update rate limiters accordingly. These requirements limit its deployability to environments with full-stack control, which is why you see it at hyperscalers (Alibaba, and variants at other large operators) but not in standard enterprise deployments.

SWIFT (Google's congestion control for RoCE) is delay-based rather than ECN-based, using one-way delay measurements to infer congestion. By maintaining precise clock synchronization (Google's infrastructure supports nanosecond-precision time via GPS and PTP), SWIFT can detect queue buildup from the delay signal before ECN thresholds are crossed. This allows proactive rate reduction before congestion actually occurs, which is one step toward the predictive model this paper is arguing for, applied at the flow level rather than the fabric level.


Part Four: The Topology Problem

DCQCN and PFC together can maintain a near-lossless RoCE fabric under benign traffic conditions. They struggle when the traffic pattern becomes adversarial to the topology, and AI workloads are structurally adversarial to the topologies we currently deploy.

Fat-Tree and the Assumption of Uniform Traffic

The dominant data center topology for AI workloads is the fat-tree (Clos network), where every layer has twice the bandwidth of the layer below, and the aggregate uplink bandwidth at every level equals the aggregate downlink bandwidth. Fat-trees are non-blocking under uniform traffic: if every host sends to a random destination, no link is oversubscribed.

AI training workloads are not uniform. They are structured, periodic, and highly correlated. During the backward pass of a transformer training run, every GPU in the cluster is simultaneously performing gradient all-reduce operations with a specific set of peers. The all-reduce pattern in ring-based implementations creates a specific communication graph: each GPU communicates with two neighbors in the ring, and the ring is typically configured to maximize bandwidth by traversing the fabric in a way that avoids uplink/downlink bottlenecks.

But "maximize bandwidth" in the context of ring-allreduce still means that hundreds of high-bandwidth flows start simultaneously, share common links in the fabric, and create synchronized congestion events that PFC and DCQCN were not designed to handle. The issue is not that individual links are oversubscribed; the issue is temporal. For the 50 milliseconds during which all-reduce is executing, the traffic matrix of the fabric is radically different from the equilibrium that ECMP (Equal-Cost Multipath) hashing was configured to balance. ECMP sees a different traffic matrix, hashes flows to paths that seemed balanced at hash-time, and concentrates many flows onto common links regardless.

This is the structural problem: reactive routing (ECMP, OSPF, BGP) makes decisions based on the current state of the network. AI training workloads create future states that are deterministically predictable from the execution graph. There is an information asymmetry between what the network knows and what the scheduler knows, and that asymmetry is the root cause of congestion in large AI clusters.

The Scale Problem

The state of the art in AI infrastructure in 2025 is clusters of 10,000 to 100,000 GPUs. NVIDIA's NVLink and NVSwitch handle intra-rack and intra-pod all-reduce with tremendous bandwidth. But inter-pod and inter-rack communication still traverses Ethernet or InfiniBand fabrics, and at the scale of 100,000 GPUs, even 400Gbps per port fabrics face traffic engineering problems that current protocols cannot solve.

The numbers: a 100,000 GPU cluster with B300s, each with 400Gbps of network bandwidth, generates 40 petabits per second of potential network traffic. During a synchronized all-reduce, a significant fraction of that traffic is in flight simultaneously. The spine switches in a three-tier Clos topology at this scale are handling traffic from thousands of pods simultaneously, and the routing decisions that determine which paths that traffic takes were made by OSPF days ago based on link utilization metrics that have no relationship to the AI workload's communication pattern.

The hypothesis this paper is building toward: the solution to this problem is not better reactive protocols. It is removing the word "reactive" from the vocabulary of AI fabric networking.


Part Five: Groq's Compiler and the Determinism Insight

Groq builds Language Processing Units (LPUs), purpose-built inference accelerators. The architectural decision at the heart of their performance story is static scheduling: rather than relying on dynamic hardware dispatch, cache coherency protocols, and runtime arbitration, Groq's compiler pre-computes the entire execution graph, including every inter-chip communication event, down to the individual clock cycle.

From their own technical description: "Our compiler pre-computes the entire execution graph, including inter-chip communication patterns, down to the individual clock cycles. This static scheduling eliminates cache coherency protocols, reorder buffers, speculative execution overhead, and runtime coordination delays."

The result is a system where the software knows exactly when data will arrive at every chip. Not approximately. Not with some probability distribution. Exactly. Periodic software synchronization adjusts for crystal-based clock drift, but the execution graph itself is a deterministic schedule. Groq describes it as operating "like a single-core supercluster, sidestepping complex coordination problems found in traditional architectures by starting with the compiler."

This is not a curiosity. This is the existence proof of a claim that many network engineers would find implausible: at scale, in real production workloads, you can eliminate the need for dynamic coordination entirely by precomputing the schedule and making the execution deterministic. Groq does this within a chassis. The question this paper asks is what happens when you extend this principle to the network fabric.

The Execution Graph as a Network Demand Oracle

Consider what the compiler knows at the point when a training job is launched:

  • The model architecture: layer dimensions, attention head counts, MoE expert placement, everything
  • The parallelism strategy: tensor parallelism degree, pipeline parallelism degree, data parallelism degree
  • The batch configuration: batch size, sequence length, microbatch schedule for pipeline parallelism
  • The device mapping: which GPU is at which network address, in which rack, connected to which ToR switch
  • The collective communication library configuration: which all-reduce algorithm, which ring topology, which chunk size

From this information, the collective communication library (NCCL, for example) generates a communication schedule: GPU 0 sends tensor chunk 0 to GPU 1 starting at time T0, GPU 1 sends tensor chunk 1 to GPU 2 starting at time T1, and so on through the entire ring-allreduce operation across all nodes. This schedule is deterministic given the inputs. NCCL computes it before the first iteration begins.

The network knows none of this. The packets arrive at the ToR switch as if they were HTTP requests from random clients. BGP and OSPF respond as if the destination could be anywhere in the internet. ECMP distributes flows with a hash function that was not designed for, and does not account for, the temporal correlation structure of collective communication patterns.

Here's the uncomfortable truth about the current state of AI networking: the compute scheduler and the communication library have solved the problem of deterministic scheduling at the application layer. The network, carrying the communication those libraries produce, is operating as if the application layer doesn't exist.


Part Six: Predictive Traffic Engineering, a Hypothesis

The hypothesis is this: AI training and inference frameworks will eventually expose their communication schedules to the network, and the network will use those schedules to perform prospective traffic engineering, routing traffic around predicted congestion points before the traffic that would cause that congestion has been generated.

This is not incremental improvement to existing protocols. It is a different model of what a network is.

What "Predictive" Actually Means

Predictive traffic engineering in this context means the following specific claim: given the communication schedule derived from the execution graph, the network can compute, for each future time step T+n, the expected utilization of every link in the fabric, and use that computation to pre-position flow routing decisions, pre-configure switch forwarding entries, and pre-empty queues of lower-priority traffic, before the traffic that would cause congestion has been generated.

The scale of prediction matters. This paper suggests three distinct time horizons, each with different implementation requirements:

T+1 steps ahead (millisecond scale, within a single iteration): The communication library knows that an all-reduce operation will begin in roughly 50ms, when the current forward pass completes. The network can use those 50ms to pre-drain competing flows from links that the all-reduce will need, reduce ECN thresholds on those links in anticipation of the burst, and pre-warm alternative path entries in switch FIBs so that flows can be rapidly redistributed if the predicted pattern diverges from actual traffic.

T+5 steps ahead (second scale, across multiple iterations): Training runs are periodic. The communication pattern in iteration N+1 is identical to iteration N. After a handful of iterations, the network has a perfect model of the communication schedule. It can begin reserving bandwidth for the upcoming all-reduce phase before the forward pass of the current iteration even begins, distributing lower-priority traffic to alternative paths while the reservation holds.

T+10 steps ahead (tens of seconds, across pipeline stages and KV cache events): In a long-context inference serving deployment running speculative decoding, the KV cache placement decisions made during prompt processing determine the communication pattern for the subsequent generation phase. The network can receive KV cache placement signals from the inference scheduler and pre-position the network paths that will carry KV cache transfers before the generation request queue reaches those entries.

KV Cache Traffic as a Demand Oracle

KV cache placement is one of the most interesting traffic engineering opportunities in inference infrastructure, and one of the most underexplored.

During inference serving, the KV cache for a given sequence accumulates across all attention layers as tokens are generated. For long-context models, the KV cache for a single sequence can occupy tens of gigabytes. In a disaggregated prefill-decode architecture (where the prefill phase, which processes the prompt, runs on different hardware from the decode phase, which generates tokens), the KV cache must be transferred from prefill hardware to decode hardware. This transfer happens once, is large, and occurs at a predictable point in the request lifecycle.

An inference scheduler that manages KV cache placement knows, at the moment a prefill request is dispatched, which decode instance will eventually receive that sequence's KV cache. It knows the approximate size of that cache (based on the prompt length and model architecture). It knows when the prefill will complete (based on prompt length and prefill hardware performance). It knows the network path between the prefill hardware and the decode hardware.

This is a traffic demand forecast with concrete parameters: a specific source, a specific destination, a specific data volume, and a specific time window. A predictive traffic engineering system would consume this forecast and pre-clear the network path for that KV cache transfer, ensuring that when the transfer begins, it finds available bandwidth rather than competing with the ongoing all-reduce operations from the training jobs sharing the same fabric.

The current state: the KV cache transfer competes for bandwidth with everything else on the fabric, protected only by whatever DSCP marking the inference system applies, handled by whatever QoS policy someone configured on the switches weeks ago. The predictive state: the inference scheduler publishes a traffic demand forecast to the network controller, which pre-routes accordingly.

The Training Pipeline as a Clock

The observation that unlocks T+5 and T+10 step prediction for training workloads is simple: training iterations are periodic. The communication pattern of iteration N+1 is identical to iteration N, modulo any dynamic batching decisions. The periodicity is not approximate. It is exact, bounded only by the variance in backward pass timing introduced by load imbalance.

This means that after iteration 1 completes, the network has a perfect empirical model of the iteration's communication schedule. It has observed every flow, measured every link utilization peak, identified every congestion event, and has the full measurement available before iteration 2 begins.

The traffic engineering insight: use iteration N's measured communication schedule to pre-engineer the network for iteration N+1. This is not machine learning or prediction in any probabilistic sense. It is deterministic replay of a known schedule. The only uncertainty is whether anything in the cluster changes between iterations (a failed link, a GPU that falls behind), and for those events, fallback reactive mechanisms (PFC, DCQCN) continue to provide protection.

The same logic applies to the communication schedule within a pipeline parallel training run. Stage N of the pipeline sends activations to Stage N+1 at a predictable time relative to the start of each microbatch. The network can schedule these flows with guaranteed bandwidth reservations rather than best-effort delivery, because the schedule is known.


Part Seven: Why Existing Protocols Cannot Implement This

The reason this predictive architecture does not exist today is not engineering difficulty. The engineering challenges are real but tractable. The reason is that the protocols that run data center networks, BGP and OSPF, were designed for a different problem and cannot be extended to solve this one.

BGP's Fundamental Limitations

BGP (Border Gateway Protocol) is a path vector protocol that disseminates reachability information. Its job is to answer the question: "Is this destination reachable, and via which autonomous system?" It is not designed to answer: "What is the instantaneous utilization of every link on the path to this destination?" It has no mechanism for expressing time-varying bandwidth reservations. It has no concept of flow-level routing. It operates at the granularity of prefixes, not flows, and its convergence time (the time between a topology change and the propagation of updated routes to all routers) is measured in seconds.

Traffic engineering extensions to BGP exist. BGP-TE (Traffic Engineering), BGP-LU (Labeled Unicast), SR-TE (Segment Routing Traffic Engineering) all extend BGP's capabilities in the direction of traffic engineering. They allow operators to express path preferences, bandwidth constraints, and flow steering policies. They are substantially more powerful than vanilla BGP.

They are still not sufficient for what predictive traffic engineering requires. SR-TE can steer a specific flow down a pre-configured path. What it cannot do is receive a time-indexed traffic demand forecast from an AI framework, compute the optimal routing assignment for a set of thousand concurrent flows across the fabric for each of the next ten iterations, and install those routing decisions in the forwarding plane before the traffic is generated. SR-TE is a tool for network engineers to configure traffic policies. Predictive traffic engineering requires the AI framework to drive the network configuration programmatically, in real time, at the scale of individual flow routing decisions.

OSPF, as an IGP, has similar limitations. OSPF TE extensions allow links to advertise bandwidth availability, and RSVP-TE can be used to signal bandwidth reservations. But OSPF convergence takes seconds. A training iteration takes tens of milliseconds. The protocol convergence time is three orders of magnitude too slow to be useful for intra-iteration traffic engineering.

The fundamental mismatch: BGP and OSPF operate on timescales of seconds to minutes, with message-based control planes that were designed for human-operated networks where configuration changes happen occasionally. Predictive traffic engineering for AI workloads requires a control plane that can make and install routing decisions on timescales of single-digit milliseconds, driven by programmatic input from the AI framework, at the granularity of individual flows.

What a Purpose-Built Protocol Needs

A protocol capable of implementing predictive traffic engineering for AI workloads needs several capabilities that no existing protocol has in combination:

Demand signaling: The AI framework must be able to express, to the network controller, a demand forecast: "In approximately T milliseconds, a flow of approximately B bytes will begin from source S to destination D, and will last approximately D milliseconds." This is a structured, time-indexed, quantitative signal that is categorically different from anything BGP or OSPF can carry.

Fabric-level topology visibility: The network controller must maintain a real-time model of link utilization across the entire fabric, at sub-second granularity. This requires telemetry from every switch in the fabric, aggregated and processed fast enough to be useful for millisecond-timescale routing decisions. INT (In-Network Telemetry) and streaming telemetry via QUIC/gNMI provide the data collection mechanism. The processing pipeline must be fast enough to consume it.

Flow-level path programming: The network must be able to route individual flows to specific paths based on demand signals, not ECMP hash buckets. This requires either flow-level forwarding entries in switch FIBs (expensive, does not scale to millions of concurrent flows) or programmable tunnel headers that steer flows through pre-computed paths without per-flow state in every switch. Segment Routing (SR) provides the mechanism: a source can encode the entire path in the packet header as a stack of segment identifiers, so intermediate switches simply pop the next SID and forward without maintaining flow state.

Admission control: When the demand forecast exceeds available capacity, the system needs a mechanism to reject or defer new flows, or to reduce the priority of existing flows to make room. This is admission control, and it requires the network to have a model of committed bandwidth on every link and to enforce that model when accepting new reservations.

Fallback and graceful degradation: The predicted schedule will sometimes be wrong. A GPU completes its forward pass faster than expected, triggering the all-reduce early. A link fails, invalidating the pre-computed path. The predictive system must degrade gracefully to reactive behavior (DCQCN, PFC) when predictions fail, without making the failure worse.

None of these requirements are individually exotic. Datacenter SDN controllers (like ONOS, OpenDaylight, and various proprietary implementations) address subsets of them. The novelty is the requirement that the AI framework be the entity driving demand signals, and that the network respond on the same timescale as the AI framework's execution.


Part Eight: SONiC and the Open Stack That Makes This Possible

The reason this architectural shift is plausible, rather than purely theoretical, is that the network operating system layer has been disaggregated and open-sourced. SONiC (Software for Open Networking in the Cloud), developed initially at Microsoft and now a Linux Foundation project with contributions from Google, Alibaba, Dell, Arista, and others, provides an open-source NOS that runs on standard merchant silicon (Broadcom, Marvell, Mellanox) and exposes a programmable control plane via a rich set of APIs.

SONiC's architecture is relevant here for several reasons:

The SAI (Switch Abstraction Interface) layer provides a vendor-neutral API for programming switch forwarding tables, ACLs, QoS configurations, and buffer management parameters. A traffic engineering controller can use SAI to install flow-level forwarding entries, modify ECN thresholds dynamically, and configure buffer allocations per-flow-class without vendor-specific CLI commands. This is the programmatic access layer that makes predictive traffic engineering implementable in software.

The CoPP (Control Plane Policing) and buffer management APIs allow dynamic adjustment of DCQCN parameters at runtime. A predictive controller could, upon receiving a demand signal for an upcoming all-reduce, lower the ECN marking threshold on the links that all-reduce will traverse (to start signaling congestion earlier, before queue buildup), and raise it on links that the all-reduce will not use (to avoid false-positive congestion signals on flows that have plenty of bandwidth).

Redis and the ASIC_DB/APP_DB architecture in SONiC provides an event-driven coordination mechanism between the control plane (the Python/C++ control applications) and the dataplane (the SAI adapter to the switch ASIC). A demand signal from a PyTorch training job, received over QUIC by a SONiC application, can translate to a SAI route entry modification and be installed in the ASIC forwarding table in milliseconds.

P4 and programmable data planes: SONiC running on a P4-programmable switching ASIC (Tofino, or the emerging generation of programmable Broadcom ASICs) provides the data plane flexibility to implement INT, custom congestion control mechanisms, and flow classification logic without modifying switch vendor software. A predictive traffic engineering system could run a P4 program that tags packets with flow identifiers derived from RDMA QP identifiers, tracks per-flow queue occupancy, and feeds that telemetry to the control plane in real time.

The combination of an open, programmable NOS, a rich programmatic API for forwarding table manipulation, and the ability to run custom data plane programs is the infrastructure substrate that makes the predictive traffic engineering hypothesis implementable. What does not yet exist is the integration layer between the AI framework and the network control plane.


Part Nine: The Integration Layer That Does Not Yet Exist

The missing piece is not hardware. It is not the NOS. It is the protocol and the software stack that allows a distributed AI framework to expose its communication schedule to the network and allows the network to act on it.

This integration layer needs to solve several problems simultaneously:

The Communication Schedule Export Problem

PyTorch and JAX, the dominant training frameworks, generate communication schedules through their collective communication libraries (NCCL, Gloo, XLA). These schedules currently exist only within the training process. They are expressed in terms of device IDs (GPU 0, GPU 1, etc.) and operation types (all-reduce, all-gather, reduce-scatter), not in terms of network addresses and bandwidth demands.

A first step is translating these schedules into network-relevant terms: device ID 0 corresponds to IP address 10.1.1.1, a ring-allreduce on N devices with chunk size C will produce flows of approximately B = C * 2 * (N-1) / N bytes each on specific source-destination pairs, starting at approximately time T after the forward pass completes.

NCCL already does a version of this translation internally when it selects the ring topology and computes the communication schedule. Exposing this information via an API is an engineering problem, not a research problem. The NCCL team could expose a "upcoming communication events" signal via a callback or a shared memory region that a local network agent reads.

The Demand Aggregation and Conflict Resolution Problem

In a cluster where training and inference jobs are collocated (as is increasingly common in large deployments), the network controller receives demand signals from multiple sources simultaneously. A training job's all-reduce schedule may conflict with an inference job's KV cache transfer. The controller must resolve these conflicts, applying admission control and prioritization, before installing forwarding decisions.

This is fundamentally a scheduling problem, isomorphic to the CPU scheduling problems that operating systems have solved for decades, applied to network bandwidth. The difference is that network bandwidth is not fungible across paths: 10Gbps of capacity on path A does not substitute for 10Gbps of capacity on path B if both jobs need path A. The conflict resolution algorithm must be path-aware.

A practical approach: the demand signals include priority, elasticity (can this flow be deferred without unacceptable cost?), and deadline (the training iteration waits for the all-reduce to complete before proceeding; the deadline is the start of the next forward pass). The controller runs a flow admission control algorithm that maximizes total weighted throughput subject to link capacity constraints, expressed as an integer program over the network graph. For reasonable fabric sizes (hundreds of links, thousands of concurrent flows), this can be solved in milliseconds by a controller with adequate compute.

The Signal Latency Problem

For T+1 prediction to be useful, the demand signal must travel from the AI framework to the network controller, be processed, and result in forwarding table updates installed in the switch ASIC, all within the time budget before the predicted traffic burst begins. If the all-reduce starts in 50ms, the entire control loop must complete in less than 50ms.

This is achievable with modern software stacks. QUIC over a local network connection has single-digit millisecond latency. SONiC's SAI programming path from application to ASIC takes single-digit milliseconds for bulk route updates. A well-implemented demand signal processing pipeline can complete the full control loop in under 20ms, leaving 30ms of margin before the predicted event.

For T+5 and T+10 prediction, the signal latency requirement is relaxed by definition: you're acting on predictions far enough ahead that the control loop latency is negligible relative to the prediction horizon.

The State Synchronization Problem

The network controller must maintain a consistent model of current link utilization, committed bandwidth reservations, and pending demand signals, all updated faster than the AI framework produces new events. At the scale of 10,000 GPUs with 400Gbps per port, the state machine is large.

Streaming telemetry (gNMI SUBSCRIBE with STREAM mode) from switches provides per-port utilization at configurable granularity (as fast as sub-second in modern implementations). A controller cluster (not a single controller) aggregates this telemetry, runs the demand signal processing, and distributes forwarding decisions back to the switches. The controller cluster itself must be highly available and partitionable across the fabric so that controller failures do not halt the training job.

This is a distributed systems problem more than a networking problem, and it is the problem that ONOS, OpenDaylight, and their successors were designed to address. The novelty in the AI fabric context is the demand signal interface and the sub-second control loop requirement.


Part Ten: Toward a Formal Protocol Specification

The integration layer described above would benefit from a formal protocol rather than a collection of bespoke API calls. A formal protocol provides interoperability between AI framework implementations and network controller implementations, a published specification that accelerates ecosystem adoption, and a clear extension mechanism for capabilities not anticipated in the initial design.

The protocol this paper hypothesizes, which does not yet exist and which this paper proposes calling the Network Demand Signal Protocol (NDSP) for purposes of discussion, would operate as follows:

Topology registration: When a training or inference job begins, the workload manager (Kubernetes, Slurm, or a proprietary job scheduler) registers a topology map with the NDSP controller: which device ID corresponds to which network address, which communication library is in use, which collective algorithms are configured. This establishes the mapping between compute-layer abstractions and network-layer addresses.

Demand signal messages: As the training job executes, it emits demand signal messages containing: the predicted start time (as an offset from current time), the source and destination (as device IDs translated to network addresses), the predicted data volume (bytes), the predicted duration, the priority class (is this blocking the next iteration, or is it background traffic?), and the elasticity flag (can this flow be deferred if bandwidth is unavailable?).

Reservation responses: The NDSP controller responds with an acknowledgment that includes: the accepted bandwidth reservation, the assigned path (expressed as an SR segment ID stack), any advisory that the requested bandwidth was not fully available and a fallback mechanism will be used.

Telemetry feedback: The NDSP controller continuously emits fabric state summaries back to registered workloads: current link utilizations, recent congestion events, path availability. This allows the AI framework to make informed decisions about batch sizing and communication schedule adjustments, closing the control loop from network to application as well as from application to network.

Path deviation signaling: When actual traffic deviates from the predicted schedule (a GPU completes early, a link fails), both the workload and the network controller emit deviation signals that trigger rapid rescheduling. This is the fallback mechanism.

The protocol would run over QUIC, with Protocol Buffer-defined message schemas. The controller implementation would be a SONiC application, running in SONiC's application container and accessing the SAI layer via the existing SONiC infrastructure. Reference implementations would be provided as NCCL plugins (for PyTorch training) and as vLLM or SGLang extensions (for inference serving).


Part Eleven: The Implications

For Hardware

If this architecture becomes standard, the requirements on switch hardware shift. The ability to install flow-level forwarding entries at high rate (the controller is installing reservations in real time) becomes more important than the ability to maintain large routing tables for internet-scale BGP deployments. Programmable ASICs become more attractive relative to traditional fixed-function silicon. The ability to run INT at line rate, to generate precise queue occupancy telemetry at microsecond granularity, becomes a baseline requirement rather than a premium feature.

SmartNICs and DPUs (Data Processing Units, as NVIDIA calls them) become the natural home for the per-host NDSP agent: a processor capable of running the demand signal emission logic, translating device IDs to network addresses, and communicating with the fabric controller, all without consuming CPU cycles on the GPU host.

For Software

The collective communication libraries (NCCL, OpenMPI, Gloo) will need to expose their scheduling information via NDSP-compatible APIs. This is a small change relative to their existing complexity: they already compute communication schedules internally. Exposing them via a callback interface is days of engineering work once the NDSP specification exists.

The training framework (PyTorch, JAX) will need to integrate with the NDSP agent for demand signal emission. This is the point of integration that requires the most careful design: the demand signals must be emitted early enough to be useful, but the framework cannot block on network confirmation before proceeding with computation. The signal emission must be non-blocking, with the network providing best-effort accommodation rather than synchronous reservation confirmation.

For Operations

The operational model changes fundamentally. Instead of network engineers configuring traffic policies based on their knowledge of the workloads, the workloads configure the network directly, and network engineers configure the policies that govern how workloads interact with each other. This is a shift from imperative network configuration to declarative policy, analogous to the shift from manually writing kernel driver code to writing device drivers against a standard DDK.

The network operations team's job becomes: define the priority policies between training jobs, inference serving, and other traffic classes; set the bandwidth guarantees and admission control parameters; monitor the health of the demand signal processing pipeline; and respond when predictions diverge significantly from actuals.

The Groq Insight Applied at Scale

Return to what Groq is doing within a chassis. Their compiler pre-computes the inter-chip communication schedule to individual clock cycles. The network carries those communications on a plesiosynchronous fabric where the timing of every packet is known before it is sent. Congestion cannot occur because the schedule prevents it.

The NDSP hypothesis is the extension of this principle beyond the chassis. The AI framework compiler produces not just a compute schedule but a network schedule. The compute schedule runs on GPUs. The network schedule runs on SONiC switches programmed by the NDSP controller. The two schedules are derived from the same execution graph and are consistent by construction.

The difference in scale is significant: Groq's system operates on a chip-to-chip interconnect with nanosecond timing precision. The NDSP system operates on an Ethernet fabric with millisecond timing precision. The precision requirement is relaxed by several orders of magnitude, which is why the hypothesis is achievable with current technology rather than requiring new physics.

But the conceptual step is identical: from a reactive network that responds to traffic after it arrives, to a scheduled network that accommodates traffic before it is generated, because the schedule is known.


The Actual Hypothesis, Stated Precisely

The technology exists today to implement the following system. It has not been implemented because the organizational boundary between "AI infrastructure team" and "network infrastructure team" does not currently allow the AI framework to drive the network control plane. The hypothesis is that this organizational boundary will erode as the performance penalty for not crossing it becomes impossible to ignore.

By the time AI training clusters reach the scale of 500,000 to 1,000,000 GPUs, the congestion losses on a purely reactive fabric will be large enough to represent hundreds of millions of dollars of wasted compute annually. At that cost, the engineering investment to implement a predictive traffic engineering system with NDSP-style integration becomes trivially justified.

The firms most likely to implement this first are the ones that control both the AI framework stack and the network infrastructure stack end-to-end. Google, with XLA, TPU pods, and a proprietary network fabric that already runs SWIFT. Meta, with PyTorch, NCCL, and the infrastructure to deploy custom SONiC applications at scale. Microsoft, with Azure's AI infrastructure and deep expertise in SONiC (they wrote it). Amazon, with Trainium, Annapurna-designed EFAs, and the SDN infrastructure.

Open source implementations will follow, driven by the same economics that led SONiC to be open-sourced in the first place: a shared infrastructure problem that no single company benefits from solving proprietary.

The future of AI infrastructure networking is not a faster version of BGP. It is the collapse of the boundary between the execution graph and the network graph, until the network is not a separate system that the AI framework communicates with, but a schedulable resource that the AI framework controls the same way it controls GPU memory allocation and tensor placement: precisely, deterministically, and ahead of demand.

The reactive era of AI networking is ending. The question is only who builds the replacement first.


This paper describes a hypothesis, not an implemented system. The author welcomes challenges to the underlying assumptions and is particularly interested in hearing from teams who have attempted variants of demand-signal-driven fabric management and run into the walls that made them give up. Reach out and comment on https://www.linkedin.com/in/nvscottmorrison/.