>

OSPF: The IGP That Scales Until It Doesn't

Scott MorrisonNovember 15, 2025 0 views
OSPF interior gateway protocol link-state routing network routing OSPF areas stub areas SPF algorithm routing scalability IGP network design
Open Shortest Path First promises fast convergence and hierarchical scalability, but delivers area design nightmares, CPU-intensive SPF recalculations, and the routing flexibility of a brick. It's better than RIP and it's the standard everyone knows, so we're stuck with a protocol that works fine for 50 routers and struggles beyond that.

Open Shortest Path First is the Interior Gateway Protocol that network engineers love to hate. It's infinitely better than RIP (which is a low bar, RIP literally counts to infinity when things break), but it's also a protocol that punishes you for growth. OSPF was designed with elegant mathematics, link-state algorithms that converge quickly, and a hierarchical area structure that should scale beautifully. In practice, it's a protocol where adding routers means recalculating your entire area design, where one misconfigured router can bring down your network, and where the word "stub" appears in configuration so often you start to wonder if it's describing the protocol or your career choices. OSPF works, technically, but it's the network protocol equivalent of a sports car that needs constant tuning and breaks if you look at it wrong.

Let's explore how OSPF actually works, why its design seemed like a good idea in 1989, and why network engineers spend so much time fighting its limitations.

The Link-State Revolution

Before OSPF, most networks ran distance-vector protocols like RIP. Distance-vector protocols are simple: routers share how far away destinations are (hop count), and everyone picks the shortest distance. This simplicity comes with severe problems:

Slow Convergence: When topology changes, updates propagate slowly from router to router. During convergence, routing loops can form.

Count to Infinity: When a link fails, routers can get stuck incrementing hop counts forever (well, to 16, which RIP defines as infinity).

Limited Metrics: RIP uses only hop count. A 10 Gbps link and a 56k modem look identical if they're both one hop.

No Load Balancing: You can't split traffic across equal-cost paths effectively.

Link-state protocols like OSPF (1989) promised to solve these problems with a fundamentally different approach.

How OSPF Actually Works

Instead of sharing distance to destinations, OSPF routers share information about the links they're connected to. Each router builds a complete map of the network topology, then runs Dijkstra's shortest-path-first algorithm to compute the best paths.

LSAs: The Building Blocks

OSPF's fundamental unit is the Link-State Advertisement (LSA). There are several types, each serving different purposes:

Type 1 (Router LSA): Every router generates one, describing its directly connected links and their costs. "I'm Router A, I have links to Router B (cost 10) and Router C (cost 20)."

Type 2 (Network LSA): Generated by the Designated Router on multi-access segments (Ethernet), describing which routers are on that segment.

Type 3 (Summary LSA): Generated by Area Border Routers, summarizing routes from other areas. "Area 0 has these networks available."

Type 4 (ASBR Summary LSA): Advertises how to reach an Autonomous System Boundary Router that's injecting external routes.

Type 5 (AS External LSA): External routes redistributed into OSPF from other protocols (BGP, static routes, etc.).

Type 7 (NSSA External LSA): Like Type 5 but for Not-So-Stubby Areas (yes, that's the real name).

Each LSA has a sequence number, age, and checksum. Routers flood LSAs throughout the area, ensuring everyone has identical topology databases.

The SPF Calculation

Once a router has all the LSAs, it runs the Dijkstra algorithm:

  1. Start with yourself as the root
  2. Examine all directly connected links
  3. Pick the lowest-cost link not yet in the tree
  4. Add that destination to the tree
  5. Repeat until all destinations are processed

The result is a shortest-path tree with your router at the root, showing the best path to every destination. This becomes the routing table.

Hello Protocol and Neighbor Discovery

OSPF routers send Hello packets every 10 seconds (default) on each interface. These serve multiple purposes:

  • Discover neighbors
  • Elect Designated and Backup Designated Routers (on multi-access networks)
  • Detect failures (if you miss 4 Hellos, the neighbor is dead)
  • Verify configuration compatibility (area ID, authentication, MTU, etc.)

Neighbors go through states: Down, Init, Two-Way, ExStart, Exchange, Loading, Full. Only in Full state do they exchange complete LSA databases.

Designated Router: Reducing Chattiness

On an Ethernet segment with N routers, without optimization each router would need to form adjacencies with N-1 others, creating N(N-1)/2 adjacencies. That's a lot of LSA flooding.

OSPF's solution: elect a Designated Router (DR) and Backup DR (BDR). All other routers form adjacencies only with the DR and BDR. The DR is responsible for generating the Network LSA and forwarding LSAs on the segment.

This reduces complexity but creates its own problems. DR election uses Router Priority (configurable, 0-255) and Router ID as tiebreaker. If your DR fails, the BDR takes over and a new BDR is elected. Sounds great, except:

  • DR election is not preemptive (if a better router joins, it doesn't automatically become DR)
  • If the wrong router becomes DR (low-spec device), you're stuck unless you reset OSPF
  • DR failure causes brief disruption even with BDR

Cost Metrics: Better Than Hop Count, Still Problematic

OSPF uses link cost (default: 100 Mbps / bandwidth). So a 10 Mbps link has cost 10, a 100 Mbps link has cost 1. This is better than RIP's hop count but has issues:

Fast Link Problem: With the default formula, 100 Mbps, 1 Gbps, 10 Gbps, and 100 Gbps links all have cost 1 (formula assumes 100 Mbps reference). You must manually configure costs or adjust the reference bandwidth.

No Traffic Engineering: OSPF picks the lowest-cost path, period. If you want to load-balance or prefer certain paths for policy reasons, tough luck. You can manipulate costs, but that affects all traffic, not specific flows.

Equal-Cost Multipath (ECMP): OSPF can load-balance across equal-cost paths, which is nice. But if your paths aren't exactly equal cost, it won't use the second path at all, even if it's only slightly worse.

This is a fundamental difference from BGP. BGP lets you manipulate routing with dozens of attributes, communities, and policies. OSPF is "lowest cost wins, no negotiation."

Areas: The Scalability Solution That Creates New Problems

Here's OSPF's dirty secret: it doesn't scale. The problem is the SPF calculation. Every time an LSA changes (link flap, cost change, router reboot), every router in the area must:

  1. Update its link-state database
  2. Recalculate SPF
  3. Update the routing table

With a large flat OSPF network (hundreds of routers in one area):

  • Any topology change triggers SPF on every router
  • SPF calculation is CPU-intensive (Dijkstra is O(N²) or O(N log N) with good implementations)
  • Frequent changes can cause "SPF churn" where routers are constantly recalculating
  • Large LSA databases consume memory
  • Flooding LSAs creates traffic

OSPF's solution is hierarchical design with areas.

The Area Hierarchy

Area 0 (Backbone): The special area that all other areas must connect to. Area 0 forms the routing backbone, interconnecting all other areas.

Regular Areas: Non-backbone areas (Area 1, Area 2, etc.). Routers in Area 1 only run SPF for Area 1 topology. They receive summary routes to other areas via ABRs.

Area Border Routers (ABRs): Routers with interfaces in multiple areas. They run separate SPF calculations for each area and generate Summary LSAs to share routes between areas.

Autonomous System Boundary Routers (ASBRs): Routers that redistribute external routes (from BGP, other protocols) into OSPF.

The hierarchy reduces SPF churn. A link flap in Area 1 only triggers SPF recalculation in Area 1, not the entire network. Area 0 routers see it as a stable summary route.

The Area Design Nightmare

This sounds great until you actually try to design it. The constraints:

All areas must connect to Area 0: If you have a branch site in Area 2, it must connect to an Area 0 router or use a virtual link (don't use virtual links, they're terrible and complicated). This creates hub-and-spoke topologies even when your physical topology isn't hub-and-spoke.

Area 0 must be contiguous: You can't split Area 0. If Area 0 becomes disconnected, OSPF breaks. You need careful design and possibly virtual links.

ABRs are complex: Running multiple SPF instances consumes CPU and memory. ABRs are often performance bottlenecks.

Renumbering is painful: Changing a router's area means disrupting service, reconfiguring interfaces, and potentially redesigning your addressing scheme.

Area boundaries are rigid: Unlike BGP where you can flexibly control route announcements, OSPF area boundaries are hard divisions. Moving functionality across boundaries is difficult.

In practice, many organizations end up with poor area designs:

  • Everything in Area 0 (defeats the purpose)
  • Overly complex multi-area designs that are hard to troubleshoot
  • Areas that don't match organizational or traffic patterns

OSPF's scalability solution requires expert network design, and most networks don't have that expertise.

Stub Areas: OSPF's Way of Saying "You Don't Need To Know Everything"

As if regular areas weren't complex enough, OSPF has multiple types of stub areas, each removing different LSA types to reduce overhead.

Stub Area

A stub area doesn't receive Type 5 (external) LSAs. The ABR injects a default route instead. This reduces LSA count and SPF complexity in the stub area.

Rationale: Branch offices don't need to know about every external route in the organization. Give them a default route and let the core handle specifics.

Limitations:

  • Can't have ASBRs (no external routes)
  • Still receives Type 3 Summary LSAs from other areas
  • Still needs to know internal topology

Totally Stubby Area

A Cisco-specific extension (later standardized), totally stubby areas don't receive Type 3, 4, or 5 LSAs. The ABR injects a single default route for everything outside the area.

This is maximum simplification: the stub area only knows its own topology and has a default route for everything else. Great for very simple branch sites.

Limitations:

  • Extremely limited routing knowledge
  • No path selection for external destinations
  • Cisco proprietary (initially, now supported elsewhere)

Not-So-Stubby Area (NSSA)

Someone realized stub areas are great, but what if you need to redistribute some external routes in a stub area? Enter NSSA.

NSSA allows an ASBR in a stub area. External routes are advertised as Type 7 LSAs within the NSSA, and the ABR converts them to Type 5 LSAs when advertising to other areas.

This solves a specific problem: branch site with an Internet connection (external routes) that still benefits from stub area simplification.

Limitations:

  • More complex configuration
  • Type 7 to Type 5 conversion has quirks
  • Yet another thing to understand and configure

Totally NSSA

Because we didn't have enough stub variants, there's also Totally NSSA (combining totally stubby with NSSA external route allowance). At this point, you need a flowchart to remember which LSA types appear where.

The Stub Area Problem

Stub areas reduce overhead, which is good. But they also:

  • Add configuration complexity
  • Require careful planning (can't casually add an ASBR to a stub area)
  • Reduce visibility (you don't know what's beyond the ABR)
  • Make troubleshooting harder (Why isn't this route in my table? Oh right, stub area.)

In practice, many networks avoid stub areas because the operational complexity outweighs the benefits, especially on modern routers with plenty of CPU and memory.

The Scalability Ceiling

Even with areas, OSPF hits scalability limits. The general guidelines:

50 routers per area: Conservative, probably safe 100+ routers per area: Possible with good hardware, but expect issues 200+ routers per area: You're pushing it, SPF churn and convergence problems likely

These numbers seem tiny compared to BGP, which handles 950,000+ routes globally. Why does OSPF scale so poorly?

Full Mesh Adjacencies: Within an area, every router must have complete topology knowledge. BGP only needs best paths.

Frequent SPF: Any topology change triggers SPF recalculation. BGP has incremental updates.

CPU Intensive: Dijkstra is more expensive than BGP's simpler path vector operations.

Memory: Storing complete topology (all LSAs) for even medium networks consumes significant memory.

Chatty Protocol: OSPF constantly sends Hellos, LSAs, and acknowledgments. BGP is relatively quiet after initial convergence.

The Convergence Problem

OSPF is supposed to converge faster than distance-vector protocols, and it does, for small networks. In large networks, convergence has issues:

LSA Flooding: When topology changes, LSAs must flood through the area. On large, complex topologies, this takes time.

SPF Delay: Routers don't immediately run SPF when an LSA arrives, they wait (SPF holdtime) to batch changes. This prevents SPF thrashing but adds delay.

Multiple Changes: If multiple links fail simultaneously, routers may run SPF multiple times as they learn about each failure.

Blackholes: During convergence, temporary routing loops or blackholes can form. Packets are dropped until routing stabilizes.

In practice, OSPF convergence in large networks can take several seconds, even tens of seconds in pathological cases. This is better than RIP's minutes but worse than the subsecond convergence we'd like.

The Flexibility Problem: OSPF vs BGP

BGP is complex, but it's flexible. You can manipulate routing with:

  • Local preference
  • AS path prepending
  • Communities
  • MED attributes
  • Route maps
  • Complex policies

OSPF offers: cost. That's it. Well, technically you have:

  • Cost: Set link cost
  • Cost: Adjust reference bandwidth
  • Cost: Did we mention cost?

If you want to prefer certain paths for policy reasons (not pure shortest-path), you must hack it with cost manipulation. This affects all traffic, not specific flows. There's no "route this traffic via path A for policy reasons while using shortest path for everything else."

Want to control how routes are exported from OSPF to BGP? You'll need redistribution policies, which are separate from OSPF and often messy.

Want to do traffic engineering? You're better off using MPLS or segment routing on top of OSPF rather than trying to do it within OSPF.

This inflexibility is by design. OSPF is an IGP (Interior Gateway Protocol), meant for internal routing where shortest-path should be good enough. But real networks have requirements beyond shortest-path: security zones, traffic engineering, cost optimization, regulatory compliance. OSPF handles these poorly.

The Configuration Complexity

OSPF configuration seems simple at first:



router ospf 1
 network 10.0.0.0 0.255.255.255 area 0

But as your network grows, you need:

  • Area design
  • Stub area configuration
  • Summarization to reduce LSAs
  • Authentication (because someone will try to inject rogue LSAs)
  • Cost tuning
  • Virtual links (if your topology forces it)
  • Graceful restart configuration
  • BFD (Bidirectional Forwarding Detection) for fast failure detection
  • Route filtering at ABRs and ASBRs
  • Passive interfaces (to prevent OSPF on user-facing ports)

A properly configured enterprise OSPF network has hundreds of lines of configuration per router, and every router must be consistent in its area assignments and authentication.

Misconfiguration consequences:

  • Wrong area: Router can't form adjacencies, becomes isolated
  • Mismatched authentication: Neighbors won't form
  • Mismatched MTU: Neighbors get stuck in ExStart
  • Wrong network mask in wildcard: Interfaces missing from OSPF
  • Cost misconfiguration: Suboptimal routing, potential loops
  • Missing passive interface: Your users receive OSPF Hellos, someone might inject routes

OSPF's complexity is a fractal, it looks simple from far away but reveals more complexity at every zoom level.

Why Do We Still Use OSPF?

Given all these problems, why is OSPF still the dominant enterprise IGP?

Standard: OSPF is an open standard (unlike EIGRP, which was Cisco proprietary until 2013). Multi-vendor networks need standards.

Better Than RIP: RIP is worse in every way. OSPF is the next step up.

Good Enough: For small to medium networks (under 50 routers), OSPF works fine. Most enterprises fit this.

Known Quantity: Network engineers understand OSPF. There's decades of documentation, troubleshooting guides, and trained staff.

No Great Alternatives:

  • EIGRP: Was Cisco-only for years, still has limited multi-vendor support
  • IS-IS: Technically superior but less common, harder to find engineers who know it
  • RIP: lol no

It's Not BGP: Using BGP internally (which some large data centers do) is complex and requires different thinking. Most enterprises aren't ready for that.

OSPF is the network protocol equivalent of democracy: the worst option except for all the others.

IS-IS: The Protocol That's Better But Nobody Uses

Integrated IS-IS is technically superior to OSPF in several ways:

  • More flexible TLV-based design (easier to extend)
  • Slightly better scalability
  • Simpler encapsulation (runs directly over layer 2, not IP)
  • No IP-related vulnerabilities

Large ISPs prefer IS-IS for internal routing. But enterprises stick with OSPF because:

  • Less familiar
  • Fewer engineers trained on it
  • Less documentation and tooling
  • "If it ain't broke, don't switch to an equally complex but different protocol"

IS-IS is the Betamax of routing protocols, technically superior but lost the market adoption war.

OSPF in Modern Networks

Modern networks have partially moved beyond pure OSPF:

Segment Routing: Extends OSPF with source-routing capabilities, providing traffic engineering without MPLS complexity.

OSPF Extensions: OSPFv3 (for IPv6), traffic engineering extensions, and graceful restart improve functionality.

BGP Everywhere: Some data centers use BGP for both external and internal routing (BGP unnumbered, BGP EVPN). This simplifies by using one protocol for everything.

SDN Controllers: Software-defined networking controllers can compute paths centrally and install forwarding rules, using OSPF only for backup or basic connectivity.

Shorter Hold Times: With BFD, networks can detect failures in milliseconds instead of OSPF's 40-second default dead interval, improving convergence.

These improvements help, but they don't fix OSPF's fundamental limitations. They're Band-Aids on a protocol that's showing its age.

The Enterprise Network Reality

Here's the uncomfortable truth about OSPF in enterprise networks:

Most are over-designed or under-designed. You either have:

Flat Network: Everything in Area 0. This "works" until you hit about 50 routers, then SPF churn and convergence issues appear. But it's simple to understand and configure.

Over-Engineered Hierarchy: Multiple areas, stub configurations, complex summarization, designed by a consultant who's no longer available. When something breaks, nobody fully understands the design.

The ideal is a well-designed multi-area OSPF network with appropriate stub usage, clean summarization, and logical area boundaries matching organizational structure. Most networks aren't this. They're OSPF configurations that evolved organically, accumulated cruft over years, and now "work well enough that nobody wants to touch it."

This is fine. Perfect is the enemy of good. An imperfect OSPF design that's well-documented and understood by your team beats a theoretically optimal design nobody comprehends.

Living With OSPF's Limitations

OSPF is a protocol from a simpler time, when networks were smaller, traffic was more predictable, and shortest-path routing was good enough. Modern networks need more flexibility, better scalability, and faster convergence.

But we're stuck with OSPF because:

  1. It's the standard everyone knows
  2. Replacing it is expensive and risky
  3. Alternatives have their own issues
  4. For most networks, it's good enough

The path forward is incremental:

  • Keep networks small enough for OSPF (under 100 routers)
  • Use areas properly (but don't over-complicate)
  • Deploy BFD for fast failure detection
  • Consider alternatives (IS-IS, BGP) for very large networks
  • Use OSPF as basic connectivity with overlay protocols (MPLS, segment routing, EVPN) for advanced features
  • Accept that OSPF will never be perfect

OSPF is like an old car that's paid off. It has quirks, needs regular maintenance, and isn't as nice as newer models. But you know how to fix it, parts are available, and it gets you where you need to go. Trading it in seems like too much hassle.

So we live with OSPF's scalability limits, area complexity, and inflexibility. We work around its limitations with careful design, supplementary protocols, and occasional cursing. Because the alternative, ripping out your IGP and replacing it, is worse than the disease.

OSPF: it's not great, but it's what we have, and we're probably stuck with it until someone invents something dramatically better that's also backward-compatible, easy to configure, and well-understood. Don't hold your breath.