>

Route 53: The Most Undervalued AWS Service With Skeletons They Won't Sweep Out of the Closet

Scott MorrisonDecember 29, 2025 0 views
Route 53 AWS DNS shuffle-sharding anycast control plane data plane Nitro card DNSSEC
Amazon Route 53 delivers the only 100% uptime SLA in AWS through shuffle-sharding, anycast routing, and architectural brilliance, but its control plane lives in us-east-1 like it's 2015 and nobody learned anything from a decade of outages. Let's explore why Route 53's data plane is engineering excellence while its control plane is a single point of failure wrapped in a 60-minute RTO band-aid.

Let's talk about Amazon Route 53. To be clear, Route 53 isn't a single service, it's a suite of services that perform various DNS functions. They include: Route 53 Hosted Zones (public hosted zones), Route 53 Registrar, Route 53 Private Hosted Zones, Route 53 Health Checks, Route 53 Resolver (also known as VPC resolver, also known as the .2 resolver, also known as the VPC +2 resolver, also known as... can you see where I'm going here... branding...), Route 53 Resolver Endpoints, Route 53 Profiles, Route 53 Resolver DNS Firewall, and Route 53 Global Resolver (which was released to public preview at re:Invent 2025).

In my opinion, Route 53 (the entire suite of products) is the most undervalued service at AWS, but as the title says, they have a few skeletons in their closet that they just won't sweep up. Now I want to be up-front, until December of 2025, I was employed by AWS at which time I left for a new opportunity. None of the following information is secret or proprietary information. Anyone can see it in the docs or by using the services. I also want to reiterate my disclaimer at the bottom of every page on my site: the views expressed here are my own and don't represent that of any current, former, or future employer or associate. OK! Now that I am past all the disclaimers, let's get into the fun part.

Why Route 53 is the most undervalued service at AWS

When DNS fails, it's a really bad day, but with Route 53 they focus on making sure those bad days aren't a product of the provider (in this case, AWS). Let's start with the easy one to talk about: Route 53 public hosted zones.

When you create a hosted zone, like theinternetpapers.com, you get four nameservers (go dig NS theinternetpapers.com and you can see what mine are). Those four nameservers represent four different data planes. Now these aren't just four different data planes, they are on four different top-level domains (TLDs). If you are in the commercial partition like most customers, those are .com, .net, .org, and .co.uk. So when AWS rolls out a code update, they roll it out to one data plane at a time and make sure it's ok before rolling it out to the next so that there is always a data plane available. If the provider of a TLD is having a bad day, at most two of the four data planes might be affected (.net and .com are both run by Verisign, so in theory they could both have the same impact, but that's unlikely).

Shuffle-sharding, or how AWS convinced you to share nameservers without sharing problems

Those four nameservers are four of a pool of 2,048 virtual nameservers run by Route 53. You share no more than two nameservers with any other customer on the planet. This is called shuffle-sharding, and it's mathematically beautiful. With 2,048 nameservers selected in groups of four, there are approximately 730 billion possible combinations. When someone DDoS attacks your domain, only your four virtual nameservers see the traffic spike. Nobody else is affected because they share at most two nameservers with you, and those two nameservers are different from the two you share with someone else.

It's the networking equivalent of living in an apartment building where you only share walls with two neighbors maximum, and those neighbors each share walls with completely different people. If your neighbor throws a loud party, it affects you and maybe one other person, but not the entire building. AWS open-sourced the implementation in the Route 53 Infima library on GitHub if you want to get really nerdy about it.

Anycast, or how the same IP address exists in 200 places simultaneously

Those four nameservers are actually many physical servers that use anycast to announce the same IP address from many different points-of-presence (PoPs) across the globe. Route 53 operates from over 410 PoPs including more than 400 edge locations in over 90 cities across 48 countries. These include AWS edge sites, AWS regions, and embedded sites like ISPs.

Here's the clever part: Route 53 typically advertises only one TLD stripe per edge location, not all four. Sydney might serve .com and .net, Singapore might serve .org, Hong Kong might serve .net. This intentional distribution provides Internet path diversity. If one edge location fails, resolvers have three other nameservers at completely different locations to fall back to. The Internet's BGP routing converges in about five minutes, and queries continue working because you've got alternative paths.

So you could have some type of congestion event from a noisy neighbor that affects two of your nameservers, an issue with a TLD or data plane affecting a third nameserver, and lose a few physical servers in your fourth nameserver pool and still not miss answering any of your clients' DNS queries.

The 100% uptime SLA, and what it actually covers

This is why Route 53 hosted zones is the only AWS service with a 100% uptime SLA (short of force majeure and some fine print, although it will probably survive that too). It's also cheap for all that it provides, which isn't just a 100% uptime SLA but some cool features like geographic routing, latency-based routing, GeoProximity routing, failover routing, aliasing (a somewhat proprietary record type), IP-based routing, managed DNSSEC, query logging, and built-in denial-of-service (DoS) protection.

But here's the thing about that SLA: it only applies to answering DNS queries (data plane), not changing the records (control plane). Which brings us to the skeletons.

Skeleton #1: The us-east-1 control plane, or how we learned nothing from a decade of outages

It is pretty well known that the control plane for Route 53 hosted zones is in us-east-1. This means that when us-east-1 has issues, which happens from time to time, Route 53's control plane may or may not be available. But what about that 100% uptime SLA? It only applies to answering DNS queries (data plane), not changing the records (control plane).

The December 7, 2021 outage was a masterclass in what happens when your control plane lives in one region. Automated scaling triggered "unexpected behavior from a large number of clients" on AWS's internal network, which is a polite way of saying AWS DDoS'd itself. Route 53 APIs remained impaired from 7:30 AM PST until 2:30 PM PST, seven hours during which customers could not modify DNS records. The good news? Existing DNS entries and query resolution continued functioning throughout, because the data plane doesn't care what the control plane is doing.

Accelerated recovery, or the 60-minute band-aid

AWS released "accelerated recovery" for public hosted zones at re:Invent 2025 to help address this skeleton. At the time of writing this the docs say "Route 53 accelerated recovery for managing public DNS records is designed to achieve a 60-minute Recovery Time Objective (RTO) in the event of service unavailability in the US East (N. Virginia) Region." While a 60-minute RTO is better than "eventually," that's still an hour before you can change your records. It also forces you to use us-west-2 as your secondary region.

For most customers this isn't an issue, but customers with pedantic data residency requirements may find that storing your control plane data in the US isn't an option, and unfortunately these requirements are becoming more popular. If you are the person who makes those requirements, don't. If someone is storing something of any level of secrecy in a DNS zone, public or private, they are already in the wrong, and if an IP address, public or private, is secret, you're still wrong.

The multi-provider fallacy

If you think you're going to increase your availability by using multiple DNS providers, DON'T! Route 53 control plane issues don't happen frequently and you can architect yourself in a way to minimize and even potentially eliminate impact from them on your business. By introducing a second DNS provider you create complexity that will almost certainly lead to mistakes and actually reduce your availability. So in the grand scheme of things, just accept it and move on.

What AWS should actually do

If you're an AWS customer big enough to have an account team, here are the feature requests I'd be pushing if I were you: 1) let you choose the region you create your hosted zone in, and have that be the primary region for your hosted zone that controls your global data plane, 2) let you choose the second region for failover, 3) give you a highly available way to switch which region is primary and which is secondary based on your choices from feature request 1 and 2, 4) give you a global zone-specific DNS record to control your zone which switches which region it points to based on feature 3, 5) sub-5 minute failover.

Private hosted zones: mostly the same problems, with bonus annoyances

Private hosted zones are similar in almost every way to public hosted zones. They don't currently offer accelerated recovery though. Most of the skeletons related to private hosted zones are the same as public ones except a few: first, you must attach a private hosted zone to a VPC on create. Many people like to create private hosted zones in a centralized account which may not have a VPC at all. This means they have to create a VPC to create the zone and then can delete the VPC immediately after. So annoying.

The second is that private hosted zones don't support DNSSEC. While the security between the Route 53 Resolver, Resolver Endpoints, and the private hosted zone is, dare I say, impenetrable, this can create a compliance headache for some folks, but this is more of a policy issue for people, not a technical one.

Route 53 Registrar: the boring but necessary service

Let's break the ice and start with it: Route 53 Registrar is more of a domain reseller than registrar. Buying domains is pretty boring, and to be fair, it's also a low-margin business. There just isn't much meat for AWS in any way other than they can give buying domains an API and bill from a single vendor for customers who have painful procurement processes that make billing consolidation useful.

They do a good job, it's just not that interesting one way or another. A few downsides: they don't support premium domains right now, they don't have the best TLD support in the industry (but they are routinely adding more supported TLDs), and they do resell a lot of domains, although they are a registrar for some domains too. What does this mean to you? It just may alter who shows up in the whois registry. Honestly, given all the legal overhead of being a domain registrar, I wouldn't want to be one either.

If you're not using a premium domain, and already using AWS for everything else, you might as well use AWS as your registrar. If you prefer someone else, use someone else, so long as they are trustworthy.

Health checks: powerful but expensive or cheap but limited

Health checks are also pretty boring. They can do a simple TCP setup to see if the 3-way handshake succeeds, they can do a more complex HTTP/HTTPS health check to get the HTTP response code, an Application Recovery Controller health check (which helps create a big red button to push if one of the other health checks gets into a nondeterministic state), CloudWatch health checks (which we will talk about in a minute), and calculated health checks that sum a set of other health checks.

Application Recovery Controller is super expensive (about $1,800 per month per recovery cluster) or you can hack something by creating simple HTTP health checks for an S3 object in a region different than the region you are failing away from to control health, delete or create the object for control. TCP is super simple: are you there or aren't you. HTTP(S) can be thorough if you build some path on your endpoint that does thorough checking of system health and then simplifies that to an HTTP response code.

CloudWatch health checks are painful for many reasons

You can only specify a CloudWatch alarm in a single region, while HTTP(S) and TCP health checkers exist in multiple regions. You cannot use high-resolution metrics, extended statistics, metric math, or a bunch of other scenarios, and if you change the alarm configuration you must re-sync that to Route 53 health checks or you get some weird indeterministic state. You can use calculated health checks to get around some of that, but not a lot.

Health checking leaves a lot to be desired. It's great at what it does but depth and breadth of health checking would make it a lot better. Also, a simpler and less expensive version of Application Recovery Controller that helps you just flip the switch without single-region dependency would be nice.

Route 53 Resolver: the free service with no SLA that's more reliable than most paid services

Route 53 Resolver, other than the fact that it has and does go by too many names, has a few key issues, but it also has some great qualities. It has some really great properties: when you send a query to the resolver IP, it is intercepted by the Nitro card on the box, duplicated, and sent to two hosts in the resolver fleet. Each of those hosts does resolution and sends a response back to the Nitro card, which deduplicates and sends back to the guest OS. You can see this if you control the nameserver as you will see two queries from the client that only did one.

This means that if one of those queries fails due to a network issue, a resolver host failing, or whatever other of the million possible reasons that could cause it to fail, the client still gets one of the two answers. Route 53 Resolver doesn't have an SLA though, so why? Well, simply stated, SLAs are a billing construct about refunding money spent when something doesn't work, and because resolver is free, how do you refund something that is free? But rest assured it is extremely available, and while individual hosts/instances might see an issue for some reason, it's very unlikely that many hosts in a VPC will see that issue.

Skeleton: the 1024 packets-per-second limit, or when the Nitro card says "not my problem"

One of the painful things about Route 53 Resolver is the 1024 packet-per-second limit. This is a limit on link-local services like resolver, instance metadata, Amazon Time Service (NTP), and if you enjoy the pain of Windows Server, Windows licensing service. Combined, each of these services can only send 1024 packets-per-second per ENI. For people running a single, monolithic service on an instance, this isn't so bad. For people running many services, usually in containers, this can be more painful.

The problem exists in the fact that all of these services are being proxied by the Nitro card. The Nitro card, if you've watched any of the re:Invent videos or read the blogs, has many jobs to do like running the hypervisor, networking stack, link-local services, Nitro Enclaves, and many other things, and it is a single compute unit with finite CPU cycles and memory space. This limit is not tied to instance type and cannot be increased. When exceeded, DNS queries are silently dropped, no error returns, and throttled queries don't appear in query logs.

Monitoring requires checking the linklocal_allowance_exceeded metric via ethtool -S eth0 | grep linklocal. Mitigations include enabling local DNS caching, implementing NodeLocal DNSCache for Kubernetes, and distributing workloads across more ENIs.

Skeleton: EDNS-Client Subnet remains unsupported

The first downside to Route 53 Resolver is that it doesn't currently support EDNS-Client Subnet, which adds an extension field with a truncated value for the client IP that made the query. This would be nice so that public hosted zones can make decisions based on the public IP of the host making the request, and private hosted zones can make decisions based on the private IP of the host. Currently, both zones can only make decisions based on the public IP used by the resolver host in the resolver fleet, which is unknown to the user.

Geo and latency-based routing work, because AWS knows these IPs and third parties know the location of the subnets due to published GEO data like MaxMind, but you can't get specific to the level of the actual client right now, and that could be a really handy feature. To AWS's defense, they built a really resilient and complex system and implementing that is incredibly hard, so let's cut them a break for a while and hope they get there sooner rather than later.

Skeleton: DNSSEC validation operates silently

Another thing that annoys some people is how Route 53 Resolver handles DNSSEC validation. You can enable DNSSEC validation on Route 53 Resolver, but it is handled prior to the guest OS and the validation markers aren't passed to the guest OS. So you must trust that Route 53 is doing this correctly. You have no way to really verify (you could, I presume, do some queries to something with bad signatures and watch them not resolve, but that only proves that specific query is working, not that all are).

For most people, trusting AWS is fine, and having worked there and seen how the sausage is made, I trust them to do this validation personally. But for some people, they have auditors or serious trust issues that require them to be able to validate, and this presents an issue. Those people tend to bypass the resolver, losing all the resilience it brings, and go to their own resolver running on EC2 or elsewhere.

Route 53 Resolver Endpoints: the service people deploy incorrectly and pay too much for

Route 53 Resolver Endpoints are just an extension to Resolver to allow ingress and egress of queries from outside of a VPC (read: on-premises). It is a modest service that just places ENIs in a VPC of your choosing and allows you to either resolve queries in the context of that VPC for inbound endpoints, or use resolver rules to forward queries for a particular zone out of that VPC via that ENI to an IP address of your choosing.

The big skeleton in the closet here is people misusing the service and incurring the cost to boot. You pay an hourly charge for every endpoint ENI you create ($0.125/hour per ENI, translating to $182.50/month minimum per endpoint with two ENIs). Many people create an inbound and outbound endpoint in every, or at least most, of the VPCs they have. You don't have to do this.

You can take the private hosted zones for all those VPCs and attach them to one central VPC and put a single inbound resolver in that VPC. You can take a resolver rule for a single outbound endpoint in that same VPC and share it to other accounts and attach it to all of your VPCs. The only thing you can't do is span regions, so you need one set of outbound endpoints for every region you need those rules in. Inbound, in theory, you can have a single region since private hosted zones can span regions, but hosting in a single region might be an availability risk you're not happy with.

Route 53 DNS Firewall: defense in depth, not defense in full

One very cool feature that Route 53 came out with a few years ago is Route 53 Resolver DNS Firewall. It helps filter what DNS queries a client on EC2 or using an inbound endpoint can do. Pretty handy first line of defense, just remember, there are plenty of ways for malicious users and code to bypass Route 53 by doing a direct query, a direct query on DNS over HTTPS (which just looks like a regular HTTPS packet), by not doing a query at all and going straight to an IP, etc etc. So remember this is a defense-in-depth tool.

Critical limitations: DNS Firewall filters domain name strings only, never resolving domains to IP addresses. It only filters queries routing through VPC Resolver. Bypass methods include direct queries to external DNS (8.8.8.8, 1.1.1.1), DNS over HTTPS (DoH), DNS over TLS (DoT), direct IP access if attackers know targets, and VPN/proxy tunneling. It is not free, but it sure is handy.

Route 53 Profiles: automation with a regional pricing quirk

Route 53 Profiles is an abstraction layer that basically does control plane automation for attaching private hosted zones, resolver rules, DNS Firewall rules, query logging configurations, and private hosted zones associated with PrivateLink endpoints to many VPCs and accounts.

The downsides are: there is a higher upfront cost for the first 100 VPCs ($0.75/hour, approximately $540/month), but it's pretty cheap scaling after that ($0.0014/hour per additional association). So if you only have 20 VPCs this might be a lot, but if you have 1000s of VPCs it's probably worth it. The profiles are also regional, which means if you have 100 VPCs in 30 regions, you aren't paying the base price, you are paying the base price times 30.

If you don't like the price, you can build all this yourself, but then you own the code and operational cost, so weigh your choices. There are also some quirks with what information you can see about resources in a profile if you are a member account and not the owner of the resource. That presents some concern if you don't have strong governance around those profiles and what people can attach. You are left having to trust what's in the profiles.

Route 53 Global Resolver: amazing data plane, us-east-2 control plane (really?!)

Route 53 Global Resolver (not to be confused with VPC Resolver or Resolver Endpoints... can you see the branding nightmare...) is the baby of the Route 53 family. While it may be the baby, it's been a long time coming. I, personally, was extremely excited to see this released and am a big proponent of this feature. Then I got my hands on it and I gave some of the engineers and friends of mine some big grief for one major mistake they made.

Before I get into the bad part, let me give all the good parts. Global Resolver allows you to either install an agent on machines which sign DNS requests to a configured endpoint and allow you to query Route 53 Resolver from anywhere on the globe, and probably beyond if you're a rocket ship billionaire. If you don't want to install the agent, you can allow-list a set of source IPs that are allowed to query the endpoint anonymously (aka unsigned).

It's highly available, highly scalable, and allows you to resolve your private hosted zones and take advantage of DNS Firewall and query logging across your entire fleet outside of AWS. From a data plane perspective, this is an amazing feature, and I'd look real hard and deep at any reason to use inbound resolver endpoints over this. There are reasons to still use inbound resolver endpoints, but a lot of the old reasons can go away with this.

The skeleton that makes me question if anyone learned anything

I promised skeletons and Global Resolver has a big one in my opinion. Its control plane is hosted in a single region: us-east-2. Well, at least it isn't us-east-1, I suppose, but us-east-2 is no less predisposed to failure than us-east-1. One would think after the decade of complaining and pain from having a single-region control plane for a global service that is Route 53 hosted zones, they would learn their lesson, but here we are releasing a service at the end of 2025 with a single-region control plane.

At the time I am writing this, Global Resolver is still in public preview, not generally available, and I honestly hope they pull it back and figure out how to make regional control planes with a global data plane and give the customer the choice of a primary and secondary regional control plane for Global Resolver.

The conclusion: data plane excellence, control plane disappointment

Route 53 has created a family of DNS data plane excellence. When I build services, I go all in on Route 53 and I would encourage anyone starting fresh to choose Route 53. Their control planes leave something to be desired, but I do have hope they will improve over the next many years. Route 53 has changed the mantra from "It's always DNS" to "It's always the DNS control plane," and that makes horrible days a lot less horrible.

Their engineers, while not perfect, have made some of the best decisions in the industry, building some of the best products, and for all their faults, they have made the industry a better place.