Inside the AWS Outage

What really happened, one week on.

⏱️ 7-minute read.

With the dust having settled on what was one of the most disruptive cloud outages of the past decade, this article will review the AWS outage in the us-east-1 region, the race condition that caused it, and the issue of single vendor risk.

The DNS Management System.

Between 11.48pm Pacific Daylight Time (PDT) on the 19th of October and 2.40pm on the 20th of October, Amazon Web Services experienced a serious and prolonged outage in the N. Virginia (us-east-1) Region.

The incident was triggered by a race condition that lay hidden within Amazon’s automated DNS management system.

Once triggered, it caused complete failures for over 1000 services that relied on this AWS cloud services region, from Roblox, Snapchat and Robinhood, to IoT hospital beds and coffee machines.

DynamoDB is what AWS uses to maintain DNS records, a key component in their load balancing for each AWS region.

DNS records are like a phone book for internet services; they are what computers use to find their way to whatever website, API endpoint or application they are looking for.

DynamoDB maintains this phone book, and updates on a rapid, automated schedule to divert traffic from hardware failures, increase compute capacity and generally manage traffic through and by AWS services.

An example of this may be if a certain server is experiencing errors, the DNS management system will update its records to avoid or reduce traffic to that server. This results in high uptime and minimal disruption to services in the event of failures, which are inevitable in the vast data centres that AWS operates.

DNS Management System Architecture.

In this DNS management system, there was a hidden race condition that had been lying dormant for an unspecified amount of time.

However, before explaining what happened, I will give a brief overview of the architecture. There are two primary components in the DNS management system: the DNS Planner and the DNS Enactor.

The DNS Planner monitors the health and capacity of the load balancers and creates new DNS plans for each of the service’s endpoints. Each plan defines how traffic should be distributed across the available load balancers by assigning weights to each.

The second component is the DNS Enactor. This component takes the plans from the DNS Planner and applies the required changes.

Each AWS region is split into at least three Availability Zones (AZs). For resiliency, the DNS Enactor operates an independent instance in each AZ, meaning there is always at least one Enactor applying the latest DNS plan.

Each instance of the Enactor looks for new DNS plans and updates the appropriate service by replacing the old plan with the new plan, one endpoint at a time using a transaction, ensuring the plans remain in a consistent state. This transaction is where the race condition came into play.

The Process

Under normal circumstances, this is how a DNS Enactor instance is supposed to operate:

Enactor instance reads the latest DNS plan.
Enactor then checks that the plan it has just read is newer than the plan it is overwriting.
Enactor applies the changes to each service endpoint.
Enactor does a clean-up of outdated plans.
Return to step 1.

This is a rapid process and ensures that the DNS state is always freshly updated.

A DNS Enactor instance may encounter delays if another instance is currently updating the same endpoint.

When this happens, the Enactor instance will retry each endpoint until the plan is successfully applied.

Slow and Steady Wins the Race Condition

Last week however, one DNS Enactor instance (let’s call it DNS Enactor 1) experienced high delays, needing to retry its update on several DNS endpoints, it may have tried to update an endpoint several times in a row that was already being updated by another Enactor instance.

The delay was such that several new DNS Plans had been produced by the time it had begun working through the service endpoints. Additionally, another instance of the DNS Enactor (let’s call it DNS Enactor 2) began applying one of these new plans. The unfortunate timing of these events triggered the race condition.

DNS Enactor 2 had managed to write the new plan before DNS Enactor 1 had completed its own writing. DNS Enactor 2 then began the step of cleaning up outdated plans. At this point DNS Enactor 1 had a plan that was several generations old due to the unusually high delays it was experiencing.

DNS Enactor 1 was operating on the basis that it had the newest plan, as it had previously carried out the check. Enactor 1 therefore continued writing, however the plan it was using had now been deleted, and so Enactor 1 began overwriting IP addresses with null values. This process continued until all addresses were gone.

As the current plan no longer existed, the other DNS Enactor instances were then unable to check if their plans were the most recent, and so they just stopped updating, leaving the phone book blank.

This caused all systems that needed to connect to the DynamoDB service in the us-east-1 region to immediately experience failures.

AWS engineers kicked into action and by 2.25am on October 20th they had identified the issue and had restored the correct DNS information. Whilst the earthquake had been dealt with, the tsunami had not yet arrived.

The Tsunami

The cascading effect of inconsistent network states meant that even though the underlying load balancers had recovered, health checks continued to oscillate between passing and failing.

This instability caused additional DNS churn, with target IP addresses being repeatedly removed and re-added. Meanwhile, a massive backlog of serverless workloads, including Lambda function invocations and queue service messages, had accumulated.

This put further strain on capacity within the us-east-1 region, causing errors for nearly 12 hours after the initial DNS error had been resolved.

Ultimately, all issues were fully resolved by 2.20 pm PDT on the 20th of October.

Single Vendor Risk and Data Resilience

The AWS outage underscores the inherent danger of single vendor dependency. When critical infrastructure, data storage and access all sit within one provider’s ecosystem, even a rare failure can cascade across thousands of services.

At Binarii Labs, our distributed architecture is designed to mitigate exactly this risk.

By decentralising data storage and replication across multiple independent environments, customers retained uninterrupted access to their data throughout the outage.

While much of the internet stood still, Binarii Labs systems continued to operate as designed, resilient, independent and available when it mattered most.

Learn More: https://www.binariilabs.com/

Original Article: https://www.binariilabs.com/post/inside-the-aws-outage

‍

Download