The Typo Heard ‘Round the World

Remember earlier this week when like a third of the internet was broken for several hours? It quickly came out that the disruption was Amazon’s fault—a whole mess of their servers shut down when they (obviously) shouldn’t’ve. Those servers were part of S3, the company’s web hosting service, which serves who knows how many internet sites. Now, Amazon has revealed that: 1) they kinda-sorta did it on purpose; and 2) the outage was caused by a typo.

Dammit, Gary! What'd you do?!

Dammit, Gary! What’d you do?!

Someone Needs Grammarly!

On Tuesday, 28 February, Amazon’s S3 commandos were working on debugging the billing system. (Sidebar: good to know there’s a bug in Amazon’s billing system. That $#!t only handles, what, billions of dollars a day? No problem!) In the course of the debug, the S3 team needed to shut down a few servers. Had it all gone smoothly, no one would have noticed.

Of course, it did not go smoothly. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended,” Amazon said in a CYA statement posted yesterday. “The servers that were inadvertently removed supported two other S3 subsystems,” one of which “manages the metadata and location information of all S3 objects in the region.” With those subsystems off the grid, websites that utilize them couldn’t perform basic tasks like data retrieval and storage.

Always Backup the Backups of the Backups

Since they’re not total dummies, Amazon designed the S3 system with failsafes that enable it to keep running if it loses a few servers. However, it wasn’t built to handle a shutdown of this scale. As a result, a number of systems required a full restart. The average laptop computer takes about 45 minutes to restart[1], so you can imagine how long it takes to reboot a system the size of S3[2].

The restart knocked out a bunch of other Amazon services, as well, including the Elastic Compute Cloud (EC2), a massive storage repository used by countless websites. One system that was still firing on all cylinders throughout the outage was the online giant’s AWS Service Health Dashboard—which erroneously showed that all services were up and running as usual. Green lights across the board.

Unsurprisingly, Amazon is now working to revamp the S3 system to avoid similar events in the future. Hopefully, they’ll also fix that glitch where something ordered with two-day shipping on Thursday isn’t delivered until Monday.

[1] Citation needed.
[2] It’s roughly thrice as big as four extra-large server farms.

Photo credit: IntelFreePress via Foter.com / CC BY

Leave a Reply