The E3 sub-systems had not been restarted in years, and had expanded considerably since the last time they were restarted. The other was the placement subsystem, which handles storage allocation for new S3 objects. The placement subsystem is used during PUT requests to allocate storage for new objects.
To get it back, both systems required a full restart.
The outage also had a knock-on impact on a number of AWS services, hosted from US East-1, that rely on S3 for backend support, including Amazon Elastic Computer Cloud (EC2), AWS Elastic Block Store, and AWS Lambda. That’s what happened to Amazon Web Services on Tuesday, February 28, 2017. More than half of the top 100 online retailers saw their websites slow by 20% or more.
Additionally, Amazon plans to break down its network in smaller cells, so an outage like will only affect a smaller number of customers. AWS employees were unable to update the dashboard during the outage because the console relied on S3 from the affected region.
The AWS outage cost companies in the S&P 500 index $150 million, according to Cyence Inc., a startup that specializes in estimating cyberrisks. It took Amazon nearly four hours to resolve the issue. “The difference is that the ones who have fully embraced Amazon’s design philosophy to have their website data distributed across multiple regions were prepared”, Shawn Moore, CTO at Solodev.
When a large portion of the internet went offline earlier this week, no one could have guessed that the reason for it would be a simple typo.
To be fair to AWS, it has already made several changes to its operational practices as a result of this unforeseen event.
The company said that even though the removal of capacity was key to its operations, the tool used by the employee led to too much capacity being removed too quickly.
The company goes on to explain that S3 subsystems are created to support the removal or failure of significant capacity with no customer impact, but that because of the exponential growth that Amazon has experienced, the process of restarting the servers and running safety checks took longer than expected.
Amazon says it is now implementing some changes to prevent a similar situation.
Looking to avoid a similar snafu, AWS said Thursday it’s adding additional safety checks and ways to improve recovery times.
AWS assures everyone that it’s prepared for the occasional failure.