Amazon S3/EC2/AWS outage take II
A few days ago I wrote about the Amazon outage of the popular S3/EC2/AWS services. Yesterday I received some more detailed information via Shahram that tried to explain what had happened. Those messages below were posted on an Amazon Bulletin Board where they kept track of the issues.
And for those of you who don’t want to read through all of the stuff: They had the nicest problem one can have: success disaster. Too many people using the service beyond it’s capacity. In this particular case it seemed that a cryptographic sub-system could not handle all the requests that were thrown at it.
First message from an Amazon employee:
Quick note to keep everyone up to date. The team continues to be heads down focused on getting to root cause on this morning’s problem. One of our three geographic locations for S3 was unreachable beginning at 4:31 a.m. PST and was back to near normal performance at 6:48 a.m. PST (a small number of customers experienced intermittent issues for a short period thereafter). Though we’re proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable and we won’t be satisfied until it’s perfect. We will be providing additional information on this thread as soon as we have it.
Second message from an Amazon employee:
Here’s some additional detail about the problem we experienced earlier today.
Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types.
Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. By 6:48am PST, we had moved enough capacity online to resolve the issue.
As we said earlier today, though we’re proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable. As part of the post mortem for this event, we have identified a set of short-term actions as well as longer term improvements. We are taking immediate action on the following: (a) improving our monitoring of the proportion of authenticated requests; (b) further increasing our authentication service capacity; and (c) adding additional defensive measures around the authenticated calls. Additionally, we’ve begun work on a service health dashboard, and expect to release that shortly.
And a non-Amazon party (company who uses the service) reported this:
What caused the problem however was a sudden unexpected surge in a particular type of usage (PUT’s and GET’s of private files which require cryptographic credentials, rather than GET’s of public files that require no credentials). As I understand what Kathrin said, the surge was caused by several large customers suddenly and unexpectedly increasing their usage. Perhaps they all decided to go live with a new service at around the same time, although this is not clear. What is clear however is that S3 was the momentary victim of its own success, but the problem was quickly rectified.