Feb 15 2008

Ouch – Amazon S3/EC2/AWS outage

Jesse over at O’Reilly Radar reports that there was a significant outage of the Amazon S3/EC2/AWS services this morning starting at 6:30am EST. Just last week I listened to Werner Vogels (CTO and VP of amazon.com) talk about the rationale behind those services and the reliability requirements of them. This has got to be a major blow to the reputation of the service. I think lots of people operated under the assumption that if those pieces also drive the Amazon store itself, they have to be rock solid. However, let’s face it, shit happens. It happens everywhere and it will happen again. There’s no single service out there that can guarantee a 100% uptime. Even a simple “Hello World!” will fail under the right circumstances.

I also don’t buy Bob’s assessment (over at SmoothSpan Blog) that unpredictable growth and the “low friction” effect of the Internet are to blame for this incident. I’m certain that Amazon knows how much traffic their web-services can handle theoretically. I’m also sure that traffic trends are being monitored very, very carefully. Every responsible web-service out there will have emergency breaks (throttling, signup-queues, etc.) in order to avoid service-degradation for all users. Amazon is no exception here.

There’s a different reason for this outage. Whether we ever find out what the real problem was, remains to be seen. I do hope that Amazon isn’t going to sit on the post-mortem details, but provides full disclosure on the root cause. I would be much more inclined to go with a service provider who offers transparency instead of trying to sweep things under the carpet.

PS: Would also love to hear how SmugMug’s architecture allowed them to avoid customer-facing problems during the outage (as indicated in Don’s comment here).

Update: Don indeed responds and promises to provide more details once he found out why they were not affected …

