Status da rede

Amazon outage: Summary, and lessons learned
26 Abr 2011

Here's a short update on our experiences with the Amazon outage in Virginia last week.

Like many others, our systems were badly hit by the outage, which struck in two of the availability zones we use, thus affecting two-thirds of the systems involved in serving our Public Status Pages.

But luckily, to the end-user, the only really visible problem was short and caused by the fact that our AWS load balancer wasn't able to pick up the affected systems automatically, therefore, routing a percentage of the API requests to back-end nodes that were not healthy. This wasn't Amazon's fault, but was instead caused by the fact that the affected database nodes were in a sense functioning correctly, apart from getting time-outs on their EBS volumes. The (network-based) fail-over algorithm of the database cluster couldn't see the disk errors that were hidden by the kernel. This in turn translated into time-outs to database queries, and to the front-end web caches. But as not all requests timed out, the load-balancer thresholds weren't immediately hit. After about 30 minutes we did a manual fail-over causing all queries to be satisfied again using the remaining nodes.

Other than that, this outage was luckily more of an interesting test case for our cloud architecture. I'm happy to say that our asynchronous designs have proven themselves: Not a single update to the public status pages was lost, and the incoming result queues built up nicely during the outage. The monitoring queues were emptied quite soon after our systems were stable again, getting rid of the backlog in updating the status pages.

Yet, of course there is always room for improvement, and we learned a lot from this for our future operations. The main lesson learned, where many others would agree I suppose, is that it is now clear that also multiple availability zones can fail at once, and that we should therefore prepare for this by replicating to a different data centre, at a different hosting provider. We'll start working on this in the next months, but luckily, due to the nature of our architecture, a multi-DC set up should not be very difficult to implement.

Another lesson would be that we created a situation where the fail-over detection of the load balancer couldn't work properly. I'm currently looking into how we can help the load balancer with this.

If you have any further questions about our architecture, let me know. I'm happy to take away any concerns, or explain an aspect in more detail.

Pieter Ennes (@skion)
VP Engineering

Public Pages shortly affected by Amazon outage
20 Abr 2011

We lost a few nodes in a major outage at Amazon AWS this morning, but managed to get back online quickly using replicated nodes in different availability zones.

None of our core monitoring or alerting systems were affected during this period, as they are served from dedicated systems at RackSpace.

Updates to the status pages from checks that were executed during the outage were queued up automatically, and are being fed into the status pages as fast as possible at this moment. We expect the result queues to be emptied and the status pages to be fully up to date in a few minutes.

We are currently seeing what we can learn more from this outage, and why the AWS load balancer wasn't able to pick up the failing services automatically.

Update: The same nodes are still affected by the ongoing outage in Virginia, but we have have enough redundancy to mitigate any problems. Aside from a manual fail-over we had to do earlier to bypass the bad nodes serving the Public Status Pages, currently the only known issue is a small lag in the status pages updates due to some precautionary measures that we're taking to offload work to fresh nodes.


Routing issues between US and Europe
14 Mar 2011
Three of our stations (Portugal, Denmark and Ukraine) are experiencing difficulties connecting to our main stations in the USA. We've opened a ticket at the hosters to check intermediate routing issues with their upstream providers.

Connectivity problems in Moscow, Russia
5 Mar 2011
Our hoster in Moscow has informed us they are experiencing difficulties in their data center. We've put the station in maintenance mode until they have been able to resolve the problems.

Issues with South-African station
13 Jan 2011
We are aware of the issues with our station in South-Africa and have contacted the local hosting party to provide us with a solution ASAP.

  próximo »