Data Centre Outages

In view of the recent issues at Atlanta…

A little while ago, we were inconvenienced by a series of DDoS attacks on Hurricane Electric (Fremont). I'm sure that The Planet (Dallas) has also had its problems.

The simple fact is that you will never get 100% uptime, anywhere. Buildings burn down (or in the case of HE, fall into the St Andreas fault ;-)) and other disasters can occur.

No matter how many redundant anythings you have in a data centre, you will get outages. If we, as users, have mission-critical sites/applications running, it falls to us to provide contingencies on top of those provided by Chris and Team, data centre staff, etc.

My approach is to have a redundant Linode but NOT in the same data centre. I don't have an automatic failover system but dump my databases HE databases every night and transfer them over to The Planet. Likewise, any uploaded files get rsync'd over. These dumps/files are also copied down to the server in my office as part of the process. Hey, if we lost the USA, I could run the whole lot off my laptop although I wouldn't like to say what sort of shape the InterWeb would be in ;-)

If I get an outage that looks like it's going to persist, I load up the databases from the dumps at The Planet, change DNS (I have a short TTL set) and about half an hour later, am running on the secondary.

There are more elegant ways in which this can be done - multiple replicated databases, round-robin DNS, etc., but these are not something that I or my clients (all small businesses) can afford.

What we should not do is to turn round and blame Linode when we ourselves have failed to identify and make contingencies for a single point of failure in a critical system.

13 Replies

I nominate this for sticky status

I agree completely, you can never be "too" careful with regards to things like this.

This was pretty much entirely out of linode's hands. They handled everything they could as well as they could. The server failed and it was replaced.

However, let me put this clearly.

With the exception of a major catastrophe or barring other extremely rare circumstances, what happened with the power at the Atlanta Datacenter is completely unacceptable.

Rackspace had an outage a while back but that was a big combination of multiple points of failing as well as on the part of the electric company.

A few batteries failing in the UPS system is unacceptable especially given their excuse that they check the thing every 6 months.

I work in a datacenter and today at the biweekly meetings we have on updates regarding happenings in our datacenter (any issues, outages, coverage issues with technicians) and we laughed at the reasons given by Atlanta for their problems.

Needless to say, it's generally most businesses that try to cut corners like they did–but it has now bit them in the ass (like it will at any point to any business that tries to do it).

One of the reasons, we may use linode.com is their ability to assess a datacenter. I'd like to know their thoughts on their supplier from Atlanta. Are they going to dump that supplier? Does linode think the Atlanta Dadacenter guys were just unlucky? Incompetent?

On a more positive note, I love that linode.com is so open about those problems and uses the forum to communicate.

@A-KO:

A few batteries failing in the UPS system is unacceptable especially given their excuse that they check the thing every 6 months.

I just re-read the copy of the message posted in the thread in the announcement announcement section, and noticed the way they worded this at the end:

> we are increasing the battery pm schedule to monthly from biannual.

I always thought bi-annual meant every other year. Like bi-weekly means every other week. But it makes sense to a degree if they meant twice a year ;)

Biennial = every other year

Bi-annual = twice per year

  • and I hope that the DC has got this the right way round!

Will the disk image cloning feature of Linode work if I move one of my Linodes to a different datacenter? And will a cloning be "free" in terms of the traffic per month limit? Having your eggs in different baskets seems like a good idea.

"Like bi-weekly means every other week."

This is the problem with the "bi-" terminology. Bi-weekly could, logically, mean "twice a week". Indeed, in England, it's more likely to mean that because we have the perfectly usable word "fortnightly" to mean "every 2 weeks".

Ain't the English language fun!

@harmone:

Will the disk image cloning feature of Linode work if I move one of my Linodes to a different datacenter? And will a cloning be "free" in terms of the traffic per month limit? Having your eggs in different baskets seems like a good idea.

Yes, you are able to clone images between data centers. Using the cloning utility does not accumulate against your monthly bandwidth quota.

-Tom

@jcr:

One of the reasons, we may use linode.com is their ability to assess a datacenter. I'd like to know their thoughts on their supplier from Atlanta. Are they going to dump that supplier? Does linode think the Atlanta Dadacenter guys were just unlucky? Incompetent?

On a more positive note, I love that linode.com is so open about those problems and uses the forum to communicate.

The dust is still settling here and we have yet to form a final opinion. We did visit the data center in person before deploying and we were impressed with it. They were also extremely reliable over the past year, with the last month aside. Two things did happen yesterday as a result of this, however:

1 - Ten brand new hosts were to be picked up yesterday morning by FedEx, on their way to Atlanta. FedEx was turned around and these hosts will now go to Dallas.

2 - We signed a new contract which effectively doubles our cage size in Dallas.

-Tom

Hi,

Anybody thinking that they'd want to move from Atlanta must think twice. Especially after the incident. Here's why I think so:

Atlanta has been very reliable over the past while, but an unfortunate incident occurred which seems to be vary rare. Their backup systems are very impressive, but you can be sure that because of this incident they are now especially vigilant, perhaps even more so than at another DC at which something like this is just waiting to happen.

–deckert

On Friday we had a very lengthy conversation with Jeff Hinkle, president of the facility we use in Atlanta. Our feeling is that the reliability we came to expect from them has been restored.

A few key points from our conference call:

The intermittent network troubles we've been experiencing since early December have been caused by their Global Crossing backbone. Atlanta has been working with Global Crossing over the past month to troubleshoot the problem with no luck. It may not sound like much to you or me, but shutting off a backbone is a big deal to a datacenter. Not only in terms of the additional bandwidth now going over the other providers, but from a financial aspect as well. On Thursday afternoon Jeff made the decision to shut Global Crossing down. Since then we've noticed increased throughput and no latency issues. Hopefully this sticks. They are also adding Level3 to their network on February 1st.

Regarding the power outage, they have already implemented steps to help prevent this from happening again. They have purchased their own testing equipment and started conducting tests on a more frequent basis (daily and monthly, in addition to their regularly scheduled vendor maintenance). They have also installed a third string of batteries. When Chris and I toured the data center prior to deploying there, we were impressed with their redundant power configuration. Their cooling system is also rigged to their generators, to prevent overheating in such an event (I think I read somewhere that a data center has ~ 15 minutes before an unacceptable operating temperature is reached and they need to start shutting equipment down – not so in Atlanta).

Outages (network, power, hardware failure, etc) are inevitable at any data center - no matter how many 9's they stick in their SLA. I think Atlanta just had a series of compounding issues, which we believe are now resolved.

-Tom

Before everyone raises a ticket to leave the Atlanta datacenter, it's worth remembering that there was a massive, 8-hour power outage at the Dallas datacenter on 31 March 2005, again caused by multiple failures and/or configuration errors in a (supposedly) redundant power system. ThePlanet presumably learned their lesson, because they have been rock steady ever since.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct