[ Fremont Power Outage ]ⁿ

Today a power outage affected Fremont (thanks to the team on site the service was back after about 2 hours downtime).

That didn't happen for quite some time, fortunately.

However Fremont seems to be plagued with recurrent power issues.

I'm wondering if a migration to another site would likely increase the uptime?
Could we have more details about today's issue to assess if the new issue is merely something of an unlucky event?

Thank you.

16 Replies

Linode will likely post a postmortem to the status page once they've put one together. Sh*t happens, and it's not any more or less likely to happen in any other datacenter. You can only plan so much, and things can and will go wrong (Murphy's Law).

In today's cloud world, you are no longer worried so much about machine failure. The cloud providers mitigate this.

Now you have to worry about Datacenter failure and while its rare, it happens several times a year to everyone.

Azure just have a Datacenter outage also today.

AWS had a couple last year I believe.

If you had planned to be multi-datacenter or multi-cloud you would mitigate a physical outage like this.

Thankfully we had planned for something like this and were able to failover within an hour (we would have liked to be faster)

Not all our services though were planned out like this.

It all depends on what outages cost for your business.

If AWS Region or AZ goes down, you may be able to mitigate it, but if you rely on services that host on AWS you may still be affected as those are out of your control.

You now plan for failure and consider failure a norm instead of a once in a lifetime event.

I understand that the outage can happen to any datacenter, but why some of my linodes were shut down and rebooted? I had to spend time to restore the replication topology for my DB servers because of the data corruption during the instant shutdown :(

I thought that all Linode datacenters have Tier 4 standard and there should be some reliable backup power supply like separate power line, on-site UPS + diesel generators?

This is definitely the worst thing I have seen as a 6-year Linode user. Will consider moving to another provider soon.

I'm fine with those guys being unable to keep their servers running / do a graceful shutdown when it comes to power outage considering that UPS system might went down as well (if there's any). I'm okay to repair my corrupted file system although some files were already lost. What I have totally no idea about is their backup system. It just stopped working after the outage. I got messages saying they failed to restore the latest backup. And hours after the outage was gone their backup system even failed to perform a full backup (saying something went wrong with their backend lol…) So the underlying message seems to be: never trust us and you'd better go away! :-)

@dwfreed said:

Linode will likely post a postmortem to the status page once they've put one together. Sh*t happens, and it's not any more or less likely to happen in any other datacenter. You can only plan so much, and things can and will go wrong (Murphy's Law).

I'm pretty sure Murphy never ran a datacenter, and if he did, he would have necessarily gone bankrupt instantly by trying to achieve infinite redundancy, which is… not cost effective.

I appreciate your underlying point that despite the best planning, sometimes you still have to deal with failure. However, I have to point out that all datacenters are not equal in terms of reliability. They are complex and dynamic systems, and accurately measuring things is a critical part of successful operation. This means knowing not only if failures can happen (hint: they always can), but also how likely they are to happen, and also how often they have actually happened in the past. One of the popular standards for data reliability comes from the Uptime Institute, which defines four different tiers and outlines the criteria for each - here's a summary of tier criteria on wikipedia.

Sometimes an outage in a datacenter takes down a customer's service, while another customer's service keeps operating because they invested more in redundancy, etc. The fact that all systems are vulnerable to some kinds of failures is unrelated to the fact that overall service availability can be influenced by the operators of that service. Planning, measuring, cost/benefit analysis… all important things that are way more complicated than a coin flip.

You can find a current timeline of this incident and our response on the incident's status page.

We do have backup power generators in our Fremont data center, however, even with these in place there was still a partial power outage. Our teams are working with the data center to determine why this occurred. This information isn't available to us at the moment, and for that we apologize. As soon as we do have more details to share, we will indeed update the status page with a postmortem.

While a migrating to a different data center is certainly an option (just open up a support ticket for any Linodes you'd like us to migrate), such failures are a risk at any data center as discussed above.

Two of my three linode's in fremont went down ungracefully and the with the status page being "closed" can we expect a postmortem still? This doesn't sound like a power outage but a UPS failure. I know events like this happen, but I find this alarming, maybe our next upgrade could be redundant power?

There was a utility power outage; obviously some circuits' backup systems worked, others' didn't. Bbigger is Linode staff; if he says they plan on posting a postmortem, they plan on posting a postmortem. The incident is closed on the status page so that it doesn't show up in the current incidents list, because it isn't a current incident anymore, but a postmortem can still be added to it (see this previous incident for example).

Interestingly their backup system still does not work (at least for my Linode).. Constantly got error messages when trying to perform a backup.. And it's been several days.. Just could not believe how they worked on this issue..

Anyway I've cancelled the service and is ready to leave..

The Linode Backup Service is completely unrelated to this thread. If you're having issues with the Linode Backup Service, you should open a ticket so that support can help you sort it out.

@neoweb

Our Backups storage systems are stored in the same facility as our hosts meaning they were impacted by the outage. It's possible you were having issues related to the loss of power. If that is the case, open a ticket (if you haven't already) and we'll help you out.

Any update on the postmortem?

I don't see a postmortem posted? (Unless you're referring to the previous incident I linked.) Anyway, if it was a UPS failure, a DC UPS isn't going to be connected to all the servers on that circuit to tell them to gracefully power off before battery is exhausted. This isn't a desktop UPS we're talking about, but a UPS designed to service several hundred amps of equipment long enough for the generator to start and come online. There's no way it could be connected to 100+ servers to say "crap, the batteries are almost dead, you better shut off now before the power dies."

Yes it can. But lets wait for the postmortem if its the same as the other one you linked then linode has a systemic problem here.

Still waiting for that post-mortem…

A postmortem has been posted to the Incident Report.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct