Fremont is down.

Any updates from your end, Linode?

147 Replies

I sent in a ticket and they asked me to check this page for updates: http://status.linode.com/2011/05/outage … ility.html">http://status.linode.com/2011/05/outage-in-fremont-facility.html

Thanks - what time is it currently there (I am in Australia)

It's 4:05AM in the West Coast. 7:05AM in the East.

Still no updates on what has happened. I haven't felt this uneasy for a long time.

Not sure whats up exactly but a reply on my ticket said there might have been something funky with an update.

At least I found that we have IPv6 support now!

And I am halfway restoring a backup to a new London linode just to avert a longer downtime.

I hope they bring fremont up quick.

At about 3AM PST it looks like their announcements were removed from the global BGP table:

May 6 03:03:27 [email protected] bgpd[27687]: Rib Loc-RIB: neighbor 206.51.38…. (eqixhea) AS25658: withdraw 173.230.144.0/20

They are still missing.

Right, just found out the backups are stored in the same datacenter.

Well, that isn't very useful. :evil:

Word of note:

If your particular service or application cannot be down, pick up another linode and have some sort of load-balancing/HA system configured.

Yeah, I asked that question when they first announced the facility, and the linode backup may be on the next machine to the source.

Having been through that at WestHost, I wasn't going to get caught the same way twice…

@A-KO:

Word of note:

If your particular service or application cannot be down, pick up another linode and have some sort of load-balancing/HA system configured.

Yes. It was and is foolish of me to think Linode is bulletproof. :oops:

Hurricane Electric on Twitter: A utility failure is affecting much of Fremont, CA, USA, damaging some electrical equipment. We are currently working to restore power.

From the page at PG&E's website

~~[http://www.pge.com/myhome/customerservice/energystatus/outagemap/" target="_blank">](http://www.pge.com/myhome/customerservi … outagemap/">http://www.pge.com/myhome/customerservice/energystatus/outagemap/](

The fremont outage is scheduled for resolution at about 6:15 local time which is 13:15 UTC

> 7:53am (EDT): Power appears to have been restored and we are working on bringing Linodes up now.

http://status.linode.com/2011/05/outage … ility.html">http://status.linode.com/2011/05/outage-in-fremont-facility.html

Downtime lasted 2.2 hrs. I am glad it's back up.

Do these facilities not have UPS/genset? It seems relatively common for datacenters (not Linodes' in particular) to suffer from power failures, regardless of if they have backup power or not.

@Guspaz:

Do these facilities not have UPS/genset?
I've wondered the same thing, as I recall that there have been several power incidents at the Fremont DC in the last year or two.

Hi!

Can someone on Linode weigh-in on this question:

1) Can you confirm that Linode does NOT use UPS services?

2) Can you confirm that Linode does NOT use generator backup?

And therefore since both of those are true, can we assume that Linode will go down even with the slightest of power interruptions?

Thanks,

Like pretty much every colocation centre, HE has UPS and generator backup facilities.

Like pretty much every colocation centre, they either failed when needed to deliver the goods, or some previous incident damaged part or all of all of the UPS system, and so the facility was running on street power whilst the UPS systems were being repaired, and then when street power goes, so do the servers.

I'm convinced the highly paid, highly qualified, highly regarded enginers who design these systems are morons, because they have failed to notice that the systems they design and specify fail time and time again.

@dbuckley:

I'm convinced the highly paid, highly qualified, highly regarded enginers who design these systems are morons, because they have failed to notice that the systems they design and specify fail time and time again.
The last time there was a major outage at HE Fremont 1 (20–21 November 2010), it was caused by a lightning strike that took out a bunch of UPS units. There was another outage on 23 November, caused by a one-second break in utility power that could not be protected against since the UPS systems were under repair.

Designers have to balance the cost of power protection systems against the severity of the events that they can withstand. No system that we would want to pay a part of the costs for will survive a lightning strike on a nearby switching station or utility pole. Also, testing production power protection systems is notoriously difficult; nobody wants to pull a breaker just to see if everything works, but no amount of simulation and half-assed exercises can really prove the system. Let’s wait for the RFO before calling them morons.

That being said, this is the second time in six months that Fremont 1 has had a power outage. I await the RFO with interest. If the cause of the outage was anything less that some kind of natural catastrophe, HE has some explaining to do.

Has anyone else noticed that the times/dates of updates on status.linode.com are back dated? ie, that the 9:18AM EDT update showed up sometime between 3PM and 5PM PDT. Is there some totally innocuous reason behind this or is it just to keep up appearances?

-John

@pclissold:

Let’s wait for the RFO before calling them morons.

Sorry Peter, I failed to make myself clear. I'm not suggesting that the specific folks responsible for the design at HE are more or less morons due to this one incident; I'm saying that decades of experience with datacentres that (if you believe their owners) are as unsinkable as the Titanic says that they simply aren't that good, and their designers are - as a group - morons.

The mistake that all the big datacentres I've seen make is that there is an opinion that there is economy of scale in power systems, and really there isn't; to get the so-called economy of scale cost efficiencies, sacrifices to system integrity are made, and availability suffers as a result.

At the data centre at my place of work (ok, its not a big facility, its about 400KVA, basically a tier 2 facility with tier 3 power, but only a single genset) the first wednesday of every month the supply to the datacentre is pulled (from upstream distribution, not even in the datacentre building) for an hour or two, just to see what happens.

This little datacentre has suffered from the "economy of scale" problem I mentioned above; it was originally comissioned with 200KVA UPSs with 400KVA infrastructure, but the UPSs were upgraded to 400KVA by parallelling another 200KVA set. Paralleled UPSs are less reliable than single UPSs, so additional risk has been accepted for a lower cost upgrade. Only time will tell if this has a deleterious effect on availability.

When the power protection was needed in anger, twice, (the two big earthquakes that damaged Christchurch in New Zealand), with widespread and prolonged utility outages, the datacentre (and all the IT services) didn't miss a beat.

I'm reasonably convinced the datacentre willl survive a lightning strike to the distribution; it is an anticipated possible event (even though it has never happened historically) and the protection is in place in case it should.

But even this little datacentre which was designed by guys (and a girl!) with many years in the high availability power field, responsible for many facilities in London, it still has unfixed flaws. In the early days there was a flaw (now fixed) which caused the cooling systems to shut down, and for a internal power outage to some systems. Despite the fact that these engineers are really nice people, and seem really competent, and have bags of experience and history under their collective belts, they still made design errors I was seeing twenty years ago.

And that is largely the reason I call this group collectively morons. They aren't learning from history, they are to this day building systems with the same shortcomings that we discovered 20 years ago that we know will lead to outages.

/rant

@john.bloom:

Has anyone else noticed that the times/dates of updates on status.linode.com are back dated? ie, that the 9:18AM EDT update showed up sometime between 3PM and 5PM PDT. Is there some totally innocuous reason behind this or is it just to keep up appearances?

-John

I don't understand what you're talking about. http://status.linode.com/ is hosted at http://www.sixapart.com/ on the West coast, somewhere around Oakland. The blog post for 9:18 AM EDT showed up at 6:18 AM PDT. So the different posts showed up between 3:12 AM PDT and 6:18 AM PDT, or adding 3 hours (the offset between EDT and PDT), 6:12 AM EDT and 9:18 AM EDT.

I didn't mean to get timezones confused in this. What I wanted to know was whether other people were seeing updates to status.linode.com appear long after the time they were posted. At this point I've seen it on more than one internet connection, on Firefox, Chrome and Safari.

-John

Ah, looking at other posts, I see. Their timezone is correct (EDT/EST). They made the Fremont post at 6:18 AM EDT with the first update for the 6:12 AM EDT. There was just a coincidence that they made the original post exactly 3 hours before the issue was resolved.

Looks like Typepad doesn't change the dates of posts when you go back and edit them.

Jeez, would be nice if Linodes automatically booted after failures like this so they aren't down all day until I can log in to the Manager and press the Boot button.

They do. Have you enabled Lassie? Log into the Linode manager, click on the Settings tab for your node.

It's been so long since I looked, I can't recall if it's enabled by default or not.

Description:
> Lassie is a Shutdown Watchdog that monitors your Linode and will reboot it if it powers off unexpectedly. It works by issuing a boot job when your Linode powers off without a shutdown job being responsible.

To prevent a loop, Lassie will give up if there have been more than 5 boot jobs issued within 15 minutes.

@waldo:

They do. Have you enabled Lassie? Log into the Linode manager, click on the Settings tab for your node.

It's been so long since I looked, I can't recall if it's enabled by default or not.

Description:
> Lassie is a Shutdown Watchdog that monitors your Linode and will reboot it if it powers off unexpectedly. It works by issuing a boot job when your Linode powers off without a shutdown job being responsible.

To prevent a loop, Lassie will give up if there have been more than 5 boot jobs issued within 15 minutes.
I see, thanks. I must have disabled lassie for some reason. Did it work in this case?

Linode and Linode support are wonderful, but I can't believe power outages repeatedly knock out a world-class datacenter like HE.

From the November outage RFO:

"The Fremont facility is consulting with the UPS manufacturer to make sure the system is more robust in order to protect against similar failures in the future. We plan to follow up with them and ensure that the reliability of Linode's infrastructure meets our expectations."

@kenyon:

I see, thanks. I must have disabled lassie for some reason. Did it work in this case?

Yep.

given this is 3 strikes for HE,

What steps will Linode be taking to ensure that its provider is fit to provide a Tier 1 service?

@waldo:

@kenyon:

I see, thanks. I must have disabled lassie for some reason. Did it work in this case?

Yep.

Worked for me too. I just spent an hour going through logs looking for the reason for re-start and came up blank. Came to the forums to ask a question and here's my answer :)

Being in this DC is starting to worry me :|

I had done some work on my Linode the night before and I usually sync backups to another machine at my place. Add to that I had turned off the machine for the night. Just my luck that the DC would go down and those same configuration files where lost :lol:

Oh well. That is what I get for not backing things up. :D Will have to rewrite them when I can again.

@JeremyD:

Being in this DC is starting to worry me :|

I had done some work on my Linode the night before and I usually sync backups to another machine at my place. Add to that I had turned off the machine for the night. Just my luck that the DC would go down and those same configuration files where lost :lol:

Oh well. That is what I get for not backing things up. :D Will have to rewrite them when I can again.

um, what? how were your configurations lost? the power went down, didn't see anything about any hosts blowing up.

@dbuckley:

I'm convinced the highly paid, highly qualified, highly regarded enginers who design these systems are morons, because they have failed to notice that the systems they design and specify fail time and time again.

I'm not defending HE here because they did screw up, but building redundant power is way harder than it sounds. UPS's fail as the worst times even though they pass weekly tests. Diesel generators fail even though they pass weekly tests. Switching equipment jams, things overheat. Idiots wire dual PSU servers on one power circuit then swear blind they used both. Racks of servers all start up at the same time because the BIOS random start-up delay wasn't set and management didn't pay 10 times more for staged start-up PDUs. Air conditioning gets a bit old and starts to draw more current than the specs say. And humans normally screw things up big time when they realize they don't actually have a procedure for the current situation and start panicing.

Everything that can go wrong will go wrong. Everything that can't go wrong will go wrong anyway. And every bit of equipment that solves one problem introduces another one.

I don't think it's actually possible to do better than one rackmount UPS per server sitting in the rack right next to that server. That's what EMC do with their storage arrays. It's pretty expensive and hard to manage but it's the only thing I've seen that actually works.

@sednet:

I don't think it's actually possible to do better than one rackmount UPS per server sitting in the rack right next to that server. That's what EMC do with their storage arrays. It's pretty expensive and hard to manage but it's the only thing I've seen that actually works.
One rackmount UPS per power supply, not per server. Obviously with redundant power supplies.

@sednet:

Everything that can go wrong will go wrong. Everything that can't go wrong will go wrong anyway.

To paraphrase the late, great Douglas Adams… The only difference between something that can go wrong and something that go wrong is that when something that can't go wrong does go wrong, it's much harder to fix.

@JshWright:

To paraphrase the late, great Douglas Adams… The only difference between something that can go wrong and something that go wrong is that when something that can't go wrong does go wrong, it's much harder to fix. I don't get you, or the other dude.. haha.. if fremont can't handle a occasional storm, it might be time to look for another facility near Fremont that can. And that might sound harsh, but you'd think by now they'd have some sort of plan that doesn't involve downtime.

Additionally, I think I'd be less worried about it if it happened with all the linode centers, instead it seems like fremont gets this online issue pretty often. And apparently.. I'm not the only one who isn't cool with downtime. :roll:

@superfastcars:

And apparently.. I'm not the only one who isn't cool with downtime. :roll: If you can't stand downtime, then you have already bought additional server(s) in other physical location(s) and set up failover so that one facility going down (which happens because, obviously, nothing is perfect) doesn't screw you over. If you haven't set that up, then you don't actually care about downtime.

@Alucard:

@superfastcars:

And apparently.. I'm not the only one who isn't cool with downtime. :roll: If you can't stand downtime, then you have already bought additional server(s) in other physical location(s) and set up failover so that one facility going down (which happens because, obviously, nothing is perfect) doesn't screw you over. If you haven't set that up, then you don't actually care about downtime.

That's a very nice example of the "No True Scotsman" logical fallacy. You are claiming that people who "really" care about downtime would do the thing you suggest, and anyone who doesn't doesn't "really" care about downtime.

The logical fallacy is in the presumption that "caring about downtime" implies caring about downtime to the exclusion of all else. Obviously different people balance their needs differently and it's perfectly possible to care about downtime to the extent that you buy into a service with a reasonable belief that it will maintain a given level of uptime and then find later that it does not meet your expectations. This does not mean that you didn't care about downtime to begin with because you didn't choose the most extreme option for minimizing it; it just means that either your initial assessment of the service was wrong, either because you assumed too much, or because the service promised too much.

I am not sure what is the case here; but it does certainly seem that H.E. does not meet the same level up of uptime as other data centers. It seems to me that there has been a history of problems with H.E. exceeding that of other data centers, and I think it is reasonable to express concern about this.

The SANS NewsBytes had a great quote about the Amazon outage that fits perfectly for this situation.

@John Pescatore:

Anyone who plans on using cloud without planning on workarounds for outages is not doing their due diligence.

@hoopycat:

(I do find it interesting that the most intricate and failure-prone utility spawns the biggest outrage when it breaks; if a failover router had blown a turboencabulator and seized the common-mode ambaphascient lunar wain shaft, taking out network reachability for a similar period of time, this thread probably would have not gone on this long.)

I'd be grumpy if HE's turboencabulators were blowing several times as often as any of the other data centers. (Like at DreamHost!)

(Well, not really, since I'm in Dallas. Which actually did experience network downtime several times as often as the other data centers back in 2008, thanks to DDoSes.)

I can't seem to access my servers in Fremont again. Anyone else with the same issue?

Yeah mine in Fremont are down as well - I can't even hit the AJAX console. Looks to be network related, since the manager claims they're still up

Yeah having the same issue, node in freemont not available. AJAX console management not available. It looks like a network issue.

no sign of an update on the status page, hope they are aware and working on it.

funny…I have a ticket open with no acknowledgements whatsoever for well over 14 minutes.

@autone:

funny…I have a ticket open with no acknowledgements whatsoever for well over 14 minutes.

If nobody is watching the tickets I won't be laughing!

I can see the network routes slowly getting better, so someone is working on it… not fast enough though! :P

@ybop:

@autone:

funny…I have a ticket open with no acknowledgements whatsoever for well over 14 minutes.

If nobody is watching the tickets I won't be laughing!

I am definitely not laughing. Fremont has been down so often it is getting sore and this is the first time linode support has been so slow.

18 minutes and still no response.

My 2 linodes in Fremont down too.

At the least they should take time to post a status on the status page, take all of 30 seconds to do. Being left in the dark with no idea of whats gone wrong or how long its going to take to fix makes it thrice as bad.

Logged onto the console of the host my VM sits on, and it says it is powered off - sounds like there was a failure of the power systems.

Totally agree. Still no reply to my ticket, no status updates. Nothing.

I opened a ticket and it has been acknowledged by support.

They are working on the issue.

@klightspeed:

Logged onto the console of the host my VM sits on, and it says it is powered off - sounds like there was a failure of the power systems.

I can't believe it's the power supplies again. It seems like they have a power outage every second month or so. Surely it must be some other issue this time?

@qwerty123:

I opened a ticket and it has been acknowledged by support.

They are working on the issue.

good for you. my ticket is aging nicely without a response still. well over 30 minutes now.

I wonder what has happened to linodes legendary support. seems to have taken a break.

theres been a status update. Another power outage. Is it so impossible to have backup generators that work?

one of my three is back up after 49 minutes

@ybop:

theres been a status update. Another power outage. Is it so impossible to have backup generators that work?

Fremont seems unreliable, the only reason why my linodes are still there is because it provides the best response times for Australia.

Might look at moving to Dallas.

1 of 2 still down. Over an hour downtime now.

Looking at my records there have now been four serious power outages at Fremont over the past 9 months:

November 2010

Dec 2010

May 2011

August 2011

Obviously the power issues there are systemic and have not been properly addressed (i.e. nobody is prepared to spend the money to fix this properly).

How about some compensation for those of us who are stuck using Fremont and have to put up with this 3rd-world service??

The host my VM is on mustn't have come up properly first time, as it was rebooted a few minutes ago.

@autone:

I wonder what has happened to linodes legendary support. seems to have taken a break.
That's worrying. I used to recommend Linode like a crazy man. Less so, recently.

@BigPhil:

How about some compensation for those of us who are stuck using Fremont and have to put up with this 3rd-world service??

Honest question - why are you stuck using Fremont? I chose it somewhat arbitrarily since I live nearby, but I'm thinking of switching. Just wondering what would make it impossible for a Linode customer to switch DCs?

> Honest question - why are you stuck using Fremont? I chose it somewhat arbitrarily since I live nearby, but I'm thinking of switching. Just wondering what would make it impossible for a Linode customer to switch DCs?

Best RTT to Australia/NZ (i.e closest to our end-users).

Okay, my Linodes are up again.

I will probably schedule a migration to Dallas soon - before the next Fremont disaster hits in 3 months time!

Not sure if anyone saw this, and it doesn't help all that much, but HE have a twitter account:

http://twitter.com/#!/henet

They posted earlier that it was a breaker on a UPS or something. Weird.

@BigPhil:

Best RTT to Australia/NZ (i.e closest to our end-users).

I am in the same boat as you.

But looking at http://www.linode.com/speedtest/ Fremont is not that much faster than Dallas.

If it means an extra 5% uptime and some additional sleep (it is nearly 1:00AM here) the additional latency is probably worth it.

> If it means an extra 5% uptime and some additional sleep (it is nearly 1:00AM here) the additional latency is probably worth it.

You're probably right. Extra 30ms from my test, not a lot. The grass is always greener though, wonder if they have a decent UPS? Having good power is just so fundamental, it does my head in that they can't do it properly.

@roopesh:

@autone:

I wonder what has happened to linodes legendary support. seems to have taken a break.
That's worrying. I used to recommend Linode like a crazy man. Less so, recently.

Take a step back on this. Would you rather they be answering support tickets in under 5 minutes like normal or fixing the problem? In a major issue like this, I'd rather the latter.

I still recommend them like crazy, just not Fremont

I'm looking at moving away from Linode at this point. If the engineers can't figure this out, then there's no point in staying.

My linodes didn't even reboot properly, meaning I had to get in there Sunday AM to manually sort the Linode crap out again…

I'm not sure why I'm paying Linode almost $2k per mo (and growing), only to have constant headaches from their service.

Anyways, my fault for not considering moving later… I need a host with some decent uptime / service… time to start hunting and testing new hosts :)

@BigPhil:

Looking at my records there have now been four serious power outages at Fremont over the past 9 months:

November 2010

Dec 2010

May 2011

August 2011

Obviously the power issues there are systemic and have not been properly addressed (i.e. nobody is prepared to spend the money to fix this properly).

How about some compensation for those of us who are stuck using Fremont and have to put up with this 3rd-world service??

@glg:

@roopesh:

@autone:

I wonder what has happened to linodes legendary support. seems to have taken a break.
That's worrying. I used to recommend Linode like a crazy man. Less so, recently.

Take a step back on this. Would you rather they be answering support tickets in under 5 minutes like normal or fixing the problem? In a major issue like this, I'd rather the latter.

I still recommend them like crazy, just not Fremont
I'm not trying to use all their locations and see which is best. When I first signed up with Linode, there were hardly any outages. Service outages were well communicated, apologies and refunds issued, and explanations provided. Now we get "hold on while we try to restore" and nothing else. On my ticket today, the support person didn't even know what was going on. So yeah, I don't recommend them anymore. Sure I haven't left (yet), but that's more a function of laziness than a tribute to their support.

Ive never had a single power issue in Newark…

@tyrelb:

I'm looking at moving away from Linode at this point. If the engineers can't figure this out, then there's no point in staying.

You realize that the problem is HE, not Linode, right? And only one of Linode's datacenters is with HE… So you really ought to take your issue up with HE's engineers, as Linode's people are not the source of your woes.

@tyrelb:

My linodes didn't even reboot properly, meaning I had to get in there Sunday AM to manually sort the Linode crap out again…

If your nodes didn't reboot properly, it's due to either a hardware fail or your config. If a hardware issue, you don't have to do anything, because the DC folks have to fix the hardware, and then it's back to normal. If it's your config, then it's not a DC or Linode issue at all, it's that your config is borked.

@tyrelb:

I'm not sure why I'm paying Linode almost $2k per mo (and growing), only to have constant headaches from their service.

If you're paying $2k per month, why don't you have geographic redundancy?

akeri - I agree with most of your points.

Unfortunately I have a relationship with Linode, not HE. So I can't really go chase around HE's engineers… I guess I should have expanded that I haven't seen sufficient resolution from Linode's point of view (really nothing other than "sorry, it won't happen again").

That's fine, but I think I'll move to a host which either runs their own datacenters or has a better relationship with the datacenter engineers.

Most of the linodes did reboot, but the 'Lassie' feature seems to not work 100% all the time. For the linodes that didn't reboot properly, all I had to do is simply login, assess the situation, realize that Lassie didn't do the reboot correctly, and issue a manual reboot. So the server configuration was fine, it was just the Linode architecture.

And I agree, I should have geo redundancy. But this wouldn't have solved the fact that linodes didn't reboot… even with redundancy I would have had this issue. But I take your comments as constructive, and thus emphasizes the point that I should have some sort of redundancy built in.

Thanks again for your suggestions.

@akerl:

@tyrelb:

I'm looking at moving away from Linode at this point. If the engineers can't figure this out, then there's no point in staying.

You realize that the problem is HE, not Linode, right? And only one of Linode's datacenters is with HE… So you really ought to take your issue up with HE's engineers, as Linode's people are not the source of your woes.

@tyrelb:

My linodes didn't even reboot properly, meaning I had to get in there Sunday AM to manually sort the Linode crap out again…

If your nodes didn't reboot properly, it's due to either a hardware fail or your config. If a hardware issue, you don't have to do anything, because the DC folks have to fix the hardware, and then it's back to normal. If it's your config, then it's not a DC or Linode issue at all, it's that your config is borked.

@tyrelb:

I'm not sure why I'm paying Linode almost $2k per mo (and growing), only to have constant headaches from their service.

If you're paying $2k per month, why don't you have geographic redundancy?

I will definitely agree that I wouldn't trust the fremont datacenter for things that needed uptime.

My main point is that it's just fremont that has this issue. All the other datacenters have been rocksolid, or at least as close to that as anything technical can be.

@qwerty123:

I will probably schedule a migration to Dallas soon - before the next Fremont disaster hits in 3 months time!

i scheduled a migration to dallas as soon as my linode came up and the process was pretty painless. file support ticket, wait for them to setup the migration (took less than five minutes), shutdown linode, click the migrate button, and update dns.

@akerl:

I will definitely agree that I wouldn't trust the fremont datacenter for things that needed uptime.

In my opinion, it seems Fremont is best suited for those who perform their own backups or have development environments. Place your live server in another datacenter and your backup server in Fremont. If your backup server goes down your customers don't notice and, with a few exceptions, you can probably handle some unscheduled downtime of your backup server.

@carmp3fan:

@akerl:

I will definitely agree that I wouldn't trust the fremont datacenter for things that needed uptime.
In my opinion, it seems Fremont is best suited for those who perform their own backups or have development environments. Place your live server in another datacenter and your backup server in Fremont. If your backup server goes down your customers don't notice and, with a few exceptions, you can probably handle some unscheduled downtime of your backup server.
Why keep your backup server in a location that you know to be unreliable, if you can have it elsewhere for exactly the same cost? After all these outages, I can't think of a single reason to keep a node in Fremont, other than latency across the Pacific Ocean. But if it's a backup server, latency doesn't matter.

I'm really seriously hoping that Linode gets a new datacenter in the West Coast and/or Asia-Pacific region.

We rely on the fremont facility for the lower latency to Australia.

I think Linode needs to make its position clear on what action they will be taking to ensure HE has its house in order.

If it continues to simply be business as usual - then we will migrate our VM's away from Linode. As this has been the third outage due to power at this particular facility.

This can only be attributed to a lack of investment by HE at Fremont. And Linode should seriously start assessing the collateral damage this is having to both itself and its customers!

Don't get me wrong, I am a huge fan. But business is business and I cannot justify to my clients why they should continue to use nodes at a facility that is simply sub-par.

@hybinet:

@carmp3fan:

@akerl:

I will definitely agree that I wouldn't trust the fremont datacenter for things that needed uptime.
In my opinion, it seems Fremont is best suited for those who perform their own backups or have development environments. Place your live server in another datacenter and your backup server in Fremont. If your backup server goes down your customers don't notice and, with a few exceptions, you can probably handle some unscheduled downtime of your backup server.
Why keep your backup server in a location that you know to be unreliable, if you can have it elsewhere for exactly the same cost? After all these outages, I can't think of a single reason to keep a node in Fremont, other than latency across the Pacific Ocean. But if it's a backup server, latency doesn't matter.

Let me partially rephrase, Fremont is a good solution for backups or developments systems if other options aren't available. It wouldn't be my first choice, but if I had to use Fremont i would only use it for backups or development.

I think there is some consensus here: move ALL production environments away from Fremont as there have been multiple unexplained, uncontrollable, unfixable downtime events over the past year, and there is little-to-no information that would suggest otherwise.

Twitter shows the situation is a disaster with HE now: http://twitter.com/#!/search/henet

Not too much Linode can do here, other than move their own datacenters.

I'll probably keep a linode account for testing purposes (not even reliable enough for staging purposes)… currently in process of migrating 12 production and staging linodes (most >1024) off… what a way to spend a beautiful Sunday! :)

At least the outbound transfer speed at Linode is FAST! :)

Probably move the 15 production and testing ones later next week / next month… :(

@carmp3fan:

@hybinet:

@carmp3fan:

In my opinion, it seems Fremont is best suited for those who perform their own backups or have development environments. Place your live server in another datacenter and your backup server in Fremont. If your backup server goes down your customers don't notice and, with a few exceptions, you can probably handle some unscheduled downtime of your backup server.
Why keep your backup server in a location that you know to be unreliable, if you can have it elsewhere for exactly the same cost? After all these outages, I can't think of a single reason to keep a node in Fremont, other than latency across the Pacific Ocean. But if it's a backup server, latency doesn't matter.

Let me partially rephrase, Fremont is a good solution for backups or developments systems if other options aren't available. It wouldn't be my first choice, but if I had to use Fremont i would only use it for backups or development.

@tyrelb:

I think there is some consensus here: move ALL production environments away from Fremont as there have been multiple unexplained, uncontrollable, unfixable downtime events over the past year, and there is little-to-no information that would suggest otherwise.

One was mother nature. Huge lightning storm in area took down pretty much everything in the Bay Area. :) Pretty rare for this area though.

http://forum.linode.com/viewtopic.php?t … c&start=30">http://forum.linode.com/viewtopic.php?t=6301&postdays=0&postorder=asc&start=30

@reaktor:

http://i.imgur.com/2VoyJ.jpg

(Bay Bridge)

It was one of the worst lightning storms in the Bay Area in recent memory.

@tyrelb:

Not too much Linode can do here, other than move their own datacenters.
What about putting in their own rack UPS'? Where there's a will, there's a way…Someone just needs to get their wallet out.

@BigPhil:

@tyrelb:

Not too much Linode can do here, other than move their own datacenters.
What about putting in their own rack UPS'? Where there's a will, there's a way…Someone just needs to get their wallet out.
That might help prevent damages to hardware, but no UPS will keep your server online when the rest of the datacenter (including the networking equipment) has no power.

> That might help prevent damages to hardware, but no UPS will keep your server online when the rest of the datacenter (including the networking equipment) has no power.

Depends how long the power is off for. Routers and switches boot quickly. At each of the four power outages it's taken around two hours before most VPS' are restored. I suspect the power is coming back on a lot sooner than this, but we're waiting for Hosts/SANs/Linode stuff to come up.

One would hope, in 2011, that their core network equipment had reliable dual power sources, surely…

@carmp3fan:

In my opinion, it seems Fremont is best suited for those who perform their own backups or have development environments.(…)
Well, not really encouraging.

A move to another DC seems easy, according to this thread.

But the data move (stopping initial server, move, start, DNS change, DNS cache lag time…) has its share of painfulness.

The real question is :

  • do we have any information that shows there is something going on in order to prevent such problem in a near future?

  • or should we better move?

If it is the latter I'd prefer to read some kind of official information that says "Fremont is a test platform." - at least we'd be warned.

@~root:

  • do we have any information that shows there is something going on in order to prevent such problem in a near future?

I asked what their plans were to mitigate these on-going power issues at Fremont were, and the response was they do not currently have any or not in a position to disclose them.

@BigPhil:

Looking at my records there have now been four serious power outages at Fremont over the past 9 months:

November 2010

Dec 2010

May 2011

August 2011

Obviously the power issues there are systemic and have not been properly addressed (i.e. nobody is prepared to spend the money to fix this properly).

How about some compensation for those of us who are stuck using Fremont and have to put up with this 3rd-world service??

You should add the 90 second bgp outage on 6/21

http://forum.linode.com/viewtopic.php?t … c&start=15">http://forum.linode.com/viewtopic.php?t=7294&postdays=0&postorder=asc&start=15

@tyrelb:

I'm looking at moving away from Linode at this point. If the engineers can't figure this out, then there's no point in staying.

My linodes didn't even reboot properly, meaning I had to get in there Sunday AM to manually sort the Linode crap out again…

I'm not sure why I'm paying Linode almost $2k per mo (and growing), only to have constant headaches from their service.

Anyways, my fault for not considering moving later… I need a host with some decent uptime / service… time to start hunting and testing new hosts :)

@BigPhil:

Looking at my records there have now been four serious power outages at Fremont over the past 9 months:

November 2010

Dec 2010

May 2011

August 2011

Obviously the power issues there are systemic and have not been properly addressed (i.e. nobody is prepared to spend the money to fix this properly).

How about some compensation for those of us who are stuck using Fremont and have to put up with this 3rd-world service??

Glad to know there are others with a lot of nodes.

Some programs won't work will redundant, ie. Voip or its just not cost effective.

Ideally we move to Dallas, hope for better days and pray latency is good enough.

i saw this product today and it reminded me of he.net :)

http://www.engadget.com/2011/08/08/sony … rnment-ag/">http://www.engadget.com/2011/08/08/sony-intros-200-pound-battery-to-power-businesses-government-ag/

Anyone else on node 390, 391 and 411 ?

We're seeing super high latency on those - maybe a switch issue?

@ultramookie:

i saw this product today and it reminded me of he.net :)

http://www.engadget.com/2011/08/08/sony … rnment-ag/">http://www.engadget.com/2011/08/08/sony-intros-200-pound-battery-to-power-businesses-government-ag/

Typical Sony. APC will sell you a 2200VA 3.4 kWh UPS/battery for $1850, but Sony thinks that that they can charge $27,500 for a 1000VA 2.4 kWh UPS.

@Guspaz:

@ultramookie:

i saw this product today and it reminded me of he.net :)

http://www.engadget.com/2011/08/08/sony … rnment-ag/">http://www.engadget.com/2011/08/08/sony-intros-200-pound-battery-to-power-businesses-government-ag/

Typical Sony. APC will sell you a 2200VA 3.4 kWh UPS/battery for $1850, but Sony thinks that that they can charge $27,500 for a 1000VA 2.4 kWh UPS.

Sony must be using Monster Cables for all of the wiring….

I think the spec sheet states they use ionized copper for the ultrapure sine wave of power.

> Sony Just wait until it gets hacked…

RFO has been posted on the status blog: http://status.linode.com/2011/08/fremon … e-rfo.html">http://status.linode.com/2011/08/fremont-power-outage-rfo.html

@bjl:

RFO has been posted on the status blog: http://status.linode.com/2011/08/fremon … e-rfo.html">http://status.linode.com/2011/08/fremont-power-outage-rfo.html
Thanks, that answers my questions.

They should probably split the load on several breakers in order

1. to minimize a PO scope

2. to identify faster which machine(s) triggered the breaker

@vonskippy:

I think the spec sheet states they use ionized copper for the ultrapure sine wave of power.

APC's Smart-UPS products all produce pure sine waves these days, unlike Back-UPS, which are still approximated. They also have their Smart-UPS On-Line products; they're not that much more expensive, and provide much better quality power, obviously.

Anyone else experiencing erratic accessibility to Fremont linodes today? Mine has gone down twice, down right now. Trace from my LAN ends here:

14 19 ms 18 ms 21 ms linode-llc.10gigabitethernet2-2.core2.fmt1.he.net [64.71.180.158]

15 * * * Request timed out.

I can't get into LISH either.

Just curious…

yes, multiple physical nodes are not responding.

Yep….down again. :evil:

Fremont seems to be down for me too. Since ~4:00PM PST.

I guess the last of us will be migrating to Dallas now. For some stupid reason I didn't do it the other day.

@haus:

I guess the last of us will be migrating to Dallas now. For some stupid reason I didn't do it the other day.

yes, us too.

@haus:

I guess the last of us will be migrating to Dallas now. For some stupid reason I didn't do it the other day.
This.

(I'd just got my IPv6 stuff all working with good DNS etc etc, but yet another outage…. sigh)

This problem does not seem to be affecting me, but I am in Fremont also… Very weird.

EDIT: Linode's status page is reporting a partial outage.

I had two nodes in Freemont. After the last outage I moved one to a different datacenter.

I asked on my support ticket if this datacenter should be used for production linodes. They said "You can utilize any of our datacenters that you wish – all of our datacenters are production ready."

I am beginning to doubt this. My linodes didn't even restart automatically after the outage last time!

The only reason I keep hanging on in there with Freemont is to have at least one as close as possible to Asia/Pacific.

Is everyone else happy with the other datacenters or is this a more general degradation of service with Linode?

The consensus seems to be that this is a Fremont thing. I'm far from an expert on this, but does Dallas VS Fremont make that big a difference from Asia/Australia (the two places people seem to really care about Fremont).

It's a shame, you'd think that in a town that is fairly synonymous with Silicon Valley you could keep a server running. There's nothing out there but office buildings and nice weather.

there is a power outage @ Fremont.

we're told to wait until our Node gets power - we made a new node and went to a backup. (so it would be built on a node with power).

Haus,

For me:

Sydney -> Fremont ~ 170ms

Sydney -> Dallas ~ 220ms

West coast is really the best for Asia/Pac. This is - again- disappointing.

Why is fremont down AGAIN!?

@tyrelb:

Why is fremont down AGAIN!? Yes.

The question is: when is it back?

@fiat:

Haus,

For me:

Sydney -> Fremont ~ 170ms

Sydney -> Dallas ~ 220ms

West coast is really the best for Asia/Pac. This is - again- disappointing.

what about to Hawaii?

@Alohatone:

there is a power outage @ Fremont.

we're told to wait until our Node gets power - we made a new node and went to a backup. (so it would be built on a node with power).
Where "are you told"? Did you receive a mail, or is it from a web page?

(didn't receive anything…)

@~root:

@Alohatone:

there is a power outage @ Fremont.

we're told to wait until our Node gets power - we made a new node and went to a backup. (so it would be built on a node with power).
Where "are you told"? Did you receive a mail, or is it from a web page?

(didn't receive anything…)

Support tickets.

@Alohatone:

@~root:

@Alohatone:

Support tickets.
Of course, thank you.

It would be painful (ask clients to change some of their DNS settings) but it seems a linode migration is to be planned :-(

Or is there the slightest hope that the situation in Fremont is going to improve?

(btw, just asking: when migrating to another linode, does Linode manage to to keep the same public IP at the new location? I doubt, but…)

@~root:

@Alohatone:

@~root:

Of course, thank you.

It would be painful (ask clients to change some of their DNS settings) but it seems a linode migration is to be planned :-(

Or is there the slightest hope that the situation in Fremont is going to improve?

(btw, just asking: when migrating to another linode, does Linode manage to to keep the same public IP at the new location? I doubt, but…)

when moving between datacenters, you change public IP.

Here is the HE report from 9 days ago : http://prgmr.com/~lsc/incident08072011.pdf

@~root:

is there the slightest hope that the situation in Fremont is going to improve?

Judging from the frequency of power outages, it just seems to be getting worse and worse.

I have a couple of dirt cheap OpenVZ VPS's with another company that used to colocate in HE's Fremont datacenter. In mid-July they moved all of their gear to another datacenter in the Bay Area. Took all of two hours, the IPs stayed the same, and they already dodged two power outages. It's a shame that my $3/mo throwaway VPS has better uptime than a Linode in Fremont.

Hope Linode does something about the Fremont situation soon…

@Alohatone:

Here is the HE report from 9 days ago : http://prgmr.com/~lsc/incident08072011.pdf
Looks like there is no huge improvement since Aug 7.

@~root:

@tyrelb:

Why is fremont down AGAIN!? Yes.

The question is: when is it back?

I can't wait to read the RFO for this outage… I wonder if it's the UPS' again, or some other failure.

@thehousecat:

Is everyone else happy with the other datacenters or is this a more general degradation of service with Linode?
It's a "degradation of service" of Hurricane Electric's Fremont 1 data center. That doesn't magically affect the other facilities Linode colos at, run by other companies in different cities.

@~root:

Looks like there is no huge improvement since Aug 7.
The RFO did say "the next few months".

For all we know this outage happened when someone tripped over a wire while trying to improve things.

@mnordhoff:

For all we know this outage happened when someone tripped over a wire while trying to improve things.
I consider this to be a positive note.

Thank you for the update. I keep my fingers crossed in the meantime…

So far the clients were not much affected.

@~root:

I consider this to be a positive note.

Thank you for the update. I keep my fingers crossed in the meantime…

So far the clients were not much affected.
To be clear, what I said was optimistic speculation. I have zero information.

Is this 4-5 in the last few years related to power?

@OverlordQ:

Is this 4-5 in the last few years related to power?

Full power outage @ HE.net on 8/7

Partial power outage @ HE.net on 8/16

There's definately been more then 2.

@OverlordQ:

There's definately been more then 2.

Also there was the BGP issue @ HE.net in June/July 2011.

We're in the process of moving to Dallas, hopefully the issues don't follow up.

OR maybe Fremont can be moved to the new HE.net datacenter.

From digging through the status blog I found this two other power issues:

Nov 23, 2010 and May 6, 2011

http://status.linode.com/2011/05/outage … ility.html">http://status.linode.com/2011/05/outage-in-fremont-facility.html

http://status.linode.com/2010/11/possib … emont.html">http://status.linode.com/2010/11/possible-power-outage-in-fremont.html

So it's been 4+ power issues in the last year alone.

August 16, 2011 = 1 Hour / Partial

August 7, 2011 = 4 Hours

May 6, 2011 = 3 Hours

November 23, 2010 = 5 hours

@mnordhoff:

@thehousecat:

Is everyone else happy with the other datacenters or is this a more general degradation of service with Linode?
It's a "degradation of service" of Hurricane Electric's Fremont 1 data center. That doesn't magically affect the other facilities Linode colos at, run by other companies in different cities.

No need to be patronizing - your magic comment is stating the obvious

I didn't ask about 3rd parties anyway - I have no business relationship with them. I specifically asked about the service level at different Linode datacenters.

Does anyone who has wider experience find the other datacenters more reliable?

Thanks

@thehousecat:

Does anyone who has wider experience find the other datacenters more reliable?
Yes. I have Linodes at Newark and London, as well as Fremont, and they have many fewer problems. The Fremont power problems all seem to stem from a nearby lightning strike last year. They are getting replacement UPSs (that hopefully really will be uninterruptible this time), but this fix is months away due to the lead time on the equipment. In the meanwhile, customers are leaving HE's Fremont 1 datacenter like rats off a sinking ship.

This is still not as bad as that time that my previous VPS provider decided to move all his servers to a new datacenter, physically damaging most of them in the process, and never told customers about it. And by this I mean the first sign of a problem was that all of our VPS instances went down without warning, and stayed down for several weeks with zero communication from the host about why they were down, and the host stopped answering e-mails and phone calls during this period. Only after weeks of downtime did the host finally update their website with an explanation. Up until then, most customers were convinced that he had cut and run.

Eventually, we got so fed up waiting for our VPS to return that we signed up with Linode and restored from not-as-recent-as-we'd-like backups… while in the back of a van driving between Montreal and Toronto (~560 KM). It was an interesting experience, to say the least. Since that disaster, we've instituted a more regular backup procedure; on-site backups with Linode (via their backup service), and off-site nightly incremental backups to a file server in my apartment.

@Guspaz:

This is still not as bad as that time that my previous VPS provider decided to move all his servers to a new datacenter, physically damaging most of them in the process, and never told customers about it. And by this I mean the first sign of a problem was that all of our VPS instances went down without warning, and stayed down for several weeks with zero communication from the host about why they were down, and the host stopped answering e-mails and phone calls during this period. Only after weeks of downtime did the host finally update their website with an explanation. Up until then, most customers were convinced that he had cut and run.

Eventually, we got so fed up waiting for our VPS to return that we signed up with Linode and restored from not-as-recent-as-we'd-like backups… while in the back of a van driving between Montreal and Toronto (~560 KM). It was an interesting experience, to say the least. Since that disaster, we've instituted a more regular backup procedure; on-site backups with Linode (via their backup service), and off-site nightly incremental backups to a file server in my apartment.

Weeks down? dang, you are too patient.

After 4 hours we were rebuilding someplace else.

I looked up the original timeline in my e-mail. It looks like our server went down Friday May 15th, 2009, and we rebuilt our server on Friday May 22nd, 2009. We were on an unreliable mobile phone connection on wifi in the back of a van, thank goodness for screen…

It looks like the original VPS did come up later on the 22nd, allowing us to pull in the latest data to supplement our newly deployed system (having been based on an older backup). So my memory was faulty, the downtime wasn't weeks, but about one week. (EDIT: Other sources say we may have been down for two weeks, but got access to the data after one week?) The rest of the recollection would be accurate in that there was no news from the host for the first chunk of it, and only later on did we start getting updates from the host.

We probably should have migrated sooner, but a combination of factors meant we didn't:

1) While our event's pre-registration was live at the time (and in that sense we were losing hundreds or thousands of dollars in sales), our event is of the type where people were likely to wait until the server was back up to pre-register. It was also not a critical point in the pre-registration process (which would be closer to the end of it). In fact, our biggest concern with restoring from backup, and one reason we were hesitant to rebuild, was what we should do about registrations that had been paid and processed, but that we no longer had any data about. We obviously couldn't refuse an attendee who had paid and registered just because we had lost their data… We could have reviewed our paypal records to rebuild the important parts of the information (the names of people who paid, if not their assigned registration numbers) We were lucky that we later got access to the original data and were able to re-merge it back in regardless.

2) After 2 days, I redirected our DNS to a server at the local university which we controlled and posted a downtime notice. A day later, with the server still down, I put our website's design around the message to at least make it look a bit more official, and redirected all 404s to the message. We also redirected our mail to a server we controlled so that we could bring mail services back up on a temporary server.

3) By this point we were actively researching a new host to switch to. Picking a hosting provider is a big decision, and despite the urgency, we still had to do our due diligence, and by this point we were pretty much settled on Linode.

4) After the situation went from "Our server is down" to "Our host is MIA", we started trying to gather up all the backups we could from various servers and sources. Database and site code/content was primarily pulled from these older backups, and we refreshed our content with the newer data from the wayback machine. We were still hesitant to restore from backup because of the difficulty of a later merge if we did get access to the original data.

5) Eventually, the situation became intolerable; we were leaving for Anime North, the largest convention in the country, and we needed our server up for promotion purposes. This is why at this point we took the plunge and started rebuilding from our gathered backups. From the back of a car, going down the highway.

It's always a difficult decision to make. How much downtime do you tolerate before you go from waiting for your server to come back online to rebuilding from backups? As a somewhat loosely organized non-profit company, we're also not the kind of organization that has policies or procedures for this sort of thing.

Since then, we've at least taken precautions. As I've said, we have nightly incrementals off-site, and on-site linode backups. At this exact moment, since our event has ended for the year (literally just three days ago), our registration system is not active and downtime would be relatively unimportant; if we did need to take emergency actions our forums are probably all we'd care about. But nevertheless, we'd probably still take action sooner.

Of course, we've also gone from a fly-by-night operation to a first-class hosting provider (Linode), so I'm relatively confident that we'd be unlikely to have to restore from off-site backups. If our linode's host should die, we can restore from Linode's on-site backups in a matter of minutes. If Linode's datacenter should go down, we can restore from the off-site backup with a little bit of rebuilding (it's an incomplete backup so we'd need to deploy a full linode, layer our backup on top of that, do some checking after that, and get back up in about 3 hours (I've got two bonded VDSL2 lines and some other connections to my apartment, so I can push 14 megs upstream on my fastest link, and probably 20 megs up total if I combine that with cable, 3G, and free wifi). And if Linode went down entirely (nuclear bomb exploding at Linode headquarters?) then we have a much better picture of the VPS hosting industry such that we could move to a new host and be up and running again probably in 6 to 12 hours. Of course, all these times are after we make the decision that the original machine is a write-off and we need to restore from backups…

@thehousecat:

No need to be patronizing - your magic comment is stating the obvious
I didn't mean to be patronizing, just a jerk. :P Um, I'm sorry. I was grumpy about the people who blamed Linode for the outage rather than HE. (Yes, it is Linode's fault for using HE, and maybe Linode should take some sort of action, but it's not entirely Linode's fault.)

@thehousecat:

Does anyone who has wider experience find the other datacenters more reliable?
None of the other data centers have had luck this bad, at least in the last few years. On the other hand, neither had Fremont up until a year ago.

You might want to read the "Datacenter reliability comparison" thread, but it doesn't really add much to what HoopyCat and I have said, except for some more details.

Thanks everyone.

It just stings a bit when there are outages like this so close together! Hopefully they will get it sorted soon. I'm going to keep a VM in Fremont, but that datacenter is on its final warning :-)

mnordhoff - no worries.

@Guspaz:

….. and off-site nightly incremental backups to a file server in my apartment.
Now see, that right there, makes you a geek :)

@graq:

@Guspaz:

….. and off-site nightly incremental backups to a file server in my apartment.
Now see, that right there, makes you a geek :)
Doesn't everyone do that?

Does an old laptop count as a file server for the purpose of this requirement? :)

@thehousecat:

Does anyone who has wider experience find the other datacenters more reliable?

I seem to recall a circuit breaker tripped in Newark a few years back, taking out a half-dozen servers (not affecting mine). Things were back up in about a half hour. That's about all I can remember for power-related issues outside of Fremont.

I've never deployed long-term stuff in Fremont, and I can't recall ever claiming a SLA credit. It's difficult to accumulate enough downtime anywhere else.

@~root:

They should probably split the load on several breakers in order

1. to minimize a PO scope

2. to identify faster which machine(s) triggered the breaker

The breaker in question is a very large one which – if I'm reading this correctly, and I hope I'm not -- protects the input side of the UPS. So, there's already only one machine on the circuit. Internal distribution is going to be normal branch circuits, the tripping of which would only affect a small handful of machines.

@carmp3fan:

The SANS NewsBytes had a great quote about the Amazon outage that fits perfectly for this situation.

@John Pescatore:

Anyone who plans on using cloud without planning on workarounds for outages is not doing their due diligence.

s/cloud/Internet/ and I'm pretty much on the same page.

The Internet (and the various cloud-computing technologies within) exist on real equipment in a real world. There will be failures. It sucks when they happen, but they will, and usually not in the way you're expecting.

I'm not excusing this outage by any stretch, but, well, it'll happen again. Maybe not Fremont again (although I said that last time, didn't I?!), but electrical power is particularly tricky to do right. I'm personally a big fan of DC, but it's one of those IPv6-like chicken-and-egg problems, except actual real capex is involved and there are no dominant standards yet.

This is anticipated to change this year, though it would be unwise to get your hopes up for immediate adoption, and this is but just one failure mode of many.

(I do find it interesting that the most intricate and failure-prone utility spawns the biggest outrage when it breaks; if a failover router had blown a turboencabulator and seized the common-mode ambaphascient lunar wain shaft, taking out network reachability for a similar period of time, this thread probably would have not gone on this long.)

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct