Host Reboots - October 27th, 2009

Guys, this is BRUTAL.

Why are you doing shared library upgrades in the middle of the afternoon when they can affect ALL of your sites like this? I like the fact that I can have geographical diversity with your services but what's the point if you do things that take out all the sites…

Do you have any updates or an ETA on when this will be resolved?

102 Replies

I agree, this is brutal…. My monitors show that my outage started at 1:10PM (PDT). It's now 2:34PM… help

I have to admit it seems that pushing the upgrade to all data centres at ones wasn't the smartest option, but then again, it may have been inevitable to do all datacentres at once. I'm sure it'll be elaborated upon once this issue has been resolved.

My linode is up again (downtime less than 2 hours), so thanks for fixing it :D

What DC are you in?

I'm in Atlanta and my host is still up and running fine (knock on wood).

I wish i could say the same about the linode.com page, it's slow as hell (including the forums).

Glad you said that, I was going nuts trying to figure out what is wrong with my network connection.

2 hours and counting…

Argh!!

My Linode and sites are running, but my host actually has a load (first time I've seen that since I've been at Linode). My SSH session is locked up and I can't reboot or shutdown my Linode.

I'm in Fremont.

linode.com/ is slow has molasses for me as well. Except for accessing the support page. That loaded and sent my support request nice and speedy

Support referred me to this thread.

They sent me to this thread, after I had already posted on it. And support isn't updating the thread, so what's the point?

This isn't OK. 3 hours. Deafening silence. No progress updates. I've loved Linode up till now, but I can't accept this type of lack of communication.

I think we would all feel better if Linode would commit to providing regular and frequent updates. This lack of communication is beyond frustrating.

So we were supposed to be notified via the support system of this outage? We didn't receive anything. Our linodes were shutdown ungracefully, which of course isn't good for our databases.

We run a public safety avalanche forecast site on one of these linodes and it's the first day of wet, unstable snow. We only saw 8 minutes of downtime, luckily. 2 hours like others are reporting would've definitely been unacceptable.

Well. My linode at Fremont was rebooted around 30 minutes ago. Total downtime is around 20 minutes according to Pingdom, although from the Linode Manager it looks like at least 1-2 hours offline. The downtime would be even less if I did not forget to bootstrap some services in RC script. Everything is speedy again – unlike the Linode website and this forum.

You guys hosted on a Linode 360?!

3 hours and 20 minutes. It looks like my server is 80% up now. I still see some errors. And the instance is slooooooooooooooow.

I've been offline since 4:05pm EST. That's nearly 4 hours now and I've seen one(1)! update in this thread in that time. They're not large sites but I was running some critical processes at the time. I'm fed up.

We've been working as fast as we can to bring the machines affected to a stable state.

There are still a few problem hosts we're working through, although we expect all issues to be resolved shortly. Please stand by for additional updates.

-Chris

My linode has been down over 4 hours now with no warning and no information other than a canned response from support referring me to this thread. Not impressed.

Can we please get some forewarning next time?

I understand these things happen, but it sounds to me like this could have been avoided, or at least delayed.

We could have at least gotten some warning to let our clients know.

I'm sorry but I too have to question doing this in the middle of the day US time. This has been quite the outage.

I love linode and have recommended it to so many people over the years, but when things like this happen, it not only makes you look bad, but it makes me look bad to everyone I have referred here.

That is not even taking into account the users of my own site…

I love Linode, but this is absolutely unacceptable. Not only did your unscheduled, un-communicated upgrade break some customers, your "solution" caused abrupt hard resets for a great many who were not broken, including myself.

This is not how production systems are run. Billing credit needs to be extended to all affected customers, and this needs to never happen again.

Maintenance that has any chance of causing breakage needs to be done in off-peak hours with at least 24-hours notice to affected customers, and in case of emergencies, emails need to be sent out at least 30 minutes in advance of taking actions that you KNOW will cause problems for customer not currently affected by the problem you're trying to solve.

See if you can guess where the trouble started ;)

~~![](<URL url=)http://mibbit.com/down.png" />

Unfortunately my nodes were rebooted one by one, over the course of 2 hours which meant for far more downtime than if they had all been restarted together.~~

I didn't notice being offline but I did notice extremely slow response times. It seems to be much faster now.

Looking at my stats, there's a huge gap from 21:00 to 0:00 - I'm guessing that's the outage, and I just wasn't here to notice.

My linode says "Up since Oct 28, 12:00 AM".

I think my linode is in Dallas.

-=-

It would be better to do these kinds of things 4 AM eastern (1 AM Pacific).

Fortunately our setup has linode mirrors in Atlanta, Dallas, Newark and SFO and our monitoring system takes any that are down out of the lineup automatically, but the timing (non-wee-hours-of-the-AM) was awkward:

Tue Oct 27 20:01:09 EDT 2009 ERROR: 66.220.1.164 http is down!

Tue Oct 27 20:02:19 EDT 2009 ERROR: 66.220.1.164 http is down!

Tue Oct 27 20:03:31 EDT 2009 ERROR: 66.220.1.164 http is down!

Tue Oct 27 20:05:03 EDT 2009 ERROR: 66.220.1.164 http is down!

Tue Oct 27 20:06:14 EDT 2009 ERROR: 66.220.1.164 http is down!

Tue Oct 27 20:07:28 EDT 2009 ERROR: 66.220.1.164 http is down!

Tue Oct 27 20:08:48 EDT 2009 ERROR: 66.220.1.164 http is down!

Tue Oct 27 20:09:46 EDT 2009 ERROR: 66.220.1.164 http is down!

Stuff Happens. Still, it would have been better to have this particular Stuff Happen at 3 or 4 AM when nobody would notice …

My server is still down as well. Unfortunately guys Virtual Servers are still shared servers and no-matter what time they do maintenance I guess there would be complaints as they have customers all over the world.

RE: status updates I guess they are too busy fixing the problem (been there myself in an office environment everyone rings you to find out whats going on so much so you don't get much time to fix the problem till you go into the server room and lock the door!).

Here's hoping everything is up soon!

awesome

JobID: 1498922 - Host initiated restart

Job Entered 01/03/1974 11:00:00 PM Status In Queue

Host Start Date Host Finish Date

Host Duration waiting on host Host Message

Totally killed my linode, way after you guys already knew there was an issue, did you not stop?

Oh newbies.. take it easy. For me this is the first time that there has been a problem(at all, as relatively minor as it is) since I signed up, in 2007(my linode is on Freemont btw). And this isn't even a real problem so it seems, just some upgrade that went awry. Over two years up, I believe that fucking beats Amazon services. :)

@razza:

My server is still down as well. Unfortunately guys Virtual Servers are still shared servers and no-matter what time they do maintenance I guess there would be complaints as they have customers all over the world.

Colo companies have the exact same problem with power and network maintenance, but the difference is they tell their customers in advance and they schedule it for the least possible impact. You can't get it perfect, but there are far better times to reboot a bunch of US-based servers than late afternoon/early evening US time.

Mine are both still down and have been for almost 4 hours.

All the boot requests are just sitting in the queue.

Guess my plans for tonight are canceled…

just out of curiosity, will linode be giving refunds for downtime? my host was down 30 minutes, so i won't get refunds, but i see there are people who have been affected for hours.

http://www.linode.com/faq.cfm#what-is-your-sla

What is your SLA?

We're not going to lie to you: server maintenance, upgrades, hardware and network issues, all affect a Linode in the same ways as any other provider. What we can boast about is our commitment to resolving these issues in the quickest fashion possible. Most customers will tell you the last time they rebooted was to take advantage of a plan upgrade. 99.99% uptime, or your lost time is refunded back to your account.

@Infinito:

Oh newbies.. take it easy. For me this is the first time that there has been a problem(at all, as relatively minor as it is) since I signed up, in 2007(my linode is on Freemont btw). And this isn't even a real problem so it seems, just some upgrade that went awry. Over two years up, I believe that fucking beats Amazon services. :)

Who are you to tell us to take it easy? Not all of us use our Linode's for expensive shell accounts. I have clients who are pissed and ZERO warning. You take it easy when this is costing you money.

I feel for you guys, 4 hours in a long time. It seems my linode was down for less than one. Everything's back to normal here.

Keep in mind Linode has an amazing track record compared to any of their competitors. Yes, the timing was terrible and there was no warning, but these things happen in this industry.

That being said, I am a bit irked as I haven't gotten any explanation on my current situation: my servers have come back up and I can SSH in but I cannot get a shell prompt and I am unable to reboot from the dashboard.

@nknight:

Colo companies have the exact same problem with power and network maintenance, but the difference is they tell their customers in advance and they schedule it for the least possible impact. You can't get it perfect, but there are far better times to reboot a bunch of US-based servers than late afternoon/early evening US time.
Colo customers also pay a premium for that level of redundancy/reliability.

I'm not saying that the linode folks don't deserve some ribbing for this, or that customers don't deserve some credit on their accounts. Rather, just trying to provide some perspective on the issue. If you need umpteen nines of reliability, then linode probably isn't the service for you. For the rest of us, it's pretty dang good. I've been a customer since 2004, and since that time, the number of issues like this can be counted on say, 2 fingers. Pretty damn good if you ask me. While any amount of downtime sucks, this afternoon's incident surely isn't going to turn me away from Linode. Once the dust settles, I'm sure that we'll get an explanation from caker, as well as some credit on our accounts is likely as well.

4.5 hours. That one hurt.

@anderiv:

Colo customers also pay a premium for that level of redundancy/reliability.

I'm not saying that the linode folks don't deserve some ribbing for this, or that customers don't deserve some credit on their accounts. Rather, just trying to provide some perspective on the issue. If you need umpteen nines of reliability, then linode probably isn't the service for you. For the rest of us, it's pretty dang good. I've been a customer since 2004, and since that time, the number of issues like this can be counted on say, 2 fingers. Pretty damn good if you ask me. While any amount of downtime sucks, this afternoon's incident surely isn't going to turn me away from Linode. Once the dust settles, I'm sure that we'll get an explanation from caker, as well as some credit on our accounts is likely as well.

totally agree. i came from a shared hosting environment (dreamhost) that had one nine worth of uptime. since moving here to linode in april, outside of today's 30m of downtime, the rest of the downtime was caused by my own stupidity. i won't be leaving linode anytime soon. that said, it was kind of irksome to suddenly without warning lose my vps.

@anderiv:

Colo customers also pay a premium for that level of redundancy/reliability.

This has nothing to do with redundancy or reliability, this is purely a procedural and judgment problem. Linode made a bad choice – a series of them, actually -- resulting in wholly avoidable downtime and hard resets for many customers.

Saying we should have to pay a premium to avoid a rather horrific error in judgment on the part of Linode is like saying we should pay a premium to get a car that has a steering wheel. It's nonsense.

Any hosting service that holds itself out as being suitable for commercial purposes should not be making these kinds of mistakes, regardless of their level of redundancy or hardware reliability.

@ultramookie:

99.99% uptime, or your lost time is refunded back to your account.

According to that, a 4 hour outage on a Linode 2880 (the most expensive) gets you 88 cents! :)

–John

This outage has affected several of my linodes. While I am not pleased, especially after the Atlanta issue took so long to resolve, it still pales in comparison to what ThePlanet put my customers through with a 2 week outage several years ago.

The big difference their is that even with their lame CEO's emails and speeches, there was customer contact. Even though inside my soul I cried "BS" with every word they said about the incident, at least they appeared to try. What I see here, is chaos and pain.

I'm not sure how the company is built personnel-wise, but I'm sure we'll hear something useful when its all said and done. Customers are what makes any company, and I'm sure upsetting us is likely not on their todo list.

It seems I'm been unaffected by this problem, so I wonder if it only affects older hosts and linodes (as I got mine a few days ago). Plus my host is always idle.

@jpw:

@ultramookie:

99.99% uptime, or your lost time is refunded back to your account.

According to that, a 4 hour outage on a Linode 2880 (the most expensive) gets you 88 cents! :)

–John

Exactly! I can hardly wait for my refund on my 720 for the past 4.5 hours. I'm sure my customers won't mind either.

@Orrin:

My linode has been down over 4 hours now with no warning and no information other than a canned response from support referring me to this thread. Not impressed.

A) Unplanned outage, how do they warn against those?

B) Do you want them to fix your box or post here?

@nknight:

This has nothing to do with redundancy or reliability, this is purely a procedural and judgment problem. Linode made a bad choice – a series of them, actually -- resulting in wholly avoidable downtime and hard resets for many customers.

See above.

C) You get what you pay for.

D) Run it yourself if you think you can do better.

So I finally get to test my automated back-up works for real after 4 months. Shit happens.

There's no way I'd be up at 4am performing upgrades, so I wouldn't expect anyone to do the same for me.

Some of us do appreciate you're working your balls off to sort the issue out. We've all been there.

Keep up the good work, Mr Linode!

I had around 1.5 hours of downtime. Good thing I decided to check in on my IRC server before leaving this morning, it seems a typo in my crontab prevented the ircd from starting up after the reboot. Some sort of email notification would have been great; there are automated notifications for exceeding CPU thresholds, etc., how about an automated notification for a shutdown not initiated by the user?

@techman224:

It seems I'm been unaffected by this problem, so I wonder if it only affects older hosts and linodes (as I got mine a few days ago). Plus my host is always idle.

I don't think so. I am also unaffected (fingers crossed) and I got my Linode in 2006. It's on Dallas 5.

@OverlordQ:

B) Do you want them to fix your box or post here?

Both.

j/k

Actually, I'm not, I totally agree with people in this thread that complain about a lack of communication.

Posting an update takes very little time so I don't think it's too much to ask to keep people in the loop about the scope of the problem and what's being done to fix it.

@DharmaTech:

Our linodes were shutdown ungracefully, which of course isn't good for our databases.
This is the biggest issue as far as I'm concerned. No hosting company would consider walking into the Data Center and literally pulling the power cords out of the servers in a rack, and the same care should be taken with physical servers and virtual servers.

So far 2 of my Linodes have been down, and fortunately come back up again (atlanta54 and atlanta57) but my database server (atlanta20) is still down at the moment, and will hopefully not have data issues after having it's power plug yanked out!

How hard is it to set up a mailing list of some sort that we can subscribe to to get announcements of upcoming maintenance? A former ISP of mine did just that, and it worked out great. They would email out a description of maintenance that was to be performed, who would be impacted, and a time window.

Not only would it save a lot of customer frustration, but it would make me look good to the other members of the non-profit I host when I warn them of upcoming maintenance. Right now it has the opposite effect, making look to them like I dropped the ball. Not cool.

There are people whose linodes are completely down or inoperable, and Linode seems to be rebooting nodes that are perfectly fine, like mine…

@OverlordQ:

A) Unplanned outage, how do they warn against those?

B) Do you want them to fix your box or post here?

Usually "unplanned outage" would imply that something unexpected happened outside of the control of the administrators. Chris initially said it was due to "a shared library update distributed to our hosts". Based on the thoroughness I've observed in the past from Linode, I would expect that that sort of update is (1) scheduled by Linode staff and (2) tested on a staging host before pushing to production hosts. If (1) is true then the update was planned, even if the outage was not. I think the point of many posters here is that such maintenance should be announced, even if no outage is expected. If they aren't doing (2), they should be, although that doesn't always catch the problem.

–John

@OverlordQ:

A) Unplanned outage, how do they warn against those?

The original outage was unplanned, the maintenance that caused it was not.

@OverlordQ:

D) Run it yourself if you think you can do better.

I do. My linode use is for my personal business use.

I've been responsible for one particular corporate production service with thousands of customers since early 2006.

You know what each customer is paying us? The equivalent of a few US dollars per month. You know what our contractual uptime obligations are to them? Nothing. You know how much impact 24 hours of downtime would have on our customers? 95% probably wouldn't even notice.

You know how many people are involved with this service? At peak, it was 4. Now it's just me.

And yet, in all that time, all maintenance has taken place during off-peak hours, as has all planned downtime (which was communicated to customers well in advance). We have had approximately 30 minutes worth of unplanned downtime in that period, and about two hours of "partial" downtime due to one of our upstream ISP's flapping BGP (causing approximately 50% of customers to have intermittent difficulty connecting to the service).

I can do better, and I have done better. I do better every day. I'm confident Linode can and will, too, but they have seriously dropped the ball today, and need to be held accountable for it.

@dmuth:

How hard is it to set up a mailing list of some sort that we can subscribe to to get announcements of upcoming maintenance? A former ISP of mine did just that, and it worked out great. They would email out a description of maintenance that was to be performed, who would be impacted, and a time window.

Not only would it save a lot of customer frustration, but it would make me look good to the other members of the non-profit I host when I warn them of upcoming maintenance. Right now it has the opposite effect, making look to them like I dropped the ball. Not cool.

I feel your pain, as I've had four angry customers call me and I'm left holding the bag, but remember that it's always hard to warn someone of unexpected downtime ;)

@Infinito:

Oh newbies.. take it easy. For me this is the first time that there has been a problem(at all, as relatively minor as it is) since I signed up, in 2007(my linode is on Freemont btw). And this isn't even a real problem so it seems, just some upgrade that went awry. Over two years up, I believe that fucking beats Amazon services. :)

+1 I've been with linode since Oct 2003 (also in Freemont) and this is the first real outage I'm aware of. The only other issues I've ever experienced are short outages due to DDOS etc..

Fortunately my box is up and running again :D

@OverlordQ:

A) Unplanned outage, how do they warn against those?
Did you read the OP?
> To recover from this we may be issuing host reboots to upgrade their software to our latest stack, and then bringing the Linodes to their last state. We're working on this now and expect to have additional updates shortly. We'll also be notifying those affected via our support ticket system.

So no, I didn't get notified via the support system.

@OverlordQ:

B) Do you want them to fix your box or post here?
Oh yeah, because that's an either/or thing right?

I'm not bashing Linode and overall I've been very happy with the service, but today they fell short of my expectations.

JobID: 1505768 - Host initiated restart

Job Entered 01/04/1974 12:00:00 AM Status In Queue

Host Start Date Host Finish Date

Host Duration waiting on host Host Message

Thought that was kinda funny! My Newark node is all sorts of borked now. :shock:

@spearson:

Job Entered 01/04/1974 12:00:00 AM Status In Queue

Host Start Date Host Finish Date

Host Duration waiting on host Host Message

Thought that was kinda funny! My Newark node is all sorts of borked now. :shock:

Nah, they do that to force the boot job to the front of the queue.

ever since the downtime today I cannot get my linode to boot up correctly. I cannot reach it by ssh. When I use the ajax console it seems stuck and here is the message:

INIT: /etc/inittab[33]: rlevel field too long (max 11 characters)
INIT: /etc/inittab[34]: rlevel field too long (max 11 characters)
INIT: /etc/inittab[35]: rlevel field too long (max 11 characters)
INIT: /etc/inittab[36]: missing action field
Enter runlevel:

If I enter runlevel 3 it gives me this error:

INIT: Entering runlevel 3                                         
INIT: no more processes left in this runlevel

What can I do at this point? I need help bad here. All my websites are down. I am dead in the water…

@hiscom:

Still, it would have been better to have this particular Stuff Happen at 3 or 4 AM when nobody would notice …

Which 3 or 4 AM?

From Linode's 'Interesting stats': "131 countries customer diversity".

Both my linodes (on different hosts) went down because of your mistake. One of them is still down. Why didn't you try your updates on just one host and then decide whether or not to push the updates on the others? You don't do this kind of thing on all hosts at once. You would have saved a lot of people a lot of trouble if you'd have used common sense!

Ofcourse noone likes downtime, but its a reality of the business.

I myself have to answer to my customers whom I referred to Linode and who pay me to make sure their services stay online. However, they understand that outages are a reality of the business.

Even the big guys like rackspace, amazon, the planet, google, and facebook have outages.

its not a matter of if…its a matter of when. It could be worse.

Two of my customers who I emailed to pass along information about this incident replied back and told me this is nothing in comparison to the outages they endured at media temple. Not uncommon for servers to remain offline for an entire day or more at a time without resolution.

Earlier this week, I was bragging about my linode uptime on a mailing list. Oh well, that will teach me! Karma has its way …

My linode appear to be up again, according to the dashboard. I got connection refused when ssh'ing, so I was fairly confident it was running indeed (otherwise, it would have timed out). I used the AJAX console to inspect why the ssh service was down, and lo and behold, it was stuck at the initramfs prompt complaining about a failed fask. Ran fsck -y, and the … lost connection to the console, which have been down since then. Arrrgh!!!11

Rebooting now. Hopefully it will come up just fine. Hopefully.

While I am not impressed with the lack of advance notification, I am displeased even further in the lack of reboot notification.

It was stated that those affected would receive notification if any hosts require a reboot. Thus far, 2 of my Linodes have been restarted without any notification - after I'd read that I would be notified.

If you say you are going to notify customers, you should do just that. I was expecting notification if downtime was going to occur - not just the downtime!

@andrewz:

What can I do at this point? I need help bad here. All my websites are down. I am dead in the water…
Open a new thread, or support ticket…

@Rogi:

@hiscom:

Still, it would have been better to have this particular Stuff Happen at 3 or 4 AM when nobody would notice …

Which 3 or 4 AM?

From Linode's 'Interesting stats': "131 countries customer diversity".
Countries exist outside the USA?

:lol:

@OverlordQ:

A) Unplanned outage, how do they warn against those?

By creating [email protected] which lists all maintenance, without exception. That way I can look at that mail archive and see todays date and 'upgrading libraries on xen hosts' and go… ahhhh!

> B) Do you want them to fix your box or post here?

Actually I want them to post here first, then fix the problem

Ive just had to tell a customer "your servers seem ok but may be rebooted at some point" because I dont know if they are going to reboot all hosts. I had to tell them that because there is NO official information that I can find on the status of the problem (or even much on the cause, fix eta, etc)

Take 5 mins to post all the info you have. Say what is going to happen to rebooted hosts (are they now ok) and what about un-rebooted hosts (will they be rebooted later or are they fine).

Part of taking credit and fanboy love we have for this wonderful thing (and I think linode is great) is also taking the responsibility for the fsck-ups that happen along the way.

Im a professional sysadmin, so I understand things 'go wrong' and that is fine, people have to live with that. But what you can do is be open and honest and fully informative about the problem. It takes 5 mins to do and often stopping and thinking about the problem enough to lay it out clearly can actually help.

For example xen instances can be saved to disk. Is there some reason that admin's cant do the following:

  • save all xen instances on a host

  • reboot the host

  • restore the xen instance

If that was workable, then maybe you could have a 2 minute 'hang' for each host and not need a reboot. shrugs Maybe linode should look at trying that (xm save savefile) and see if it could be used to reduce the impact next time.

> C) You get what you pay for.

D) Run it yourself if you think you can do better.

I pay for a service and part of that service involves updating the:

  • forums

  • outage announcements

  • blogs

  • twitter

None of which have any useful information.

I have taken 5 mins to email my clients and say "linode hosted servers are to be taken as 'unreliable' until further notice". :-) There. I did better. ;-)

That totally sucked. So what if they were marked down, they were running just fine until being rebooted…

Darn, too bad I queued up a shutdown, a startup and then another shutdown… :)

My linode in Dallas was down and rebooted. But now for some reason my website is no longer responding, which is running magento.

So, I have tried to see if I can just access or bring up a file and still not able to get anything. Anyone else seeing this type of problem?

Once again, we sincerely apologize for these issues. All system administrators are continuing to work to restore full service for all affected customers. Support tickets are being processed as rapidly as possible. Thank you for your patience.

@EtienneG:

Earlier this week, I was bragging about my linode uptime on a mailing list. Oh well, that will teach me! Karma has its way …
Aha! So you're the one to blame for all this. ;-)

I've been a customer for over 4 years, but never posted here before. Before this, my downtime was approximately 0 (perhaps one reboot the whole time) and a few slowdowns from the occasional DDos to the DC. That's quite remarkable for $20/month. This is definitely the worst outage I've seen here, if judging by metrics such as duration and number of hosts affected. One of my hosts is still down. But keep it in perspective people:

a) This is not a frequent occurrence.

b) I'm sure they like it even less than you do. Reputation is important to small business like this.

c) I'm sure they will take steps to make sure it doesn't happen again. However, the nature of a shared virtual hosting environment with geographic (e.g., timezone) diversity means that unscheduled problems (aren't they all?) will always happen at a bad time for somebody. And they have to be able to do scheduled non-reboot service to hosts without coordinating windows with 40 people. As for notification of a patch to a host, sure, they could do that. But it still wouldn't have stopped this. Would you rather them patch the running host or bring down 40 VMs unnecessarily? Clearly, they've done this kind of maintenance before and it worked fine except for this time. If it were isolated to one host no more than 40 people would notice, but this was widespread.

d) They are busy fixing the problem; posting "we're working on it" messages to the forums is counter-productive. They've said just that on IRC.

e) Linode.com clearly states one 9 of uptime. In my experience, it exceeds this SLA. But if you want five 9s, you're going to have to pay more for it. If you're running a revenue generating operation that's that important (e.g., amount of money at stake per minute downtime) then you have either a backup/business continuity plan involving another provider or are paying to colocate your own racks, and even then you still have a backup plan with another provider. In other words, you're spending more money somewhere.

f) Other providers (EC2?) have unscheduled downtime / outages / capacity problems all the time.

All this doesn't make it OK. It's frustrating to have a host go down. But I think it's hard to find as reliable a host for the money.

One other observation, my original linode on a UML host has been rock solid. Anecdotally, I think UML is more stable than Xen. But Xen is faster … go figure.

@pparadis:

Once again, we sincerely apologize for these issues. All system administrators are continuing to work to restore full service for all affected customers. Support tickets are being processed as rapidly as possible. Thank you for your patience.

What would you guess your ETA is, for solving this problem for all hosts? How many hosts have you fixed and how many are left? It's been five hours downtime for one of my linodes now, and it's still down.

The guy who did this (or the boss who ordered him, if that was the case) should get punished by being forced to work 4 weekend-days without pay, and linodes with more than one hour downtime should get one month of service credited to their account.

That way the technician/boss would think twice before 1. not notifying customers of the potentially risky upgrade well in advance, and 2. not testing the update on a few hosts before pushing the update onto all (?) of your hosts at once.

And the company as a whole would get a clear financial incentive not to repeat such foolishness in the future.

That said, I will not move to a different provider just because of todays downtime. I just won't recommend Linode as enthusiastically any more. You're still better than any other alternative I've heard of.

For the future:

Let's say you have 1000 hosts you wish to update. You can never be sure nothing will break. You should therefore test your update on a test-server. If that works you should try it on a live production server. If that works, you should try it on two additional production servers at once. If that works you should try it on 4 more servers at once. If that works, try it on 8, 16, 32 and so on. After 11 tests you would have upgraded 1024 servers. Testing to see if everything is ok before proceeding to update even more servers, eleven times, is a reasonable "waste of time" considering the impact one undiscovered error would have for so many people. If you do upgrades on just a few servers at a time (as suggested above in this message), any problems you miss, we customers will catch because we are so many people.

And please create a mailinglist anyone could join if we want updates on planned upgrades and potential problems, progress reports and so on. Maybe one list per datacenter? I only want to know about any updates you do in the Atlanda DC for example.

Anyway, good luck fixing today's problems. It's 03:26 in Sweden now and I'm hitting the sack.

My server looks like running. But I can't connect by SSH, even ping doesn't work. Already more than 5 hours

I have logged in by LISH and I see that my kernel is ubuntu, why is that hell - my server is running on Gentoo

If you are still experiencing issues, please join #linode-is-broken on irc.oftc.net and I'll help you first-come-first-serve. We're working through these one by one, and do appreciate your patience.

Well I came to this thread annoyed and looking for answers.

I didn't necessarily find any answers, but I'm now more annoyed at the customers posting in this thread than I am at Linode.

Thank God I am not the underling or child of any of the perfectionists posting in here. May He have mercy on anyone who has to live up to your standards.

To Linode: I've been with a lot of companies since 2002 and your reliability has been the best I've dealt with to date. Going forward, I would appreciate a little more emphasis on communication as we have people to communicate with as well. It is difficult to do that when we have little or no information to go on. Please consider that - but otherwise good work.

@randrp:

Well I came to this thread annoyed and looking for answers.

I didn't necessarily find any answers, but I'm now more annoyed at the customers posting in this thread than I am at Linode.

Thank God I am not the underling or child of any of the perfectionists posting in here. May He have mercy on anyone who has to live up to your standards.

To Linode: I've been with a lot of companies since 2002 and your reliability has been the best I've dealt with to date. Going forward, I would appreciate a little more emphasis on communication as we have people to communicate with as well. It is difficult to do that when we have little or no information to go on. Please consider that - but otherwise good work.

Another tool who clearly only uses his Linode server as an expensive shell account.

@nsajeff:

Another tool who clearly only uses his Linode server as an expensive shell account.

:D I love that the internet gives people like you a place to be "tough"

@randrp:

@nsajeff:

Another tool who clearly only uses his Linode server as an expensive shell account.

:D I love that the internet gives people like you a place to be "tough"

And you the place to preach to others about what is acceptable to everyone else.

C'mon folks, no need to get personal here.

I was down for a good part of the evening too, but to be fair, this is maybe the second real outage (other than brief network issues, etc) I've had in what… 5 years? That's a pretty good track record to me.

I don't like it either, but these things happen. That's simply the reality of it.

Wow my Linode went down for around seven hours today. Honestly it's pretty bad when people who depend on your server being up are angry and are demanding and explanation. What's worse is not being able to provide an explanation.

Linode guarantees 99.99% uptime which means that there's about nine hours a year when you can expect it to go offline. That's fine. What is not okay is having a major outage (all three of my servers went down) in the middle of the day and more importantly, not being able to provide detailed information about the outage that I can pass on to my clients.

This whole situation could have been a lot smoother if some administrator had posted perhaps every half an hour with something along the lines of,

These nodes are down. We expect to bring these nodes up in the next half hour. We expect to bring these nodes up in the next hour. And we expect to bring these nodes up in the next two hours. We expect to have the entire system back online within eight hours.

Not only would I have an ETA to give to the people who depend on this server, but I also wouldn't have to check back every five minutes to ascertain whether or not my servers were operable.

CAN SOMEONE TELL US WHAT IS GOING ON!?!?! I AM COMPLETELY DOWN AND THERE IS NO HELP AT ALL FROM SUPPORT?!!!!!

@andrewz:

CAN SOMEONE TELL US WHAT IS GOING ON!?!?! I AM COMPLETELY DOWN AND THERE IS NO HELP AT ALL FROM SUPPORT?!!!!!

My suggestion is for you to go to IRC, #linode or #linode-is-broken at irc.oftc.net, the staff is there 24/7, you might be able to get help faster.

@nsajeff:

@Infinito:

Oh newbies.. take it easy. For me this is the first time that there has been a problem(at all, as relatively minor as it is) since I signed up, in 2007(my linode is on Freemont btw). And this isn't even a real problem so it seems, just some upgrade that went awry. Over two years up, I believe that fucking beats Amazon services. :)

Who are you to tell us to take it easy? Not all of us use our Linode's for expensive shell accounts. I have clients who are pissed and ZERO warning. You take it easy when this is costing you money.

I think rather than poking at other members and making assumptions about their usage, it might be better to take a step back and evaluate your own configuration. The reality is whether you are a sole proprietor or Google things can and will happen and if you want to isolate yourself and your customers from those events you need to plan, develop and deploy your solution in a manner consistent with your level of concern.

That's not to say things couldn't have been done better today and I would like to think that they will in the future.

I was actually working on the site I have here at Linode earlier this evening. It was working fine.

I logged into the forum here and learned of the problems. I thought I had lucked out until I tried to access my site. Only the front static page was visible. I could not log in via ssh. I was able to log in via the Linode console. Mysql is up but not responding through the site.

I am lucky, the site I have here is still in development. Thank goodness.

I have had service since February. This is the first problem I have had. It is just another lesson in server administration. Constant vigilance and backups!

Jeff

You have the right to be pissed nsajeff, I'm not saying you don't. Maybe I should've kept my mouth shut, but I felt like defending them because they do a fine job imo. You've read the posts, 5 years of uptime, that's pretty impressive. And Linode is not an expensive shell account to me, but you're right, nor is where I host clients websites.

Don't think I don't know what being under pressure and annoyance by customers and bosses is like, next week there will be a real life 'test drive' of the red5/flash video conferencing/video broadcast/chat tool I(along along with other two developers) have been developing for the last 8 months, where 2700 people(that's right) will be using it at the same time, and I couldn't circumvent all the performance glitches yet.

@randrp:

Thank God I am not the underling or child of any of the perfectionists posting in here. May He have mercy on anyone who has to live up to your standards.

Production systems operations is a tough, uncompromising, and precise business. If you can't take the heat, get out of the fire. This is no more a reflection on how we conduct our personal lives than a surgeon's precision in an operating room is on his personal life.

Mistakes happen, but this was not a "mistake". They didn't mistype a hostname or accidentally bounce the wrong box. This was negligence.

If I had failed so spectacularly to adhere to basic industry standards of care at any of the companies I've worked for, there is a very good chance I would have simply and instantly been shown the door. At the very least, I would not have been allowed near production boxes again for months.

Like most I was disappointed about the unexpected downtime today (about 2.5 hours downtime for me), but it happens so rarely that it's not the end of the world. I work in a technical field, and it's understandable that occasionally issues pop up that weren't ever seen before or expected.

What I hope is that once this is fixed, (and the admins get a chance to sleep) is that this "event" becomes a "lesson learned" from a communications standpoint. Since they are much bigger now than they've ever been, they need to find a way to communicate to a large number of customers at once on occasion. (Maybe have someone post every hour the list of affected hosts) Other than this big incident, though, the communication from the staff has been great!

I'm still a satisfied customer, though, and I won't hesitate to recommend Linode to someone else!

Hey guys! Crap breaks.

If you really want a better SLA, then you should probably get one in writing.

Thanks for getting it back up and running, linode staffers.

@kbrantley:

Hey guys! Crap breaks.

If a RAID controller had starting puking all over itself or a power supply popped a capacitor, I assure you I would not be here loudly voicing my displeasure. Based on their past performance in such situations, I'd probably be thanking them for prompt action.

The present situation is not "crap breaks", it is "Linode broke it".

@nknight:

@kbrantley:

Hey guys! Crap breaks.

If a RAID controller had starting puking all over itself or a power supply popped a capacitor, I assure you I would not be here loudly voicing my displeasure. Based on their past performance in such situations, I'd probably be thanking them for prompt action.

The present situation is not "crap breaks", it is "Linode broke it".

It's a good thing that their uptime guarantee (section 5) covers both hardware error and human error then, huh? The part where it is 99.99% over a month as opposed to a year is nice too.

Second, not to be the obvious counter-whiner, but if you're really that pissed then take it to their contact page or ultimately, your wallet - not the boards here. It won't change a thing.

Interesting that this turned into a flaming thread. Oh well.

I absolutely call BS on Linode for the unacceptable uncommunicated maintenance. And the continuing silence after the event. And even now, 8+ hours after the event started, they haven't chimed in with much information.

And let's focus on those issues. Who cares if your Linode is a personal shell (mine is not) or a production system? We all expect to know when something we pay for is broken, and when it'll be fixed. Oh, and a warning before it's going to be broken, if you know it.

My point? Focus. Let's focus on one thing.

Communication.

Let's all demand more and better communication from Linode. I agree that we don't want to pull people away from resolving the issues. And we don't want to be kept in the dark, either.

So let's all demand clear and timely communication from Linode. Whether it's in forums, blogs, tweets, or email, what we want to know is "What's going on?".

We don't care about the medium, we care about the content.

We don't care about the framing, we care about the content.

Please, Linode, communicate with us. I haven't seen a single post from someone say they're going to leave Linode (yet), but why wait till it comes to that point? Why wait till you're in the position of begging for customers before talking to the ones you have?

/ramble

@kbrantley:

It's a good thing that their uptime guarantee (section 5) covers both hardware error and human error then, huh? The part where it is 99.99% over a month as opposed to a year is nice too.

The existence of an SLA and the paltry refunds do not excuse the careless behavior in evidence here.

@kbrantley:

Second, not to be the obvious counter-whiner, but if you're really that pissed then take it to their contact page or ultimately, your wallet - not the boards here. It won't change a thing.

Really? I'd always been under the impression that Linode was one of those rare companies that actually encourages open dialogue with their community, including in their official forums.

If that is not the case, then that is indeed a reason to consider taking my business elsewhere.

To be sure, there are local datacenters where I could put one of my spare 1Us for little more than I pay Linode, and get more resources in the process. My time is valuable enough that I've elected to stay with Linode rather than administer my own hardware, but perhaps I should reevaluate that stance if Linode truly does not care what is posted in their own forums.

@nknight:

My time is valuable enough that I've elected to stay with Linode rather than administer my own hardware, but perhaps I should reevaluate that stance if Linode truly does not care what is posted in their own forums.

Don't care? You clearly don't use IRC where they are active even right now … I'm pretty sure the standard operating procedure is ticket -> IRC -> forum for problem resolution (e.g., forum is not for realtime support).

@kg1866:

Don't care? You clearly don't use IRC where they are active even right now …

I was responding specifically to kbrantley's frankly irrelevant assertions.

And no, I generally don't use IRC these days. It is an inefficient and rushed form of communication for which I generally have little use.

@kg1866:

I'm pretty sure the standard operating procedure is ticket -> IRC -> forum for problem resolution (e.g., forum is not for realtime support).

Which is neither here nor there. If you'd care to look at what I've actually posted in this thread, you'll see that not once have I sought any form of support or assistance. You and kbrantley seem to be addressing a strawman.

Please refer to the newly posted service status announcement for additional information on outage resolution efforts. Thank you again for your patience.

Honestly, I've been very happy with Linode since I signed up with them a year and a half ago. However, as someone who was a professional UNIX admin for 5 years, there are two things that really bother me about this outage:

1. There was zero communication about the maintenance being done. I understand Linode has international customers, so they can't schedule downtime that will make everyone happy. On the other hand, it would have been nice to know this was coming so we wouldn't have all been caught unaware.

2. That every server in every datacenter was upgraded at once. Even if this upgrade was tested prior to pushing it out to production, there was no way of knowing for sure that there would be zero problems. Even the best tested upgrade can go awry. My business isn't big enough yet to have taken a serious hit from what looks like 4 hours of downtime. When it did get that big, I was planning on buying a second Linode in another datacenter, but if upgrading everything at once is going to be the policy going forward, I'll probably end up buying my secondary server from a different VPS provider.

@craversp:

2. That every server in every datacenter was upgraded at once. Even if this upgrade was tested prior to pushing it out to production, there was no way of knowing for sure that there would be zero problems.

This is what bugs me as well. No matter how well it's been tested, at the very least, upgrades should be broken up by datacenter and performed with some gap between. ie more than 5 minutes, so that the upgrade can bake in and in this case only one DC would have been affected.

To address someone else's comment (too lazy to go back and find it), would I rather someone post a communication or work on the problem. To a sysadmin, the answer isn't intuitive, but it's post a communication. Working for a very large company, I've been angered when my boss has pulled me off of fixing things to communicate about the outage, but in the end, it's the right move. Taking 2 minutes to dash off a communication calms people and gets them off your back to a much greater degree than the cost of that 2 minutes getting the last server up.

@glg:

This is what bugs me as well. No matter how well it's been tested, at the very least, upgrades should be broken up by datacenter and performed with some gap between. ie more than 5 minutes, so that the upgrade can bake in and in this case only one DC would have been affected.

It could be that Linode's structure is such that pushing it to all datacentres at once was vital. But we won't know until they tell us :)

@gnummep-martin:

It could be that Linode's structure is such that pushing it to all datacentres at once was vital. But we won't know until they tell us :)
I sure as hell hope not. If this turn out to be the case, then it adds serious weight to the "your hot standby machine should be with another provider" argument. If this is the case, Linode should (and most likely will) address the problem by changing the structure that required simultaneous upgrades.

@pclissold:

@gnummep-martin:

It could be that Linode's structure is such that pushing it to all datacentres at once was vital. But we won't know until they tell us :)
I sure as hell hope not. If this turn out to be the case, then it adds serious weight to the "your hot standby machine should be with another provider" argument. If this is the case, Linode should (and most likely will) address the problem by changing the structure that required simultaneous upgrades.

Well, perhaps it was the nature of the upgrade, I don't know. And we won't know unless they publish a better explanation of some sort.

My linode was bounced, however I cannot find a support ticket from the linode team in my email or their support system.

….Such as "General Discussion"?

I get notified of every post to "System and Network Status", and I suspect that I'm not the only one.

I want to be notified about status changes, but in the last 12 hours my inbox has been overflowing with notifications about postings that offer no additional information.

@cirric:

I want to be notified about status changes, but in the last 12 hours my inbox has been overflowing with notifications about postings that offer no additional information.

There is a link in the lower left corner of the page saying "Stop watching this topic" - maybe that's what you're looking for ?

A tip to everyone who's Linode is still down: Create a support ticket asking them to fix your host. I did that, and 15 minutes later my Linode was alive and kicking. So now I've had a downtime of 18.5 hours but at least it's up again. I hope someone is bringing food to their admins as I imagine they've been working nonstop since the problem was first reported.

Good luck everyone!

EDIT:

Btw: This is the response I got from my ticket:

> Hello,

I repaired /etc/inittab and /etc/fstab and issued a boot job – and the Linode appears to have booted correctly. We apologize for this inconvenience.

Please let us know if there's anything else we can assist you with.

Regards,

-Chris

So if you can mount your disk image with the Finnix rescue disk image (create a profile for booting the Finnix disk image and mount your real disk image as the second disk image) you could maybe fiddle around with /etc/inittab and /etc/fstab and perhaps fix the problem yourselves. But that is probably not possible because if it were, then Linode staff would probably have posted instructions on how we could fix our own Linodes. But someone who knows more than me and still have a nonfunctioning Linode could at least try while waiting.

EDIT2:

You can look at a copy of my inittab and fstab files on the below urls if you want to see a functioning version. I run my Linode on Debian and use ext3 as my filesystem.

http://neo101.org/fstab

http://neo101.org/inittab

EDIT3:

Can someone post a non-working fstab and inittab file? It would be interesting so see what the differences are. Maybe we could make a short howto for those who would want to fix the problem themselves instead of waiting for the admins.

I'll just throw in my two cents here:

Those who are complaining about communication problems were obviously looking in the wrong places. Their twitter account got several updates during the downtime. The staff was always on irc giving us updates.

The problem I got was that Linode was doing a shared library update that occurs every once in a while with no downtime. I assume that this type of thing has been done in the past without error and this was just another routine host update. Something about the libraries or the way they were installed caused issues on the host. They then decided to fix every host one at a time. Having over 500 hosts with a staff as small if Linode's leaves some downtime.

My complaint isn't so much about communication while they work the issue, they did a decent job there. My complaint is that they don't often announce it ahead of time when they are going to be doing system maintenance. And don't tell me that "it was minor" or "what could you have done in advance anyway". EVERY type of system maintenance, no matter how small, has SOME chance of bringing things to a screeching halt. And even if we couldn't have done anything, letting customers know in advance when there is a maintenance window, at a minimum, buys the sys admins a little working time before people start expecting updates if something goes wrong.

@oliver:

There is a link in the lower left corner of the page saying "Stop watching this topic" - maybe that's what you're looking for ?

That option doesn't appear until you reply to a particular thread. But even before I replied, I was subscribed to the forum. So, I got notification of any replies posted to the forum.

The notification email includes a link to stop watching this forum, but I don't want to disable that, because I'll miss the original postings of announcements by staff.

Yes, it's a limitation of the forum software. I'm just asking that folks use "General Discussion" for general discussion, and reserve "System and Network Status" for actual status updates.

While I admit that my linode is not much more than an experimental and vanity ground. I do not understand why people are so quick to blame Linode for their loss of business. Linode is not responsible for your revenue stream. You are. Stop making Linode responsible for your business. If you have a mission critical ZOMG CANT HAVE ANY DOWNTIME type of application then for pete's sake, have some redundancy. Don't just trust caker for your income. Trust no single point of failure. Get other servers somewhere else. Make it so that if caker and co suddenly decide it's not worth their time to run Linode anymore, you're not left with nothing. The Linode staff is responsible for their own business. Take care of yours yourself.

Cheers,

Antonio

It's not a question of if the ball will get dropped…. Ohhh it will believe you me. Like in all respects when dealing with humans, it's a question of when. So be prepared.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct