Linode.com Forum Forum Index Linode.com Forum
Linode Community Forums
 


Full Moon Related? Downtime

Click here to go to the original topic

 
       Linode.com Forum Forum Index -> System and Network Status
Author Message
caker



Joined: 15 Apr 2003
Posts: 2386
Location: Galloway, NJ

Posted: Fri Sep 12, 2003 5:09 pm    Post subject: Full Moon Related? Downtime  

Seriously, a few bizarre things happened today - all at the same time! What gives?!

We had a run-away Linode process on host4 that slowed it to a crawl. It just so happened that we started a handful of migrations off host4 to host8 around the same time. After a while and a lot of convincing that I should restart the Linode in question, I did. That fixed host4, and the Linodes on host4 looked a lot happier.

By that time the migrations (using scp, which was a bad idea in the first place because of performance) were hosed, only one migrations completing. I un-migrate those back where they were. "Something" also caused host8 to either drop off the network, or freeze completely. I wasn't able to get console access so I had to reboot host8 :(

In the process of getting host8 rebooted, "someone" (*whistle*, looks around innocently) managed to power cycle host6. Do'h!

I'm disappointed because I thought the issue with host4 was hardware related. I'm starting to believe now that its a software/kernel condition, since both host4 and host8 have had similar freezes. I need to capture an oops. Why haven't I already, you ask?

The one remote-console unit I have at ThePlanet is quirky. It only stays alive for a few minutes at a time and then drops off the network. Baytech wants me to send the bad module in for repair work, but I just bought a bunch of new console units for a new rack I'm building. I'm awaiting shipment of the new console units, and I'm sending two of the units up to ThePlanet as a replacement.

The host4 issue also caused a few shutdowns to not complete, so I had to follow up on those, as well.

In regards to shutdown's hanging, there seems to be a bug that's been hit about 7 or 8 times over the last few days. The job gets stuck, because a call to uml_mconsole (a management utility for UMLs) is hanging.

I'm also questioning the stability of the UML kernels I released. I now know (after the fact) that there were bugs in 2.4.22-1um and 2um. Hopefully 2.4.22-3um (linode9) is working fine. Just so you guys don't think I'm a total flake, I do perform a kernel compile inside the new UML kernels as a test. Lesson learned: If it ain't broke, don't fix it. I'll keep making newer kernels available, but I won't be pointing "Latest 2.4 Kernel" to anything that you guys don't approved of first.

Our 2.4.21 kernel has uptimes of months (and counting).

I won't touch anything else today, I promise.

-Chris
Back to top  
adamgent



Joined: 23 Jun 2003
Posts: 261

Posted: Fri Sep 12, 2003 5:23 pm    Post subject:  

At least you only restarted one machine by mistake.

It is not as bad, as loging in to the wrong remote power unit and taking offline an array of UPS systems and all of the attached devices 100+

Didnt realise what had happened until I started getting, emails, phone calls and pages telling me that the servers where down, not to mention all the other people who got the alerts as well.

I learnt two things from that, first that the monitoring service worked and worked well.

Second have different passwords for everything, so you dont log into something by mistake and to actually look what your doing.

Adam
Back to top  
caker



Joined: 15 Apr 2003
Posts: 2386
Location: Galloway, NJ

Posted: Fri Sep 12, 2003 5:29 pm    Post subject:  

Thanks Adam, other people's suffering always makes me feel better :-)

This will forever hold a special place in my memory, along with the time I sshed into a machine and "/etc/rc.d/network stop" (which was not connected to remote console).

Guess what plug #8 points to on my RPC unit? Host6, of course! With everything that was going on, and that I only have a window of a few minutes when going through my console server, I must have lost total brain function.

I also almost got into an accident (on the bike, no less) when I went out for lunch an hour ago.

I'm just going to sit in the corner for a while.

-Chris
Back to top  
adamgent



Joined: 23 Jun 2003
Posts: 261

Posted: Fri Sep 12, 2003 5:35 pm    Post subject:  

I was lucky, the system was not yet live, although if I had done it 2 days later I would have been well.

It was put down as testing the response of the enginners to a complete loss of all power to the system and how they recovered.

Everyone has bad days, although I tend to have bad months, espcially when doing web development.

Adam
Back to top  
bji



Joined: 27 Aug 2003
Posts: 182

Posted: Fri Sep 12, 2003 6:47 pm    Post subject:  

I certainly don't want to add to your worries but I have to say that this episode has me a little concerned, on a few fronts.

First - how does a single Linode bring an entire system to a crawl? I thought that each Linode can only add a maximum of 1 to the load average of the system? So the first other Linode, or host server process, to start contending with the runaway Linode will still get roughly 50% of the CPU, right? I've sat at plently of Linux systems with a single process pegged at 100% and not even noticed. How does a single Linode manage to hog so much CPU that the system as a whole (and presumably all Linodes on that system) are seriously adversely affected?

Second - what would happen if Chris really were to have an accident on the way to lunch? I certainly don't mean to be morbid by suggesting such a possibility (I ride a CBR600F4i myself, I wouldn't wish an accident on anyone) ... but, imagine the scenario where Chris drops his bike on the way to lunch, has to go to the hospital with a broken leg, and in the meantime host8 hangs due to whatever unsolved problems still remain. What happens to all of the Linodes that are on host8? It would make me feel so much better about my Linode if there were just 1 other person with the ability to manage the systems ...

I'm getting very close to paying for a yearly contract for my Linode because I've been very happy with it so far and would really like the extra disk space. But threads like these start giving me cold feet when I think about my server being unavailable due to the lack of redundancy in the personnel department at Linode.com ...
Back to top  
alphonso



Joined: 03 Sep 2003
Posts: 19

Posted: Fri Sep 12, 2003 7:12 pm    Post subject:  

Don't blame the full moon when there's a simple, scientific answer:
Mercury is retrograde (until sometime next week) :?
Back to top  
bji



Joined: 27 Aug 2003
Posts: 182

Posted: Sat Sep 13, 2003 9:43 pm    Post subject:  

Any comments on the issues I have raised? Anyone?
Back to top  
caker



Joined: 15 Apr 2003
Posts: 2386
Location: Galloway, NJ

Posted: Sat Sep 13, 2003 10:40 pm    Post subject:  

bji wrote: I certainly don't want to add to your worries but I have to say that this episode has me a little concerned, on a few fronts.

First - how does a single Linode bring an entire system to a crawl? I thought that each Linode can only add a maximum of 1 to the load average of the system? So the first other Linode, or host server process, to start contending with the runaway Linode will still get roughly 50% of the CPU, right? I've sat at plently of Linux systems with a single process pegged at 100% and not even noticed. How does a single Linode manage to hog so much CPU that the system as a whole (and presumably all Linodes on that system) are seriously adversely affected?
That was not an ordinary situation (something amok). The UML process was disk i/o bound, not CPU. Just like CPU time, disk access is scheduled by the host kernel, and no one has come up with a magic scheduler that works for every type of workload. Situations like these (even though this was the first) are reported very quickly or monitored for and can be dealt with.

bji wrote: Second - what would happen if Chris really were to have an accident on the way to lunch? I certainly don't mean to be morbid by suggesting such a possibility (I ride a CBR600F4i myself, I wouldn't wish an accident on anyone) ... but, imagine the scenario where Chris drops his bike on the way to lunch, has to go to the hospital with a broken leg, and in the meantime host8 hangs due to whatever unsolved problems still remain. What happens to all of the Linodes that are on host8? It would make me feel so much better about my Linode if there were just 1 other person with the ability to manage the systems ...

I'm getting very close to paying for a yearly contract for my Linode because I've been very happy with it so far and would really like the extra disk space. But threads like these start giving me cold feet when I think about my server being unavailable due to the lack of redundancy in the personnel department at Linode.com ...
Linode.com is new, just a few months old (launched June 16th, 2003). I have worked extremely hard to develop, market, and to provide great customer service. My vision for Linode is real, is proven, and is on track. Financially, Linode is stable and growing, but I don't have the choice to make hires at this time. It will be a few months before I start looking. I only have to offer the reputation of the level of service Linode.com provides, and my commitment to Linode. However, I realized you do have a choice, and I can appreciate your very valid concern.

I have every incentive to make sure the uptime and service is as good as possible. I will work on improving contingency plans to minimize the risk of "Bus vs. Chris" situations.

Thanks for your support,
-Chris
Back to top  
zan



Joined: 16 Jul 2003
Posts: 30
Location: Australia

Posted: Sun Sep 21, 2003 9:09 am    Post subject:  

well, im a lot happier that chris didnt try and spin us some story like some other hosting companies that I have been with. He has not left us in the dark and that should be applauded.
Back to top  
 
       Linode.com Forum Forum Index -> System and Network Status
Page 1 of 1