 |
Linode.com Forum Linode Community Forums
|
| Author |
Message |
sarge
Joined: 19 Dec 2004
Posts: 58
|
| Posted: Mon Dec 20, 2004 3:49 am Post subject: Linode crashed (powered off state), twice within 24 hours |
|
|
OS: Debian Sarge
Kernel: Latest 2.4 Series (2.4.28-linode37-1um)
Host: host36
Plan: Linode96
This is the 2nd time this linode crashed in the past 24 hours.
The job queue doesn't show any shutdown or reboot request. It just shows the linode is powered off.
Last crash occurred around 2:20 am CST (10 minutes ago). Prior one happened during the day.
I'll try to reproduce the problem, any advice in the meantime would be appreciated--particularly advice on what to look for in the logs produced by syslog-ng.
Also, is it possible to automatically power-on the Linode when something like this happens again?
ps
Sorry if I'm lite on info, it's my turn to crash....zzz |
|
| Back to top |
|
sarge
Joined: 19 Dec 2004
Posts: 58
|
| Posted: Mon Dec 20, 2004 4:08 am Post subject: More info |
|
|
I think I found the cause (at least circumstantial evidence)--but in theory, Linux should not be this fragile if this is reproducable.
It seems the crash happened at the exact same time as an hourly cron job that issues a "shorewall refresh" after adding dshield.org blacklist entries.
Perhaps this causes a crash if there is traffic (such as ssh tunnelling) at the time of the shorewall refresh. Since this is an hourly cron job and it doesn't crash every hour, there must be some other factor such as network traffic involved with this. And at most, I only have up to ~40 blacklist entries.
Anyway, it would be nice if the Linode automatically powered on when it crashes like this. |
|
| Back to top |
|
bji
Joined: 27 Aug 2003
Posts: 182
|
| Posted: Mon Dec 20, 2004 8:58 am Post subject: Re: More info |
|
|
sarge wrote: I think I found the cause (at least circumstantial evidence)--but in theory, Linux should not be this fragile if this is reproducable.
It seems the crash happened at the exact same time as an hourly cron job that issues a "shorewall refresh" after adding dshield.org blacklist entries.
Perhaps this causes a crash if there is traffic (such as ssh tunnelling) at the time of the shorewall refresh. Since this is an hourly cron job and it doesn't crash every hour, there must be some other factor such as network traffic involved with this. And at most, I only have up to ~40 blacklist entries.
Anyway, it would be nice if the Linode automatically powered on when it crashes like this.
I'll bet that with a little bit of lish scripting, and another host permanently connected to the network, you could accomplish this. You write a script that every minute, attempted to connect to your Linode via lish (ssh you@hostXX.linode.com), detected when it was down, and issued a boot/reboot command when it is down. |
|
| Back to top |
|
caker
Joined: 15 Apr 2003
Posts: 2404
Location: Galloway, NJ
|
| Posted: Mon Dec 20, 2004 1:24 pm Post subject: |
|
|
I've had a few reports of mysterious crashes with no additional information in the console log with Linodes on a host machine that is running the 2.6.8.1-4 kernel (check cat /proc/cpuinfo), which host36 is.
However, when this crash happens, please check Lish's console log and provide me that information if you can.
I'll have a new host online today with a kernel we've been testing for a while. I'd like to move your Linode over to it and see if that eliminates the problem. Up for that?
Thanks,
-Chris |
|
| Back to top |
|
dmuench
Joined: 30 Oct 2003
Posts: 51
Location: Rochester, NY
|
| Posted: Mon Dec 20, 2004 2:34 pm Post subject: |
|
|
I've been having similar problems, and have been working with Chris to try to solve them. In the meantime, here is a watchdog script that I use. I run it from a linux box at my house (via cable modem). It checks every 15 minutes to make sure it's booted. Even if it makes a mistake, the worst it will do is issue a boot command to an already running linode (which does nothing).
Replace <linode> with the name of your linode, <username> with your LPM username, and hostxx with the host your linode is on. This script is dependant on you having set up your ssh key in lish, and having the key loaded into an agent on the machine you're running the watchdog from. If you can "ssh username@hostxx.linode.com" and get the lish prompt without being prompted for a password, you should be set.
Dave
Code:
#!/bin/bash
#
# Simple watchdog to make sure <linode> is running, and if not boot it.
#
while [ 1 ]; do
echo Checking <linode> at: `date`
ssh <username>@hostxx.linode.com version | grep "OK Linux"
if [ $? -eq 1 ] ; then
echo "<linode> down!"
echo "<linode> is down, booting up" | mailx -s "<linode> down" youremail@hostname.com
ssh <username>@hostxx.linode.com boot
fi
sleep 900
done
|
|
| Back to top |
|
caker
Joined: 15 Apr 2003
Posts: 2404
Location: Galloway, NJ
|
| Posted: Mon Dec 20, 2004 9:26 pm Post subject: |
|
|
caker wrote: I'll have a new host online today with a kernel we've been testing for a while. I'd like to move your Linode over to it and see if that eliminates the problem. Up for that?
The new server is online -- shoot me a support ticket and I'll set up the migration for you.
-Chris |
|
| Back to top |
|
sarge
Joined: 19 Dec 2004
Posts: 58
|
| Posted: Mon Dec 20, 2004 9:40 pm Post subject: Anyone thinking of signing up should see this thread |
|
|
Chris,
I'll try to reproduce some of the same conditions that were present during the last crash and take a closer look at the existing logs.
I appreciate the offer and I'm up for moving to a new host (sooner the better). The timing is great since we're not live yet.
Also, I prefer 2.4 kernels but am open to trying the 2.6 series as soon as you think it is ready for production use with Debian Sarge. I got interested when you mentioned the cryptoloop limitations in 2.4 kernels.
Dave,
Thanks for the script!!! Every bit helps as I'm working insane hours.
Everyone else,
So far, the Linode experience has been Chris (caker) proving that he really knows his stuff (unlike other hosting companies with clueless staff) and very helpful fellow customers in both IRC and forums. But the thing I love most so far is the ability to fix almost everything I can possibly mess up without waiting for support staff--even screwed up sshd configs or init.d scripts. |
|
| Back to top |
|
sarge
Joined: 19 Dec 2004
Posts: 58
|
| Posted: Mon Dec 20, 2004 11:00 pm Post subject: Able to reproduce power-off crash |
|
|
Hi,
I was able to reproduce the power-off crash on the new server after migrating.
I don't know if the crash is due to fragile code in iptables, shorewall, kernel 2.4, openssh, etc. or something else. Maybe these are red herrings, and the crash is caused by some subtle network error that generally doesn't cause problems except under special conditions like the following test.
1. I had another hosted test server use siege with 40 concurrent users on the linode server.
2. The test server and the linode had an ssh tunnel between them for these requests.
3. The linode had an ssh tunnel between it and a private server (adsl at home office) which generated the html content.
This test runs with a bunch of free RAM and idle cpu available on both the test server and linode server.
The crash occurs when I perform a 'shorewall refresh' command on the linode during the test, which adds a couple dozen blacklisted ip addresses using iptables. This command does not always bring down the server on the first try--it might have to be executed a few times to make the crash happen. |
|
| Back to top |
|
caker
Joined: 15 Apr 2003
Posts: 2404
Location: Galloway, NJ
|
| Posted: Mon Dec 20, 2004 11:10 pm Post subject: |
|
|
Ok. And looking at your console logs, I can see it just terminates without any debugging information.
Since you can reproduce the problem, would you mind trying 2.6.9-linode9? You will need to "mv /lib/tls /lib/tls-disabled" before booting into 2.6.9. My hope is that 2.6.9 will provide *some* console output relating to the problem.
-Chris |
|
| Back to top |
|
caker
Joined: 15 Apr 2003
Posts: 2404
Location: Galloway, NJ
|
| Posted: Mon Dec 20, 2004 11:29 pm Post subject: |
|
|
Actually, if you wouldn't mind popping into the IRC channel, I might have a way to get some more information out of the kernel (thanks to Jeff Dike).
http://www.linode.com/forums/viewtopic.php?t=588
-Chris |
|
| Back to top |
|
sarge
Joined: 19 Dec 2004
Posts: 58
|
| Posted: Sat Jan 08, 2005 5:52 am Post subject: |
|
|
UPDATE:
The problem was reproduced using both 2.4 and 2.6 kernels.
Caker was kind enough to stay up late to monitor the host machine and help try to debug this problem. Ultimately, it is thought to be a bug in UML that was most likely introduced in a recent version.
At caker's request, I provided a crashkit as a standalone Linode disk image so he can reproduce the problem around December 21.
He offered to pass the info along to the UML dev team after taking a look at the crashkit.
As-is, the crashkit requires 2 machines external to crash the linode. It might be possible to create a much simpler crashkit but my schedule has become swamped (hence the message at 4:45am). I guess if another customer encounters this bug, they can attempt to create a simpler crashkit to help figure this out.
Given that the Linode completely powers off unexpectedly, I'm hoping this nasty UML bug gets fixed before it causes loss of valuable data. |
|
| Back to top |
|
caker
Joined: 15 Apr 2003
Posts: 2404
Location: Galloway, NJ
|
| Posted: Tue Jan 11, 2005 7:49 pm Post subject: |
|
|
Just an update:
Using sarge's crashkit as a jumping off point, I've been able to reproduce this bug very easily:
http://www.theshore.net/~caker/uml/crashkit-no-console-output/
Quote: All you need is an UML guest assigned an IP on your network, iptables, and
the two files below in the same directory. Run the script, while ping-flooding
the UML's IP address from the host, or another machine on your network.
This script works best with a 2.6 based UML. I can recreate the crash easily using
2.6.9-linode9 (based on -bb4), also available on this website.
I've forwarded the info to Jeff, so hopefully we'll see a fix soon enough.
Thanks,
-Chris |
|
| Back to top |
|
caker
Joined: 15 Apr 2003
Posts: 2404
Location: Galloway, NJ
|
| Posted: Tue Jan 11, 2005 9:00 pm Post subject: |
|
|
Quote: 19:39 < caker> jdike: http://www.theshore.net/~caker/uml/crashkit-no-console-output/
19:41 < caker> jdike: very easy to reproduce -- just a script, another file with IPs, and ping flood the UML's IP
19:51 < caker> time for food .. brb
20:14 < jdike> Blocking 222.64.190.133...
20:14 < jdike> Jan 11 20:06:28 usermode syslogd: recvfrom unix: Resource temporarily unavailable
20:14 < jdike> Segmentation fault
20:14 < jdike> Looking like a stack overflow
20:16 < jdike> and that doesn't surprise me too much
20:27 < jdike> When caker comes back, someone tell him that his problem smells heavily of a stack overflow
20:28 < jdike> Probably easily fixed, although I need to think about it some
20:35 < caker> jdike: thanks for taking a look at that
20:37 < jdike> caker: For some reason, you can't make the idle thread stack overflow, or at least not badly enough to cause a crash
20:37 < jdike> caker: however when its doing something else in the kernel, and I'm guessing iptables is generating deep stacks on its own, it does cause a crash
20:38 < jdike> caker: theory only, I haven't poked at it at all yet
20:38 < caker> jdike: ok -- glad it was easy enough for you to reproduce
20:38 < jdike> caker: yup
20:38 < jdike> caker: the right test case is a wondrous thing
20:40 * jdike disappears again
|
|
| Back to top |
|
caker
Joined: 15 Apr 2003
Posts: 2404
Location: Galloway, NJ
|
| Posted: Wed Jan 12, 2005 12:33 pm Post subject: patch |
|
|
Jeff is QUICK:
http://marc.theaimsgroup.com/?l=user-mode-linux-devel&m=110554794608577&w=2
Code: This patch fixes a long-standing problem in skas mode process creation. Chris
Aker has been seeing it at linode, and found a way of reproducing it. Once
I spotted the bug, I found an easier way:
ping flood the UML from the host while running
while true; do ls > /dev/null; done
In 10-15 seconds, UML will simply exit back to the shell with a segfault,
no panic, no output, no nothing.
When UML sets up the kernel stack for a new process, it sends itself a
SA_ONSTACK signal with the signal stack being the new kernel stack. It
calls setjmp there to set up a context that it can longjmp to when the new
process is run for the first time.
The problem was that, while signals were blocked during this, they were
re-enabled before SA_ONSTACK was disabled. Thus, a signal arriving at the
wrong time, between signals being turned on and SA_ONSTACK being disabled,
would cause the signal to be handled on the stack, destroying the context
that had been set up there.
When the new process ran, it would longjmp to this trashed stack, and UML
would die.
The fix is obvious:
Index: 2.6.10/arch/um/kernel/skas/process.c
===================================================================
--- 2.6.10.orig/arch/um/kernel/skas/process.c 2005-01-12 11:17:22.000000000 -0500
+++ 2.6.10/arch/um/kernel/skas/process.c 2005-01-12 11:18:03.000000000 -0500
@@ -323,9 +323,10 @@
block_signals();
if(sigsetjmp(fork_buf, 1) == 0)
new_thread_proc(stack, handler);
- set_signals(flags);
remove_sigstack();
+
+ set_signals(flags);
}
void thread_wait(void *sw, void *fb)
Jeff
I'll be releasing kernels later today that fix this and the recent local root exploit vulnerability.
-Chris |
|
| Back to top |
|
caker
Joined: 15 Apr 2003
Posts: 2404
Location: Galloway, NJ
|
| Posted: Thu Jan 13, 2005 1:12 pm Post subject: |
|
|
Update: I can still reproduce this bug. Jeff is having another look.
-Chris |
|
| Back to top |
|
| |
|