Advice on tracking 'random' server hangs....

Hi,

Had a couple of odd server hangs recently and wonder if anyone had any good tips on tools to use to diagnose them.

The server's running Debian Lenny, Apache, MySQL, PHP and Postfix/Dovecot for mail handling.

Situation is that CPU usage is hitting 100% and then locking up - memory and disk seem to be running fine.

This was odd because the linode monitoring was showing that the server was up, but using the shell access through the linode website it was limited to lish, rather than coming up with the login prompt it would normally give if the server were up.

I'm assuming that the server was responding to pings, but everything else was locked up.

In the past when I've had problems it's been PHP and I've been able to do the following:

1) ensure max execution time limit set in php.ini

2) check using ps aux / top to see what was stuck (httpd in previous cases)

3) having set ExtendedStatus On in http.conf I could use lynx localhost/server-status through the shell on linode to see detailed diagnostics of apache and find the culprit script

Obviously not being able to get any kind of shell access this was not possible and all I could do was restart.

What I'd like to know is:

1) what useful things could I be logging so that when I grep my logfiles after restarting I could see what had gone wrong

2) how could I force a hard %CPU execution limit on all processes from a single executable - i.e. ensure that if it is PHP then it can't take 100% CPU, so that I'll always be able to shell in.

I've googled various things (and, as I said above, the ExtendedStatus was useful), and my guess is that there are many ways to do this. So I wonder if anyone could share their favourite quick tips / pointers to ways to limit damage and log excessive resource utilisation?

Best wishes

Peter

6 Replies

One thing you can try is leaving 'top' running on your lish console. When your machine locks up, log in to lish and check which process is using the CPU.

I think it's more likely that you're OOMing rather than the CPU maxing out, because you've got four virtual cores; full CPU usage would show up as 400%, not 100%. It also means that you'd need four processes hammering the CPU to lock up the node, and even then merely maxing out a CPU doesn't cause the machine to lock up.

check for /var/log/syslog and /var/log/messages for kernel warnings, you can also install munin http://library.linode.com/server-monito … an-5-lenny">http://library.linode.com/server-monitoring/munin/debian-5-lenny

What kernel version are you running? I had a similar problem where I had random hangs for some reason the clock stopped which caused all kinds of hell, switching kernel fixed it. If you have a lot of repeated timestamps at the end of your logs could be you had the issue I had.

Hi guys,

Thanks for the replies. It happened again yesterday so I'm really focusing on it again.

@Guspaz - sorry I should have mentioned that it does max out at 400% (I meant 100% of the physical server, but you're right, that shows as 400% on the linode web interfact).

Leaving top running might not be practical because it happens about once a fortnight, and when I spot it I can't even log in through the lish console. Yesterday it at least came up with a login prompt, but hung when the password was typed in.

What I'd like is be able to limit any apache/php/etc. processes to, say, 80% of total CPU, then at least I'd be able to shell in through lish or directly and see what's going on.

@obs - Thanks for the suggestions - the link to linode's library with a section called 'server-monitoring' made me smile. I tried Google when I should have first tried Linode ;-)

As for kernel version, uname -a tells me:

2.6.18.8-linode22 #1 SMP Tue Nov 10 16:12:12 UTC 2009 i686 GNU/Linux

which I think is Linode's latest debian build

Lookig at /var/log/messages I can see a boatload of messages at the same timestamp, but they are all related (apparently) to the hard reboot we did. Just before that there are some telltale memory ones.

Every day there is the usual

kernel: imklog 3.18.6, log source = /proc/kmsg started.

rsyslogd: [origin software="rsyslogd" swVersion="3.18.6" x-pid="1112" x-info="http://www.rsyslog.com"] restart

But yesterday morning this was followed an hour or so later by a stream of kernel memory one, e.g.:

<:0 all_unreclaimable? no

…..

oom-killer: gfp_mask=0x201d2, order=0

[] outofmemory+0x1c9/0x200

[] _allocpages+0x28f/0x310

[] _dopagecachereadahead+0x139/0x2f0

…..

HighMem per-cpu: empty

So OOM seems to be the final thing that killed it, with the CPU maxing out because some process was hanging as it couldn't get any memory alloc-ed? Does that sound reasonable?

Looking in /var/log/syslog at the time /var/log/messages complained (7:24 am) I can see something extremely suspicious… quite a few of these:

postfix/local[24786]: warning: maildir access problem for UID/GID=33/33: create maildir file /var/www/Maildir/tmp/1276153966.P24786.li62-252.members.linode.com: Permission denied

postfix/local[24786]: warning: perhaps you need to create the maildirs in advance

I'll go and check out the server-monitoring stuff you suggest, but my working hypothesis now is that I've royally screwed up the postfix configuration….. I'm sure that that would be quite capable of bringing the server to its knees….

Thanks again to both of you for replying to my post. Sorry about the long reply. What I really like about this forum is that you get a chance not just to fix your screw-ups, but learn something along the way :-)

Cheers

Peter

Something chewed up your ram hence the OOM error. Could be postfix, not sure why postfix is trying to write to /var/www mine doesn't do that.

If in doubt purge postfix (not just uninstall) then reinstall and re-create your config files. If you're only sending emails from postfix then it should only take around 10-15 mins.

You don't need to log in via LISH. Just run top and leave LISH logged in when you disconnect. You should then be able to reconnect (since LISH is not hosted by your linode) and see the results of top. If you're OOMing, I'd suggest sorting by RAM usage (capital M).

Another thing to do is log the output of 'ps aux' to a file, so when it goes down you can reboot and check out the log.

Thanks for the further replies….

@Guspaz,

Didn't know that Lish should stay logged in when I left the webpage - I assumed it would timeout somehow. Useful to know.

@obs,

To get Postfix up and running I followed a pretty good howto - http://workaround.org/ispmail/lenny/

It recommends setting dovecot as follows:

mail_location = maildir:/var/vmail/%d/%n/Maildir

I saw an earlier FAQ version that had it pointing to /home/vmail which is a bit pants, but I'm not in a rush to move it….

You're right that it might not be postfix - just that that was the thing logging errors into the syslog closest to the crashtime.

I think I've fixed the postfix config problem now, but will leave lish running top.

re: logging ps-aux to a file, I assume you're thinking of something like a 5 minute cronjob with a day or so of logrotate.

Thanks again for the comments. I'm determined to get to the bottom of this :-)

Peter

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct