Strange random freezes after lucid upgrade

Ok, so for a few days now I've been getting strange freezes. This started immediately after I upgraded from Karmic to Lucid. The load average seems to go way up, to the 30-70 range, and the system becomes totally unresponsive. All network services, the console, etc, all become unresponsive.

It happens about 1 time per day, but not at any set regular time. Sometimes it will last 2 days before freezing, sometimes it freezes a few times in one day.

I can't find anything interesting in logs. I only know the load average is high because I leave a terminal open with htop running, and the last loadavg displayed before the SSH session disconnects shows something like this:

kiomava@h2:~$ age from root@h4                                                        1.6%]     Tasks: 382 total, 1 running
  2  [   unknown) at 19:10 ...                                                        0.0%]     Load average: 38.85 37.81 33.63
  Mem[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||868/2020MB]     Uptime: 2 days, 12:29:11
The system is going down for power off NOW!                                       42/719MB]

  PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
32657 kiomava   20   0  2816  1472   940 R  1.0  0.1  1h15:41 htop

The only way to recover is to issue a reboot from the linode dashboard. It will continue to be unresponsive for hours, until I reboot from the linode dashboard.

I've tried several kernels. I've tried pretty much all the recent 2.6 kernels. This started when on latest 2.6 paravirt.

This system has been in more or less the same configuration (by configuration here I mean packages installed, their config files, etc) for a year. This all started exactly when I upgraded to Lucid, so it seems almost a certainty there is some Lucid-specific problem in play.

I have munin going, and there is nothing I see leading up to the points where it freezes. There is just a discontinuity in the various graphs starting at the point where it dies and until it recovers. Even the load average doesn't spike – it must happen too quickly for munin to catch and log successfully. Only running htop catches the load average spike. Munin shows nothing suspicious -- no slowly increasing memory/cpu/load, no slowly increasing process count -- nothing. Just total perfect normalness until it dies.

The logs also show nothing interesting. Just discontinuity when it dies. I've scoured apache logs, java appserver logs, etc., and found nothing interesting around when it dies.

There are no cron jobs timed near when these freezes happen. They happen at seemingly random times, I haven't seen any pattern like it failing every day around the same time or minute or whatever.

So…

Has anybody seen anything similar?

Anybody have suggestions how to better instrument this to see what's going on?

Any other suggested courses of action to fix this?

Thanks in advance for any help you all can offer.

7 Replies

Hmm, 382 processes seems like a rather large number… is it normally that high?

In steady state it's around 200. It certainly climbs during the load problem, last crash went up to 571. I think this is because new processes are spawned as incoming http requests happen but they hang and just keep increasing in number.

My latest theory is that disk access is the thing that completely freezes.

I ran this script:

#!/bin/bash

unif="stats"

while true; do
 date >>$unif
 cat /proc/loadavg >>$unif
 ps -fel >>$unif
 sleep 1
done

And then I also ran this in a screen session from another server:

while true; do date; cat /proc/loadavg; cat /proc/vmstat; ps -fel; sleep 1; done

The last output from the one that appends to the file had this loadavg line:

Fri Sep 17 06:05:03 GMT 2010
0.80 0.87 0.84 2/358 32751

The last loadavg/date output from the console-only one was this:

Fri Sep 17 06:24:29 GMT 2010
59.26 50.67 31.28 1/571 6556

The plain shell stuff to repeatedly do PS was chugging along fine. In fact it just kept going after i logged in to its screen at 6:24 gmt to check.

But the script to repeatedly append a file stopped immediately at 6:05 gmt.

Can't recall if this has already been checked, but… are you running Apache? If so, are you using the prefork mpm? What's MaxClients set to in apache2.conf?

Apache+mod_php, by default, will eventually shred your system in this manner if you have less than a few GB of memory. -rt

It's got 2GB of ram, MaxClients is 150.

I'll drop MaxClients to 50 and see if that helps, thanks for the suggestion.

At the two points in my last post, the number of apache2 processes in the console-based ps was 98, was 11 when file output stopped. I haven't seen the ram usage hit into swap, but I'll add a little more to the console-based script to check that.

It did the high loadavg crash again with the reduced MaxClients.

Also at the point where it died, it had plenty of free memory.

Hi,

Have you been able to resolve this? Because I'm experiencing exactly the same problems…

I ended up having to reinstall to a fresh new instance, and now it's stable. Never did figure out the cause of the problems I described in this thread.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct