CPU spike, unresponsive SSH and corrupted logs
Basically, what happened is that today at around 8am I lost ability to login using SSH or web console, though I still was able to access the website (lighttpd+php5+mysql), ftp (proftpd) and database (postgresql) and it all worked pretty fast. Later today I found out that email (postfix+courier) stopped working as well at around the same time, ~8am. At that point I decided not to wait until the night anymore and restarted the server from dashboard, which took almost two minutes to shutdown.
When I started reading the /var/log/ content I didn't find any suspicious records except for the fact that some of the logs are corrupted either with garbage in them, entries out of order, or entries from other log files mixed in. So far I haven't found any other corrupted files besides logs.
In graphs I can see that today at ~8:00 CPU suddenly maxed out at ~95-105% and stayed there for about hour and a half, until ~9:30. IO rate was steady at 50 with occasional (once in ~12 hours) spikes ~250-300, then during the CPU spike from ~8:00 to ~9:30 IO gradually dropped to 0 and then started jumping around erratically with max ~600.
I would really appreciate if anyone can give me any insight on how I can find out what had happened I how I can prevent it from happening again.
3 Replies
When this occurred I also noticed the system date had been changed to 1914. I've had this date error occur one other time on another linode where it didn't cause corrupted logs or a cpu spike, but that might have been that I noticed the date change before it caused too much of a problem (just a few hrs after the change). No hint in the logs as to the cause of the date change I could see.
NTP is set to ntp.ubuntu.com, but if that were the problem I assume I'd be seeing a lot of reports of this problem. Also, ntpd doesn't reset the date when this occurs due to the amount of change exceeding the sanity limit. ntpd's -g option could be used to avoid the sanity check, but I'm not going to use this till I'm sure of the cause.
The date issue resolves with a reboot, but then all the affected file's modified times need to be fixed.
@capitan_dorko:
When this occurred I also noticed the system date had been changed to 1914.
That is a kernel bug in 2.6.34… if you're still running that version after a reboot, make sure you're set to run Latest 2.6 Paravirt and aren't locked to a specific kernel.
Again. Much thanks.