CPU and Disk IO go out of countrol every few days

Wondering if someone here has some suggestions on how help me solve this problem (or at least reign it in):

Every several weeks (time period is not always the same), my CPU usage and disk IO pegs out the meter. My system becomes un responsive, and the only way to get back control is to reboot.

Does anyone have any suggestions on process monitors that can keep an eye on runaway processes, kill them if they get too high for too long, and restart them?

21 Replies

Though this doesn't quite answer your question… if you're running Mysql as part of a CMS installation, lack of caching and/or slow queries are likely to have something to do with it. Simply enabling some sort of caching (and the slow query log) for Mysql could help.

I met 300% ~ 400% CPU useage randomly. I used Centos5.0 with Virtualmin before, and met that issue every 1 or 2 days. Later I switched to Ubuntu 8.04 LTS with VIrtualmin (cause installation script cannot find 5.0 iso mirror on Centos site now, they only provide mirror for 5.2). But I still met it every each 3 or 4 days.

I just bought the account 1 month before, type is Linode 360 located at Newark, NJ, USA

I came to post a similar question.

It's happened a couple times in the past but it just happened twice in the same week. CPU spikes up but I/O flatlines.

Checking my /var/log/messages I see a ton of these

Nov 13 01:02:58 li4-190 kernel: ''IN-internet':'IN=eth0 OUT= MAC=fe:fd:42:dc:01:be:00:0c:db:fc:8b:59:08:00 SRC=88.31.96.105 DST=66.220.1.190 LEN=52 TOS=0x08 PREC=0x00 TTL=55 ID=18 15 DF PROTO=TCP SPT=4242 DPT=445 WINDOW=59584 RES=0x00 SYN URGP=0

Same thing when it crashed on Nov 9.

Logging in to LISH just gives me a screen full of that.

Not running a high profile site. 10 hits a day honestly.

It's not quite a process monitor, but I would suggest installing Munin (http://munin.projects.linpro.no/). That way, when this next happens, you can see if there was anything that accompanied that, such as high levels of network traffic or disk activity.

Based on what you say so far, I'll bet you run out of RAM and started to chew up swap, which would explain both the CPU and disk usage. One way munin can help is it can show you RAM usage over time and you can see if the amount of used memory slowly went up over time until this happened.

Good luck!

@pmarsh:

I came to post a similar question.

It's happened a couple times in the past but it just happened twice in the same week. CPU spikes up but I/O flatlines.

Checking my /var/log/messages I see a ton of these

Nov 13 01:02:58 li4-190 kernel: ''IN-internet':'IN=eth0 OUT= MAC=fe:fd:42:dc:01:be:00:0c:db:fc:8b:59:08:00 SRC=88.31.96.105 DST=66.220.1.190 LEN=52 TOS=0x08 PREC=0x00 TTL=55 ID=18 15 DF PROTO=TCP SPT=4242 DPT=445 WINDOW=59584 RES=0x00 SYN URGP=0

Same thing when it crashed on Nov 9.

Logging in to LISH just gives me a screen full of that.

Not running a high profile site. 10 hits a day honestly.

Those are netfilter (firewall) messages. Destination port 445 is a filesharing port. You can configure it to log these things elsewhere so it doesn't spam your console so much.

I was having the what sounds like the same problem almost every day: a huge IO spike, and then the CPU is maxed out until I reboot. From watching top I could see that all the RAM and swap were consumed. OOM-killer unsuccessfully tries to save the system:

Code:

Out of Memory: Kill process 28567 (apache2) score 41314 and children.

It seems that Apache goes out of control creating new processes and eats up all the memory. A few days ago I lowered the MaxClients and KeepAliveTimeout options and it hasn't happened again since.

Hope this helps.

Instead of Apache, you might want to give nginx a try. It's much nicer to your memory. :-)

http://nginx.net/

http://blog.kovyrin.net/2006/05/30/ngin … cgi-howto/">http://blog.kovyrin.net/2006/05/30/nginx-php-fastcgi-howto/

i can only second that, nginx is great. i'm using it with fcgi-php, small memory footprint, fast, rock solid.

check this site for english documentation: http://wiki.codemongers.com/Main

I see one "Out of Memory" message in my logs (from Today), where mysqld is killed. As suggested, I have lowered some of the values for apache (MaxSpareServers, MaxRequestsPerServer, etc). Looks like I need to look into something similar for mysql as well.

Thanks for your help, everyone.

@hotgazpacho: mysqld may not be the culprit.

My understanding of how the OOM system works (someone correct me if I'm wrong) is that when the system runs out of memory, the kernel basically goes into to "triage mode" and decides that it has to kill off some processes in order to keep the machine running. It does this by killing off what it perceives as the worst offenders, which may not always be the case.

Case in point, if you're getting hammered with connections, Apache will go and create as many as 255 child processes to handle those connections, which will eat up your memory. However, each of those processes MAY be smaller than a single MySQL process. Going by memory usage alone, it would appear at first glance that MySQL is wasting the most memory, when it is really Apache (with all of its child processes) that is really the culprit here.

@dmuth Point Taken

Here's my new apache settings:

StartServers 5
MinSpareServers 5
MaxSpareServers 10
ServerLimit 128
MaxClients 128
MaxRequestsPerChild 500

It was:

StartServers 5
MinSpareServers 5
MaxSpareServers 15
ServerLimit 256
MaxClients 256
MaxRequestsPerChild 5000

@hotgazpacho:

Here's my new apache settings:

StartServers 5
MinSpareServers 5
MaxSpareServers 10
ServerLimit 128
MaxClients 128
MaxRequestsPerChild 500

Does your linode have enough RAM to support 128 Apache instances?

Probably not. That's half the default, though.

Guess I need to be more aggressive, no?

Maybe. Remember what I said about installing Munin some replies back? One of the reasons for doing that is so that you can get metrics and baselines. i.e., in normal operation, what is typical resource usage?

That when, when things fail on your box, you won't know just the /cause/ of the fail, but /how bad of a cause it was/. Based on whatever activity you the next time this happens, it will give you a better picture of if 128 servers was way too many, or just a few servers too many.

Munin graphs show absolutely nothing. Meaning it stopped recording at the time of the spike.

The linode CPU graph shows 70% CPU usage no Disk I/O and no Network I/O.

No spike in traffic. Only thing running is 1 mongrel nothing fancy just a blog.

Logging into Lish shows shows nothing. No login prompt.

It's like the entire thing "blue screened" if that were possible.

Anyone else still having problems?

@pmarsh:

Munin graphs show absolutely nothing. Meaning it stopped recording at the time of the spike.
Well you need to look at what was going on up to that problem. Are you watching Apache and MySQL in Munin? Those aren't on by default.

I was getting similar issues. What I found to help was tuning Apache, removing unneeded features in Virtualmin.

Here's some tips. This is written with Virtualmin in mind, but could be applied elsewhere….

  • Don't run BIND DNS for your domains. Use Linode's DNS Manager instead. Turn off this feature in Virtualmin.

  • If you forward your emails to another provider (e.g. GMail, Yahoo), disable SpamAssassin and Virus Filtering in Virtualmin

  • Disable DAV in Virtualmin if you aren't using it. Also, edit your httpd.conf via Webmin and comment out the following lines:

#LoadModule dav_module modules/mod_dav.so
#LoadModule dav_fs_module modules/mod_dav_fs.so

Then open the subversion.conf file and comment out:

#LoadModule dav_svn_module     modules/mod_dav_svn.so
#LoadModule authz_svn_module   modules/mod_authz_svn.so

This keeps Apache from loading an unneeded module. Restart Apache.

  • Tune Apache. There's two sections you need to tune up prefork and mpm. These are in the httpd.conf. Here's my setup:
 <ifmodule prefork.c="">StartServers       8
MinSpareServers    5
MaxSpareServers   10
ServerLimit      128
MaxClients       128
MaxRequestsPerChild  500</ifmodule> 

 <ifmodule worker.c="">StartServers         2
MaxClients         150
MinSpareThreads     25
MaxSpareThreads     75 
ThreadsPerChild     25
MaxRequestsPerChild  0</ifmodule> 

Restart Apache to take effect.

-Take a close look at the LoadModule section in your httpd.conf. You may not need every module that's loading. With Virtualmin, there's quite a few.

  • Look at the Features and Plugins page. If you aren't using it, uncheck it. Dovecot, Postgres, Mailman, subversion, etc. After you disable the feature, visit the System->Bootup section in Webmin and verify that those services aren't set to start at boot. For some reason Postgres was set to startup on boot for me and I wasn't using it.

  • Use SFTP instead of FTP. Then you can stop ProFTPd as well. You can use a client such as Filezilla do your transfers this way. Its also much more secure.

That ought to get you started.

Yup Apache and MySQL are in there. There is no spike or indication of any run-up to the problem.

Everything is showing almost zero activity and then all of a sudden the graph just disappears.

Would be great if I could see what was actually using CPU. Anyone know of any other way to track what programs are using what? Kind of like a recording of top?

It really looks like an all out crash.

There's no BIND running, no virtualmin, Apache is stripped down. No memory issues at all on the box.

If it wasn't for the Linode CPU graph I'd have no idea that the box was doing anything at all. Would just appear to the outside world to have gone down.

The only thing that spikes is CPU? What's your IO rate at when the CPU spikes?

Ah shoot I didn't save the graphs from linode. Next time I will.

CPU was right at 70% Disk I/O and network activity were 0. Not even registering on the graph.

New linode user, and happy with it.

Started noticing the same CPU and I/O problems, CPU maxing for hours at 400%, I/O going up and down in the graph.

To maybe help someone else looking for solutions, here is what I did. I started with a Ubuntu 8.10.

__apache: changed settings for maxclients and threads to around a 1/4 of the default, as mentioned earlier.

mysql: skip-innodb, set it to log slow-queries__

these helped lower the CPU max to around 280%. In my case skipping InnoDB helped, even though I had to switch to another engine for some databases.

timezone: set to my timezone (Europe)

As the peaks started around the hour, I asked myself if cronjobs were partle to blame. They weren't.

ufw: can't remember what I set as default

About 6 hours later I was back at 400%. So I decided to switch to Ubuntu 8.04, as it is LTS and 8.10 isn't.

repeated the above steps, and changed UFW to default deny. Graphs are now looking nicely, bubbling at the bottom at the X axis.

@hotgazpacho:

I see one "Out of Memory" message in my logs (from Today), where mysqld is killed. As suggested, I have lowered some of the values for apache (MaxSpareServers, MaxRequestsPerServer, etc). Looks like I need to look into something similar for mysql as well.

Thanks for your help, everyone.

It could be a bug in your code (php … etc.) where a tight loop keeps exhausting the total allowable memory for an apache thread. When that thread runs out of memory, it freezes so the next request spawns another thread, the next thread ends up sharing the same faith. That happens till your system runs out of memory. So, mysql starves to death … and so on … The path of execution can be such that the end result occurs once in a few days. If you have access to a shell. At the prompt type ps -ef | grep httpd (or appache) and see what is going on, then you can use strace to see what and where the dead thread doing last.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct