Linode keeps OOMing; High IOWAIT; High load average - help?

I have never had this much trouble with a server and I am at a loss…

I have a Linode 2048 w/ extra RAM (2678 MB total). I am running Ubuntu 10.04 LTS w/ Apache and MySQL. There is a single WordPress website - teleread.com, it gets an average of 5000 uniques a day.

 cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=10.04
DISTRIB_CODENAME=lucid
DISTRIB_DESCRIPTION="Ubuntu 10.04.1 LTS"

uname -a
Linux teleread 2.6.32.16-linode28 #1 SMP Sun Jul 25 21:32:42 UTC 2010 i686 GNU/Linux
 apache2ctl -V
Server version: Apache/2.2.14 (Ubuntu)
Server built:   Apr 13 2010 19:28:27
Server's Module Magic Number: 20051115:23
Server loaded:  APR 1.3.8, APR-Util 1.3.9
Compiled using: APR 1.3.8, APR-Util 1.3.9
Architecture:   32-bit
Server MPM:     Prefork
  threaded:     no
    forked:     yes (variable process count)
Server compiled with....
 -D APACHE_MPM_DIR="server/mpm/prefork"
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=128
 -D HTTPD_ROOT=""
 -D SUEXEC_BIN="/usr/lib/apache2/suexec"
 -D DEFAULT_PIDLOG="/var/run/apache2.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_LOCKFILE="/var/run/apache2/accept.lock"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="/etc/apache2/mime.types"
 -D SERVER_CONFIG_FILE="/etc/apache2/apache2.conf"
 mysql --version
mysql  Ver 14.14 Distrib 5.1.41, for debian-linux-gnu (i486) using readline 6.1

A few times a day the IOWAIT starts to rapidly increase, the SWAP starts to thrash and the load average jumps. I have tuned, and tuned, and retuned Apache and MySQL, but no matter what I do, it keeps happening.

Running Apache2 w/ prefork MPM

 <ifmodule mpm_prefork_module="">StartServers            8
    MinSpareServers         5
    MaxSpareServers        20
    ServerLimit           300
    MaxClients            300
    MaxRequestsPerChild  4000</ifmodule> 

MySQL:

[mysqld]
user            = mysql
port            = 3306
socket          = /var/run/mysqld/mysqld.sock
basedir         = /usr
datadir         = /var/lib/mysql
tmpdir          = /tmp
skip-external-locking
skip-innodb

key_buffer_size = 64M
table_open_cache = 1048
sort_buffer_size = 1M
read_buffer_size = 1M
read_rnd_buffer_size = 8M
myisam_sort_buffer_size = 64M
thread_cache_size =16
query_cache_size = 32M
tmp_table_size=64M
max_heap_table_size=64M
back_log = 100
max_connections = 301
max_connect_errors = 5000
join_buffer_size=1M
open-files = 10000
interactive_timeout = 300
wait_timeout = 300
thread_concurrency = 8

I've tried php-cgi, fcgid, and regular php to see if anything made a difference, but it doesn't help.

Eventually I start getting these in the Apache error log:

[Wed Sep 22 09:19:23 2010] [warn] child process 18258 still did not exit, sending a SIGTERM
[Wed Sep 22 09:19:23 2010] [warn] child process 18303 still did not exit, sending a SIGTERM
[Wed Sep 22 09:19:23 2010] [warn] child process 18304 still did not exit, sending a SIGTERM

But I think that's a sign of the OOMing/thrashing, not a sign of the culprit.

According to netstat, at any given moment in time I have about 120+ tcp connections to www, but occasionally it'll spike… I have seen these in the Apache error log (when I lower the MaxClients to test:

[Mon Sep 20 11:51:13 2010] [error] server reached MaxClients setting, consider raising the MaxClients setting

The lowest I've set MaxClients is 150, and I just set it back to 300 after the latest issue.

From top it looks like Apache is using 23M (RES) per process… at this moment netstat says I have 174 connections to www and ps shows 41 apache2 processes… that's roughly 943MB of RAM

At this moment MySQL is at 42M (RES)

These are the stats at this very moment in time

teleread# netstat -t | grep -c www
163

teleread# ps auxww | grep -c www-data
27

teleread# top
top - 10:05:44 up 17:43,  3 users,  load average: 0.69, 0.67, 3.93
Tasks: 133 total,   2 running, 131 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.2%us,  1.3%sy,  0.0%ni, 92.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.1%st
Mem:   2708280k total,   978832k used,  1729448k free,    63564k buffers
Swap:   262136k total,    14024k used,   248112k free,   312024k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
18886 mysql     20   0  121m  45m 5460 S    0  1.7   0:41.06 mysqld
19435 www-data  20   0 52696  26m 4020 S    0  1.0   0:05.63 apache2
19474 www-data  20   0 52440  26m 4024 S    0  1.0   0:04.07 apache2
19451 www-data  20   0 49648  24m 4056 S    9  0.9   0:07.32 apache2
19429 www-data  20   0 49656  24m 4060 S    0  0.9   0:06.01 apache2
19496 www-data  20   0 49880  24m 3844 S    0  0.9   0:02.08 apache2
19492 www-data  20   0 49652  24m 4056 S    0  0.9   0:04.87 apache2
19331 www-data  20   0 49652  24m 4056 S    1  0.9   0:10.19 apache2
19469 www-data  20   0 49644  23m 4024 S    0  0.9   0:03.30 apache2
19473 www-data  20   0 49636  23m 4020 S    0  0.9   0:04.82 apache2
19479 www-data  20   0 49728  23m 3828 S    0  0.9   0:02.79 apache2
19507 www-data  20   0 49652  23m 3844 S   11  0.9   0:01.59 apache2
19495 www-data  20   0 49644  23m 3844 S    0  0.9   0:01.88 apache2
19508 www-data  20   0 49472  23m 4012 S    0  0.9   0:01.76 apache2
19501 www-data  20   0 49580  23m 3880 S    0  0.9   0:02.13 apache2
19433 www-data  20   0 49368  23m 4028 S    0  0.9   0:04.99 apache2
19487 www-data  20   0 49476  23m 3860 S    0  0.9   0:03.64 apache2

So… I am looking for some ideas, as I said, I am at a loss.

Thank you.

Lew

10 Replies

My first impression is that this is a grossly overpowered box for what you're serving. But maybe your Wordpress application really is that inefficient.

What's probably happening here is that your PHP processes which normally take ~1GB do something so that they take up ~2GB or more, so you spiral into swap use and start IO thrashing.

But you need better data. Run vmstat, iotop, htop or something and report data from the times that your box goes into swap, what is it doing then?

Maybe install something like munin to monitor your system, maybe you'll see some patterns there.

Maybe you can use mod_php instead of fastcgi? Maybe 80% of your traffic could be handled by a reverse squid proxy?

Well, the thing is, he's using prefork.

If he switched to mod_fastcgi, good, but still it's the prefork that sucks so much ram up. Keep fastcgi, switch to worker, and seriously, MaxClients down to 100-120 is totally enough. Especially after you cut down KeepAliveTimeout to 5.

coughwordpresssuckscough

````
[Mon Sep 20 11:51:13 2010] [error] server reached MaxClients setting, consider raising the MaxClients setting

````
Do not ever listen to what this error message says. It is very misleading in a VPS environment! When you're running out of memory, you want a lower MaxClients setting, not higher.

Some rules of thumb, assuming mpmprefork and modphp:

Linode 512: MaxClients 25 or less

Linode 1024: MaxClients 50 or less

Linode 1536: MaxClients 75 or less

Linode 2048: MaxClients 100 or less

Anything more and you're likely to swap.

Also, install the WordPress Super Cache plugin. This solves the problem 95% of the time. I wouldn't even start looking for another source of trouble until this was installed.

Thanks everyone for responding… the system start thrashing again today around 1:30PM EST (I wasn't near by to deal with it, so I don't have any good details)…

I was running iostat -x 1 when it took place and got this (not very useful to me) -- last few entries:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.77    0.00    7.79   91.32    0.12    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
xvda             19.00     0.00 1157.00    5.00 33456.00    40.00    28.83   220.60  218.08   5.02 583.00
xvdb            415.00   206.00  246.00 1519.00  5056.00 13824.00    10.70   981.47  636.18   3.47 611.90

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.96    0.00    5.76   93.16    0.12    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
xvda              0.67     1.01  116.95    0.17  3393.29     4.03    29.01    42.31  357.67   6.29  73.61
xvdb             32.72    24.16   14.77  164.93   408.05  1551.68    10.91   125.80  726.63   4.20  75.42

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.41    0.00   11.54   86.92    0.12    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
xvda              1.56     0.00   82.58    0.31  2397.51     7.47    29.01    27.58  242.03   4.11  34.11
xvdb             26.75    11.35   15.40   95.02   364.54   831.10    10.83    52.70  481.36   3.16  34.93

I had tried Worker MPM and w/ FastCGI but it was still happening. In fact, I just changed it to prefork, mod_php this past Friday as an attempt to fix the issue.

I also have W3 Total Cache installed, which I personally think out performs Super Cache.

I'll try lowering MaxClients tonight.

I know what I have is over-kill for the traffic being generated, this is why I am utterly confused. I manage several servers and this is the only one giving me grief.

Thanks again.

Your numbers clearly say that you're out of memory and swapping hard. In the default configuration, /dev/xvdb is the swap partition.

@lewayotte:

I also have W3 Total Cache installed, which I personally think out performs Super Cache.
IIRC, if you use Total Cache, you still need to go through the PHP interpreter. Total Cache has truckloads of features, but it comes at the expense of having to use PHP. (After all, its features are written in PHP.) At least that's how I remember it; recent versions may have changed a bit.

In contrast, Super Cache tinkers with your .htaccess file so that most pages completely bypass PHP. The two plugins have different performance characteristics. You have to judge not only by raw speed but also take your low-memory situation into account. I think the "bypass PHP altogether" approach has clear benefits in this regard, but YMMV.

I would suggest using Apache worker with mod_fastcgi. Something like StartServers 4, ThreadsPerChild 32 (or even 64). Apache would then eat much less memory. FcgidMaxProcesses maybe to 64? And I would (for a testing purposes) disable caching unless you are hitting too much I/O (CPU is rather irrelevant). You will serve pages slower but you should be able to serve them more.

Using prefork, you would save next to nothing since PHP engine is loaded even for serving static content.

I advise against Apache if you don't need any Apache specific features. Lighty or Nginx will do just fine.

I am running a low-profile website (average 1200 visitors, 400-450 thousand hits per day), in php. Using worker and mod_fastcgi. With a bit of tuning, it all runs in near-constant memory space (my usual "used RAM" readout with all running services is ~170MB, of which the whole set of apache and php-cgi processes is maybe about 100).

It's way easier than trying to convert all the apache dependencies into lighttpd's or nginx's syntax, setting up proxing, and/or (in case of nginx) compiling the whole mess by hand to get anything resembling "recent version".

If anyone's interested…

apc.shm_size = 64
(...)
    StartServers          2
    MaxClients          250
    MinSpareThreads      25
    MaxSpareThreads      75
    ThreadsPerChild      25
    MaxRequestsPerChild   0
    ThreadStackSize 2097152 # Linux default is 8MB, a ton of wasted VMem... 2MB seems totally safe, probably could be reduced further.
(...)
        FastCgiConfig \
                -idle-timeout 120 \
                -initial-env PHP_FCGI_CHILDREN=24 \
                -initial-env PHP_FCGI_MAX_REQUESTS=500 \
                -killInterval 100000 \
                -listen-queue-depth 300 \
                -maxClassProcesses 1 \
                -singleThreshold 0

I know it's a very blunt force solution, but I'd honestly suggest just backing up your database and wordpress configs, and reinstalling the server. Then triple check your backed up configs, and toss them back on. Odds are, you accidentally deleted a # in a random config file while tuning things, and uncommented "eat-all-swap-space=1" or something.

All the power to you if you want to debug it, but a rebuild would probably be quite a bit faster.

@akerl:

Odds are, you accidentally deleted a # in a random config file while tuning things, and uncommented "eat-all-swap-space=1" or something.
I doubt it 8)

@lewayotte:

StartServers 8

MinSpareServers 5

MaxSpareServers 20

ServerLimit 300

MaxClients 300

MaxRequestsPerChild 4000
Three hundred prefork processes, each fully loaded with the PHP engine. No surprise the box is OOMing.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct