High Load (possibly IO wait)

I've been seeing a lot of mysterious load spikes on my Ubuntu 8.04 linode in the past 24+ hours, with load average ranging from .25 up to 1.5+, almost all of that coming from IO wait (as far as I can tell). I can ameliorate the problem a lot by turning off swap, but that's clearly not a good long term solution. I've tried several kernels: 2.6.18, .27, and .28 with no changes. I've turned off all of my servers to no effect, and cleanly run both rkhunter and chkrootkit (freshly reinstalled). Finally, I've installed munin to track things going forward.

Is there something else I should be looking for on my server? I'm at a loss at this point for other things that I can do. Is anyone else having similar problems on Atlanta36?

23 Replies

You might take a look at the output from $ vmstat 5 and see whether it is io or swap activity that is loading you down.

To be honest, if you are actively swapping you probably need to cut down your memory usage or upgrade to a bigger linode.

i'm on atlanta12 and i noticed a spike in CPU usage, loadavg and IOstat at around 2am-4am this morning, no idea why.

Linode dashbord graph shows it as flat but munin caught it.

Thanks. I'll check out vmstat if things get bad again. I'm not actively swapping, though – about 1/2 of my physical memory is disk cache, so there's plenty that's available for use. When swap is enabled, a few MB get used up (as I've seen on every linux box I've worked on), but it's never been really used.

I've been seeing very poor performance on my linode at newark62 as well. I've seen my loads spike when the vps isn't doing anything.

Me too. I'm on fremont59 and my load average is a ridiculous 3.20. I'm not doing anything out of the ordinary, I'm not swapping at all. (I'm using only half of my RAM.) I don't see any unfamiliar processes running in the background, so I know I wasn't hacked. But the CPU is spending 20-40% of its time waiting for disk I/O.

Edit: The load average is now below 1.00. Whatever the problem was, it seems to be going away. But I don't like things like this, anyway!

Can somebody among the staff look into the issue, please? Now you have a list of hosts (newark62, atlanta12, atlanta36, fremont59) to look into, and a corresponding list of times at which the I/O problems seem to have occurred. What was going on on these hosts?

Using the program iotop you can monitor which programs are waiting on io/using io.

Problem persists on fremont59.

Operations that require even a moderate amount of CPU + IO are feeling very sluggish. All MySQL queries are slower than usual, even though my DB is sitting idle at this time. Public key based SSH logins take several seconds longer than usual. Apt-get update takes forever. Rsync takes ~100 times longer than usual to generate a file list.

My linode usually has a load average of 0.1 or lower. Right now my load average is 0.8, and a reboot doesn't do anything to solve the problem. I waited a few minutes after the reboot to check my load average, so that all those daemons starting up after reboot doesn't affect my short term load average. No noticeable drop. Oh, and it took several times longer than usual for my daemons to start up after the reboot.

I've been monitoring this using top, iotop, vmstat, and munin. I'm not using any more CPU or IO than usual – CPU average 4%, IO average 200 according to the Dashboard. I'm not doing anything unusual with my linode, nor does anyone else seem to be doing something nasty with my account. Nonetheless, many of my processes are waiting for IO to complete. As if I were trying to do some work while a degraded RAID array was being rebuilt!

Which makes me suspect……….

Is somebody fsck'ing a disk on fremont59 (and other nodes mentioned above) ?

I also noticed this behavior last night on newark74.

@hybinet

Doesn't sound like the problem's on your linode. I'd open a ticket about it.

@btmorex

Already did, three hours ago. I'll post in this thread if I get an answer that might be helpful to others as well. Or better yet, a staff member should let us know what's going on.

UPDATE

Might be kernel related. I was running the new 2.6.27.4 kernel when I came across the anomalies described above. I switched back to "Latest 2.6 series" twelve hours ago, and the problem has disappeared since then. (I've been monitoring my load averages at 10 minute intervals throughout the night.)

Caker also says that Linux kernels newer than 2.6.20 may have obscure IO issues, so I for one is willing to attribute my problems to my premature adoption of the bleeding edge kernel.

But what about others? @patrickpkt, oliver, astrashe3 : Which kernel version have you been running?

2.6.18.8-linode10 (SMP) i686

Whatever was going on isn't nearly as bad today – it might not be happening at all.

I'm running 2.6.18.8-linode16. I didn't run iotop because it complained that my kernel was too old.

I reinstalled my linode's OS yesterday with a 32 bit ubuntu 8.10 -- I had the problem before and after the reinstall.

I don't know about everyone else, but mine is running pretty well now. I don't know what was going on, but I haven't changed anything to fix it. It got better on its own.

I'm on newark61 and over the last 30 days my IO has averaged 10K by looking at the graphs in my control panel.

3 weeks ago this was less than 1K and I'm not doing anything different on my server it just seems to use up swap space more than normal!

CPU usage is currently at 10% (5% average for the last 30 days).

@richard.scott:

I'm on newark61 and over the last 30 days my IO has averaged 10K by looking at the graphs in my control panel.

3 weeks ago this was less than 1K and I'm not doing anything different on my server it just seems to use up swap space more than normal!

CPU usage is currently at 10% (5% average for the last 30 days).

If you're regularly using swap, you have a problem. It also hurts the performance of whatever you have on the server, so find a convenient time to reboot your server and see if you can control your swap usage.

@hybinet:

If you're regularly using swap, you have a problem.

I am using swap space, but I don't know why!

I used to run my email server on a co-located Mini-itx VIA PD10000 1GHz Motherboard with only 512MB of ram and that never ever touched swap space.

I would have thought that a linode was quicker than that?? But since moving that all to a Linode 540 its done nothing but swap!

…but, and here's the really annoying part….

i've only been swapping like crazy for the last 3 weeks and I've been using the server since the start of December as my mail email server!

I don't understand why the performacne has degraded so much in the past 3 weeks :shock:

Rich

The problem we've been discussing in this thread only hits from time to time, for a maximum of few hours. Not for three weeks straight. Hardware problems tend to get fixed pretty soon, so if it's been three weeks, something's definitely wrong with one of your programs.

A linode 540 obviously has more RAM than your old colo, and it's also many times more powerful. Also, it was fine until three weeks ago, right? Then try to find out what happened when the problems began. Did you change some configurations? Did you update your programs from your distro's repository? Did you suddenly get a lot of visitors to your web site or a lot of spam to your mail server? Perhaps someone hacked your server without your knowledge?

There are many diagnostic tools that can tell you which program is using the most CPU and RAM. top is the most simple of all; htop looks prettier; munin is a bit more sophisticated. You can get halfway towards a solution if you can pinpoint the bad guy.

Head hung in shame… I think its a config error!

I've been poking about and found that my grey-listing daemon on my mail server is configured to keep connections open to mysql.

It doesn't seem to re-use connections, but just keeps opening them up to the 100 limit set in gld.conf

Hopefully I've fixed it and my Linode can go back to being awesome! :wink: :oops:

ok, I spoke too soon. but I think I've found the reason.

using 'htop' i've found that clamd is using 311MB of ram! WTF :cry:

No wonder nothing else has room to work nicely without swapping all the time :roll:

ok, I've found a possible fix for my I/O trhashing.

If I change my "vm.swappiness" value from the default of 60 to 95 it seems to help?

I run this at boot time:

sysctl -q -w vm.swappiness=95

However, it totally fills my swap space up :shock:

top - 12:34:33 up 20 min,  1 user,  load average: 0.24, 0.14, 0.10
Tasks: 104 total,   1 running, 103 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.0%us,  1.0%sy,  0.1%ni, 95.9%id,  1.9%wa,  0.0%hi,  0.0%si,  0.2%st
Mem:    553176k total,   537280k used,    15896k free,     2480k buffers
Swap:   262136k total,   218204k used,    43932k free,   297812k cached

But it seems to have reduced my disk IO :lol:

I'll keep an eye on it.

Rich

Using up most of the swap space during normal system operation is probably not a great idea. If your load increases even slightly for whatever reason, your machine is going to quickly slow to a crawl.

If none of the programs you're running are using abnormal amounts of memory, I would suggest just getting a bigger linode.

@btmorex:

Using up most of the swap space during normal system operation is probably not a great idea. If your load increases even slightly for whatever reason, your machine is going to quickly slow to a crawl.

I apprecieate that its not a good position to be in but its better than it was with regards to performance.

@btmorex:

Is there anything specific in the configuration of the Zen Host that would make this use swap space more often?

If none of the programs you're running are using abnormal amounts of memory, I would suggest just getting a bigger linode.

I've changed my vm.swappiness value to be 5 to test both ends of the scale and my swap space isn't used but my load average goes up to around 3!

Reverting vm.swappiness back to 95 results in a load average of 0.40!

Rich

I've edited this post as my fix was to create more indexes in MySQL :oops:

do not edit your vm.swappiness value unless you have a good reason to do so :wink:

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct