Diagnosing Lag Spikes

I'm running a MUD on my VPS and for the past couple of weeks, I've been noticing pretty pronounced spikes of lag. I started profiling the MUD via gprof, adding various benchmark timing to the code to further profile major loops, database functions, file I/O, etc.

I can't find anything in the MUD itself that would be causing the lag spikes, so I'm turning to the VPS itself or something else running on it. I've noticed that even in a shell over SSH, I notice random bursts of lag. So I can confirm that the problem is affecting both me in my shell and my various MUD players.

I've been looking at "top" to measure CPU and memory usage, but I'm not finding anything out of the ordinary.

Are there are tools or utilities out there I can use to keep track of various system stats over time?

Anyone have any suggestions on how to otherwise track down or diagnose problems like this?

9 Replies

I am also having problems with unexpected load, my VPS runs idle most of the time, but not anymore. Any use seems to slow everything down. I also use Linode backups and the time it takes to do backups as increased 4-5 fold. I opened a ticket twice and was told everything is OK despite my high disk i/o.

You wouldn't happed to be on dallas152?

@crazylane:

You wouldn't happed to be on dallas152?

zunzun.com is on dallas105 and is seeing very bad lag spikes and lost packets. I thought this was on my my end at first, but not according to my tests here - only linode.

James

@crazylane:

You wouldn't happed to be on dallas152?

Nope, atlanta8.

I haven't done much digging into network stats yet. I can't find any hints in CPU usage, memory usage, or file I/O though, despite the fact that my SSH connection lags at the same time players on my MUD complain, so something is definitely up.

The best answer I received from support was that they can move me to a different server. I would rather have a better answer than that, I do not want the downtime at all.

Here is my current top:

top - 18:21:03 up 1 day, 9:22, 1 user, load average: 14.26, 10.75, 7.53

Tasks: 190 total, 2 running, 187 sleeping, 0 stopped, 1 zombie

Cpu(s): 1.2%us, 0.9%sy, 0.1%ni, 41.4%id, 56.4%wa, 0.0%hi, 0.0%si, 0.0%st

Mem: 1549180k total, 1247216k used, 301964k free, 153940k buffers

Swap: 524280k total, 4180k used, 520100k free, 765600k cached

12017 apache 20 0 34708 17m 3716 R 4.6 1.2 0:00.23 httpd

2889 mysql 20 0 246m 95m 5340 S 3.3 6.3 64:45.48 mysqld

10947 apache 20 0 35316 18m 4752 S 0.7 1.2 0:01.29 httpd

11478 apache 20 0 34156 17m 4292 S 0.7 1.2 0:00.55 httpd

12018 apache 20 0 34164 17m 3812 S 0.7 1.1 0:00.22 httpd

9488 root 20 0 13408 9928 2380 S 0.3 0.6 0:00.51 backup.pl

10686 root 20 0 13140 10m 1560 S 0.3 0.7 0:28.26 lfd

11622 apache 20 0 34208 17m 4412 S 0.3 1.2 0:00.38 httpd

22437 root 39 19 1968 648 284 S 0.3 0.0 1:43.58 gzip

25285 root 20 0 28604 14m 4724 S 0.3 1.0 0:28.19 httpd

28122 root 20 0 2416 1184 828 R 0.3 0.1 0:22.29 top

1 root 20 0 2152 604 560 S 0.0 0.0 0:00.38 init

NO idea if this is related, but since a few days (not sure… less than a week, but I might not be watching dstat when doing these loads earlier) I'm noticing way larger IO contention - whenever I hit data that's not in cache, I get whole CPU core stuck in iowaits, and the disk read speed is measured in hundreds of KB, not as before tens of MB. Exact same files (large database used once a day), so it's something on the host disk subsystem. Actually, even launching up emacs seems to be getting stuck in iowaits for quite a few seconds, and this is not my SSH connection, as other half of split-screen still scrolls the dstat.

I'm on newark10.

Try running one of the line-per-second monitoring tools (maybe in a screen session?), like 'vmstat 1' or 'dstat -c', and look for large amounts of CPU time spent in iowaits when the stalls happen.

Anything intensive rsync, tar, gzip just totally kills the vps.

procs –---------memory---------- ---swap-- -----io---- --system-- -----cpu------

r b swpd free buff cache si so bi bo in cs us sy id wa st

0 5 3956 270452 170008 785656 0 0 23 16 3 5 1 0 75 24 0

0 6 3956 265384 170044 785632 0 0 36 0 824 401 1 1 30 69 0

0 7 3956 261408 170088 785700 0 0 32 92 370 164 0 0 42 57 0

0 9 3956 254920 170120 785668 0 0 32 0 508 168 1 1 9 89 0

0 9 3956 252740 170160 785700 0 0 40 0 212 91 0 0 35 65 0

1 9 3956 243120 170172 785700 0 0 12 0 406 162 1 1 0 98 0

0 10 3956 236460 170188 785700 0 0 16 0 972 501 1 1 41 56 0

0 11 3956 236460 170224 785664 0 0 32 236 166 82 0 0 0 100 0

0 11 3956 236460 170260 785700 0 0 36 0 43 71 0 0 38 62 0

0 11 3956 236460 170284 785676 0 0 24 0 45 54 0 0 0 100 0

0 11 3956 236540 170320 785704 0 0 36 0 69 71 0 0 40 60 0

0 11 3956 236540 170356 785668 0 0 32 8 75 60 0 0 0 100 0

0 11 3956 236572 170368 785704 0 0 12 4 44 49 0 0 40 60 0

0 10 3956 236572 170380 785692 0 0 12 0 47 43 0 0 0 100 0

0 10 3956 236760 170404 785704 0 0 24 0 45 58 0 0 39 61 0

1 10 3956 236760 170420 785688 0 0 16 32 48 48 0 0 12 88 0

0 10 3956 236760 170440 785704 0 0 20 0 49 48 0 0 35 65 0

0 10 3956 235140 170468 785676 0 0 36 40 169 56 0 0 0 99 0

0 6 3956 235320 170504 785712 0 0 36 0 87 83 0 0 38 62 0

0 6 3956 235196 170528 785688 0 0 24 0 61 54 0 0 0 100 0

0 5 3956 238056 170572 785708 0 0 44 0 224 114 0 0 40 60 0

1 5 3956 236320 170608 785712 0 0 36 0 233 118 0 0 0 99 0

Well, you really should take up Linode on their offer to move you to another host. If somebody else on the box is hitting the disk pretty hard, why would you want to stay on that box?

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 5 3956 270452 170008 785656 0 0 23 16 3 5 1 0 75 24 0 0 6 3956 265384 170044 785632 0 0 36 0 824 401 1 1 30 69 0 0 7 3956 261408 170088 785700 0 0 32 92 370 164 0 0 42 57 0 0 9 3956 254920 170120 785668 0 0 32 0 508 168 1 1 9 89 0 0 9 3956 252740 170160 785700 0 0 40 0 212 91 0 0 35 65 0 1 9 3956 243120 170172 785700 0 0 12 0 406 162 1 1 0 98 0 0 10 3956 236460 170188 785700 0 0 16 0 972 501 1 1 41 56 0 0 11 3956 236460 170224 785664 0 0 32 236 166 82 0 0 0 100 0 0 11 3956 236460 170260 785700 0 0 36 0 43 71 0 0 38 62 0 0 11 3956 236460 170284 785676 0 0 24 0 45 54 0 0 0 100 0 0 11 3956 236540 170320 785704 0 0 36 0 69 71 0 0 40 60 0 0 11 3956 236540 170356 785668 0 0 32 8 75 60 0 0 0 100 0 0 11 3956 236572 170368 785704 0 0 12 4 44 49 0 0 40 60 0 0 10 3956 236572 170380 785692 0 0 12 0 47 43 0 0 0 100 0 0 10 3956 236760 170404 785704 0 0 24 0 45 58 0 0 39 61 0 1 10 3956 236760 170420 785688 0 0 16 32 48 48 0 0 12 88 0 0 10 3956 236760 170440 785704 0 0 20 0 49 48 0 0 35 65 0 0 10 3956 235140 170468 785676 0 0 36 40 169 56 0 0 0 99 0 0 6 3956 235320 170504 785712 0 0 36 0 87 83 0 0 38 62 0 0 6 3956 235196 170528 785688 0 0 24 0 61 54 0 0 0 100 0 0 5 3956 238056 170572 785708 0 0 44 0 224 114 0 0 40 60 0 1 5 3956 236320 170608 785712 0 0 36 0 233 118 0 0 0 99 0

Yep, iowaits. You can try complaining so the Linode staff locates whoever's the disk hog and convinces him to stop, or accept the host switch offer.

And I thought I had it bad with 25-30% in waits… you seem to be having 60-100%… >.>

PS. [ code ] tag is useful.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct