Xen, 100% cpu usage, but not on UML

Hi

I have a weird problem, and I've run out of ideas. I am hoping someone here can suggest what else I can do to track it down, or better yet, tell me the solution. :)

THE PROBLEM

I have 4 linodes in 3 data centres, one is UML and the other three are XEN. (fremont45, newark28, newark56, dallas5)

At seemingly random intervals (days to months) I'll get an email from the linode Alert service that a Xen linode is using 100% cpu. Investigation shows that ssh and lish are both not responding. Graphs on the dashboard show CPU at 100%, network and disk both at 0. Typical values according to these graphs for the quietest of the linodes are cpu 2%, network 500 bits/sec, disk io 10.

A reboot via dashboard eventually reboots the system.

This has never happened on the UML linode, which makes me wonder if it is Xen itself, or some strange interaction with Xen and something I'm running.

ENVIRONMENT

All 4 linodes have the same monitoring and intrusion detection packages (e.g. monit and munin-node and fail2ban). I'm using firehol to configure the firewall. All are debian installs, based on etch. Some packages from lenny have been installed. I'm using the "Latest 2.6 series" kernel on all 4 linodes.

The versions of some of these packages are different on the different linodes.

The purpose of the linodes vary, so other packages installed vary. The common ones include mysql and exim4 (both listening only on 127.0.0.1) and apache.

WHAT I HAVE TRIED

I won't list everything here, as this message will be long enough without all the gory detail.

All logs I can find have been investigated for anything in any way different just before the cpu increase, with no result.

Monit has been set to alert on CPU (user|wait|system) > 50%, and there is no corresponding syslog entry of this happening, suggesting the cpu load increased too quickly for monit to notice.

Two of the linodes have heartbeat running. When the master node recently had this problem, the other node (uggweb2) had the following in syslog (I am using the local gateway, 209.123.234.1 as a ping node, and uggweb2 is also a mysql slave to the master node):

Jan 22 09:12:05 uggweb2 ipfail: [3358]: debug: Got asked for num_ping.

Jan 22 09:12:05 uggweb2 ipfail: [3358]: debug: Found ping node 209.123.234.1!

Jan 22 09:12:06 uggweb2 ipfail: [3358]: info: Telling other node that we have more visible ping nodes.

Jan 22 09:12:06 uggweb2 ipfail: [3358]: debug: Sending youaredead.

Jan 22 09:12:06 uggweb2 ipfail: [3358]: debug: Message [youaredead] sent.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Got asked for num_ping.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Found ping node 209.123.234.1!

Jan 22 09:12:07 uggweb2 ipfail: [3358]: info: Telling other node that we have more visible ping nodes.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Sending youaredead.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Message [youaredead] sent.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Got asked for num_ping.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Found ping node 209.123.234.1!

Jan 22 09:12:07 uggweb2 ipfail: [3358]: info: Telling other node that we have more visible ping nodes.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Sending youaredead.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Message [youaredead] sent.

Jan 22 09:23:59 uggweb2 heartbeat: [2388]: info: all clients are now paused

Jan 22 09:25:05 uggweb2 mysqld[1445]: 090122 9:25:05 [ERROR] Slave I/O thread: error reconnecting to master 'xxx@192.168.yyy.zzz:3306': Error: 'Lost connection to MySQL server at 'reading initial communication packet', system error: 113' errno: 2013 retry-time: 60 retries: 86400

Jan 22 09:30:08 uggweb2 heartbeat: [2388]: WARN: Message hist queue is filling up (376 messages in queue)

Jan 22 09:30:11 uggweb2 heartbeat: [2388]: WARN: Message hist queue is filling up (377 messages in queue)

Jan 22 09:30:14 uggweb2 heartbeat: [2388]: WARN: Message hist queue is filling up (378 messages in queue)

Jan 22 09:30:17 uggweb2 heartbeat: [2388]: WARN: Message hist queue is filling up (379 messages in queue)

Jan 22 09:30:20 uggweb2 heartbeat: [2388]: WARN: Message hist queue is filling up (380 messages in queue)

Jan 22 09:30:23 uggweb2 heartbeat: [2388]: WARN: Message hist queue is filling up (381 messages in queue)

Uggweb2 did not take over from the main node, I guess because I don't have heartbeat setup properly (still learning).

I've googled and checked bug lists for many of the packages I've installed, and the only reports I've found that might be related at all are some posts on the linode forums about people having one-off problems with CPU 100% on Xen linodes. I've checked and where relevant changed my setup in the way these people have, and still have the problem.

Load on all four is low at the moment, but two of the Xen ones are due to take over from an existing web server elsewhere, probably during the next month. I really want to sort out this problem before then.

The biggest problem is that I don't know how to trigger the problem, or predict when it might occur. All I've been able to do so far is make a change, and wait to see if the problem happens again. I think the longest time I've had with none of the linodes doing this has been about 6 weeks.

All ideas gratefully listened to.

Cheers,

amj.

7 Replies

No ideas, but some sympathy.

I've been a linode customer since May 2005, and I've been a big big fan. Recommended linode to lots of people; some have even spent money :-)

But I've got to say that I've had more downtime this past year than the previous 3 put together. And since I got moved to Xen (September 2008; not my choice) then performance and reliability has got even worse. Even interactive access to my linode seems slow. Dunno. Maybe I'm getting fussy :-) I still believe linode is good value for money, but I'm no longer an evangelist for the service.

Xen definitely seems less stable than other technologies that are more mature. Look at all the "host1234 hit a Xen bug so we rebooted it" notices on this forum! Not a week goes by without one of those reboot notices. I really hope that the Xen guys make it stable enough soon…

How can i tell if i'm on Xen?

Under Xen, /proc/cpuinfo shows info for the four cores that you have access to as if they were four processors. Under UML, /proc/cpuinfo contains references to 'UML' and 'User Mode Linux'.

@sweh:

No ideas, but some sympathy.

I've been a linode customer since May 2005, and I've been a big big fan. Recommended linode to lots of people; some have even spent money :-)

But I've got to say that I've had more downtime this past year than the previous 3 put together. And since I got moved to Xen (September 2008; not my choice) then performance and reliability has got even worse. Even interactive access to my linode seems slow. Dunno. Maybe I'm getting fussy :-) I still believe linode is good value for money, but I'm no longer an evangelist for the service.

I use XEN on our office servers admittedly not as complex as linodes setup (just a couple of domU's on each dom0) but haven't once had to reboot due to a xen bug so its not bad as far as I can tell.

I've never had a problem with xen on my linode since the migration so things work okay for me. Like others I have noticed a bit of a slow down in ssh sessions but I rarely login so am not bothered also I just tested and ssh seems responsive at the moment.

I use ssh and sftp on a regular basis. I don't notice any slow down at all compared to other servers I work with.

Jeff

@amj:

Hi

I have a weird problem, and I've run out of ideas. I am hoping someone here can suggest what else I can do to track it down, or better yet, tell me the solution. :)

THE PROBLEM

I have 4 linodes in 3 data centres, one is UML and the other three are XEN. (fremont45, newark28, newark56, dallas5)

At seemingly random intervals (days to months) I'll get an email from the linode Alert service that a Xen linode is using 100% cpu. Investigation shows that ssh and lish are both not responding. Graphs on the dashboard show CPU at 100%, network and disk both at 0. Typical values according to these graphs for the quietest of the linodes are cpu 2%, network 500 bits/sec, disk io 10.

A reboot via dashboard eventually reboots the system.

This has never happened on the UML linode, which makes me wonder if it is Xen itself, or some strange interaction with Xen and something I'm running.

ENVIRONMENT

All 4 linodes have the same monitoring and intrusion detection packages (e.g. monit and munin-node and fail2ban). I'm using firehol to configure the firewall. All are debian installs, based on etch. Some packages from lenny have been installed. I'm using the "Latest 2.6 series" kernel on all 4 linodes.

The versions of some of these packages are different on the different linodes.

The purpose of the linodes vary, so other packages installed vary. The common ones include mysql and exim4 (both listening only on 127.0.0.1) and apache.

WHAT I HAVE TRIED

I won't list everything here, as this message will be long enough without all the gory detail.

All logs I can find have been investigated for anything in any way different just before the cpu increase, with no result.

Monit has been set to alert on CPU (user|wait|system) > 50%, and there is no corresponding syslog entry of this happening, suggesting the cpu load increased too quickly for monit to notice.

Two of the linodes have heartbeat running. When the master node recently had this problem, the other node (uggweb2) had the following in syslog (I am using the local gateway, 209.123.234.1 as a ping node, and uggweb2 is also a mysql slave to the master node):

Jan 22 09:12:05 uggweb2 ipfail: [3358]: debug: Got asked for num_ping.

Jan 22 09:12:05 uggweb2 ipfail: [3358]: debug: Found ping node 209.123.234.1!

Jan 22 09:12:06 uggweb2 ipfail: [3358]: info: Telling other node that we have more visible ping nodes.

Jan 22 09:12:06 uggweb2 ipfail: [3358]: debug: Sending youaredead.

Jan 22 09:12:06 uggweb2 ipfail: [3358]: debug: Message [youaredead] sent.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Got asked for num_ping.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Found ping node 209.123.234.1!

Jan 22 09:12:07 uggweb2 ipfail: [3358]: info: Telling other node that we have more visible ping nodes.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Sending youaredead.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Message [youaredead] sent.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Got asked for num_ping.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Found ping node 209.123.234.1!

Jan 22 09:12:07 uggweb2 ipfail: [3358]: info: Telling other node that we have more visible ping nodes.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Sending youaredead.

Jan 22 09:12:07 uggweb2 ipfail: [3358]: debug: Message [youaredead] sent.

Jan 22 09:23:59 uggweb2 heartbeat: [2388]: info: all clients are now paused

Jan 22 09:25:05 uggweb2 mysqld[1445]: 090122 9:25:05 [ERROR] Slave I/O thread: error reconnecting to master 'xxx@192.168.yyy.zzz:3306': Error: 'Lost connection to MySQL server at 'reading initial communication packet', system error: 113' errno: 2013 retry-time: 60 retries: 86400

Jan 22 09:30:08 uggweb2 heartbeat: [2388]: WARN: Message hist queue is filling up (376 messages in queue)

Jan 22 09:30:11 uggweb2 heartbeat: [2388]: WARN: Message hist queue is filling up (377 messages in queue)

Jan 22 09:30:14 uggweb2 heartbeat: [2388]: WARN: Message hist queue is filling up (378 messages in queue)

Jan 22 09:30:17 uggweb2 heartbeat: [2388]: WARN: Message hist queue is filling up (379 messages in queue)

Jan 22 09:30:20 uggweb2 heartbeat: [2388]: WARN: Message hist queue is filling up (380 messages in queue)

Jan 22 09:30:23 uggweb2 heartbeat: [2388]: WARN: Message hist queue is filling up (381 messages in queue)

Uggweb2 did not take over from the main node, I guess because I don't have heartbeat setup properly (still learning).

I've googled and checked bug lists for many of the packages I've installed, and the only reports I've found that might be related at all are some posts on the linode forums about people having one-off problems with CPU 100% on Xen linodes. I've checked and where relevant changed my setup in the way these people have, and still have the problem.

Load on all four is low at the moment, but two of the Xen ones are due to take over from an existing web server elsewhere, probably during the next month. I really want to sort out this problem before then.

The biggest problem is that I don't know how to trigger the problem, or predict when it might occur. All I've been able to do so far is make a change, and wait to see if the problem happens again. I think the longest time I've had with none of the linodes doing this has been about 6 weeks.

All ideas gratefully listened to.

Cheers,

amj.

Like others have said, I've also found Xen to be slower and less reliable than UML. (Though, I haven't experienced any reliability problems with the latest kernel)

The next time you get a notice that one of your Xen boxes is using 100% CPU, attempt to log on using the console access. That gets you a direct serial port to your machine, and you should be able to get on it even if SSH is down. From there, you can use the "top" command to see which app is failing… ( I don't know if you've tried this or not)

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct