Hard kernel crash, rcu_sched stall, kernel 3.7.5-linode48

For the last ~1.5 months I've been experiencing seemingly random kernel lock-ups, in the range of 1-2 times per week now! All services remain unresponsive, and the lish console shell interface shows some messages but does not respond to input. This requires a "reboot", and to avoid waiting the several-minutes for the gentle sync/reboot to work, a "destroy" operation.

The messages seem to vary, but they all seem to be related to memory page faults. Looks like a Xen bug. Anyone have recommendations on how I can alleviate this problem? Downgrade to a known-good kernel, etc? Many thanks.

INFO: rcu_sched self-detected stall on CPU
        3: (779947 ticks this GP) idle=1b9/140000000000001/0
         (t=780012 jiffies)
Pid: 24440, comm: gs Tainted: G    B D      3.7.5-linode48 #1
Call Trace:
 [<c0193e8f>] ? print_cpu_stall+0xdf/0x190
 [<c078bae1>] ? _raw_spin_unlock_irqrestore+0x11/0x20
 [<c016890f>] ? update_wall_time+0x18f/0x290
 [<c019435a>] ? rcu_check_callbacks+0x12a/0x230
 [<c013f965>] ? update_process_times+0x35/0x70
 [<c016f2fd>] ? tick_sched_timer+0x6d/0xc0
 [<c0151f65>] ? __remove_hrtimer+0x45/0xa0
 [<c016f290>] ? tick_nohz_handler+0xe0/0xe0
 [<c01520ed>] ? __run_hrtimer+0x4d/0xf0
 [<c0152569>] ? hrtimer_interrupt+0x119/0x2f0
 [<c01068f7>] ? xen_timer_interrupt+0x17/0x30
 [<c018d9ff>] ? handle_irq_event_percpu+0x3f/0x150
 [<c018fed5>] ? irq_get_irq_data+0x5/0x10
 [<c04f97a5>] ? info_for_irq+0x5/0x20
 [<c04f9e60>] ? evtchn_from_irq+0x10/0x40
 [<c0190191>] ? handle_percpu_irq+0x31/0x50
 [<c04f9664>] ? __xen_evtchn_do_upcall+0x164/0x210
 [<c04fa868>] ? xen_evtchn_do_upcall+0x18/0x30
 [<c078cb3b>] ? xen_do_upcall+0x7/0xc
 [<c018007b>] ? update_if_frozen+0x6b/0xd0
 [<c04f00d8>] ? irq_cpu_rmap_add+0x88/0x90
 [<c01013a7>] ? xen_hypercall_sched_op+0x7/0x20
 [<c04f9ed7>] ? xen_poll_irq_timeout+0x47/0x60
 [<c0108295>] ? xen_spin_lock_slow+0x65/0xd0
 [<c010835c>] ? xen_spin_lock_flags+0x5c/0x70
 [<c078ba97>] ? _raw_spin_lock_irqsave+0x27/0x40
 [<c01ab83d>] ? pagevec_lru_move_fn+0x5d/0xb0
 [<c01ab170>] ? pagevec_lookup+0x20/0x20

 [<c01c0bd7>] ? exit_mmap+0x37/0x110
 [<c04fa86d>] ? xen_evtchn_do_upcall+0x1d/0x30
 [<c078cb3b>] ? xen_do_upcall+0x7/0xc
 [<c0101227>] ? xen_hypercall_xen_version+0x7/0x20
 [<c0106297>] ? xen_force_evtchn_callback+0x17/0x30
 [<c01308eb>] ? mmput+0x2b/0xa0
 [<c0136113>] ? exit_mm+0xd3/0x100
 [<c078bac0>] ? _raw_spin_lock_irq+0x10/0x20
 [<c0137b9d>] ? do_exit+0x11d/0x3a0
 [<c0131f17>] ? print_oops_end_marker+0x27/0x30
 [<c010c272>] ? oops_end+0x72/0xa0
 [<c012687e>] ? __bad_area_nosemaphore+0xae/0x140
 [<c018da40>] ? handle_irq_event_percpu+0x80/0x150
 [<c018fed5>] ? irq_get_irq_data+0x5/0x10
 [<c012696b>] ? bad_area+0x3b/0x50
 [<c0126f32>] ? __do_page_fault+0x402/0x410
 [<c04f96ce>] ? __xen_evtchn_do_upcall+0x1ce/0x210
 [<c01947b3>] ? rcu_irq_exit+0x53/0xb0
 [<c04fa86d>] ? xen_evtchn_do_upcall+0x1d/0x30
 [<c078cb3b>] ? xen_do_upcall+0x7/0xc
 [<c0126f40>] ? __do_page_fault+0x410/0x410
 [<c078c2fe>] ? error_code+0x5a/0x60
 [<c0126f40>] ? __do_page_fault+0x410/0x410
 [<c01a78f8>] ? get_page_from_freelist+0x118/0x3c0
 [<c0103138>] ? load_TLS_descriptor+0x58/0xa0
 [<c01a7e81>] ? __alloc_pages_nodemask+0x141/0x6d0
 [<c01aa98d>] ? __do_page_cache_readahead+0xdd/0x1a0
 [<c01aaa6e>] ? ra_submit+0x1e/0x30
 [<c01a3099>] ? filemap_fault+0x309/0x3e0
 [<c01ba5a5>] ? __do_fault+0x75/0x570
 [<c01bdbf0>] ? handle_pte_fault+0xa0/0x2f0
 [<c01bdf35>] ? handle_mm_fault+0xf5/0x1b0
 [<c0126c6a>] ? __do_page_fault+0x13a/0x410
 [<c01c36a5>] ? sys_mprotect+0x1b5/0x1f0
 [<c0126f40>] ? __do_page_fault+0x410/0x410
 [<c078c2fe>] ? error_code+0x5a/0x60
 [<c0126f40>] ? __do_page_fault+0x410/0x410</c0126f40></c078c2fe></c0126f40></c01c36a5></c0126c6a></c01bdf35></c01bdbf0></c01ba5a5></c01a3099></c01aaa6e></c01aa98d></c01a7e81></c0103138></c01a78f8></c0126f40></c078c2fe></c0126f40></c078cb3b></c04fa86d></c01947b3></c04f96ce></c0126f32></c012696b></c018fed5></c018da40></c012687e></c010c272></c0131f17></c0137b9d></c078bac0></c0136113></c01308eb></c0106297></c0101227></c078cb3b></c04fa86d></c01c0bd7></c01ab170></c01ab83d></c078ba97></c010835c></c0108295></c04f9ed7></c01013a7></c04f00d8></c018007b></c078cb3b></c04fa868></c04f9664></c0190191></c04f9e60></c04f97a5></c018fed5></c018d9ff></c01068f7></c0152569></c01520ed></c016f290></c0151f65></c016f2fd></c013f965></c019435a></c016890f></c078bae1></c0193e8f> 

16 Replies

I recently got one of these myself - very similar to what you have there. The error I had referenced my webserver, Litespeed.

INFO: rcu_sched self-detected stall on CPU

1: (239698 ticks this GP) idle=98d/140000000000001/0

(t=240004 jiffies)

Pid: 2486, comm: litespeed Not tainted 3.7.5-linode48 #1

Call Trace:

After a recent reboot, I see high CPU for a process named rcu_sched. I've never seen that in my years with Linode. Ubuntu 12.10 64 bit, latest kernel.

James

A 3.7 rcu_sched stall bug related to TCP was fixed in 3.7.8 upstream. Linode might want to cherry pick or just roll out 3.7.9 that was just recently released

commit 09ea1383126d942a993b0895cec16e0961db5af9

Author: Eric Dumazet <edumazet@google.com>

Date: Thu Jan 10 07:06:10 2013 +0000

tcp: splice: fix an infinite loop in tcpreadsock()

[ Upstream commit ff905b1e4aad8ccbbb0d42f7137f19482742ff07 ]

commit 02275a2ee7c0 (tcp: don't abort splice() after small transfers)

added a regression.

[ 83.843570] INFO: rcu_sched self-detected stall on CPU

[ 83.844575] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=21002 jiffies, g=4457, c=4456, q=13132)

[ 83.844582] Task dump for CPU 6:

[ 83.844584] netperf R running task 0 8966 8952 0x0000000c

[ 83.844587] 0000000000000000 0000000000000006 0000000000006c6c 0000000000000000

[ 83.844589] 000000000000006c 0000000000000096 ffffffff819ce2bc ffffffffffffff10

[ 83.844592] ffffffff81088679 0000000000000010 0000000000000246 ffff880c4b9ddcd8

[ 83.844594] Call Trace:

[ 83.844596] [] ? vprintk_emit+0x1c9/0x4c0

[ 83.844601] [] ? schedule+0x29/0x70

[ 83.844606] [] ? tcpsplicedata_recv+0x42/0x50

[ 83.844610] [] ? tcpreadsock+0xda/0x260

[ 83.844613] [] ? tcpprequeueprocess+0xb0/0xb0

[ 83.844615] [] ? tcpspliceread+0xc0/0x250

[ 83.844618] [] ? sockspliceread+0x22/0x30

[ 83.844622] [] ? dospliceto+0x7b/0xa0

[ 83.844627] [] ? sys_splice+0x59c/0x5d0

[ 83.844630] [] ? putname+0x2b/0x40

[ 83.844633] [] ? dosysopen+0x174/0x1e0

[ 83.844636] [] ? systemcallfastpath+0x16/0x1b

if recv_actor() returns 0, we should stop immediately,

because looping wont give a chance to drain the pipe.

Signed-off-by: Eric Dumazet <edumazet@google.com>

Cc: Willy Tarreau <w@1wt.eu>

Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

3.7.9 kernels are inbound!

Thanks for the heads up about the TCP bug. However, all of mine seem to be memory / page fault related.

reboot   system boot  3.7.5-linode48   Mon Feb 18 05:53
reboot   system boot  3.7.5-linode48   Mon Feb  4 12:58
reboot   system boot  3.7.5-linode48   Mon Feb  4 07:22
reboot   system boot  3.6.5-linode47   Mon Jan 28 21:39
reboot   system boot  3.6.5-linode47   Sun Jan 27 13:51
reboot   system boot  3.6.5-linode47   Mon Jan 14 07:50
reboot   system boot  3.6.5-linode47   Sun Jan  6 15:42
reboot   system boot  3.6.5-linode47   Sat Dec 22 13:09

It would be amazing/heroic if Linode could provide 1) some sort of very simple "external" monitoring service, i.e. does a particular URL respond to at least 1 of 5 retried requests over 60 seconds, 2) hook the automatic reboot capability into this, 3) notify me of the event. I realize this has its own dangers and complexities, but I'm fairly sure that kernel bugs like this will pop up from now to eternity, and the silent hard lockups are a real pain. (I'm using Server Density to do monitoring, which is how I discover these outages, but I'm grandfathered in under their old [sane!] pricing.) Charge me $5/month for this, or give it away free knowing that I will be more hesitant to leave Linode thanks to this extra automatic monitoring/reliability.

One can configure Linux kernel to panic on OOPs, meaning your kernel should exit and then Lassie (reboot watchdog) would get your system back online.

echo 1 > /proc/sys/kernel/panic # reboot (in our case, exit) 1 second later after a panic

echo 1 > /proc/sys/kernel/paniconoops # give up after OOPsing

-Chris

Hi Chris, learned something, thanks! Will try the /proc/sys/kernel/paniconoops & /proc/sys/kernel/panic settings and hope that Lassie comes to save me next time :) - Mike

Just crashed again a few minutes ago. Fortunately, it did reboot itself as per caker's suggestion (thanks!). Unfortunately, there are no log mentions of the issue at all, making it just about impossible to debug further.

@caker:

3.7.9 kernels are inbound!

Have you been able to estimate an arrival date?

James

@caker:

3.7.9 kernels are inbound!

Have you been able to estimate an arrival date?

James

FWIW, I've been running (custom) 3.7.9 on 3 linodes for a few days with no issues. Then again I didnt have any issues with earlier 3.7's either, but at least it didnt break anything else yet.

@caker:

3.7.9 kernels are inbound!

Have you been able to estimate an arrival date?

James

3.7.10-linode49 and 3.7.10-x86_64-linode30 were released today. "Latest" now points to them.

http://www.linode.com/kernels/

http://www.linode.com/kernels/rss.xml

Enjoy!

-Chris

@caker:

3.7.10-linode49 and 3.7.10-x86_64-linode30 were released today

Outstanding, just outstanding. Thank you.

James

Thanks for the tip or I wouldn't have known I needed to reboot to pick up the new kernel.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct