Hard kernel crash, rcu_sched stall, kernel 3.7.5-linode48

Question

Hard kernel crash, rcu_sched stall, kernel 3.7.5-linode48

general

For the last ~1.5 months I've been experiencing seemingly random kernel lock-ups, in the range of 1-2 times per week now! All services remain unresponsive, and the lish console shell interface shows some messages but does not respond to input. This requires a "reboot", and to avoid waiting the several-minutes for the gentle sync/reboot to work, a "destroy" operation.

The messages seem to vary, but they all seem to be related to memory page faults. Looks like a Xen bug. Anyone have recommendations on how I can alleviate this problem? Downgrade to a known-good kernel, etc? Many thanks.

INFO: rcu_sched self-detected stall on CPU
        3: (779947 ticks this GP) idle=1b9/140000000000001/0
         (t=780012 jiffies)
Pid: 24440, comm: gs Tainted: G    B D      3.7.5-linode48 #1
Call Trace:
 [<c0193e8f>] ? print_cpu_stall+0xdf/0x190
 [<c078bae1>] ? _raw_spin_unlock_irqrestore+0x11/0x20
 [<c016890f>] ? update_wall_time+0x18f/0x290
 [<c019435a>] ? rcu_check_callbacks+0x12a/0x230
 [<c013f965>] ? update_process_times+0x35/0x70
 [<c016f2fd>] ? tick_sched_timer+0x6d/0xc0
 [<c0151f65>] ? __remove_hrtimer+0x45/0xa0
 [<c016f290>] ? tick_nohz_handler+0xe0/0xe0
 [<c01520ed>] ? __run_hrtimer+0x4d/0xf0
 [<c0152569>] ? hrtimer_interrupt+0x119/0x2f0
 [<c01068f7>] ? xen_timer_interrupt+0x17/0x30
 [<c018d9ff>] ? handle_irq_event_percpu+0x3f/0x150
 [<c018fed5>] ? irq_get_irq_data+0x5/0x10
 [<c04f97a5>] ? info_for_irq+0x5/0x20
 [<c04f9e60>] ? evtchn_from_irq+0x10/0x40
 [<c0190191>] ? handle_percpu_irq+0x31/0x50
 [<c04f9664>] ? __xen_evtchn_do_upcall+0x164/0x210
 [<c04fa868>] ? xen_evtchn_do_upcall+0x18/0x30
 [<c078cb3b>] ? xen_do_upcall+0x7/0xc
 [<c018007b>] ? update_if_frozen+0x6b/0xd0
 [<c04f00d8>] ? irq_cpu_rmap_add+0x88/0x90
 [<c01013a7>] ? xen_hypercall_sched_op+0x7/0x20
 [<c04f9ed7>] ? xen_poll_irq_timeout+0x47/0x60
 [<c0108295>] ? xen_spin_lock_slow+0x65/0xd0
 [<c010835c>] ? xen_spin_lock_flags+0x5c/0x70
 [<c078ba97>] ? _raw_spin_lock_irqsave+0x27/0x40
 [<c01ab83d>] ? pagevec_lru_move_fn+0x5d/0xb0
 [<c01ab170>] ? pagevec_lookup+0x20/0x20

 [<c01c0bd7>] ? exit_mmap+0x37/0x110
 [<c04fa86d>] ? xen_evtchn_do_upcall+0x1d/0x30
 [<c078cb3b>] ? xen_do_upcall+0x7/0xc
 [<c0101227>] ? xen_hypercall_xen_version+0x7/0x20
 [<c0106297>] ? xen_force_evtchn_callback+0x17/0x30
 [<c01308eb>] ? mmput+0x2b/0xa0
 [<c0136113>] ? exit_mm+0xd3/0x100
 [<c078bac0>] ? _raw_spin_lock_irq+0x10/0x20
 [<c0137b9d>] ? do_exit+0x11d/0x3a0
 [<c0131f17>] ? print_oops_end_marker+0x27/0x30
 [<c010c272>] ? oops_end+0x72/0xa0
 [<c012687e>] ? __bad_area_nosemaphore+0xae/0x140
 [<c018da40>] ? handle_irq_event_percpu+0x80/0x150
 [<c018fed5>] ? irq_get_irq_data+0x5/0x10
 [<c012696b>] ? bad_area+0x3b/0x50
 [<c0126f32>] ? __do_page_fault+0x402/0x410
 [<c04f96ce>] ? __xen_evtchn_do_upcall+0x1ce/0x210
 [<c01947b3>] ? rcu_irq_exit+0x53/0xb0
 [<c04fa86d>] ? xen_evtchn_do_upcall+0x1d/0x30
 [<c078cb3b>] ? xen_do_upcall+0x7/0xc
 [<c0126f40>] ? __do_page_fault+0x410/0x410
 [<c078c2fe>] ? error_code+0x5a/0x60
 [<c0126f40>] ? __do_page_fault+0x410/0x410
 [<c01a78f8>] ? get_page_from_freelist+0x118/0x3c0
 [<c0103138>] ? load_TLS_descriptor+0x58/0xa0
 [<c01a7e81>] ? __alloc_pages_nodemask+0x141/0x6d0
 [<c01aa98d>] ? __do_page_cache_readahead+0xdd/0x1a0
 [<c01aaa6e>] ? ra_submit+0x1e/0x30
 [<c01a3099>] ? filemap_fault+0x309/0x3e0
 [<c01ba5a5>] ? __do_fault+0x75/0x570
 [<c01bdbf0>] ? handle_pte_fault+0xa0/0x2f0
 [<c01bdf35>] ? handle_mm_fault+0xf5/0x1b0
 [<c0126c6a>] ? __do_page_fault+0x13a/0x410
 [<c01c36a5>] ? sys_mprotect+0x1b5/0x1f0
 [<c0126f40>] ? __do_page_fault+0x410/0x410
 [<c078c2fe>] ? error_code+0x5a/0x60
 [<c0126f40>] ? __do_page_fault+0x410/0x410</c0126f40></c078c2fe></c0126f40></c01c36a5></c0126c6a></c01bdf35></c01bdbf0></c01ba5a5></c01a3099></c01aaa6e></c01aa98d></c01a7e81></c0103138></c01a78f8></c0126f40></c078c2fe></c0126f40></c078cb3b></c04fa86d></c01947b3></c04f96ce></c0126f32></c012696b></c018fed5></c018da40></c012687e></c010c272></c0131f17></c0137b9d></c078bac0></c0136113></c01308eb></c0106297></c0101227></c078cb3b></c04fa86d></c01c0bd7></c01ab170></c01ab83d></c078ba97></c010835c></c0108295></c04f9ed7></c01013a7></c04f00d8></c018007b></c078cb3b></c04fa868></c04f9664></c0190191></c04f9e60></c04f97a5></c018fed5></c018d9ff></c01068f7></c0152569></c01520ed></c016f290></c0151f65></c016f2fd></c013f965></c019435a></c016890f></c078bae1></c0193e8f>

16 Replies

forum:ruchirablog · Answer 1 · Feb. 18, 2013, 10:52 a.m.

forum:ruchirablog 11 years, 2 months ago

whats the OS are you on?

forum:Ghan_04 · Answer 2 · Feb. 18, 2013, 10:55 a.m.

forum:Ghan_04 11 years, 2 months ago

I recently got one of these myself - very similar to what you have there. The error I had referenced my webserver, Litespeed.

INFO: rcu_sched self-detected stall on CPU

1: (239698 ticks this GP) idle=98d/140000000000001/0

(t=240004 jiffies)

Pid: 2486, comm: litespeed Not tainted 3.7.5-linode48 #1

Call Trace:

forum:zunzun · Answer 3 · Feb. 18, 2013, 11:04 a.m.

forum:zunzun 11 years, 2 months ago

After a recent reboot, I see high CPU for a process named rcu_sched. I've never seen that in my years with Linode. Ubuntu 12.10 64 bit, latest kernel.

James

forum:trippeh · Answer 4 · Feb. 18, 2013, 11:10 a.m.

forum:trippeh 11 years, 2 months ago

A 3.7 rcu_sched stall bug related to TCP was fixed in 3.7.8 upstream. Linode might want to cherry pick or just roll out 3.7.9 that was just recently released

commit 09ea1383126d942a993b0895cec16e0961db5af9

Author: Eric Dumazet <edumazet@google.com>

Date: Thu Jan 10 07:06:10 2013 +0000

tcp: splice: fix an infinite loop in tcpreadsock()

[ Upstream commit ff905b1e4aad8ccbbb0d42f7137f19482742ff07 ]

commit 02275a2ee7c0 (tcp: don't abort splice() after small transfers)

added a regression.

[ 83.843570] INFO: rcu_sched self-detected stall on CPU

[ 83.844575] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=21002 jiffies, g=4457, c=4456, q=13132)

[ 83.844582] Task dump for CPU 6:

[ 83.844584] netperf R running task 0 8966 8952 0x0000000c

[ 83.844587] 0000000000000000 0000000000000006 0000000000006c6c 0000000000000000

[ 83.844589] 000000000000006c 0000000000000096 ffffffff819ce2bc ffffffffffffff10

[ 83.844592] ffffffff81088679 0000000000000010 0000000000000246 ffff880c4b9ddcd8

[ 83.844594] Call Trace:

[ 83.844596] [] ? vprintk_emit+0x1c9/0x4c0

[ 83.844601] [] ? schedule+0x29/0x70

[ 83.844606] [] ? tcpsplicedata_recv+0x42/0x50

[ 83.844610] [] ? tcpreadsock+0xda/0x260

[ 83.844613] [] ? tcpprequeueprocess+0xb0/0xb0

[ 83.844615] [] ? tcpspliceread+0xc0/0x250

[ 83.844618] [] ? sockspliceread+0x22/0x30

[ 83.844622] [] ? dospliceto+0x7b/0xa0

[ 83.844627] [] ? sys_splice+0x59c/0x5d0

[ 83.844630] [] ? putname+0x2b/0x40

[ 83.844633] [] ? dosysopen+0x174/0x1e0

[ 83.844636] [] ? systemcallfastpath+0x16/0x1b

if recv_actor() returns 0, we should stop immediately,

because looping wont give a chance to drain the pipe.

Signed-off-by: Eric Dumazet <edumazet@google.com>

Cc: Willy Tarreau <w@1wt.eu>

Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

caker · Answer 5 · Feb. 18, 2013, 11:29 a.m.

caker 11 years, 2 months ago Linode Staff

3.7.9 kernels are inbound!

forum:compumike · Answer 6 · Feb. 18, 2013, 11:41 a.m.

forum:compumike 11 years, 2 months ago

Thanks for the heads up about the TCP bug. However, all of mine seem to be memory / page fault related.

reboot   system boot  3.7.5-linode48   Mon Feb 18 05:53
reboot   system boot  3.7.5-linode48   Mon Feb  4 12:58
reboot   system boot  3.7.5-linode48   Mon Feb  4 07:22
reboot   system boot  3.6.5-linode47   Mon Jan 28 21:39
reboot   system boot  3.6.5-linode47   Sun Jan 27 13:51
reboot   system boot  3.6.5-linode47   Mon Jan 14 07:50
reboot   system boot  3.6.5-linode47   Sun Jan  6 15:42
reboot   system boot  3.6.5-linode47   Sat Dec 22 13:09

It would be amazing/heroic if Linode could provide 1) some sort of very simple "external" monitoring service, i.e. does a particular URL respond to at least 1 of 5 retried requests over 60 seconds, 2) hook the automatic reboot capability into this, 3) notify me of the event. I realize this has its own dangers and complexities, but I'm fairly sure that kernel bugs like this will pop up from now to eternity, and the silent hard lockups are a real pain. (I'm using Server Density to do monitoring, which is how I discover these outages, but I'm grandfathered in under their old [sane!] pricing.) Charge me $5/month for this, or give it away free knowing that I will be more hesitant to leave Linode thanks to this extra automatic monitoring/reliability.

caker · Answer 7 · Feb. 18, 2013, 11:54 a.m.

caker 11 years, 2 months ago Linode Staff

One can configure Linux kernel to panic on OOPs, meaning your kernel should exit and then Lassie (reboot watchdog) would get your system back online.

echo 1 > /proc/sys/kernel/panic # reboot (in our case, exit) 1 second later after a panic

echo 1 > /proc/sys/kernel/paniconoops # give up after OOPsing

-Chris

forum:compumike · Answer 8 · Feb. 18, 2013, 12:10 p.m.

forum:compumike 11 years, 2 months ago

Hi Chris, learned something, thanks! Will try the /proc/sys/kernel/paniconoops & /proc/sys/kernel/panic settings and hope that Lassie comes to save me next time :) - Mike

forum:compumike · Answer 9 · Feb. 20, 2013, 3:04 p.m.

forum:compumike 11 years, 2 months ago

Just crashed again a few minutes ago. Fortunately, it did reboot itself as per caker's suggestion (thanks!). Unfortunately, there are no log mentions of the issue at all, making it just about impossible to debug further.

forum:zunzun · Answer 10 · Feb. 21, 2013, 3:59 p.m.

forum:zunzun 11 years, 2 months ago

~~@caker:~~

3.7.9 kernels are inbound!

Have you been able to estimate an arrival date?

James

forum:zunzun · Answer 11 · Feb. 23, 2013, 5:32 a.m.

forum:zunzun 11 years, 2 months ago

~~@caker:~~

3.7.9 kernels are inbound!

Have you been able to estimate an arrival date?

James

forum:trippeh · Answer 12 · Feb. 24, 2013, 6:59 p.m.

forum:trippeh 11 years, 2 months ago

FWIW, I've been running (custom) 3.7.9 on 3 linodes for a few days with no issues. Then again I didnt have any issues with earlier 3.7's either, but at least it didnt break anything else yet.

forum:zunzun · Answer 13 · Feb. 25, 2013, 7:56 p.m.

forum:zunzun 11 years, 2 months ago

~~@caker:~~

3.7.9 kernels are inbound!

Have you been able to estimate an arrival date?

James

caker · Answer 14 · Feb. 27, 2013, 3:23 p.m.

caker 11 years, 2 months ago Linode Staff

3.7.10-linode49 and 3.7.10-x86_64-linode30 were released today. "Latest" now points to them.

http://www.linode.com/kernels/

http://www.linode.com/kernels/rss.xml

Enjoy!

-Chris

forum:zunzun · Answer 15 · Feb. 27, 2013, 4:55 p.m.

forum:zunzun 11 years, 2 months ago

~~@caker:~~

3.7.10-linode49 and 3.7.10-x86_64-linode30 were released today

Outstanding, just outstanding. Thank you.

James

forum:jebblue · Answer 16 · Feb. 27, 2013, 6:03 p.m.

forum:jebblue 11 years, 2 months ago

Thanks for the tip or I wouldn't have known I needed to reboot to pick up the new kernel.

Compute

Storage

Databases

Networking

Developer Tools

Delivery

Security

Services

Industries

Pricing

Community

Engage With Us

Hard kernel crash, rcu_sched stall, kernel 3.7.5-linode48

16 Replies

Reply

Tips: