Custom kernel 2.6.35-rc3 and issues with FastCGI/Django?

Hi all,

I'm currently bouncing between kernel versions because over the last few weeks, I haven't found one that is issue-free for me.

2.6.33-linode24: The best of the bunch, but twice now the time has "frozen", stopping all CPU timers, cronjobs, etc. This may be fixed now thanks to some advice from the Linode staff (relates to Xen and clocksources), but it's very hard to tell because it only happened somewhat randomly, and only after 10-15 days of uptime. Probably the most frustrating kind of bug – intermittent, non-reproducible, and totally fatal! (Nevertheless, this is the kernel that I'm sticking with at the moment.)

2.6.32.12-linode25: Every few hours -- seemingly randomly and uncorrelated to cronjobs or external load -- we'd see a huge spike in load average up to 30-40 or so, just for a few seconds, but enough to set off our monitoring software and for the server to be barely responsive to other tasks for 30 to 60 seconds or so. No noticable spikes on any of the Linode graphs during these events.

2.6.35-rc3: My latest attempt was to compile a custom kernel, stock from kernel.org, using the .config file from http://linode.com/src/2.6.32.12-linode25.tar.bz2 as my starting point. Thanks to the excellent article "Running Custom Kernels with PV-GRUB", I had no problem compiling the kernel and getting it started.

Everything seems to run great on 2.6.35-rc3, including the web server (lighttpd), database (mysql), e-mail (qmail), asterisk, etc… with one major exception: my Python Django FastCGI processes will run, but will only seem to take one or two requests from lighttpd. After that, lighttpd continues to try to pass them requests (via tcp localhost:3303), but there's no answer. In the lighttpd logs, I get:

"establishing connection failed: Connection timed out socket: tcp:127.0.0.1:3303"

The python processes continue running and don't seem to use any CPU.

Since the userland was exactly the same, and only the kernel was changing, my only thought so far was that it might be firewall related (arno-iptables-firewall). However, I tried disabling the firewall entirely, but still had identical results!

Any ideas / clues? Hoping to make the custom kernel work. Why would this one piece out of everything be affected so dramatically by a new kernel? Thanks in advance!

Mike

43 Replies

Re: 2.6.33-linode24 had the same issue

Re: 2.6.32.12-linode25 I had a different issue for me nginx under this kernel had page faults

So what I've done is gone the custom kernel route, I'm using ubuntu 9.10 with the 2.6.31-307-ec2 kernel and I have no problems.

What distro are you using?

Hi obs,

Thanks for the reply – very interesting to hear that you've experienced similar issues with those two kernels (2.6.33-linode24 and 2.6.32.12-linode25). I thought I was going nuts!

Still surprised to see that there are such issues that are so dependent on the kernel version.

I'm using Debian stable (lenny).

I'll have to try a handful of different versions of custom kernels and see if it's specific to this 2.6.35-rc3.

Thanks again,

Mike

Try using the default debian kernel linux-image-2.6.26-2-486 (that's assuming you're using 32 bit). It might not boot, the ubuntu default kernel doesn't but the ec2 one does.

Give me a few and I'll roll an "official" 2.6.34 to test.

-Chris

OK - 2.6.34-linode26 and 2.6.34-x86_64-linode13 are out there. Test away.

-Chris

@caker:

OK - 2.6.34-linode26 and 2.6.34-x86_64-linode13 are out there. Test away.

-Chris

Sounds good, I'm going to be unavailable for a few days, compumike if you get a chance to test it I'd be interested in your results.

Hi Chris (& obs),

Thanks for your quick replies! Currently spending a few hundred bucks trying out reddit.com ads today, so this isn't a good time to experiment, but my plan is to give it a shot early tomorrow. Will write in and let you know how it goes.

Just curious – when you built the kernel, did you do anything other than get the stock kernel from kernel.org, copy in a .config from one your recent linode kernels (like 2.6.32.12-linode25), run "make oldconfig" and answer with the defaults, and then build? Any patches to apply or special config options? Just trying to track down my issues with 2.6.35-rc3.

(Also noticed that http://www.linode.com/src/ hasn't been updated. If you get a chance, I'd really just like the .config file so I know we're working from the same point.)

(One more "quickie" -- http://www.linode.com/irc/logs/ permissions issue? A lot of my search results ended up pointing there, but it's 403'ed.)

Thanks again,

Mike

I've been doing Xen a long time and have learned a few things alone the way - it can be sensitive to config options and even certain toolchain versions. Fortunately, mainline support for pv_ops means no external patching or other shenanigans, so it's pretty much kernel.org it, copy my precious working-config from a version past, and then make oldconfig, etc.

I'll push up the tarballs once I know the kernels will stick around for more than a few days. For now, zcat /proc/config.gz :)

I fixed the /irc/ folder permissions on one of our loadbalancers. Thanks for the heads-up.

-Chris

Hi Chris,

Just booted with the 2.6.34-linode26 kernel (32-bit) and everything seems to be fine! (All userspace services working properly. No issue with the lighttpd/django fastcgi intercommunication, even when stressed via "ab".)

It is yet to be seen whether the issues I experienced with 2.6.33-linode24 (with CPU timers stopping) and with 2.6.32.12-linode25 (with random load average spikes) will repeat themselves, as those seemed to happen over the course of weeks and hours respectively. But for at least 15 minutes of uptime, I can safely say that it's actually working, which was definitely not the case for my attempt to build 2.6.35-rc3.

Will watch it carefully this weekend and report back if there are any issues.

Thanks!

Mike

11+ hours of uptime and all is still running fine.

At this point I would have expected to experience the 2.6.32.12-linode25 load average spike issue, so it's very good news that I haven't.

Mike

Hi all,

Now at 3 days, 18 hours of uptime, and it's been the most rock-solid I've experienced. Finally had a nice quiet weekend without issues – seriously, this has made a tremendous change.

Still will have to see if it has the same once-every-few-weeks "timer stopped" fatal issues as with 2.6.33-linode24, but I'm hoping not!

I recommend trying 2.6.34-linode24 if you are having issues with one of the other kernels similar to those I've described earlier in this thread.

Thanks!

Mike

Ok I've cloned a linode and booted it up with the new kernel and asked my uses to go hammer it to see if it has issues, I'll leave it up for a few days and let you know what happens.

It's been running for 2 days no issues! This kernel's a keeper!

2.6.34-linode26 has frozen the clock for me on 2 different linodes.

Hi stefantalpalaru,

Good to know. Mine is still OK at the moment. Hopefully we can try to track this down! Let's collect some basic information that might be helpful to those more knowledgeable about the clock / timer freezing issue. Here are some questions I thought might be important, as well as my personal answers.

stefantalpalaru and obs, and anyone else who has had the clock freeze issue, can you both respond to these questions so we can make sure we're on the same page?

1) Can you confirm that the nature of the clock freeze is identical to what I described in my first post? (System is still reachable and serves web pages, but cronjobs don't run, load average statistics don't update, problems with interactive ssh sessions, non-working lish console. Running "ssh me@mylinode uptime" shows some fields that do update (number of users, amount of uptime) and some fields that don't (currrent time, load averages). Running "ssh me@mylinode date" still updates with the correct time.)

2) What distribution are you running? (I'm on Debian lenny / stable)

3) What Linode plan? (I'm on the Linode 1024, which was 720 when I had my two clock freezes)

4) What kind of load was that kernel seeing? (Based on Linode graphs, I'm typically around 4-5% CPU, sometimes bursting to maybe 110% for some batch jobs. I use almost zero swap space and make a serious effort to keep everything in RAM. This includes vm.swappiness=0, use of memcached, and making strategic use of "tmpfs" RAM-backed filesystems for certain parts of my application.)

5) Are you running ntpd? (I have seen clock freezes on 2.6.33-linode24 both with and without ntpd, but just for the record…)

6) Have you had this issue with other kernels too? (So far, I've only experienced it with 2.6.33-linode24 -- not yet with 2.6.34-linode26.)

7) How much uptime did the box have before the clocks froze? (I had roughly 10-15 days uptime on both occurrences.)

8) Have you been able to make a correlation / guess as to whether the issue occurs with high CPU usage, high IO usage, high network usage, etc? Any unusual log messages from those incidents? (I have not been able to find anything that I thought might be related.)

9) Do you have anyway to quickly / controllably reproduce the clock freeze? (Unfortunately I don't.)

10) What datacenter are you in? (I'm in newark)

Mike

@stefantalpalaru:

2.6.34-linode26 has frozen the clock for me on 2 different linodes.
Hrmm well that's not good maybe I didn't test for long enough, have you raised a ticket with support?

No need, we're already watching this thread.

compumike, here's the info:

1. the system responds to ping and I can ssh into it, but can't input anything in the interactive session. The time I see in the shell prompt is way off. The web server doesn't work, CPU usage is at 100%.

2. Debian unstable and Gentoo ~x86

3. Linode 1024 and Linode 4096

4. very low CPU usage. see the munin graphs: http://munin.od-eon.com/com/od-eon.com/index.html

5. yes, ntpd runs on both linodes

6. yes, all the paravirt kernels I've tried, with varying periods of time between clock freezes (some of them lasted for more than a month). But I need the latest DRBD version so I keep trying to stabilize it. Most of the kernels had custom configs (booted with PV_GRUB) so I was pretty much on my own, but now I see the same problem with the official config.

7. last uptimes: 5 and 2 days

8. no, but I suspect it's all triggered by the clock as presented by Xen

9. no

10. Dallas and Newark

Here's my info

1) Yes web server still serves pages (but not for long since the firewall detects a synflood and blocks connections) SSH however doesn't work it just locks up if the session is already connected, if not connected it hangs on connection.

2) Ubuntu 9.10 32 bit.

3) 512 which was 360 when the freeze happened.

4) Not sure since it was a long time ago, but if I was to hazard a guess probably around 5-10%.

5) Yes - logs didn't show NTPD trying to change the time if memory serves.

6) No (I'm currently using pv_grub with ubuntus ec2 kernel)

7) 12-24 hours at most, the server is in use almost 24/7 and people tended to yell at me as soon as it locked up

8) Absolutely nothing, I delved into my logs and could find diddly squat it seemed completely random.

9) Again no sadly.

10) Dallas

Hi obs,

Two days ago you said, in reference to 2.6.34-linode26:

> It's been running for 2 days no issues! This kernel's a keeper!

Is that box still running – now up to 4 days / 96 hours? If so, that seems like a significant departure from your

> 12-24 hours at most

description about earlier kernels.

Mike

No it's not still running it was a clone I didn't want to risk the live server, after 2 days I deleted the clone. But it did run fine for 2 days which is an improvement on the 12-24 hours :)

For what it's worth, my Linode is now at 10 days, 21 hours uptime on running 2.6.34-linode26 with no issues at all – so far, so good! However, my last two timer freezes (both on the earlier kernel version 2.6.33-linode24) were with roughly 10 and 14 days uptime, so we're not out of the woods yet.

Can anyone suggest useful things to try to record in the event that it does freeze again (before rebooting)? Catting particular files within /proc perhaps?

With great speculation: this may be load triggered in some way, either as a cumulative load in some kernel variable that isn't getting reset properly, or as an instantaneous load that causes some virtual interrupt to get missed or something like that. Alternatively, it may be host-triggered.

However, the fact that Obs suggests that it happens rather consistently on a 12-24 hour time span is really interesting. Obs, does this mean you're still manually rebooting it every 12-24 hours at this point? When you made a clone to test with the new 2.6.34-linode26 kernel for two days, was that clone taking any of your client load, or was it unloaded?

Since mine only occurs on the time period of weeks on the 2.6.33-linode24 kernel, it's just about impossible for me to do any testing. But if I had a setup that I knew would lock up within a period of minutes or hours, then I think we could really get to the bottom of this.

I currently run the ubuntu 2.6.31-307-ec2 kernel using pv_grub, so no I don't reboot every 12-24 hours (I'd have no business if I did!)

The clone was under load, I asked my users to use the server as normal and they did.

One thing I can pretty sure say it's not to do with is network load, since my backups are sent to s3 storage every night around the same time and it never crashed during that, it really was quite random.

I can tell you what runs on the box.

Nginx takes the brunt of the web serving static files, it passes php back to apache, mysql is running as the database, there's also a nsd DNS server running. The usual system utilities i.e. logrotate, munin etc are running.

Now if memory serves the last lockup was around 4:37pm with it being such an odd time no cron jobs are running, no backups are running. I checked my nginx, mysql, munin etc logs when it happened and I couldn't;spot anything unusual.

My linode froze up sometime in the middle of last night running the 2.6.34 kernel after 6 days of up time. It's the first time I've ever had the problem though, the previous kernels have all been flawless for me.

I'll add a few additional notes about my setup on the off chance it helps get to the bottom of this as there's nothing obvious in my logs to say what happened:

* Webserver still serves pages for a short period. SSH lets me connect but doesn't allow me to issue commands.

  • OS is Ubuntu 10.04 32-bit running the 2.6.34 kernel

  • Linode is a 512 (formerly a 360)

  • Load on the server is very low

  • NTPD didn't try to correct the time as far as I can tell

  • The linode is located in newark

Hi Dru, welcome to the unlucky club.

Did you have any cron jobs running? Can you provide a time stamp from when it was happening, maybe the linode guys can check the host for any weirdness at that time?

I did have a cron job running, it's the last thing logged in fact. It's not a particularly complex job, all it does is ping a website every 30 minutes. The last time it ran was at 1:30am UTC, if the linode guys want to check the host.

Does switching clocksources have any effect?

echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource

-Chris

I'll knock up a node with a kernel that I have issues with and get back to you, I should have something in a few days.

Hi Chris,

I am at 11 days, 12 hours uptime (and still have not had any issue with 2.6.34-linode26).

I'm guessing that you have a good reason to believe that tsc is a winner, so I have now switched to the tsc clocksource (previously was xen) without rebooting. Ntpd appears to be maintaining time fine after the switch.

(Of course, I expect the more useful testing results to come from the other users who have had this issue with greater frequency – looking forward to seeing your test results!)

Mike

Ok so I created a new node at Jun 29 18:08:06 UTC and at Jun 30 18:47:24 UTC the clock froze, this is without the changed clocksource.

I've changed the clocksource and will report back if it crashes.

Please note that for now I've reverted the "Latest Paravirt" entries to point back to 2.6.32. You'll need to select 2.6.34 specifically.

-Chris

I'm testing on 2.6.33 since that's the one I know I have issues on, I have it selected. "2.6.33-linode24"

07:13:33 up 4 days

That's with the change of clocksource, seems to be doing better. Anyone else had success/problems with tsc?

2.6.34-linode26, now up 17 days. (I had switched to the TSC clocksource on day 11 as per my earlier post in this thread.) No issues beyond my lighttpd process dying this morning, but I think that's entirely unrelated, and was solved without a reboot.

Good to hear that switching to tsc might be a solution!

It's still up

21:29:09 up 5 days, 1:37, 1 user, load average: 0.08, 0.02, 0.01

I've never had uptime that long on that kernel before so I'd say tsc seems to fix it.

Now it's up to the linode guys to figure out why!

Btw, I'm curious… is the tsc clocksource based on the RDTSC CPU instruction?

As what will hopefully be a conclusion to this thread:

2.6.34-linode26, now up 32 days. (I had switched to the TSC clocksource on day 11 as per my earlier post in this thread.)

Summary for those just joining this thread:

  • use 2.6.34-linode26

  • set clocksource to tsc

seems to avoid this issue!

@compumike:

As what will hopefully be a conclusion to this thread:

2.6.34-linode26, now up 32 days. (I had switched to the TSC clocksource on day 11 as per my earlier post in this thread.)

Summary for those just joining this thread:

  • use 2.6.34-linode26

  • set clocksource to tsc

seems to avoid this issue!
2.6.34-linode27 solves another related issue, which might be even more stable for you.

The correct fix is to run our "Latest 2.6 Paravirt" kernel - as you'll always get the latest and greatest working kernel.

Hard coding a specific kernel version in your config profile is just asking for trouble down the road.

-Chris

Hi guys.

Is there any progress when one can expect the "Latest 2.6 Paravirt" entry to point to a recent 32bit kernel? I've seen that you're already up to 2.6.35 for the x86_64 kernels. One _could_ run this on a 32bit system as well if carefully deployed but I'd rather avoid that.

Any update would be great.

Thanks a lot,

matthew

2.6.32 is a reasonably recent kernel. Is there something you need from a newer kernel?

Hi.

My main and only concern is security. There are is unfortunately too little official documentation/feedback from Linode about how they maintain their kernels wrt to security leaks.

I could naturally fight myself through countless git backlogs, security reports and so on to see if their 2.6.32.16 based kernel is reasonably safe or if there are open issues. But I honestly lack the time and I don't think that would be my job to do as I am not the maintainer of those kernels.

Don't get me wrong, I love Linode and I think their service is simply fabulous. It's just that I would like to get more insight on their kernel maintenance and see where we stand wrt to security. That's all.

So long,

matthew

Well, it doesn't look like the linode kernel has had any updates whatsoever in the better part of a year. Surely there's been vulnerabilities in the past 8 months…

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct