Rebooted and Now Can't Even Connect

I logged on today to find my sites were down, and a stream of emails about my Linode exceeding its IO threshold (badly).

You can see the dashboard graphs here, something clearly went wrong over night: http://img12.imageshack.us/img12/9017/graphsu.png

The first thing I did was just to reboot the Linode, and that seems to have been completely wrong. Now I can't even connect to it by SSH anymore :( (Though Linode manager does say its running)

Does anyone have any ideas what I should do? I'm really at a loss now :(

12 Replies

Have you tried logging in via Lish? Check the "Console" tab in the Linode Manager. From there, you should be able to find out why SSH isn't running.

Thanks ayman. Logging in via Lish and restarting SSH that way has let me connect again via SSH itself at least, thanks! :)

Trying to restart Apache is giving an error now that's causing it to fail. Reading more about it online now:

> (30)Read-only file system: apache2: could not open error log file /var/log/apache2/error.log.

The few results I've found seem to say it's something to do with filesystem errors, but I can't work out what could have changed to make this suddenly appear.

The "dmesg" command will show you the output from the kernel's log; that might help. Alternatively, detaching from the console (ctrl-A then d) and using the "logview" command at the lish prompt will show you the last bit of the previous run's log and anything that's happened on the console this time around.

But indeed, it does sound filesystem-related.

Hi hoopycat, thanks for that! dmesg sounds like a really useful command, I'll remember that for the future! :)

I may have things sorted temporarily.

The problem was that Ubuntu turned the filesystem readonly. The most common reason given for that happening seems to be that it perceived a disk issue. Given the crazy stats in the graphs I posted though, I'm guessing something on my server was the cause of it :(

Running the "fsck" command was enough to fix it though.

With that said, it's only been fixed for about half an hour now. Will be watching those graphs to see if the issue comes back.

Does anyone know how I could find out what is causing the massive load? (if it does come back)

What could have happened is your machine went into swap hell.

Basically if you run out of memory then your system starts to swap. This is normal. But if you're REALLY short of memory then it can swap a lot. So much that the system spends nearly all of the time swapping pages in/out. Response is almost zero, I/O activity is through the roof… it almost looks like the machine has crashed.

Now if, at this point, you told the control panel to reboot your machine it would attempt to do a graceful shutdown. BUT if your linode was in swap hell then it might not have been able to do it, so the control panel may have switched to a more aggressive reboot method and effectively pressed the "Reset" button.

This is an unclean shutdown and can result in filesystems needing fsck'ing afterwards; the machine doesn't fully reboot and the only way of accessing the machine is via lish.

If this is what happened then you need to look into why your machine started taking up so much memory. Are you running MySQL or similar? If so check the dozens of threads here on how to ensure MySQL never explodes like this. Similarly there are threads on how to tune Apache.

Basically, you just need to tune all your application processes so they can live happily in memory and not cause swap hell.

Thanks Sweh! The issue you described there does line up with the symptoms my server had. When I came on this morning, the server hadn't fully crashed, it was just slow to a point of being useless. The restart was what caused the complete crash.

I'll take a look at the things you mentioned.

The strange part is that my sites are only running fairly standard scripts; WordPress, phpBB, and Coppermine Photo Gallery. I'll take a look at them all (And any mods/plugins especially) like you said though, hopefully will be able to avoid a repeat!

Thanks again for your detailed reply, really helps to get an understanding of what happened!

Wordpress and phpBB are typical culprits, especially if using a mySQL backend and you've done no tuning. Many of these programs assume a full sized server and if running on a smaller linode (eg linode360) they can quickly use up all resources.

You might want to look at http://library.linode.com/databases/mys … l-centos-5">http://library.linode.com/databases/mysql/install-mysql-centos-5 for some mySQL tuning hints on the mySQL component.

Thanks Sweh, read that now. I'll have a look at adjusting my MySQL settings now.

The sites have hit the exact same issue again, but I haven't restarted the server this time.

What's the best way to handle this for now?

I have 2 active sites on this. If I disable one (Just using a2dissite), will that be enough to stop it from causing any more trouble if that site is the culprit? Or would I need a different way of detecting it?

Update, looks like you were right about MySQL being the issue! Lish shows this as soon as I load it:

fsck from util-linux-ng 2.16
/dev/xvda: clean, 62959/770048 files, 2408634/3072000 blocks 
FATAL: Module nf_conntrack_ftp not found. 
FATAL: Module nf_nat_ftp not found.    
FATAL: Module nf_conntrack_irc not found.
FATAL: Module nf_nat_irc not found.
 * Setting preliminary keymap...                                                 
* Setting up console font and keymap... 
* Stopping NTP server ntpd             
 * Starting OpenBSD Secure Shell server sshd
* Starting NTP server ntpd                 
 * Starting MySQL database server mysqld ...done.         
 * Checking for corrupt, not cleanly closed and upgrade needing tables.                             
 * Starting Postfix Mail Transport Agent postfix                                 * Starting NTP server ntpd
* Starting web server apache2

Ubuntu 9.10 merlin hvc0

merlin login: Out of memory: kill process 2338 (mysqld_safe) score 224717 or a child                
Killed process 2446 (mysqld)           
Out of memory: kill process 2873 (apache2) score 197820 or a child
Killed process 6182 (apache2)                            
Out of memory: kill process 2873 (apache2) score 197787 or a child
Killed process 6464 (apache2)                                                      
Out of memory: kill process 2873 (apache2) score 93674 or a child
Killed process 6492 (apache2)                                         
Out of memory: kill process 2338 (mysqld_safe) score 110602 or a child
Killed process 7010 (mysqld)                                
Out of memory: kill process 2873 (apache2) score 98892 or a child
Killed process 6649 (apache2)                                                    
Out of memory: kill process 2873 (apache2) score 98406 or a child
Killed process 8311 (apache2) 

Time to upgrade, or tune your stack to use less memory. You don't have enough memory to sustain your applications, so Linux is ragekilling applications it determines to be a threat.

Option one is significantly easier than option two.

@Michael-Martin:

Update, looks like you were right about MySQL being the issue! Lish shows this as soon as I load it:

Well, not necessarily. The Out of memory (OOM) killer doesn't necessarily kill the program that's exploding.

But given past experience in these forums it probably is a combination of apache instances (too many?) and mySQL going mad (it normally is :-))

As for how to resolve the problem… depends on how urgent your needs are.

If you need "working web site now!" then upgrade to a bigger linode (heck, even a 2880). Then work very hard in getting your footprint down to a reasonable size, then downgrade to the smallest linode that meets your needs. Because of how linode pro-rata's usage, you'll get a credit back (not a refund) for the unused 2880 period and this can be used to pay for the smaller linode.

(Umm, I think I'm right; I'm sure linode staff will correct me if I've mis-stated the billing/refund policies).

If this is still in the "I don't care if it's down" stage, then work on reducing the footprint and accept the outages.

@sweh:

Umm, I think I'm right
Yes.

Thanks for the replies guys, I've upgraded to the next Linode stage now, and still have one of the sites disabled. The amount of traffic to the site hasn't changed, so maybe this will be enough.

I'm looking into optimizing things now while it's back up. Do yous know of any good resources (online, or even books!) I should start with?

Thanks again for the help! :)

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct