Newark Data Center Power Failure

For those of you wondering why you got unexpected "Host initiated restart" notices overnight, here's why:

http://status.linode.com/

One of our servers is still down. I can ping it but cannot connect to it vis lish, ssh, etc. Longview has no recent activity, but the graphs in the Linode Manager show that the CPU has a little activity (including a spike from a cron job).

Anyone else having issues?

15 Replies

Or check the charts on the dashboard; fsck should show up as some nontrivial amount of disk I/O

Update:

I can connect via LISH, but nothing else.

  • all normal services are running (including ssh, ftp, http, etc).

  • I've turned off iptables in case it was a firewall issue.

  • I've rebooted.

  • I've recreated /etc/resolv.conf and restarted the networking service.

Any ideas?

I've tried connecting to it from one of our test VPSs located in the same data center. The test VPS is running normally.

  • the test VPS cannot connect to the problem VPS via LISH, ssh, ftp, http, etc.

  • the problem VPS cannot connect to the test VPS via LISH, ssh, ftp, http, etc.

  • the problem VPS can ping domains not on the problem server and get the response.

I can use wget on the problem VPS to get webpages from sites located on the problem VPS, but not from any other server.

Support has suggested booting into 'Rescue Mode' and performing a filesystem check. I'm currently cloning the file system and will try rescue mode.

Hrm could be a myriad of things. What's the contents of your network config files? What's the output of route -n?

obs,

I'll check 'route -n' once the fsck is done. The 'e2fsck -f' has been running for over an hour and it's still on 'Pass 1'. It's an 82 GB file system image.

I've never run into an fsck that has taken this long. Ugh.

That's not good. I've run fsck on 7 boxes in Newark in less time than that. What's on the box? Are there lots of files? Is the disk pretty full?

It's a production web server with a few dozen websites. I think the free space on the drive was about 30%.

I'd hate to lose the hour that it's been running. Is there any way to check if the VPS is still running in recovery mode without losing the progress of the fsck (if there has been any)?

If you're running via LISH and haven't started SSH in rescue mode then nope, you've only got one terminal session which you can access. You could try asking support if they can see what's going on.

The fsck finished. I had lost the LISH connection again (it's been only lasting a few minutes at a time but doesn't always respond when trying to reconnect).

Here's the output of 'route -n': [[email protected] ~]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 198.74.60.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 66.175.213.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 66.175.212.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 66.175.210.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 50.116.48.1 0.0.0.0 UG 0 0 0 eth0 50.116.48.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 66.175.210.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 66.175.212.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 66.175.213.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 198.74.60.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0

I'm not sure why 198.74.60.1 is in that list, though I assume it's our gateway at Linode (resolves to gw-li557.linode.com).

@hoopycat:

Or check the charts on the dashboard; fsck should show up as some nontrivial amount of disk I/O
I was in rescue mode and I didn't see any activity on the graphs during the 1 1/2 hours it was in rescue mode.

@Main Street James:

The fsck finished. I had lost the LISH connection again (it's been only lasting a few minutes at a time but doesn't always respond when trying to reconnect).

Here's the output of 'route -n': [[email protected] ~]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 198.74.60.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 66.175.213.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 66.175.212.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 66.175.210.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 50.116.48.1 0.0.0.0 UG 0 0 0 eth0 50.116.48.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 66.175.210.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 66.175.212.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 66.175.213.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 198.74.60.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0

I'm not sure why 198.74.60.1 is in that list, though I assume it's our gateway at Linode (resolves to gw-li557.linode.com).

You should only have one entry starting 0.0.0.0 what's the contents of your network config file? And what's the primary IP of the node (ie. the one assigned to eth0).

obs,

The primary IP is 50.116.48.0. The 66.175.X.X IPs are additional IPs used for SSLs for ecommerce sites on that server.

Which config file(s) would you like to see?

James

I really suspect that the problem is in Linode's network somewhere. Perhaps a piece of network gear or a server that failed due to the power failure.

We've never had any problems with this server in the past and I haven't changed the configuration on this server for quite some time.

The network ones I don't know what OS you're using but if it's ubuntu it'd be /etc/network/interfaces I suspect you've multiple gateway lines when you should only have one see here https://library.linode.com/networking/c … ian-ubuntu">https://library.linode.com/networking/configuring-static-ip-interfaces#sph_debian-ubuntu

The routing table should look something like this

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         50.116.33.1     0.0.0.0         UG    100    0        0 eth0
50.116.33.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
50.116.37.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
50.116.38.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
50.116.39.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
173.230.133.0   0.0.0.0         255.255.255.0   U     0      0        0 eth0
192.168.128.0   0.0.0.0         255.255.128.0   U     0      0        0 eth0
198.74.52.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0

This is from a box with multiple SSL certs on it.

To All,

Thank you for your help. This issue has been resolved by the Linode Support staff (who have graciously put up with my pestering nature while dealing with the aftermath of last night's power outage). Support has resolved a configuration issue on their end and now everything is responding correctly.

obs,

This VPS is running CentOS. I am in the planning stages of moving all the sites to Ubuntu LTS servers so I'm not going to try to figure out why my routing table seems to be a bit funky (though it may turn out to be a rabbit I chase anyway). I'm going to wait and see the reviews of 14.04 LTS before deciding whether to go with 14.04 or if I should stick with 12.04 (which I have on other VPSs).

Thanks again,

James

Glad you solved it. For ubuntu 14.04 the general rule is to wait until the first point revision before deploying critical stuff on it so 14.04.1. 12.04 is a good OS I rarely have problems with it.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct