Newark Data Center Power Failure

general

forum:Main Street James 10 years, 4 months ago

For those of you wondering why you got unexpected "Host initiated restart" notices overnight, here's why:

http://status.linode.com/

One of our servers is still down. I can ping it but cannot connect to it vis lish, ssh, etc. Longview has no recent activity, but the graphs in the Linode Manager show that the CPU has a little activity (including a spike from a cron job).

Anyone else having issues?

15 Replies

forum:Main Street James 10 years, 4 months ago

Update:

I can connect via LISH, but nothing else.

all normal services are running (including ssh, ftp, http, etc).
I've turned off iptables in case it was a firewall issue.
I've rebooted.
I've recreated /etc/resolv.conf and restarted the networking service.

Any ideas?

I've tried connecting to it from one of our test VPSs located in the same data center. The test VPS is running normally.

the test VPS cannot connect to the problem VPS via LISH, ssh, ftp, http, etc.
the problem VPS cannot connect to the test VPS via LISH, ssh, ftp, http, etc.
the problem VPS can ping domains not on the problem server and get the response.

I can use wget on the problem VPS to get webpages from sites located on the problem VPS, but not from any other server.

Support has suggested booting into 'Rescue Mode' and performing a filesystem check. I'm currently cloning the file system and will try rescue mode.

forum:obs 10 years, 4 months ago

Hrm could be a myriad of things. What's the contents of your network config files? What's the output of route -n?

forum:Main Street James 10 years, 4 months ago

obs,

I'll check 'route -n' once the fsck is done. The 'e2fsck -f' has been running for over an hour and it's still on 'Pass 1'. It's an 82 GB file system image.

I've never run into an fsck that has taken this long. Ugh.

forum:obs 10 years, 4 months ago

That's not good. I've run fsck on 7 boxes in Newark in less time than that. What's on the box? Are there lots of files? Is the disk pretty full?

forum:Main Street James 10 years, 4 months ago

It's a production web server with a few dozen websites. I think the free space on the drive was about 30%.

I'd hate to lose the hour that it's been running. Is there any way to check if the VPS is still running in recovery mode without losing the progress of the fsck (if there has been any)?

forum:obs 10 years, 4 months ago

If you're running via LISH and haven't started SSH in rescue mode then nope, you've only got one terminal session which you can access. You could try asking support if they can see what's going on.

forum:hoopycat 10 years, 4 months ago

Or check the charts on the dashboard; fsck should show up as some nontrivial amount of disk I/O

forum:Main Street James 10 years, 4 months ago

The fsck finished. I had lost the LISH connection again (it's been only lasting a few minutes at a time but doesn't always respond when trying to reconnect).

Here's the output of 'route -n':[root@www ~]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 198.74.60.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 66.175.213.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 66.175.212.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 66.175.210.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 50.116.48.1 0.0.0.0 UG 0 0 0 eth0 50.116.48.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 66.175.210.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 66.175.212.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 66.175.213.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 198.74.60.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0

I'm not sure why 198.74.60.1 is in that list, though I assume it's our gateway at Linode (resolves to gw-li557.linode.com).

forum:Main Street James 10 years, 4 months ago

~~@hoopycat:~~

Or check the charts on the dashboard; fsck should show up as some nontrivial amount of disk I/O
I was in rescue mode and I didn't see any activity on the graphs during the 1 1/2 hours it was in rescue mode.

forum:obs 10 years, 4 months ago

~~@Main Street James:~~

The fsck finished. I had lost the LISH connection again (it's been only lasting a few minutes at a time but doesn't always respond when trying to reconnect).

Here's the output of 'route -n':[root@www ~]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 198.74.60.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 66.175.213.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 66.175.212.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 66.175.210.1 0.0.0.0 UG 0 0 0 eth0 0.0.0.0 50.116.48.1 0.0.0.0 UG 0 0 0 eth0 50.116.48.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 66.175.210.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 66.175.212.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 66.175.213.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 198.74.60.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0

I'm not sure why 198.74.60.1 is in that list, though I assume it's our gateway at Linode (resolves to gw-li557.linode.com).

You should only have one entry starting 0.0.0.0 what's the contents of your network config file? And what's the primary IP of the node (ie. the one assigned to eth0).

forum:Main Street James 10 years, 4 months ago

obs,

The primary IP is 50.116.48.0. The 66.175.X.X IPs are additional IPs used for SSLs for ecommerce sites on that server.

Which config file(s) would you like to see?

James

forum:Main Street James 10 years, 4 months ago

I really suspect that the problem is in Linode's network somewhere. Perhaps a piece of network gear or a server that failed due to the power failure.

We've never had any problems with this server in the past and I haven't changed the configuration on this server for quite some time.

forum:obs 10 years, 4 months ago

The network ones I don't know what OS you're using but if it's ubuntu it'd be /etc/network/interfaces I suspect you've multiple gateway lines when you should only have one see here https://library.linode.com/networking/c … ian-ubuntu">https://library.linode.com/networking/configuring-static-ip-interfaces#sph_debian-ubuntu

The routing table should look something like this

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         50.116.33.1     0.0.0.0         UG    100    0        0 eth0
50.116.33.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
50.116.37.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
50.116.38.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
50.116.39.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
173.230.133.0   0.0.0.0         255.255.255.0   U     0      0        0 eth0
192.168.128.0   0.0.0.0         255.255.128.0   U     0      0        0 eth0
198.74.52.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0

This is from a box with multiple SSL certs on it.

forum:Main Street James 10 years, 4 months ago

To All,

Thank you for your help. This issue has been resolved by the Linode Support staff (who have graciously put up with my pestering nature while dealing with the aftermath of last night's power outage). Support has resolved a configuration issue on their end and now everything is responding correctly.

obs,

This VPS is running CentOS. I am in the planning stages of moving all the sites to Ubuntu LTS servers so I'm not going to try to figure out why my routing table seems to be a bit funky (though it may turn out to be a rabbit I chase anyway). I'm going to wait and see the reviews of 14.04 LTS before deciding whether to go with 14.04 or if I should stick with 12.04 (which I have on other VPSs).

Thanks again,

James

forum:obs 10 years, 4 months ago

Glad you solved it. For ubuntu 14.04 the general rule is to wait until the first point revision before deploying critical stuff on it so 14.04.1. 12.04 is a good OS I rarely have problems with it.

Reply

Description

Please enter an answer

Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Compute

Storage

Databases

Networking

Developer Tools

Delivery

Security

Services

Industries

Pricing

Community

Engage With Us

Newark Data Center Power Failure

15 Replies

Reply

Tips: