With heartbeat I had two nodes suddenly fail to be able to talk to each other, and when they tried to send email alerts to tell me about it, their DNS lookups failed.
So I have 5 servers reporting multiple local network failures at exactly the same time (2012/06/25 10:14 UTC), which suggests it's not individual nodes that are failing but the network they have in common.
Any idea what's up?
I made a support ticket, hopefully the issue will be resolved soon.
This whole 250k giveaway bullshit is probably the cause, they probably overloaded their network with the new servers they added. I am extremely disappointed.
Issues can happen anywhere and at anytime. If you need to be up 24/7, you need many servers in many different physical locations
The network is not oversold.
We've had repeated problems with newark490, newark491, and newark492 the past few days. Those three machines are being evacuated to known good hardware as I type this.
The network issues affecting some Newark segments last night were unrelated and was caused by a briefly misconfigured device, which caused some residual trouble.
Rest assured we're working on taking care of everything, as we always do – but problems do happen and we do our best to correct them and prevent them from recurring.
Thanks for letting us know caker