| View previous topic :: View next topic |
| Author |
Message |
caker Linode.com Staff

Joined: 15 Apr 2003 Posts: 2715 Location: Galloway, NJ
|
Posted: Tue Jul 21, 2009 7:18 pm Post subject: Migration: dallas165 |
|
|
We're investigating a problem with dallas165. Updates in a bit.
-Chris |
|
| Back to top |
|
 |
jed Linode.com Staff

Joined: 28 Mar 2009 Posts: 357 Location: New Jersey
|
Posted: Tue Jul 21, 2009 8:42 pm Post subject: |
|
|
Early this afternoon, one hard drive in dallas165's RAID 10 array failed. Calls were made, tickets filed, and a plan of action put into place. Customers would have never been any wiser, had no other drives failed; however, at around 8 PM EDT, another drive did.
Not even RAID can prepare for double drive failure. Two drives failing within six hours of each other is unprecedented, and quite unlucky. This is an extremely rare situation for Linode, and one we regret immensely. After extensive triage and troubleshooting, we have determined that all customer data on dallas165 has been lost.
However, hardware does fail; this sort of situation is mostly outside of our control. Let me be the first, on behalf of Linode, to apologize if you are affected by this host failure.
Customers on this host have been moved to dallas166, and tickets opened to discuss specifics relating to their account. If you have any questions whatsoever, please don't hesitate to open a ticket or e-mail us. Check your e-mail for a ticket from us, if you were on dallas165.
Once again, I apologize. _________________ Jed Smith
Developer & Systems Administrator
Linode.com |
|
| Back to top |
|
 |
Xan Senior Member

Joined: 08 Feb 2004 Posts: 552 Location: Austin
|
Posted: Tue Jul 21, 2009 8:56 pm Post subject: |
|
|
Ouch... My condolences to people affected. Once again, the lesson is: backups! Any valuable data must be backed up, whether it's on Linode or anywhere. Events like this are good to remind us of that, because it's a matter of when, not if, data loss hits us all.
Tangentially, of course RAID certainly can protect against double drive failures, but RAID 10 can't. But in any case, RAID is not a backup.
[edited to fix error pointed out by hybinet]
Last edited by Xan on Thu Jul 23, 2009 2:52 am; edited 1 time in total |
|
| Back to top |
|
 |
freedom_is_chaos Senior Member
Joined: 12 Sep 2008 Posts: 166
|
Posted: Wed Jul 22, 2009 12:00 am Post subject: |
|
|
| jed wrote: | | Not even RAID can prepare for double drive failure. Two drives failing within six hours of each other is unprecedented, and quite unlucky. |
Should tell that to our SAN system, we had 17 drives fail semi-simultaneously out of our 32 drive array.
It is unfortunate that all customer data is lost and unfortunate that Linode Backup is completely up and running yet either But we knew what un-managed meant. _________________ If it ain't broke, you didn't tweak it enough. If it is broke, use more duct tape.
http://independentchaos.com |
|
| Back to top |
|
 |
essentialdots
Joined: 22 Jul 2009 Posts: 1
|
Posted: Wed Jul 22, 2009 11:04 am Post subject: |
|
|
We had our Linode hosted on dallas165.
It was a complete shock when I've logged in and saw no disk images in the dashboard. Talk about scary... I hope that none of you will ever see something like that.
We had backup of everything locally. However, we also had bunch of things set up on our Linode (mail, web, svn, mysql...) with a lot of optimizations and tweaks (custom patched Apache...). So, just restoring these would take days I guess (even with the server log we manually update).
We had "luck" to move to dallas165 at the beginning of July (this was our most important image which was 2 years old). Tech support managed to somehow recover our two weeks old image file. This was very convenient as at the end only emails and two projects we are working on right now had to be recovered in addition to old image file. Though, just those took us full working day (8 hours) to restore. All of our live production projects were up and running within an hour (so we were "offline" for very short period of time in central European time zone).
I am now as lucky as desperate I was this morning. Finally, we have put this behind us.
The moral of the story: when you do backups, don't think just about backing up. The reverse process and its speed is equally important.
I hope that Linode backup will be available soon.
My condolences to the rest of our "cohabitants" on dallas165. _________________ Nikola Stojiljkovic
Essential Dots |
|
| Back to top |
|
 |
Guspaz Senior Member

Joined: 26 May 2009 Posts: 357
|
Posted: Wed Jul 22, 2009 12:55 pm Post subject: |
|
|
There seem to have been a few double-drive failures of late (well, I Think this is only the second recently).
Nevertheless, considering the huge impact when they do occur (and we've seen that they do occur on occasion), has Linode considered switching from RAID10 to RAID6 or switching to simple RAID1 with 3 larger drives instead of (presumably) the four smaller drives used in RAID10? Either of these solutions would provide for the ability to survive two-drive failures. |
|
| Back to top |
|
 |
hybinet Senior Member
Joined: 02 May 2008 Posts: 445
|
Posted: Wed Jul 22, 2009 1:13 pm Post subject: |
|
|
| jed wrote: | | Two drives failing within six hours of each other is unprecedented, |
AFAIK, drives purchased at the same time from the same vendor tend to do that. They most likely are from the same production line (same "batch"), which sort of explains why they might have similar potentials for premature failure.
| Xan wrote: | | it's a matter of if, not when, data loss hits us all. |
I think you got it backwards. It's a matter of when, not if, data loss will occur. |
|
| Back to top |
|
 |
Xan Senior Member

Joined: 08 Feb 2004 Posts: 552 Location: Austin
|
Posted: Thu Jul 23, 2009 2:52 am Post subject: |
|
|
| pfft, what a doof, thanks! |
|
| Back to top |
|
 |
smiffy Senior Member

Joined: 23 Jan 2007 Posts: 88 Location: Rural South Australia
|
Posted: Fri Jul 24, 2009 5:24 pm Post subject: |
|
|
| Once again, I find myself wishing that Linux kernel licensing were different so that we could have ZFS. That CAN cope with multiple disc failures. There's even a video of a guy taking a sledgehammer to a pair of discs in a hot system, plugging new discs in and watching it all rebuild without a hitch. |
|
| Back to top |
|
 |
Guspaz Senior Member

Joined: 26 May 2009 Posts: 357
|
Posted: Mon Jul 27, 2009 9:09 am Post subject: |
|
|
| smiffy wrote: | | Once again, I find myself wishing that Linux kernel licensing were different so that we could have ZFS. That CAN cope with multiple disc failures. There's even a video of a guy taking a sledgehammer to a pair of discs in a hot system, plugging new discs in and watching it all rebuild without a hitch. |
ZFS can't handle multiple disk failures. It has no inherent redundancy. RAID-Z2 can handle two disk failures (RAID-Z can handle one).
But, RAID-5 can handle one disk failure, and RAID-6 can handle two.
ZFS/RAID-Z's advantages are not in the number of disk failures they can handle, they're in other things. |
|
| Back to top |
|
 |
irgeek Linode.com Staff

Joined: 21 Jun 2003 Posts: 151 Location: Absecon, NJ
|
Posted: Mon Jul 27, 2009 12:03 pm Post subject: |
|
|
Most of you who were affected by the RAID crash have probably seen the ticket updates, but I just wanted to let everyone know that we finally managed to get the RAID to respond again this weekend. All customer data was copied off to a standby host and it's sitting there now in case anyone wants access to it.
If you haven't redeployed yet, we can put your Linode back the way it was. If you have redeployed and you'd just like access to the disks, let us know and we'll set it up for you.
-James |
|
| Back to top |
|
 |
jsr Junior Member
Joined: 09 Dec 2008 Posts: 43 Location: Gilbert, AZ
|
Posted: Tue Jul 28, 2009 8:45 am Post subject: |
|
|
| I wasn't affected by this, but it is good to know that Linode kept on working on getting the data back and didn't just give up. Good job Linode. |
|
| Back to top |
|
 |
freedom_is_chaos Senior Member
Joined: 12 Sep 2008 Posts: 166
|
Posted: Thu Jul 30, 2009 12:56 am Post subject: |
|
|
| jsr wrote: | | I wasn't affected by this, but it is good to know that Linode kept on working on getting the data back and didn't just give up. Good job Linode. |
This is why people that come to linode, stay with linode  _________________ If it ain't broke, you didn't tweak it enough. If it is broke, use more duct tape.
http://independentchaos.com |
|
| Back to top |
|
 |
|