Computer hardware is pretty reliable these days. However, even with good procedures and hardware in place, there is still the possibility of data-loss. As we found out on Monday night... Despite having a well documented and tested workflow, RAID data redundancy, monitoring, top-notch personnel, and operating in a very conservative manner, we had a data-loss event that impacted 6 of our virtual hosting customers.
This is, to the best of my recollection, the first major data loss event we've had related to our hosting, since we began the hosting service over 11 years ago. I wanted to document what happened, both to provide information to the clients that were impacted and also as a lesson to the other readers.
Read on if you are interested in all the gory details. In big-enterprise circles, this is called a Service Outage Analysis (SOA).
First of all, let me state that if you are a customer and are worried this may have impacted your services: don't. The few customers this impacted have been contacted and we've heard back from them. There were no other issues related to this.
A hardware or software failure, almost certainly in the RAID controller or its Battery Backup Unit, caused corruption of one of the drives in a RAID-10 array. This, coupled with drive errors on the other drive in the mirror, resulted in data corruption.
This all started this weekend when a drive in one of the RAID arrays that holds virtual machines fails. We have a number of 4-drive RAID-10 arrays. Our monitoring detected a failure of a drive in the array, and we scheduled a maintenance window to replace it for Monday night.
It's important for the next discussion to remember that a 4 drive RAID-10 array is 2 pairs of mirrored discs that are then joined together.
On Monday during the window, Kyle visually inspected the drives in the array, and confirmed that the drive we expected had failed (the second from the right in the chassis), indeed had no disc I/O, while the other drives did. He shut down the system and then replaced the impacted drive.
Before we deployed these systems, we had developed and tested a workflow for replacing the drives. This was largely because the LSI RAID controllers management tool is not very obvious, and so we wanted to make sure that we had documentation to prevent any problems in the (sadly, fairly complicated) replacement procedure.
However, after step 5 in the workflow, it became clear there were problems. Normally, the system would show that the array was degraded, with one mirror set ok, one mirror set only having one drive, and lastly an unconfigured drive...
In this case, the array instead showed the one mirror set was ok, and then it showed TWO unconfigured drives.
This RAID controller, like most these days, will store the RAID data on the hard drives as well as in the BIOS. So if the drives are moved to another controller (say, your controller fails), or you swap drives around, the controller can detect that and continue operating correctly.
In this case, somehow the drive with mirrored data for the one that failed did not have the RAID controller meta-data on it.
Despite the controller thinking that the array was bad, the data should have all been there.
So, we decided to take the conservative path and pull the drives from the array, put in fresh drives, and reconstruct the data from the drives that should have had the good data on them, and stream the data over to the newly constructed array. This had the benefit that the drives with data on them would be treated as read-only.
So we set up the test array, and laid out some test data on it. Because we had a workflow for the system setup, we knew exactly how the array was configured when it was created. We were able to use that data to tell the layout, make sure it was as we expected it to be (and it was, 64KB on the first drive, the next 64KB on the second drive, both at offset 0, then the next 64KB back on the first drive, etc...). We tested a small python program that would reconstruct this data and it provided exactly the right results.
So then we began laying down this data to the new array. Unfortunately, the best speed we were able to get was 15MB/sec (speed was limited writing to the new array), and it ended up taking 36 hours to complete the copy.
At that point, I did a few last checks, and then booted the system.
The host is running VMWare ESX, and VMWare booted fine. However, once that completed, it wasn't able to see any of the virtual machines.
It was close to working, but as one of my teachers used to like to say "Close only counts in horseshoes and hand grenades." The corruption which caused the controller to not see the drive had obviously extended into the VMFS where the virtual machines were housed as well.
At this point we decided to attempt to reconstruct the array from the original drives, using the RAID controller. This is the more risky attempt, because if things aren't done exactly right, you risk having the controller mirroring a bad drive to a good drive, etc...
LSI very helpfully provided an engineer to walk us through doing this step. And, the system booted but behaved exactly like it did with the reconstructed array in place -- it couldn't see the virtual machines. So clearly that additional drive in the array was somehow corrupted, but in such a way that the controller didn't report it to the monitoring system.
At this point I gave up the data recovery effort, and we started concentrating on either recovering the virtual machines from backups, or setting up bare virtual machines for the ones that did not have backups (with the exception of ones that had already been set up for customers who opted to not wait for the recovery).
A few curious things:
Unfortunately, we can't come up with any real definitive reason for why this other drive in the array was corrupted. While a RAID-10 array can survive 2 drives failing, this is the only drive of the 3 remaining that it couldn't survive being marked bad.
What went right:
What we could have improved:
In conclusion: I feel terrible that data was lost "on my watch". But it happened despite all these things we had in place to try to prevent it, so over all I feel like our procedures were very good. We definitely will use it as a case to learn from. But there's a reason this hasn't happened in 11 years: we have pretty good procedures.comments powered by Disqus