Your Linux Data Center Experts

Computer hardware is pretty reliable these days. However, even with good procedures and hardware in place, there is still the possibility of data-loss. As we found out on Monday night… Despite having a well documented and tested workflow, RAID data redundancy, monitoring, top-notch personnel, and operating in a very conservative manner, we had a data-loss event that impacted 6 of our virtual hosting customers.

This is, to the best of my recollection, the first major data loss event we've had related to our hosting, since we began the hosting service over 11 years ago. I wanted to document what happened, both to provide information to the clients that were impacted and also as a lesson to the other readers.

Read on if you are interested in all the gory details. In big-enterprise circles, this is called a Service Outage Analysis (SOA).

Notice

First of all, let me state that if you are a customer and are worried this may have impacted your services: don't. The few customers this impacted have been contacted and we've heard back from them. There were no other issues related to this.

Executive Summary

A hardware or software failure, almost certainly in the RAID controller or its Battery Backup Unit, caused corruption of one of the drives in a RAID-10 array. This, coupled with drive errors on the other drive in the mirror, resulted in data corruption.

Details of Outage

This all started this weekend when a drive in one of the RAID arrays that holds virtual machines fails. We have a number of 4-drive RAID-10 arrays. Our monitoring detected a failure of a drive in the array, and we scheduled a maintenance window to replace it for Monday night.

It's important for the next discussion to remember that a 4 drive RAID-10 array is 2 pairs of mirrored discs that are then joined together.

On Monday during the window, Kyle visually inspected the drives in the array, and confirmed that the drive we expected had failed (the second from the right in the chassis), indeed had no disc I/O, while the other drives did. He shut down the system and then replaced the impacted drive.

Before we deployed these systems, we had developed and tested a workflow for replacing the drives. This was largely because the LSI RAID controllers management tool is not very obvious, and so we wanted to make sure that we had documentation to prevent any problems in the (sadly, fairly complicated) replacement procedure.

However, after step 5 in the workflow, it became clear there were problems. Normally, the system would show that the array was degraded, with one mirror set ok, one mirror set only having one drive, and lastly an unconfigured drive…

In this case, the array instead showed the one mirror set was ok, and then it showed TWO unconfigured drives.

This RAID controller, like most these days, will store the RAID data on the hard drives as well as in the BIOS. So if the drives are moved to another controller (say, your controller fails), or you swap drives around, the controller can detect that and continue operating correctly.

In this case, somehow the drive with mirrored data for the one that failed did not have the RAID controller meta-data on it.

Despite the controller thinking that the array was bad, the data should have all been there.

So, we decided to take the conservative path and pull the drives from the array, put in fresh drives, and reconstruct the data from the drives that should have had the good data on them, and stream the data over to the newly constructed array. This had the benefit that the drives with data on them would be treated as read-only.

So we set up the test array, and laid out some test data on it. Because we had a workflow for the system setup, we knew exactly how the array was configured when it was created. We were able to use that data to tell the layout, make sure it was as we expected it to be (and it was, 64KB on the first drive, the next 64KB on the second drive, both at offset 0, then the next 64KB back on the first drive, etc…). We tested a small python program that would reconstruct this data and it provided exactly the right results.

So then we began laying down this data to the new array. Unfortunately, the best speed we were able to get was 15MB/sec (speed was limited writing to the new array), and it ended up taking 36 hours to complete the copy.

At that point, I did a few last checks, and then booted the system.

The host is running VMWare ESX, and VMWare booted fine. However, once that completed, it wasn't able to see any of the virtual machines.

It was close to working, but as one of my teachers used to like to say “Close only counts in horseshoes and hand grenades.” The corruption which caused the controller to not see the drive had obviously extended into the VMFS where the virtual machines were housed as well.

At this point we decided to attempt to reconstruct the array from the original drives, using the RAID controller. This is the more risky attempt, because if things aren't done exactly right, you risk having the controller mirroring a bad drive to a good drive, etc…

LSI very helpfully provided an engineer to walk us through doing this step. And, the system booted but behaved exactly like it did with the reconstructed array in place – it couldn't see the virtual machines. So clearly that additional drive in the array was somehow corrupted, but in such a way that the controller didn't report it to the monitoring system.

At this point I gave up the data recovery effort, and we started concentrating on either recovering the virtual machines from backups, or setting up bare virtual machines for the ones that did not have backups (with the exception of ones that had already been set up for customers who opted to not wait for the recovery).

A few curious things:

  • The controller correctly saw the “failed” drive with RAID meta-data on it. That plus the monitoring (saying slot “2” had a bad drive), and the RAID data on that drive showing it as “BAD” (indicating it had been failed by the controller) indicates to us that this wasn't a problem with pulling the wrong drive.
  • From when the drive went bad until we shut the machine down, the machine continued to operate normally. This plus our monitoring leads us to believe that we didn't have two drives fail. Plus, the RAID array didn't mark that last drive as “BAD”, it was like it didn't have any RAID meta-data on it at all.
  • We've done this replacement, using the workflow, 3 or 4 other times over the last 6 or 12 months, with no problem. Perhaps it was a bad controller?
  • LSI engineers best guess seems to be that the other drive in the mirror set failed. However, that doesn't agree with the lack of reports from monitoring, or that the drive was not marked as “BAD” like the drive that we know failed. However, I'm sure LSI sees it all the time, where monitoring isn't set up and nobody notices the first drive failed until the second drive fails. That is common, but wasn't the case here.

Unfortunately, we can't come up with any real definitive reason for why this other drive in the array was corrupted. While a RAID-10 array can survive 2 drives failing, this is the only drive of the 3 remaining that it couldn't survive being marked bad.

Summary

What went right:

  • We had a tested workflow for doing the replacement.
  • Because of this, when things were wrong, Kyle knew right away and called for another pair of eyes.
  • Our backups worked.
  • We had enough hardware resources to be able to try reconstructing the array without risking the old drives, and
  • We had enough staff resources to be able to have some working on this system, some contacting impacted clients, and some bringing services back up on other virtual machines.
  • We had shut the system down so that if a drive mis-swap occurred, we would have been able to recover without live data streaming onto the array.
  • Instead of having a less-experienced NOC person replace the drive, we deferred the replacement to have a higher-level person on-site.

What we could have improved:

  • In retrospect, we probably should have migrated virtual machines to another host. However, this is a lengthy process and also involves some level of risk. In the future, we will probably do this as I (not surprisingly) don't trust the RAID cards that are in these virtual machine hosts. NOTE: Our dedicated hosts use a different RAID controller, but those controllers weren't an option for the virtual hosts because array health monitoring was not available for them with ESXi.
  • Some of the virtual machines had important data on them, but were not being backed up. Migrating the machines (the item above) really only masks this problem, and this is the root of it. We split out the backups because of feedback from customers that they had uses which they would prefer lower prices because it held data that was backed up elsewhere. As one of these customers said in response to this “because of the level of redundancy on the system, I'd gotten lax in doing the backups”.
  • The list of email addresses of customers impacted by this maintenance was something Kyle had on his laptop. So when we needed to contact them with a status update, it would have required pulling Kyle off the recovery effort to give the update. We either should have done that, or should have had the address list available to someone else.

Action Items

  • We need to discuss making the backup option standard. We had tried to price it as low as possible, so that the customers who needed it had little incentive not to get it. But, when it comes right down to it, our customers are kind of relying on us to make sure that the right things are happening. Backups aren't optional, they're required.
  • I anticipate that we will escalate our plans to replace the existing virtualization infrastructure. ESX has worked relatively well for us, but it definitely does have a lot of limitations. While they didn't directly cause this issue, it definitely did contribute to it, such as: they make it very hard to migrate servers between hosts, we can't use our “standard” RAID card with ESX because if we do we lose the ability to detect when an array goes bad, proprietary format for the virtual machine storage…

In conclusion: I feel terrible that data was lost “on my watch”. But it happened despite all these things we had in place to try to prevent it, so over all I feel like our procedures were very good. We definitely will use it as a case to learn from. But there's a reason this hasn't happened in 11 years: we have pretty good procedures.

comments powered by Disqus

Join our other satisfied clients. Contact us today.