Wednesday September 17, 2008 at 00:00
Subject: Why do we test all our drives?
Keywords:
Hard drives, Technical, Testing
Posted by: Sean Reifschneider
Related entries:What's the deal with Hitachi drives lately? by Sean Reifschneider, Tuesday September 16, 2008 at 23:22
In short, we started seeing dramatically fewer drive errors after
running read/write testing of drives before putting them into production.
As I mentioned in my previous post (linked above), we run fairly
long-running tests (around a week with 500GB drives) before using them.
However, in my experience this is fairly unusual -- it's rare that I
run into others that do similar levels of testing. Even for RAID systems
(all the systems we deploy these days are using RAID), reduced errors in
production save time and attention.
We started doing these tests because we were finding that maybe one in
5 or 10 drives would report sporadic read errors, but when we replaced the
drive and ran a badblocks on the removed drive it was showing no problems.
I theorized that there might be some marginal areas of the discs,
perhaps from production, perhaps from shipping, I don't know. These areas
of the disc were good enough to be written to, but then after some fairly
short time the data there could not be read again because of bit-rot.
Once I started doing the read/write badblocks testing, writing and
then verifying the full disc 20 to 40 times, we found that these issues
basically stopped showing up in production.
(Post Reply)
(Post Reply)
| Comment |
Matt Taggart Subject: badblocks |
This is how I understand it, I don't remember where I heard this or even know if it's correct....
Drives have their own internal list of bad blocks and also additional storage beyond the published size of the drive. When the drive determines a block is bad, it marks it bad and allocates a block from this additional storage so that the size of the drive stays constant.
So I think your heavy testing is causing marginal blocks that otherwise might remain in use to fail and be listed as bad and replacement blocks to be allocated. This sounds like a good thing and is probably a good reason for everyone to do such testing. When you deploy the drive you'll have less of the "spare" space left and when that space is used up you will start getting real errors. But that is what that space is there for, better to use it sooner rather than later.
Anyone who knows this stuff, please correct the above if I have it wrong.
Matt
| Comment |
Author:
Sean Reifschneider Subject: Wikipedia has a good discussion... |
Wikipedia has a good discussion of Bad Sectors. It includes basically what you said above, but does go over the distinction of bad sectors at read time and at write time.
What must be happening in the cases I saw was that data was written to the disc, and then the sector went bad and couldn't be read later. But once you write new data there, say when doing later badblocks testing, it does the remapping and the drive shows as fine.
One thing to consider about this is it's impact on RAID controllers. Most controllers will fail a rebuild if they run into an error on a read.
So imagine you have a 2 drive RAID-1 array with 20GB of data on 500GB drives. And one drive falls out of the array... You replace the drive and tell the array to rebuild. The controller is going to read all 500GB from one drive and write it to the other. Now, what if one of the sectors on the source drive has a read error. This read error is likely to be outside the used space, but the controller is now likely to set the array as failed (instead of simply degraded), and treat both drives as failed.
Some controllers now have an option for "continue rebuild on read error", which may result in the unreadable sectors being corrupted, but at least the bulk of the data will be available.
This is why it's important to make use of your controllers "verify array" functionality regularly. That should detect bad sectors early and allow remapping before bad sectors occur on both drives.
This is one of the few things that I don't like about the Linux software RAID. It doesn't have this verify functionality. To do it you would have to fail a drive from the array and add it back in and allow it to rebuild. However, if you fail one drive and there are bad sectors on the other, you're hosed. I really wish the Linux software RAID had a "verify array" function.
What I've done for this situation in the past is to set up regular jobs which read from the underlying devices, hoping that if a bad sector is detected it will cause the RAID array to fail the drive. A better test may be to do a full read from the RAID device to ensure that bad sectors are definitely detected on at least one set of the drives if they exist, even if the bad sector is outside of the used data space.
One nice thing about reading from the underlying device is that you can run the read on different schedules. I like to run the test on one drive once a week, but the other drive twice a week. I feel like I'm less likely to wear out the drives at the same rate so that hopefully they don't fail within close proximity of each other.
Sean
| Comment |
Durval Menezes Subject: "verify array" on Linux |
>I really wish the Linux software RAID had a "verify array" function.
Well, in recent kernels (at least since 2.6.20, possibly older ones
too), you can do a
echo "check" >/sys/block/mdThe system will then start a full verify on the array, where all blocks will be read and (if the array is of a redundant type like RAID1, RAID5, etc) the redundancy will be checked too; in case you get physical read errors, the kernel will automatically recompute the block in error and then try to write to it and read it right afterwards (in fact this will solve most read errors, as the disk's firmware will simply reallocate the block to a good one from its internal block rellocation list). All this ends up logged in syslog. It's also possible (because of firmware issues, FUBARed controllers, etc) to have the redundant blocks being read ok but not agreeing: in this case, you will get a counter incremented in /sys/block/md<N>/md/mismatch_cnt, but no syslog messages; in that case, be sure to check mismatch_cnt after the check operation finishes (the above "echo" returns immediately, the check operation is actually done in background; you can monitor it via syslog (the end of the check operation gets logged), via "cat /proc/mdstat", or by checking the /sys/block/md<N>/md/sync_action to see when it turns to idle. In fact, Ubuntu 8.04 installs a script in /usr/share/mdadm/checkarray that starts an array check, and it by default is run by cron via /etc/cron.d/mdadm every first sunday of each month./md/sync_action (just replace <N> with your array number);