In short, we started seeing dramatically fewer drive errors after running read/write testing of drives before putting them into production.
As I mentioned in my previous post (linked above), we run fairly long-running tests (around a week with 500GB drives) before using them. However, in my experience this is fairly unusual – it's rare that I run into others that do similar levels of testing. Even for RAID systems (all the systems we deploy these days are using RAID), reduced errors in production save time and attention.
We started doing these tests because we were finding that maybe one in 5 or 10 drives would report sporadic read errors, but when we replaced the drive and ran a badblocks on the removed drive it was showing no problems.
I theorized that there might be some marginal areas of the discs, perhaps from production, perhaps from shipping, I don't know. These areas of the disc were good enough to be written to, but then after some fairly short time the data there could not be read again because of bit-rot.
Once I started doing the read/write badblocks testing, writing and then verifying the full disc 20 to 40 times, we found that these issues basically stopped showing up in production.comments powered by Disqus