Your Linux Data Center Experts

As many of you who know me are aware, I've been a pretty big advocate or IBM/Hitachi drives for a very long time. I've probably deployed around 500 of them over the years, and had a very low failure rate, and when I have had problems IBM and more recently Hitachi have been extremely helpful. Even the drives that brought about the “Death Star” nick-name I had almost no problems.

However, over the last 45 days we've been trying to work with Hitachi to try to resolve some problems and have had basically zero luck making any progress. We've given Hitachi quite a bit of time to provide some information on this, but so far have basically made no progress with them. I wanted to open up the discussion to other to see if others were seeing similar results, or perhaps it's just the level of testing we do, and nobody else is doing this testing.

In short, we test all drives before we put them into production by using a read/write badblocks test. We've done this for probably 5 to 7 years as part of our standard hardware burn-in. No hard drive goes into our managed hosting facility without going through 4 to 7 days of full read/write testing.

With our most recent batch of 500GB drives (part number 0A35415), the labels say “May 2008”, we started seeing errors during the burn in. And I'm not talking about one drive, I'm talking about 22 tested drives out of 2 different 20-packs of drives, tested on 4 different systems with different controllers, all of which did not report errors when testing drives from older batches.

Have you seen similar problems with recent Hitachi drives? What sort of testing do you do of new drives, would you have even noticed? Read on for more details of what we've been seeing, including information on how to do your own testing.

The interesting thing to note is that the first batch of drives we got was in the factory shipping case, but it had been opened and re-sealed at some point. All the drives appeared to be factory sealed though. When we started seeing these errors we thought that maybe somebody had damaged the drives, but the RMA we got from our vendor resulted in a second case, this time factory sealed, which had exactly the same problem. At this point, I suspected that the problem was with a new firmware on the drives, we've tested probably 60 to 80 500GB drives over the previous 6 months without problems.

So what exactly is the problem? To test it we're running “badblocks -svw -p5 /dev/sda”, which does 5 passes, each consisting of 4 full read/write cycles with different patterns. Normally if the drive were bad, badblocks would report blocks as bad. This is the result, as I understand it, of writing a block and reading it back and getting invalid data.

The problems we're seeing however are different. badblocks is incrementing it's error counter, and the kernel is reporting something similar to the following:

[ 8524.147171] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action
[ 8524.147178] ata1.00: BMDMA stat 0x25
[ 8524.147185] ata1.00: cmd c8/00:80:80:ff:ff/00:00:00:00:00/ef tag 0 dma
[ 8524.147187] ata1.00: res 51/04:80:80:ff:ff/00:00:00:00:00/ef Emask 0x1
(deve error)
[ 8524.147192] ata1.00: status: { DRDY ERR }
[ 8524.147277] ata1.00: error: { ABRT }

It's not showing it as a bad blocks, which makes me think that it's correctly writing the data, but is running into a error reported from the drive while doing so. It's perhaps retried and then goes through. So, is this a “normal error”? I just don't know, no way I can tell. However, the issue is that this is not something we normally see in our testing, and if this drive were in production and we were getting messages like the above, how would we know if they are “normal errors” or “critical errors”?

Since the drives are, apparently, not corrupting data to or from the drive, it makes me wonder whether anyone else would even notice the problem? I know a lot of people just take drives and start using them. It could be that literally nobody else has been testing these drives to the extent that they'd see this problem.

The response from Hitachi has been pretty poor. In the past I've received superb support from IBM and Hitachi on the rare occasion when I've had to contact them. We've provided detailed information on the problems we're seeing (provided below), and the support rep we were speaking to was never following back up with us, we'd basically have to pester them to get back to us about it.

Then we were told to ship the second bad case of drives we had back to Hitachi, but then they turned around and said we needed to test them (we only tested 2 out of the 20 in this case, because both of them failed in exactly the same way as the previous case of 20 that we tested the entire case of), so they're apparently in the process of shipping the case back to us.

If you are interested in replicating the testing, you basically need to boot CentOS 5.2 rescue CD with “linux rescue”, tell it to skip mounting the partitions and then run a destructive test with “badblocks -svw -p 5 /dev/sda”. That will wipe all data on the drive. Once it's done, look at the results that were written to the screen, but also look at the output of “dmesg” to see if any errors similar to the above were produced.

Below is the information I wrote up for Hitachi about the problems we were seeing. It's fairly detailed about the issues, the hardware we tested with, and the results.

Dear Winston,

We've tested the hard drives doing:

   badblocks -svw -p 5 /dev/sda

under a CentOS 5 (the free version of Red Hat Enterprise Linux version 5) system. This does 5 read/write passes of “badblocks”, which loops writing one of 4 different patterns to the disc and then reading it back in and verifying that the pattern is intact. It does that 4 times.

This is our standard procedure for verifying hard drives prior to placing them in production. Hitachi is our preferred drive, and we currently have 200 to 300 of them spinning in our production environment that have been deployed over the last 3 years, with another few hundred that we've deployed for other of our customers.

We burn in all drives we deploy for production to identify problems, which are very rare, but also to exercise the drives, doing a full read/write cycle 20 times with different patterns, allowing any marginal blocks to get remapped.

The short results are that out of a case of 20 500GB drives we have 12 that consistently fail when running in our typical configuration, with the remaining 8 drives showing some problems but being unable to run through as extensive a set of tests as the other 12.

The problem drives we have tested the drives in a number of different systems and different controllers, tested with a combination of a dozen of the drives from this case:

  • Supermicro 5015M-MT+ with on-board Intel PIIX controller with CentOS 5
    All tested drives reported errors.

  • Supermicro 6025B-3V with on-board Intel PIIX controller with CentOS 5
    All tested drives reported errors.

  • Supermicro 6025B-3V with on-board Intel PIIX controller with Ubuntu Hardy
    All tested drives reported errors.

  • Supermicro 6025B-3V with Supermicro AOC-SAT2-MV8 controller with CentOS 5
    Controller I believe uses Marvel chipset.
    All tested drives reported errors.

  • Supermicro 6025B-3V with 3Ware 9550 controller with CentOS 5
    Controller running in JBOD mode, but was running at only 7MB/sec
    instead of 80+MB/sec like all other combinations.
    This configuration I only tested one drive on (it took 24 hours to
    fill the disc just once).
    *** No errors were reported.

  • Supermicro 6025B-3V with on-board Adaptec controller with CentOS 5
    All tested drives reported errors.

  • Home-built system which is one of the 3 systems we usually use for burning in drives, with on-board Intel PIIX controller. All tested drives reported errors.

We tested all of a dozen drives on the Intel PIIX controllers and all of them failed. We then used one of these failed drives to test in a number of the other configurations.

8 of the drives from the case we pulled out initially and put in a server we were deploying. I initially was using the Supermicro AOC-SAT2-MV8 controller for this, because a few months ago we deployed a similar system with 8 Hitachi Deskstar drives in that same configuration (same controller/chassis/CPU). However, when I had these drives connected to that controller, during boot one of the drives would fail to be recognized by the controller BIOS.

I tried replacing the two drives that were acting “funny”, but the replacement drives were acting exactly the same. I replaced the controller and it still continued to act “funny”. I switched to using the on-board SATA/SAS controller, and was able to run some testes without problems, but the rate I was getting was slow. It seems that if the rate is slow enough, the problem doesn't show up.

These 8 drives are over an hour away now, and we haven't had a chance to go get them and run them through this testing configuration yet, but we definitely have to switch to the Supermicro controller so resolving these drive issues is going to be required.

Note that I've deployed something around 50 Hitachi Deskstar drives connected to the AOC-SAT2-MV8 controllers with no problems until now.

As far as the “badblocks” testing goes, over the last year we've performed this test on 100 to 150 Hitachi Deskstar drives in this configuration, many on exactly these same systems that we did the tests above, with no problems unless the drives were definitely having issues (drives that had been pulled out of production because of errors or the like).

Also note that on these same systems we have tested other drives from our stock, 80 and 250GB Deskstar drives, and they have reported no errors.

I've attached a couple of photographs showing the dmesg report for one of the drives. They are all reporting similar failures, featuring lines like: [ 8524.147171] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action [ 8524.147178] ata1.00: BMDMA stat 0x25 [ 8524.147185] ata1.00: cmd c8/00:80:80:ff:ff/00:00:00:00:00/ef tag 0 dma 6553in [ 8524.147187] ata1.00: res 51/04:80:80:ff:ff/00:00:00:00:00/ef Emask 0x1 (deve error) [ 8524.147192] ata1.00: status: { DRDY ERR } [ 8524.147277] ata1.00: error: { ABRT }

This is not a capture of the errors, but a transcription. I'm sure your testing will show similar problems.

The hard drives are:

  • 0A35415BA27270C85
  • P/N 0A35415
  • S/N [redacted] May-2008
  • S/N [redacted] May-2008
  • S/N [redacted] May-2008
  • S/N [redacted] May-2008
  • S/N [redacted] May-2008
  • S/N [redacted] May-2008
  • S/N [redacted] May-2008
  • S/N [redacted] May-2008
  • S/N [redacted] May-2008
  • S/N [redacted] May-2008
  • S/N [redacted] May-2008
  • S/N [redacted] May-2008
  • 8 left to do - they're in a server at our facility, which we won't be able to get the S/N off of until the replacements arrive from our vendor.

It's also worth noting that the case of drives that we received had been re-sealed, it was not factory sealed. However, the individual drives were in sealed silver plastic bags. We started buying drives in case quantities from our vendor around 9 months ago because we had a batch of drives we got from another vendor which had a very high failure rate. Those drives usually weren't even detected by the BIOS during boot, either that or they would run badblocks very slowly, like 6MB/sec instead of 65MB/sec for the 250GB Deskstars.

In a batch of a dozen drives we had ordered from the other vendor we had something like 8 fail. This is more failures we had out of 100 drives we had been running over the last 2 years, so we figured that it must have been some problem in the shipping department at the other vendor when they were re-packaging the drives in individual quantities, which is why we started buying in case quantities from this vendor – figuring that that would likely result in the drives arriving as they were originally packed at Hitachi.

Thank you.

Evelyn Mitchell

comments powered by Disqus

Join our other satisfied clients. Contact us today.