Why do Drives Fail?

Disk drive failures are terrifying. Drives that haven’t failed yet, arguably, are just as terrifying as failed drives—they’ll fail eventually!

Some companies use lots of disks; Google is one of them. They surveyed their drive population over time, analyzing failures and applications. The observations are recorded in a paper called Failure Trends in a Large Disk Drive Population. The study investigates almost 3500 drives that failed while being used at Google’s data centers.

There are several conclusions based on the observations and data that seem to contradict commonly held beliefs about disk dives. One is that there’s no link between drive activity or the heat in its environment and the failure rate of the drive. A cooler drive doesn’t necessarily enjoy more longevity, and a drive that’s used heavily doesn’t always last longer.

Surprisingly, the paper also fails to conclude that the age of a drive determines its failure rates. This is partially because the drive technology changes so quickly, and drives which are older are mostly drives that are of a different design or model. You’d expect either a linear curve between age and failure rate, or an upswing as drives age and then fail more. There’s an upswing in drive failure rates as drives age, but there’s also a hump and a decline in mid-life drives. I don’t think that the paper uses enough data to reach a hard conclusion, but I think it’s still notable that the conclusion can’t be reached even though it might have seemed quite apparent.

Further, the paper explains that SMART doesn’t help much in predicting drive failure. This is rather shocking, considering what the drive industry has done to invest in SMART and the support around it. The drives in Google’s servers are monitored frequently, and the SMART data is recorded. SMART gave no clear indication of impending failure.

Internet forums are full of people who proclaim that they’ve had a lot of a certain brand of drive fail, and this is very dubious logic. Almost all of these authors have very small samples contributing to their experience, and the anecdotal result is nothing to be relied upon. The fact remains that drives fail, and I don’t believe any particular model or line to be more or less reliable than another. I was prepared for this, but I wasn’t really expecting that SMART wouldn’t be useful for monitoring drive health.






One response to “Why do Drives Fail?”

  1. GregM Avatar

    Very interesting to read that continuous/high levels of activity have no real link to failing drives . . . something which I always thought to be the case myself.

    I have yet to have a drive fail on me . . . but I guess one just has to accept that there is a level of bad luck surrounding failures!

Leave a Reply

Your email address will not be published. Required fields are marked *