Mean Time Between Failure: A probability question

I was talking to some people about setting up some computer hard drives in a RAID-0 configuration (where two drives are linked to appear and act as one to the computer but have no fault tolerance).

Their claim was that this was a bad idea because you halve the expected lifetime of the hard drives. Essentially they seem to think that because you are “rolling the dice” on a chance of failure twice rather than once you double the chances a failure will occur (or halve the expected time between failures).

However, I am not so sure this would be the case. Say each hard drive has a Mean Time Between Failure (MTBF) of 50,000 hours. They may run longer or they may run shorter than that as it is an average. However, I should expect to get somewhere in the neighborhood of 50,000 hours on each drive such that I would not expect one to quit at 25,000 hours.

Certainly by running two drives the chances for failure are increased but just how much? While a specific answer may not be had is there enough info here to get a ballpark figure?

Not thinking too hard, but it seems like you need more than the mean. Consider these cases:

  1. A drive always goes to exactly 50,000 hours, then dies.
  2. 50% change of lasting 25,000 hours, 50% chance of lasting 75,000 hours.
  3. Same as above, with 0 hours and 100,000 hours.
    etc…

Each one would lead to different outcomes. So I think we need more info, the variance or distribution curve.

To further complicate matters, I once read that MTBF do not apply to specific physical drives at all. If the MTBF is 50 years then you still have to replace that drive to the exact same model of drive many times over 50 years to generate the failure curve they are really talking about

This Wikipedia entry describes MTBF much better than I could manage. Look especially at the last section called “MTBF and life expectancy” for a better understanding of what I think Shagnasty is trying to say. A quick summary, though, is that MTBF doesn’t describe life expectancy for any one individual.

For the OP, I run RAID-1 specifically because of the MTBF numbers. Granted that won’t help the situation in which you’re obviously trying to get larger volume sizes.

What muttrox said. A different probability distribution function will change the results.

That being said, if we assume that a MTBF of 50k hours is distributed evenly (so that the drive has the same chance of failing during Hour 1 as it does during Hour 100,000), then that corresponds to a 1/50000 chance of failure during any given hour, or 0.00002.

If you’ve got two drives, and the failure of one drive doesn’t affect the probability that the second drive fails, then the probability of at least one drive failing during any given hour is:

p(d1)+p(d2)-p(d1)*p(d2)

=0.00002 + 0.00002 - (0.00002)*(0.00002)
=0.0000399996

Which is a MTBF of 1/0.0000399996 or 25,000.25 hours.

So the two drive set as a whole is basically half as reliable as a single drive.

Right reasoning, wrong start. MTBF isn’t really rellevant. What you’ve got to look at is the probability of each drive failing and ther effect of that failure. A RAID 0 set will fail if either drive fails, whereas a RAID 1 (mirror) set will only fail if both drives fail.

Suppose a drive has a 10% chance of failure per year. Your RAID 0 set will fail 20% (10% + 10%) of the time whereas your RAID 1 set will only fail 1% of the time (10%x10%).

With current high drive capacities and low prices, there’s little reason not to go RAID 1.

Or if you can afford more than two disks, RAID 5. That gets you the high performance of striping across many disks with the backup of a parity track in case one disk blows up.

Actually, the RAID 0 fails only 19% of the time (there’s a 1% chance both drives will fail). With two identical independent drives, and a p chance of failure of each drive over a given time period, the chance of either failing is p+p-(p*p) [as Valgard said already].

For a 50% chance of individual failure before the MTBF, then the RAID 0 has a 75% of failing by the MTBF.

And, to nitpick further, in the real world, the drive’s failures aren’t going to be completely independent, since some failures are going to be encouraged by the environment (voltage shocks, heat stress, etc.). So I’d expect the hypothetical RAID 0 to fail slightly less often than predicted by the calculation. But not so much as to make it close to a single drive.
The bottom line is that RAID 0 is indeed much less reliable than a single drive.

Failure curves for hard drives tend to be bathtub curves. (Relatively) high rates of failure early on, then a long period of very low failure rates, then spiking failure rates again as the device gets old enough to wear out. Depending on how high the sides of the bathtub are, the MTBF is probably somewhere in the middle of the low part.

The calculation for RAID-0 vs 1 is still the same, but in the middle years, the chance of a single drive failing is relatively low.

I guess this is dependent upon what you define as the failure characteristics. In a pure drive failure sense, the calculation is most certainly the same, but you would still have a functional RAID system with missing drives, and the ability to recover data. Unless you’re in that tiny percentage where both drives fail (crosses fingers).

The problem here is that people are confusing failures (of a drive) with a system failure (or outage.) As has already been shown, adding more components reduces MTBF, but can increase the time to an outage, and so is a good thing.

Don’t forget that no particular drive is ever going to actually be in service for 50 years. I’m actually in the middle of doing some MTTF (mean time to fail - the component I’m measuring doesn’t get replaced) right now. You compute the total power on hours for all components in the field, collect the total number of failures, and divide. We actually report on FITS (Failures in Time) which is failures per billion power on hours. For us that is a more useful measure. But it will be a pessimistic measure, in that many components that get removed for other reasons might have lived much longer.

Yeah, that’s what I was supposed to have been pointing out. I forget to mention that it was in context of RAID-1 version RAID-0. Even though the MTBF is the same calculation for both configurations’ individual drives, it’s not necessarily accurate for the system, since a functioning RAID-1 system can consist of only a single drive (not useful, but still a functioning system). Once you lose a single unit of a RAID-0, though, the RAID system has failed, and so the MTBF isn’t the same.

Of course, just because technically you have a functioning RAID-1 system with a single drive, you may operationally classify this is a system failure, in which case the MTBF’s are the same again.

MTBF is normally calculated over an expected service life, say five years for a hard disk. So a 50,000 hour MTBF tells you something about the expected failure rate over the service life, not the ultimate longevity of the hard disk.

This should help you understand it more - and yes, that would technically double the probability of a failure (of course there are other variables to consider).

I agree that it doesn’t tell you anything about longevity (especially when repair is possible) but I don’t understand how you would factor in service life. You can compute estimated MTBF from the estimated MTBFs of the components of a system, or you can compute it based on field data. You of course don’t collect data from systems past their service lives.

It’s true that MTBF has problems. One of our statisticians likes an example of how it breaks down for life expectancies - you can’t predict how long a 90 year old will live given average life expectancies for an entire population. That’s that hazard functions are for.