I’m looking at purchasing internal hard drives for a server. I’m comparing drives and some of them claim, for example: “1 million hours MTBF…”. MTBF is Mean time between failures.
Regardless of how the manufacturer arrived at this number, can this be used to compare the longevity of one product’s model again another? After all, 1 million hours is a couple of centuries. Does 1 million vs 1.4 million matter to compare two products, or is anything with 1 million hours of MTBF so off the scale this measurement doesn’t matter?
MTBF is a bad measure for determining probable life of an item. It considers the possibility of failure only during a product’s projected useful life, which in the case of a hard drive certainly isn’t a million hours. Maybe the projected life is 43,830 hours (about five years). The engineered life is perhaps eight years so that five years falls at the top of the reliability bell curve (e.g., half could fail before eight years and other half after eight years, but enough will last more than five years for warranty purposes). In five years, about 23 drives will accumulate one million hours, of if they’re using my example of eight years, only 14 drives are needed to reach a million hours.
Hard Drive failure rate is around 1 - 3% per year under heavy usage. Instead of worrying about failure rate, focus on redundancy. 2 hard drives with the worst failure rate will still perform better than 1 hard drive with the best failure rate.
Beyond that, in my experience (~15 years as an IT pro), hard drives aren’t as likely to just fail randomly in use, as they are to fail after being power cycled, or very shortly after spinning up a brand new disk.
By that, I mean that in general, drives in a powered-up machine were unlikely to fail, and that the most likely times for older drives to fail were after those drives were powered down and brought back up for whatever reason. New drives were most likely to fail very shortly after you spun them up for the first time, or they were not likely to fail for a long time.
So MTBF is kind of a metric that sounds good for marketing purposes, because they’re able to use statistical analysis to produce an absurdly large number of hours between failures, but that isn’t even remotely handy in everyday use, as the failure patterns don’t really fit into a MTBF kind of framework.
Maybe for Google or Facebook that kind of metric is useful across however many tens of thousands of hard drives they have- the difference between a 900,000 hour MTBF and a 1,000,000 hour MTBF might be relevant to them, but to an average home user or non-huge IT shop, it’s kind of meaningless.
Two hard drives instead of one might seem uneconomical, since you’re paying twice as much. But you can do much better than that. It’s possible, for instance, to store N hard drives worth of data on N+1 hard drives, in such a way that if any single hard drive fails (as long as it’s a detectable failure), no data will be lost. More complicated arrangements are also possible, to allow multiple drive failures, or to protect against failure modes that are more difficult to detect, or to provide redundancy benefits while also speeding up data access when everything is going right. Collectively, these sorts of arrangements are called RAID arrays (Redundant Array of Independent Discs).
Going beyond hard drives, MTBF is used by military logistics planners to determine the level of spares required for systems. Although minimum MTBF ratings are typically specified in a contract, modules with lower numbers will have higher number of spares made available.
Being able to say “We tested 40,000 hammers and only 1 of them broke the very first time we used it” is not the same at all as saying “You can probably use this hammer 40,000 times before it breaks.” MTBF is more like the former. They connect 1,000 hard drives to a testing machine and run it for 2,000 hours. If they get one failure halfway through the test, they can say the MTBF is 1 million hours. But that’s not nearly the same thing as running a single drive continuously for 1 million hours (114 years), let along powering it up and running it for one hour then powering it down again 1 million times. I doubt that anyone in their right mind seriously thinks a hard drive can possibly still be working 114 years after it was built.
The problem is, if they only tested it with MTFB, that’s the only real test data available for predicted use of the specific drive that’s currently being shipped. Even articles which talk about drives made 2-3 years ago don’t have a lot of meaning, because the companies could be changing the product specifications and manufacturing process to actually improve or produce more problems.
Thanks to everyone who contributed to this thread. The issue is not about using a RAID (disk mirroring) systems or to simply do regular backups, but if the MTFB claim has any validity. Does a 1.4 million MTFB indicate their drive will work longer without failure than one that claims 1 million MTFB? From the responses here it sounds like it’s a meaningless number. I guess feedback/reviews from purchases of the drives is a better indication, but when I read feedback on the same product where someone bought 10 drives and they are working perfectly to the other extreme where someone says they bought 10 drives and several of them failed within a few months, you have to question if there is another metric to use that has any meaning. I guess the solution is to look for 5 year warranty drives for critical applications and replace the drives before the warranty expires.
It doesn’t mean that a single drive will last longer. It does indicate that when you have lots and lots of drives, fewer of them will malfunction within their designed operating life. This is a point that bears repeating: MTBF only makes sense for a product’s designed operating life. It has nothing at all to do with how long a product will last.
Drives warranted for five years probably have a designed operating life of at least five years. Absent other supplier information, the length of the warranty is pretty much the only indicator of the operating life.
On the other hand, you might assume the operating life is longer, but there’s no data to support how much longer. Maybe a rough guess can be made from the MTBF. Internally (and the manufacturers will never tell us) they know that if they target a five year warranty, a certain design will fail at a certain percentage before five years. If this percentage is too high, they will use better materials, components, etc., until they have a design that fails only an acceptable percentage of time (and this is never 0%). The offset to this is a very large percentage will operate beyond the five years. The reason for this explanation is in advance of…:
Maybe not a bad idea if you don’t have budgetary constraints. Note, however, that a certain percentage of drives are expected to fail before five years, and so the warranty per se is meaningless if you value the data on the drive. Thinking economically and given that most of the drives will last significantly longer than five years, it might be considered wasteful to dispose of them just as the warranty expires.
The above explanation is really why so many people are proposing RAID solutions for your critical applications. If you are already considering replacing drives in five years, it may be worth it to perform a simple cost-benefit analysis versus RAID setups. Don’t forget the cost of lost data should one of the drives fail in a non-RAID setup.
RAID doesn’t even have to be expensive. Aside from pricey RAID NAS boxes, some motherboards have built-in RAID support, and many operating systems support software RAID. The only delta cost may be the purchase of additional drives for the array.
I owned field reliability calculations for our products for a while (not hard disks.) The misunderstanding of MTTF here is why we used FITS, failures per billion hours of use in the field. You can compute this by estimating when the product was turned on after shipment, knowing the use cycle (in our case 24/7), computing the total power on hours, and adding up the failures. In our products this was information we had.
Clearly no product is going to run for a million hours, and in any case whether it can or not has absolutely nothing to do with MTTF. If a product is going to fail after ten years, but is always going to be replaced after 8, and seldom fails before that, you’ll get a good FITS rate anyway - as well you should. Actual failure curves aren’t step functions, of course.
FITS also doesn’t say much about failure profiles. You can get the same rate if some number fail after a year in the field, and a smaller number fail right away while the rest never fail. You need to look at infant mortality numbers as well as FITS.
While much of what you say is correct, this is wrong, wrong wrong. Warranties might be an indicator of operating life, but not as you say, but they may also be marketing tools. When Detroit realized that the Japanese were kicking their butts in quality, they upped their warranties, and there is no way they had time to improve their reliability.
Anyone offering a five year warranty for a product with a five year operating life is going to go bankrupt real quickly. As I just said, failures are not a step function, and you need to compute the operating life of the product that will give you an acceptable less than five year failure rate. You might run qualification tests for a five year limit, but that is far different than designing for a five year lifetime.
It’s not meaningless, if the numbers are accurate it means that the field failure rate of the 1 million MTTF part is worse than that of the 1.4 million one. But as I said, that doesn’t say much about when the failures happen. In my field a million hour mttf is 1,000 FITS and that is awful, but we’re not stuck with mechanical assemblies.
There is another thing to consider. All the discussion here assumes random reliability type failures. A high early failure rate might mean a test escape during manufacturing which the customer sees, a bug in the build and delivery process causing damage, or poor component screening. I’ve shown (and published) information showing that the high early failure rate for semiconductors doesn’t come from the traditional early life reliability fails but from use in new conditions which is like a new test. Which is good since we burn-in our parts and so should be removing weak ones. I don’t know if disk drives are the same, but it is likely.
In this case, even with a high FITS rate if your disk operates after a thousand hours you will be good for a long time.
And the reason this makes sense is that low mttf (high FITS) means more field failures. Though I was on a panel once at a defense electronics conference with a guy who made radar systems with mttfs of like 100 hours. :eek: And this is without anyone shooting at them.
You’re absolutely right about their potential as marketing tools, and I should have more clearly pointed out that they are typically marketing department driven (although, I did imply that engineering is responsible to meet warranty figures acceptable to finance defined by marketing.).
On the other hand the hard drive market is a mature, stable market populated with big players that have mostly solid reputations. If some new “Yingpan Company” suddenly released a new hard drive with a promised 10 year warranty, it would be much more like the auto market of the late 1970’s. And certainly outside the scope of hard drives there’s it’s reasonable to be skeptical, too.
I didn’t properly distinguish between the consumers’ expectations of operating life versus the suppliers’. To “be safe” one might assume a five year operating life based on the warranty, due to lack of information from the supplier, but also suggesting a figure of eight years might be appropriate because, like you say, the hard drives aren’t designed to last five years; they’re designed so that only an acceptable number fail at five years. Failures over time generally approximate a bell curve, and the top of the bell curve is what we might decide to call the operating life (but, we consumers don’t have that information and so can only guess). The leading edge of the bell curve represents the early failures, and keeping this as low as possible without introducing higher manufacturing costs will result in the lowest warranty costs. (This explanation isn’t for Voyager, of course, but for others following along.) And just for fun, the trailing edge of the bell curve is what’s going to cause our grandchildren to say, “Samsung is awesome. My circa 2015 hard drive inherited from grandpa is still going strong!”
Given that we have no other data from the manufactures, using the warranty length (for hard drives from big name companies) doesn’t seem unreasonable from a conservative standpoint. But I also think it’s unreasonable to replace the hard drives at five years, because the (private, secret, company’s) operating life is unknown, and the drives will likely last much longer for the reasons we’re both saying (and I’m sure Voyager and I are saying the same thing, just in different ways).
First, I agree that the warranty period is a reasonable signal of the actual expected life - but trying to find field failure information is even better. Cars all have the same warranty periods, pretty much, but have very different reliability levels.
When you say bell curve of failures, of course you mean inverse bell curve - the famous bathtub curve, showing a reasonably high level of early life fails, then a small random failure rate during product lifetime, then a rising failure rate as we hit wearout of electronic and mechanical parts. We use acceleration methods to move a part to the right of this curve before shipment, so the early life fails fail in the factory and not for the customer. We stop tracking our parts (since they are out of warranty and obsolete) long before they get to the right hand side of the curve. So our grandchildren are likely to say “a one terabyte hard drive. How quaint. That is 1 / 1000th of my new memory stick. I wonder if it can hold even one 4d high res virtual movie.”
I’ve been doing IT long enough to remember when the cost of the drives and being able to get a free replacement was of primary concern for most applications at one time. Now the drives are so cheap in comparison it seems foolish not to get whatever has the best spec, proven history, or at least a high expectation. It might seem wasteful to replace a drive based just on it’s age, but again the cost is low. Equally wasteful to use a mirrored disk system (RAID) since it increases the number of disks, but drives are cheap and data is not. The concern in 2015 if a drive fails before it’s time is more about the inconvenience of having to replace it, since everything is time and materials.
We have a nice light fixture in the kitchen, but it’s such a pain to replace the light bulbs in it that when one blows I replace them all because usually another blows shortly after that. I do this just to save me time and the headache of getting out the ladder and having to juggle the parts. Then LED lights came out, which are much more expensive, but they promised to last much longer. They do, instead of replacing the lights every 4 months or so, with the LED lights they have been running without replacement for about 18 months so far. Was I being wasteful replacing bulbs that had not blown yet, perhaps, but it’s the personal labor cost which I wanted to avoid.