Computer component failure rates?

My little work group runs computer systems that run many manufacturing plants. As we are revamping the system, we’ll have to decide how much redundancy to build into the systems.

When these systems go down, so do the plants. But with redundancy comes extra cost, especially when multipled by the number of installations.

Figuring out if, for example, RAID drives would be economically worthwhile to have wouldn’t be too difficult, if I knew what the mean time between failures on hard drives was. Which I don’t.

Does anyone know of a souce that has estimates of different computer components? I know this will change between manufacturers, models, and age, but since we’re dealing with odds and probabilities, I think generic component data will do just fine.

No offense, but if you don’t know the answer to your question somebody else should be designing the system.

At a minumum, if you want fault tolerance, look into RAID and redundant power supplies. Both are capable of being “hot-swapped” meaning the thing fails, operation continues on the other unit and you simply pop out the bad one and replace it without shutting anything down.

And, definitely, power the whole works though a UPS.

The manufacturer’s web site will list technical specs such as the MTBF.

My somewhat educated SWAG for failures is:

  1. human error
  2. SW crash
  3. disk crash
  4. power supply

As much as you can afford.
What does it cost when the plants go down for an hour? How about a day? Is it more or less then the cost of having redundent systems?

With Mission Critical system you really want what gotpasswords was talking about, a good RAID system and hot-swappable power supplies.
Also, don’t forget to build in a good back up system (Tape drives, most likely) and make sure that it’s used and correctly. Even with the most redundent systems you should plan for the worse case, complete system distruction (fire, whatever). You want your back-up data to safely off site

yoyodyne, you sound like your middle school teachers brainwashed you very well. In “real life” it is expected that there are some things certain people will not know, but it is acceptable to expect them to find out. Early education seems to be running on the assumption that you need to know everything off the top of your head; that for some reason it will be impossible to do research. “Well, if you don’t know the answer at this very moment you must be inferior! I’ll go hire a consultant who is obviously far more intelligent than you…”

yoyodyne, please understand I mean no offense by this gripe. You just touched on a subject I have particularly strong feelings about.

MTBFs of hard drives are typically advertised at 100,000 hours. If you have multiple hard drives in a system, you will need to do a binomial expansion to determine the probability of any drive in the RAID array to have a failure. Of course, the whole point of RAID is that a single component failure is supposed to be survivable.

Motherboard MTBFs are sometimes advertised as 50,000 hours. I was just looking at server power supplies which were advertised at 100,000 to 120,000 hours MTBF.

It’s hard to base things off of experience of one person. And you need to take into account the mode of operation. For example, I see CD-ROM drives with MTBFs of about 50,000 hours - but that’s only on time, not burn time. In my experience, CD-ROM’s tend to have a burn time MTBF of about 40 hours, but I think that’s because something is horribly wrong with the PC I use. So that experience isn’t valid.

yoyodyne, was the no offense part of your post just there to excuse your rudeness? I don’t think what you said helped anyone at all.

Well, I think this probably isnt the best place to go if your designing multi million dollar mission critical systems. Certainly it shouldnt be the only place you go.

Anyway, MTBF always seems to be pumped up a fair bit by the HD manufacturers and its a simulated time anyway. HD makers dont put drives through 100,000hrs worth of stress testing. I think your best bet would be to talk to sysadmin for a huge server farm to get a better estimate as to how often HD’s fail.

But, in reality, RAID 1 with 2 HD’s is enough for all but the most mission critical or most paranoid data. The chances of 2 HD’s failing within a 24hr period seems to be pretty slim (although it did happen to a guy I know about 2 months ago, his desktop HD and laptop HD simultaneously died). But, lets face it, IDE Storage is cheap and SCSI storage is getting there. Chuck in another 1 or 2 drives so you can sleep better at night.

Oh, don’t be too hard on him. The poor kid hasn’t yet realized that there’s a difference between memorization of figures and the understanding of fundamentals required to use such numbers (once they’ve been looked up) in a design.

Here’s a little clarification. In another system that is also “mission-critical”, we have redundant everything, including a second server that runs in parallel. If one computer dies, the system switches over to the other one–kind of like a RAID computer (although it didn’t work the one time it should have).

Redundant everything is easy to justify in this system even though it was horribly expensive. If there’s a failure, the company grinds to a halt.

However, our process control system is different. Instead of one server, we’ll have 80, one in each plant. In some of the high-production plants, a failure will result in a substantial loss in production. In other plants a failure that requires an IBM service call may result in delayed production–meaning one or two people have to work a couple hours of overtime. In the few plants that typically work an 8 hour day, a failure (that is fixable in a few hours) has a 50/50 chance of not affecting production. Given that plus the internal rate of return and time value of money, not every redundancy will be economically justified.

Complicating the economic analysis is the fact that we do have IBM service with a guaranteed response time of 4 hours. Furthermore, we have 40 of the servers already, and they have little redundancy (due to our last generation operating system that choked on things like RAID). Retrofitting them with, say, redundant power supplies will be more expensive than upgrading new orders from a 1 to 2 power supply box.

Since many of these plants are out in the sticks, they can’t run out to Fry’s to get more parts. So, we have to look at each part, not just drives and power supplies . We have to decide what gets redundancy, what has spare parts available, if we have entire spare computers available, and which parts we leave up to IBM to replace, all on a plant-by-plant basis.

But to do this analysis, I need MTBFs. I’m sure server farms have good data on this, but they probably consider it proprietary info. But surely someone keeps track of it in a statistical way and publishes it, maybe even computer magazines. My google searches resulted in nothing good.

Anthracite, the motherboards you saw had a shorter MTBF than HDs? My limited, anecdotal experience has been much different. Haven’t you found the same?

engineer_comp_geek, I’ve never seen this info listed by computer companies, especially component-by-component (although I’ve never looked before). I scoured IBMs site and couldn’t find this kind of info. Is it typical, or do I just suck at searches?

Ancedotally, you can point at the IBM Deskstar “Do Not Run For More Than 75 Hours” drive failures. It all depends on what you’re trying to do. For a new machine, raid 1 (mirrored) is a darn good idea on a heavy use, non-mission critical device. Considering it’s a manufacturing plant, the power supplies and UPS are good, too… because sometimes REALLY weird stuff happens with power supplies. Like someone splicing the wrong cable. UPSes can be put in without replacing the computers, too.

Most hard drive manufacturers list a MTBF. Personally, though… I watch for failure at Week One, Month Eight, and then again at Year Three. After that, they’re more or less good till Year Ten, when you really should have replaced them by now.

IIRC, IBM doesn’t make drives. You have to go to the actual drive manufacturer. For example, I clicked on Seagate’s web site and followed the links for one of their drives (just clicked on a drive at random) and I ended up here:

It has reliability info about halfway down.

The pdf spec (click on the link on the left side of the page) has all the gory technical specs, including the MTBF (600,000 hours for that particular drive).

IBM does in fact brand hard disk drives.
They market them as the IBM DeskStar (desktop models) and they also sell laptop and SCSI drives with different model branding.
This being said, Hitachi now owns a share of IBM’s hard drive business.
IBM drives are most notable for a debacle recently with large batches of drives failing, all from the 75GXP series, and all from one particular manufacturing plant in the Eastern Europe.

Not sure whether these would help you specifically, but you could take a look :

http://www.e-reliability.com/

http://www.relexsoftware.com/

http://www.t-cubed.com/

Hitachi hard drive information with MTBF.

I can also say, anecdotally speaking, that we’ve seen a sudden rash of failures with IBM 20GB and 30GB laptop drives in the “X” series laptops. :frowning:

Phage - I haven’t looked very extensively at motherboard MTBFs, so the figures I have are old. I think in the article I saw on Tom’s Hardware, a “failure” was not a “breakage” but a “non error-corrected data failure incident”, which is not nearly as serious. By that definition, the MTBF of Windows XP is about 32 seconds.

Errr…glilly, that is.

My point was not that one should have MTBF rates memorized, but that using them to justify the design and cost of mission critical systems is a serious mistake. One that someone who is qualified to spec such systems would not make.

The “average” calculated failure rate of tens of thousands of drives has no bearing on the failure rate of those installed in your system.

IBM brands hard drives but they don’t make hard drives. Sometimes the company that brands it may have technical specs, sometimes you have to go to the actual manufacturer. After poking around on IBM’s site they fall into the latter catagory (although they do provide a nice link straight to Hitachi’s site).

Curiously, the only reliability specs they publish are the data reliability specs (i.e. the chance of an unrecoverable read error). I couldn’t find an MTBF for any of the drives I looked at, although I did only look at three. Makes me wonder if their drives have crappy MTBF specs.