Oh, don’t be too hard on him. The poor kid hasn’t yet realized that there’s a difference between memorization of figures and the understanding of fundamentals required to use such numbers (once they’ve been looked up) in a design.
Here’s a little clarification. In another system that is also “mission-critical”, we have redundant everything, including a second server that runs in parallel. If one computer dies, the system switches over to the other one–kind of like a RAID computer (although it didn’t work the one time it should have).
Redundant everything is easy to justify in this system even though it was horribly expensive. If there’s a failure, the company grinds to a halt.
However, our process control system is different. Instead of one server, we’ll have 80, one in each plant. In some of the high-production plants, a failure will result in a substantial loss in production. In other plants a failure that requires an IBM service call may result in delayed production–meaning one or two people have to work a couple hours of overtime. In the few plants that typically work an 8 hour day, a failure (that is fixable in a few hours) has a 50/50 chance of not affecting production. Given that plus the internal rate of return and time value of money, not every redundancy will be economically justified.
Complicating the economic analysis is the fact that we do have IBM service with a guaranteed response time of 4 hours. Furthermore, we have 40 of the servers already, and they have little redundancy (due to our last generation operating system that choked on things like RAID). Retrofitting them with, say, redundant power supplies will be more expensive than upgrading new orders from a 1 to 2 power supply box.
Since many of these plants are out in the sticks, they can’t run out to Fry’s to get more parts. So, we have to look at each part, not just drives and power supplies . We have to decide what gets redundancy, what has spare parts available, if we have entire spare computers available, and which parts we leave up to IBM to replace, all on a plant-by-plant basis.
But to do this analysis, I need MTBFs. I’m sure server farms have good data on this, but they probably consider it proprietary info. But surely someone keeps track of it in a statistical way and publishes it, maybe even computer magazines. My google searches resulted in nothing good.
Anthracite, the motherboards you saw had a shorter MTBF than HDs? My limited, anecdotal experience has been much different. Haven’t you found the same?
engineer_comp_geek, I’ve never seen this info listed by computer companies, especially component-by-component (although I’ve never looked before). I scoured IBMs site and couldn’t find this kind of info. Is it typical, or do I just suck at searches?