Statistics Question

Where would I look for guidance on the following math problem?
This is not homework. This is for my job. I am enrolled in no classes at any level.
There are 84 machines at a facility.
The machines have a maximum lifespan of 36 months.
The failures of these machines are evenly distributed in such a fashion that each machine has a 1 in 36 chance of failing each month.
How can I make a chart that will show the odds of a given number of failures per month, year, etc?
I’m thinking of one of those charts that shows standard deviations, etc.
This makes me wish I’d taken more math in college.

PS - The machines are acutally hard drives in file servers, and I don’t actually think my lifespan or failure rates above are exactly accurate, but I’m trying to develop some models for failure rates so that I know how to chart this stuff properly. I can fiddle with the failure rates later.

A simplicstic way would be to view it as a binomial distirbution. 84 coins being flipped each month, with heads only coming up 1/36 times on each coins. This is not really accurate – because a coin could be heads anywhere between zero and 36 times, whereas a machine is guaranteed to come up heads once and only once, however it is probably good enough for a first simple run. Given that your basic assumptions about the failure distriibtuion are probably completely wrong, it ought to serve well enough. If your result comes out to an expected value of 1 machine failing each month, you’re in good shape.

http://cnx.rice.edu/content/m11024/latest/
http://faculty.vassar.edu/lowry/binomialX.html

Argh, your expected value should be 84/36, not 1! D’oh!

How many times can I post? The missing piece is the distribution of failure rates, to make this a more realistic model. A few minutes of googling, it’s either Gaussian or Poisson. For Poisson check out:
http://src.alionscience.com/cgi-src/formdraw.pl?Change2=2

Hope this helps.

It is accurate if broken machines are replaced with identical ones immediately. Which seems likely for a hard drive array.

But I’m confused about the 36-month lifespan. A given machine has (35/36)^36=36% chance of surviving 36 months. Perhaps you meant they’ll be replaced at 36 months regardless of whether they are broken, but that’s not really relevant to the question of how many drives fail per month.

By the way, a uniform distribution is an inaccurate way to model failure rates of machines. Machines tend to have a very high failure rate in the beginning, and those that survive tend not to break down for a long time.

It looks more like a proportion problem to me. The information “each machine has a 1 in 36 chance of failing each month” can be interpreted as “one machine in 36 will fail every month”.

You can set the ratio of failures to total machines in the statistical model - 1 in 36 or 1/36 - equal to the unknown number of failures in a month - x - to the total number of machines in your facility - 84. Therefore:

1/36 = x/84

Cross-multiplying:

36x = 84

Dividing by 36:

x = 2 1/3 machines

So in a given month, 2 or 3 machines will probably break down, or 8 in a given 3-month period (quarter).

Muttrox,

That link you gave me was working nicely for a minute, then apparently their CGI processor blew up on me.
The irony is busy slaying me.
Thanks for your assistance in this matter.

It can be interpreted that way, but it’s a mistake to do so. Having 84 machines with a 1/36 failure rate means the number of failures per month is binomial (84, 1/36), and that gives the mean number of failures as 7/3, or slightly more than 2. Some months there will be no failures, and some months there could be 5 or 6.

btw, I get that the mean lifetime of a machine is 35 months, not 36.

scr4,

Thanks so much for your input. A few points related to your post follow.

  1. Yes, failed machines (drives) in question will be replaced with identical hardware of the same design and even the same vintage… just with zero hours of post-factory runtime.
  2. The analysis I received from a gentleman who’d analyzed a similar (but larger) population of drives in a similar application discovered that our drives were averaging 2-3 years of lifespan. It is true that in similar applications similar drives have made it to 4 or even 5 years, but they’re relatively uncommon.
    I was saying “1 in 36” to simulate a world in which the drives’ average lifespan is 36 months. On reflection, I suspect that in an overly simple model such as mine, saying “1 in 72” would get me an average failure date around 36 months.
    With our current process, drives are not replaced unless they fail. The systems they are in tend to be replaced due to upgrades between 36 and 60 months of age, but as you’ve noted, that is wholly irrelevant in this analysis.
  3. I realize that machine failure rates are not a uniform distribution. I think I’m looking for a bathtub curve. I am considering trying to plug a bathtub surve into my model later, but I have yet to find an industry-specific curve I can model my failures against, and guessing at one sounds even more foolhardy than pretending I have a uniform distribution.

Indeed, as posed, you’re dealing with a binomial distribution with n = 84, p = 1/36. Here are the probabilities of x failures in a month (to 10 decimal places):

x probability
0 0.0938222105
1 0.2251733052
2 0.2669912048
3 0.2085074171
4 0.1206364342
5 0.0551480842
6 0.0207461840
7 0.0066049076
8 0.0018163496
9 0.0004382304
10 0.0000939065
11 0.0000180496
12 0.0000031372
13 0.0000004964
14 0.0000000719
15 0.0000000096
16 0.0000000012
17 0.0000000001

For 18 to 84 failures, the probs are <10[sup]-10[/sup]. Mean of distribution is 2.3333333333, standard deviation is 1.5061601902. Now go draw your charts.

Hmmmm. I was thinking 11 drive failures in 13 months (the real-life number) was unlikely. Looking at these numbers, it seems like it’s not at all noteworthy.

If anything, 11/13 is good performance, it’s about one standard deviation lower than the mean, meaning you’ve got about 1 chance in 6 of getting performance that good or better.

The Weibull Model is a good approach.

The Weibull Model is a good approach for modelling mechanical or other faults.

If you still have this analysis, that’s a good place to start. If you have access to the raw data, you have real information about the real distribution. You don’t need to assume uniform or whatever, you can just see what really happens. I don’t think you need to get very sophisticated, simply plotting a graph in Excel will probably show you what distribution is appropriate, then you can plug it into one of the tools suggested here.

Simply untrue, and assumes the conclusion. If this was right, there would be no need to do an analysis, it’s safe to assume he can divide 84 by 36. (Even though I forgot to in my second post.)

Unfortunately, the previous analysis was done several years back by my boss, and it was done “on the fly” in response to a small crisis. I believe it wasn’t retained.

Modeling a machine’s lifetime as a series of discrete Bernoulli trials is a grossly stupid idea. The Bernoulli/Binomial method requires the assumption of independence: this requires us to disregard the concepts of frailty and aging in a real life machine.

Proper use of statistical methods requires more than exploratory data analysis: gratuitous Excel abuse is not useful.

Pony up some money and do the problem correctly. Sigma Six - type people know this stuff, and departments of statistics or industrial engineering have lots of students who need applied projects…

Excel is often sufficient, depending on what the end product will be used for. Proper statistical methods also take money time and effort, which may not be aviailable to the OP.

Proper methods and software are require to get informative, valid answers…

Why accept a subpar approach?