Please Help Me Figure Out a Probability Distribution

brazil84 · November 25, 2007, 9:54pm

No, this is not a homework assignment. It’s been a long time since I took stat. I’m trying to figure out a probability distribution that will fit the following data:

150; 60; 8; 23; 5; 12; 6; 15; 10; 33; 120; 44; 70; 130; 4; 14; 80; 260; 120; 10; 40; 40; 20; 130; 8; 50; 40; 200; 28; 10; 50; 100; 100; 238; 30; 40; 130; 49; 10; 120; 200; 86; 29; 200 ;15; 30; 24; 50; 34; 10; 133; 14; 70; 1; 35; 2; 35; 60; 190; 60; 8; 55; 4; 70; 82; 33

Any guidance would be greatly appreciated.

Saint_Cad · November 25, 2007, 10:00pm

I was always more of a pure mathematician, but I think you need to make an explicit distribution.
p(1) = 1/66
p(2) = 1/66
p(60) = 2/66 = 1/33
etc.

But perhaps what you are looking for is a common distibution that will approximate the data. Make a histogram and look at the shape - if it is bell-shaped, look at a Gaussian curve; a box, it’s uniform, etc.

How will you handle the gaps? For example, there are no 3’s.

Pasta · November 25, 2007, 10:20pm

Some context would help. It’s not an awful fit to an exponential, but there is a sort of spike in the data near 8-10. Is that expected given the context? Relevant? 'Cause an exponential (for example) eliminates that.

With such sparse data, you’ll do best with something motivated by the nature of the problem. “Best” isn’t defined otherwise, except perhaps via what SaintCad describes, namely just turning the numbers directly into a distribution (which I’m guessing isn’t what you’re after.)

ETA: In other words – While it’s a lousy fit to a Gaussian, it is presumably a much better fit to, say, five Gaussians together. Where you draw the line in complexity depends on the problem you’re tackling.

ultrafilter · November 26, 2007, 1:22am

The mean and standard deviation of your data are very nearly equal, so an exponential distribution might work well, if it makes any theoretical sense. What are you measuring?

brazil84 · November 26, 2007, 1:54am

Thanks for the thoughts everyone.

After thinking about things in light of the comments here, I’m pretty confident that the data can be modeled as coming from 2 probability distributions. A larger one that is exponential, and a smaller one that is Gaussian.

I believe I remember enough stat to approximate the parameters of the distributions.

I’m sorry for not revealing where the data comes from or what it signififies, but it’s a little personal.

Name_Clever · November 26, 2007, 1:55am

Okay, I usually charge $50/hour for this, but 'tis the season and all that.

Looking at a histogram of the data with a bin size of 25 (i.e. bar one includes 1-25, bar two 26-50, etc.) the distribution looks exponential. However if you decrease the bin size to 5, it no longer seems to be exponential. If it were, you should see a spike in the first bar.

It looks to me that what you have is a unimodal (single-peaked) distribution which is highly skewed to the right. However, if you lose a few of those "10"s then the peak goes away and the distribution from 1 to 40 looks almost uniform. I’ve learned from experience that if your conclusion is based on the position of a handful of data points, you need to rethink things.

My professional opinion: Eh, I dunno. Might be unimodal with a peak around 10. Probably not exponential. In any case it’s highly skewed to the right. Go back and take more obsevations. If you’re looking to put a name on the distribution, forget about it. Real world examples rarely work out that nicely.

This is a situation where knowing what you are measuring is very important. If these numbers represent counts you analyze them one way. If they are survival time (time-to-event) measurements then you analyze them another. Need more info!

One last thing: The fact that most of the numbers end in zero indicate to me that there’s some selective rounding going on. (This is why all those "10"s bug me.) If all data are not being measured and recorded in exactly the same way, that’s a problem. It actually tells you more about the researchers than whatever is being researched. I once analyzed some morgue record data from a Southern city from the 1930s and while the dead white folk tended to have rather detailed data, the dead black people usually had ages, heights, and weights that ended in 0 or 5. Try analyzing a distribution that has a spike every five units!

I hope this helps. Please dump a bunch of bucks in the nearest Salvation Army bucket as payment. Thanks!

Name_Clever · November 26, 2007, 2:07am

Regarding the two distribution idea, if you have reason to believe that there are two processes at work creating two different distributions, then that’s a great idea to pursue. However, if you are grasping at straws in order to make your data conform to some ideal, then you’re only fooling yourself.

To be honest I’ve never seen a mixture of an exponential and a Gaussian (doesn’t anybody say “normal” anymore?) ouside of a textbook. A distribution is a relfection of some process, and the processes that create these two distributions are generally very dissimilar from each other.

Anyway, one of them can’t be a Gaussian distribution because you have a lower limit of zero. The normal distribution doesn’t have limits and you have too much data close to zero.

Pasta · November 26, 2007, 2:46am

It’s not too uncommon a combination. All you need are two competing processes, one giving a peaked “signal” on top of an exponentially falling “background”.

Another thing to check out is a power law, which shows up in many real-world scenarios.

Topic		Replies	Views
Randomness and the Bell Curve Factual Questions	46	1668	June 5, 2022
Is there an example of an exact normal distribution in nature? Factual Questions	33	16170	July 24, 2011
What are examples of authentic Bell Curves Factual Questions	55	1914	October 7, 2023
Probability problem: chance somebody will pick a number in a range randomly Factual Questions	11	2583	December 27, 2012
What is the expected frequency distribution in this scenario? Factual Questions	16	838	September 21, 2018

Please Help Me Figure Out a Probability Distribution

Related topics