No, this is not a homework assignment. It’s been a long time since I took stat. I’m trying to figure out a probability distribution that will fit the following data:
I was always more of a pure mathematician, but I think you need to make an explicit distribution.
p(1) = 1/66
p(2) = 1/66
p(60) = 2/66 = 1/33
etc.
But perhaps what you are looking for is a common distibution that will approximate the data. Make a histogram and look at the shape - if it is bell-shaped, look at a Gaussian curve; a box, it’s uniform, etc.
How will you handle the gaps? For example, there are no 3’s.
Some context would help. It’s not an awful fit to an exponential, but there is a sort of spike in the data near 8-10. Is that expected given the context? Relevant? 'Cause an exponential (for example) eliminates that.
With such sparse data, you’ll do best with something motivated by the nature of the problem. “Best” isn’t defined otherwise, except perhaps via what SaintCad describes, namely just turning the numbers directly into a distribution (which I’m guessing isn’t what you’re after.)
ETA: In other words – While it’s a lousy fit to a Gaussian, it is presumably a much better fit to, say, five Gaussians together. Where you draw the line in complexity depends on the problem you’re tackling.
The mean and standard deviation of your data are very nearly equal, so an exponential distribution might work well, if it makes any theoretical sense. What are you measuring?
After thinking about things in light of the comments here, I’m pretty confident that the data can be modeled as coming from 2 probability distributions. A larger one that is exponential, and a smaller one that is Gaussian.
I believe I remember enough stat to approximate the parameters of the distributions.
I’m sorry for not revealing where the data comes from or what it signififies, but it’s a little personal.
Okay, I usually charge $50/hour for this, but 'tis the season and all that.
Looking at a histogram of the data with a bin size of 25 (i.e. bar one includes 1-25, bar two 26-50, etc.) the distribution looks exponential. However if you decrease the bin size to 5, it no longer seems to be exponential. If it were, you should see a spike in the first bar.
It looks to me that what you have is a unimodal (single-peaked) distribution which is highly skewed to the right. However, if you lose a few of those "10"s then the peak goes away and the distribution from 1 to 40 looks almost uniform. I’ve learned from experience that if your conclusion is based on the position of a handful of data points, you need to rethink things.
My professional opinion: Eh, I dunno. Might be unimodal with a peak around 10. Probably not exponential. In any case it’s highly skewed to the right. Go back and take more obsevations. If you’re looking to put a name on the distribution, forget about it. Real world examples rarely work out that nicely.
This is a situation where knowing what you are measuring is very important. If these numbers represent counts you analyze them one way. If they are survival time (time-to-event) measurements then you analyze them another. Need more info!
One last thing: The fact that most of the numbers end in zero indicate to me that there’s some selective rounding going on. (This is why all those "10"s bug me.) If all data are not being measured and recorded in exactly the same way, that’s a problem. It actually tells you more about the researchers than whatever is being researched. I once analyzed some morgue record data from a Southern city from the 1930s and while the dead white folk tended to have rather detailed data, the dead black people usually had ages, heights, and weights that ended in 0 or 5. Try analyzing a distribution that has a spike every five units!
I hope this helps. Please dump a bunch of bucks in the nearest Salvation Army bucket as payment. Thanks!
Regarding the two distribution idea, if you have reason to believe that there are two processes at work creating two different distributions, then that’s a great idea to pursue. However, if you are grasping at straws in order to make your data conform to some ideal, then you’re only fooling yourself.
To be honest I’ve never seen a mixture of an exponential and a Gaussian (doesn’t anybody say “normal” anymore?) ouside of a textbook. A distribution is a relfection of some process, and the processes that create these two distributions are generally very dissimilar from each other.
Anyway, one of them can’t be a Gaussian distribution because you have a lower limit of zero. The normal distribution doesn’t have limits and you have too much data close to zero.
It’s not too uncommon a combination. All you need are two competing processes, one giving a peaked “signal” on top of an exponentially falling “background”.
Another thing to check out is a power law, which shows up in many real-world scenarios.