Statistics--more than meets the eye

My local surplus food store has these

https://allcitycandy.com/products/hersheys-treasure-surprise-mc-kisses-transformers-64z-case-of-5?variant=32605846601794

for $1.00 per case. While $16.25 for 20 Hershey’s Kisses and 5 tiny robot figurines is insane, $1.00 isn’t bad at all and I’ve bought them twice. The first time I bought 5 cases (so 25 figures.) I was curious if all the figures were produced in the same number, so I kept count of them. In the first lot there were:

6 Bumblebee (24%)
3 Grimlock (12%)
5 Megatron (20%)
3 Optimus Prime (12%)
8 Starscream (32%)

(with an even production being of course 20% each.)

A small sample, but it hinted, at least, that there could be more Starscreams than Grimlocks or Optimus Primes, for example. But the next time I went to the store I bought 3 more cases (15 more figures.) This time the numbers were completely different:

1 Bumblebee (6.66%)
6 Grimlock (40%)
1 Megatron (6.66%)
3 Optimus Prime (20%)
4 Starscream (26.66%)

With the total of both groups being:

7 Bumblebee (17.5%)
9 Grimlock (22.5%)
6 Megatron (15%)
6 Optimus Prime (15%)
12 Starscream (30%)

What I’m curious about is how many of the figures (as a minimum) I would need to buy before I had a resonable certainty of the production ratio? Or would you need to know how many had been produced altogether to determine that?

That depends on what the actual ratio is, and what your default assumption is. If your default assumption is that all five toys are equally likely, and if that is in fact the case, then you’ll never know it. Suppose that you bought 100, and got 21 Starscreams and 19 Optimi Prime: That is, of course, consistent with the proportions being equal for each… but it’s also consistent with the true proportions being 21% Starscream and 19% Optimus Prime.

That depends on how you define “reasonable certainty”: What’s your margin of error (like, do you need to know what percent are Bumblebees just to within ±10%, or 5%, or 1%, or 0.5%, or what)? And what’s your confidence level (like, is it enough to be 95% sure that the actual percentage is within that margin of error? do you have to be 99% sure? 90% sure)?

Once you know those things, there’s a formula you can use that gives you the sample size needed. Or, you can use a web-based app like one of these to calculate it for you:

That’s why in statistics, we always talk about P being the confidence of disproving the null hypothesis. You can never prove the null (which, in this case, is that all toys are distributed equally) but you can decrease the confidence that the null is disproved, typically down to P < 0.05.

The problem OP is asking about can be mapped onto a problem of proving a dice is fair. If you assume each candy you open is equivalent to a roll of a five sided dice, then you can ask if the dice is evenly weighted. Blog posts like this one go over some of the mathematics of testing for dice fairness using the Kolmogorov-Smirnov goodness-of-fit test.

Yeah; in the context of the OP’s example, the P-value would mean something like: Suppose the figures were all produced in the same numbers as each other. In that case, what are the chances that I would get results as unevenly distributed as what I saw in my sample?

On a related note (which I’m fascinated by), when the actual distribution is heavily biased (1% Megatron and 99% Optimus Prime), using a small sample will likely give you wildly varying estimates. If the true population is 50/50, taking a small sample will probably give you fairly consistent estimates.

Interestingly, in one of the cases today 4 out of 5 figures were Grimlock. (And they all recommended sugarless gum.)

IIRC, the formulas behind these web-based apps are all derived under the assumption that you’ll be obtaining a Simple Random Sample (or something close to it). When ordering from an online store, what gets delivered to your doorstep is usually what minimized the distribution costs for the seller (e.g., the cases that were closest to the aisle when the forklift robot went through to satisfy orders). Unless you know that enough randomization happened between manufacturing and stocking the warehouse, it might not be safe to make the SRS assumption.

A similar caveat was made by the researchers of the Stockton guaranteed income experiment. Although the rates of full-time employment improved more for the group randomly assigned to receive $500 a month (compared to the control group that didn’t receive anything), they made no pretense of having demonstrated anything about a population parameter (difference in transition rates “not-FTE”-to-“FTE”, between all unemployed Stockton residents given an income guarantee from Feb 2019 to Feb 2020, and all unemployed Stockton residents not given such an income guarantee in the same time frame). The necessary random selection wasn’t there at the start! To infer anything about that population parameter, they would have had to take random samples of unemployed Stockton residents, rather than random samples of all Stockton residents (some of whom just happened to be less than full-time employed when the experiment began, and had their status recorded as such).

The first response by Chronos hints at treating the problem in a Bayesian manner instead of aiming for a specific confidence level. Formulate some prior idea of what the distribution of figures is, and then take a weighted average between this distribution and what you observe when you receive your lot. The exact weights you attach to “background” and “observation” (to use the terminology of weather forecasters) will encode how uncertain you are about your prior. With repeated orders, you generally reduce the background uncertainty and gain more confidence in your weighted average; but quantifying this confidence level is not as straightforward as the p-value approach to inference (which, again, assumes simple random samples).

Perhaps an actual practitioner of Bayesian methods can weigh in and say how the confidence level might be quantified if we took such an approach to this problem.

P value good for medical or biological stastistics, when the idea is that the randomness of the system really shouldn’t be changing arbitrarily and rapidly

Here, the system is someone has to walk along in front of the bins, grab a handfull of one type, and then of the next type. etc, Now, so as to avoid emptying one bin while the others still have plenty, he grabs less from the depleted bin, and more from the nondepleted bin. Then when they are all nearly empty, he gets them all refilled.

But, if its near the end of the shift, he might not even bother to refill empty bins, and so you get more of whatever is remaining .

So its up to the timing of the packing occurring… you have to evaluate the required sample size to ensure that the sample represents enough different times of day.

From a Bayesian point of view, you end up with some pretty complicated priors. To illustrate: If you ended up finding, after many, many trials, that one toy is three times as common than other, that doesn’t seem too surprising: That might mean that the marketing department did a little research, and found that one character was more popular, and so made more of that toy. If you ended up finding that one toy was a hundred times more common than another, that doesn’t seem too surprising, either, because they might have wanted one to be super-rare, in the hopes that it would become a collector’s item. And all of them being equally common, of course, also wouldn’t be surprising, because that’s the simplest distribution.

But one of them being, say, 10% more common than another… That does seem surprising. You wouldn’t expect the company to do that: If some decision process led them to those numbers, you’d expect them to say “Eh, all of them being equal is close enough”. In other words, your prior probability expectation would have a narrow spike right at all of them being equal, with almost no probability of them being almost equal.

… and then, when they started manufacturing, they found that 20% of one model came out of the press with no face. And rather than spend $50,000 on modifying the mold, they said, “Eh, close enough”.

So, I keep buying more packs of these each time I visit this surplus store every month. (I figure $1.00 for 20 Hershey’s Kisses is a good enough deal anyway, and I can concider the five tiny robots a free bonus.) At this point I’m up to:

1st purchase (25)

6 Bumblebee (24%)
3 Grimlock (12%)
5 Megatron (20%)
3 Optimus Prime (12%)
8 Starscream (32%)

2nd purchase (15)

1 Bumblebee (6.66%)
6 Grimlock (40%)
1 Megatron (6.66%)
3 Optimus Prime (20%)
4 Starscream (26.66%)

3rd purchase (15)

1 Bumblebee (6.66%)
7 Grimlock (46.6%)
3 Megatron (20%)
2 Optimus Prime (13.3%)
2 Starscream (13.3%)

4th purchase (25)

5 Bumblebee (20%)
6 Grimlock (24%)
3 Megatron (12%)
7 Optimus Prime (28%)
4 Starscream (16%)

Total (80)

13 Bumblebee (16.25%)
22 Grimlock (27.5%)
12 Megatron (15%)
15 Optimus Prime (18.75%)
18 Starscream (22.5%)

(I’ll probably keep buying a few packs a month as long as they have them.)