Say I’m paying to get the prizes from a Gashapon machine. Think of it as a gumball machine with toys inside instead of gumballs.
This particular machine I’m playing has its contents hidden and there’s no convenient sign with pictures like most gacha machines have. I don’t know how many different toys there are to be had, but I’m pretty sure there are no “rares”; this vendor doesn’t do that. I’m assuming the gacha is stocked with an equal amount of each toy.
I’d like to collect at least one of every toy from this machine. I can’t peek inside, but I can look at the collection I’ve already purchased and count them up.
It’s obvious to me that if I have 5 of prize A, 3 of prize B, and 1 of prize C, there is a good chance that there’s still an unknown prize D or even more still in there I haven’t seen yet.
How do I calculate the probability that I’ve received at least one of every toy based on my current collection? How do I calculate how many more times I’ll need to pay to have an X% confidence I’ve gotten them all?
Assume the machine’s supply is effectively infinite. I never have the funds to totally empty it during any one visit. I assume it’s being refilled when I’m not there. I don’t speak Japanese, so asking someone from the owner’s company or other players is Right Out. I never observe anyone else opening their prizes; all I have to go on is my own collection.
I’ve only taken basic stats, but I think you can estimate a required sample size if you assume that the machine is always reloaded with equal proportions of toys:
You can guesstimate the population by dividing the volume of the machine by the volume of a ball (to see how many might fit in it). The “real” population is presumed infinite, but you’re also assuming that the machine gets refilled with a representative sample of that population every time.
That said, it’s probably more effective to learn some basic Japanese or make some English-speaking Japanese friends who can ask for you.
Since you assume that the number in each class is constant, the only information obtained by you is the number of classes observed. Call this b; let t be the total number of observations; let n be the unknown total number of classes. In your example
b = 3
t = 9
To solve the problem you will have to impose some a priori distribution on n. For this example I’ll assume n=1, 2, 3, …, 19, 20 are the only possibilities, each with 5% a priori probability. I hope one of the mathematicians will be along to simplify my formulae, but if you’re in a hurry you can just run the following code through a bc -type calculater:
b = 3
t = 9
scale = 6
define csub (x, y, r) {
if (y == 0) return r;
return csub(x-1, y-1, r*x/y);
}
define c (x, y) {
if (y == 0) return 1;
return csub(x-1, y-1, x/y);
}
define fd (b, t) {
auto w, z;
if (b == 1) return 1;
z = b^t;
for (w = b - 1; w; w--)
z -= c(b, w) * fd(w, t);
return z;
}
define fp(n, b, t) {
return c(n, b) * fd(b, t) / n^t;
}
define thep() {
auto n, y;
for (n = 1; n < 21; n++)
y += fp(n, b, t);
print "y is ", y, "
";
for (n = 1; n < 21; n++) {
# print "Prob. of observing ", b, " cls in ", t, " trials ";
# print "if there are ", n, " cls is ", fp(n, b, t), "
";
# print "Therefore ";
print "The prob. of exactly ", n, " classes ";
print "is ", fp(n, b, t) / y, "
";
}
}
thep()
The above code will print
The prob. of exactly 3 classes is .677568
The prob. of exactly 4 classes is .203499
The prob. of exactly 5 classes is .068282
The prob. of exactly 6 classes is .026467
…
There’s a 2/3 chance you’ve already gotten all the toy types.
Well, that, and the total number of objects obtained.
Also, there comes a point where you should question your assumptions. If, for instance, I totaled up my history and found that I had gotten 103 of toy A, 97 of toy B, 101 of toy C, and 2 of toy D, I would seriously doubt that there was really an equal mix of all toy types. But determining when you should start doubting this depends on just why you were assuming an equal mix to begin with, and so cannot be calculated without further information.
If I flip a coin thrice and get heads once, I learn that 1 out of 3 tosses were heads. Yes I also “learn” that I decided to flip thrice. I also “learn” that I’m wearing a red shirt and that today is Sunday.
In my OP, I tried to simplify the situation to just the important basics. I’m playing several of these machines. With some I can divine more info than I gave, (those pictures on the side are usually most helpful) but here I covered the worst case type where I have the least information.
My basic question I’m trying to answer is: When should I stop playing a given machine? Obviously, that answer will change depending on what confidence level I seek. That in turn usually relates to the price I have to pay to play, which varies from machine to machine.
Thanks for that link. That calculator might help me decide a maximum number of times I should play before I start, but it doesn’t look like it will help me better refine a stopping point after I have a sample.
The real population (no scare quotes required) is what’s being stored back at the warehouse. I have no confidence that the machines always get an exact equal proportion every time the delivery person refills them. But I’m playing the long game. For most of these machines, I think they get refilled more often than I can visit and play them.
There are other irrelevant details I’m leaving out that preclude that option. I’m interested in a mathematical answer to my question.
Thank you, this is exactly the kind of info I was looking for. Your code/formula is a LOT more complicated than I thought the answer would turn out to be.
How does this answer change as the distribution of n changes? In the particular example I gave in the OP, that machine is extremely unlikely to have 20 different prizes in it. However, I’ve seen a few machines with dozens of different prizes.
And I’m probably going to want to play that machine more to make sure. Usually I will want at least a .9 confidence level, and .99 for the cheaper machines. Something lower for the really expensive ones.
I’m looking for a simple formula that will tell me how many more times I should play. I expect that answer will change as my collection increases, and perhaps sometimes turn into an abrupt STOP! even if I had calculated a larger number earlier.
I think septimus’s code included that variable.
I’m discounting any rares that might be in the mix. I can’t afford to keep playing to collect all the rares, if they exist. I’m happy just collecting the common types. In your example, my gut intuition would have stopped me playing that machine long before the first type D showed up unless I happened to get very very lucky. And even having a D in hand, I’d still ignore its presence when deciding to stop.
I’m basing the assumption that the proportions are equal from my past history of playing other machines from this vendor. I probably played them way more than I really needed to, and now I’m looking to calculate stopping points.
This is not an incredibly simple problem. You should look into Good-Turing frequency estimation for techniques. I don’t know how to incorporate the idea that there are no rare items; perhaps some Bayesian version is appropriate.
OK, “ignoring rare items” makes this an even more complicated question, because now we have to decide just how rare counts as “rare”, and that decision probably depends in turn on how many different prizes there are (a prize that’s 2% of the population probably counts as a rare prize in a four-prize machine, but it wouldn’t in a 100-prize machine, for instance).
Really, I think it’s irrelevant. Rare items almost by definition won’t effect the mean-time-to-encounter a non-rare item very much. You can generate a model including rare items, and then only calculate whatever you want for the non-rare members.
Another approach is to treat this as a classification problem on Gachapon Machines. That is, let’s say you have a set of machines you’re reasonably confident you know some approximate distribution for. What elements seem to affect this? Company/manufacturer? Cost of playing it? Type of toy?
Train some model on the known Gachapon machines, then classify your new ones based on that. In the easy case, you have a finite number of classes and know the exact distribution based on some discrete classification. In the worst case, you end up with some prior belief where you can now tune hyperparameters about a categorical distribution (e.g. a Dirichlet or something).
BTW, biology scientists use a similar idea for estimating the population of wildlife in a given region, using a technique called “capture / re-capture”.
The idea is, you stake out a certain region, capturing animals, tagging them, and then releasing them. By and by, you will start to notice that some of the animals you capture are already tagged. If you keep this up long enough, you will start to notice that many, maybe even most, of the animals you capture have already been tagged.
The analogy to your toyball machine is this: After you start to see enough repeat animals, you can begin to guess that you’ve captured most of them; but well before that happens, you can use statistical methods with this data to estimate the total population. In the toyball machine, once you start to see enough “repeat” toys and fewer “originals”, you can begin to guess that you’ve got at least one of everything, and you can develop your statistical probability of that.
Also BTW: Did anybody notice that news item a few weeks ago, that there was a toy machine like this (somewhere in the US) with rings that had a swastika and other Nazi symbols on them?
ETA: I worked on a project, some years ago, that did this with humpback whales. The field researchers staked out various bays in Alaska, and took pictures of whale tails. It turns out that all humpback whale tails are unique just like fingerprints. We had hundreds and hundreds of photos. Then they had to sift through and compare them so find the “repeats”; that was their analog for capture / re-capture. They used this to estimate the whale populations.