Balls and Buckets: a probability/simulation brainteaser

Bob has “n” identical buckets, each with “x” compartments. He has “y” balls, that he rolls down a ramp. The ramp randomly distributes each ball among the various buckets. But depending on the angle of the ramp when Bob starts his experiment, a certain percentage “z” of balls bounce off the ramp and don’t end up falling into a bucket. However, after a ball falls into a bucket, it always settles into a random compartment within that same bucket. Also, once a ball settles in a compartment, no other ball can fall into that same compartment. If a box is completely full, the ball will bounce out of the bucket and back onto the ramp for redistribution.

  • buckets: (n can range from 1000 to 10^20)
  • compartments: (x can range from 1 to 100)
  • balls: (y can range from 2n to 20n)
  • % balls lost: (z can range from 2% to 50%)

After all the balls are rolled down the ramp, and the number of balls in each box is counted, is there an easy formula to find out what percentage of boxes is expected to have exactly 0 balls? (what about exactly 1, 2, 3, …20,…, or 40 balls?)

If the above doesn’t lend itself to a simple formula, is it possible to calculate the result using only the following numbers?:

  • n = 10^6 buckets
  • x = 25 compartments per bucket
  • y = 5n balls (Note: I’m interested specifically in the separate cases where y = 2n, up to y = 10n.)
  • z = 50% balls lost

Perhaps the counts over multiple Monte Carlo simulations can be used to determine the average distributions of buckets containing exactly zero, 1, 2, or up to 20 balls?

Note: this puzzle relates to a real-world scenario I’m trying to solve, but I think framing it this simpler way is easier. Interested in any thoughts on how to solve this. I never studied programming, or I would probably write a simulation.

A clarification: if a ball falls into a partially filled bucket, you say it falls into a “random” compartment. Does that mean the ball chooses a compartment randomly from all x compartments and then tries to fall into it? What happens if that compartment is already filled? On the other hand, if the ball chooses randomly from only the unfilled compartments in that bucket, it seems to me that choosing the compartment randomly doesn’t matter. The ball can just fall into the first unfilled compartment and you’ll get the same result.

The ball selects randomly from the empty compartments only, and never tries to fall into one that is already filled. If all are filled, it rolls back onto the ramp for re-distribution. And, yes, I believe you are 100% correct–it probably doesn’t matter that it is random for the sake of this puzzle. (in my real world problem, it is actually a random process where the ball goes, but I don’t think it will affect what I’m trying to determine here.)

Ok, so we can forget the compartments and just say each bucket can hold a maximum of x balls.

Also, I don’t think the value of z matters. Higher values of z just make the process take longer, without affecting the result.

It does matter; for example, if z = 1, then 100% of the balls disappear so for certain every bucket will contain zero balls.

More clarification:

Your question asks about boxes, but your scenario refers only to buckets and compartments. Are the boxes buckets or compartments?

Moving forward under the assumption that you are interested about the probability that a compartment is empty, otherwise the compartments don’t matter since all you are concerned about is whether or not you hit a bucket.

The scenario you proposed has very large numbers which means that can cut some corners and still get fairly accurate results with asymptotic assumptions.

Since all the buckets are the same its simplest just to talk about a single bucket, and ignore other things that can happen to a ball. The probability that a single ball hits a single bucket is p=(1-z)/n, so on average you expect to get y(1-z)/n balls in each bucket.

Officially that number of balls that go into that bucket is binomial, but since there are so many buckets and so many balls, you can probably get away with assuming that it follows a Poisson distribution. With mean equal to the number of balls

So the probability of getting k balls into a given bucket is approximately

(\lambda^k e^{-\lambda})/k! where \lambda=y(1-z)/n is the expected number of balls in a given bucket.

If you have k balls in a bucket with x compartments, that fill up randomly but don’t overlap, then the probability at an given compartment in empty is 1-k/x, for k<x and is 1 for k>=x.

So the final answer will be

\sum_{k=1}^{x}{\left(1-\frac{k}{x}\right)(\lambda^k\ e^{-\lambda})/k!}

There may be a way to simplify this further, but since x is pretty managable (no more than 100) you should be able to automate this sum without too much difficulty.

Note that this formula is easily generalized to unequal number of bins per bucket and uneven probability of hitting a bucket. All you have to do is to change the 1/n in the \lambda equation to the probability of hitting that bucket, and x to the specific number of bins in that bucket.

Also if your question was just about there being an empty bucket, you could skip the second part and just ask what is the probability that k=0, which is approximately e^{-\lambda}

ETA: I noticed a couple of mistakes in my previous write up:

should have been

“with mean equal to the expected number of balls”

should have been

“then the probability at an given compartment in empty is 1-k/x, for k<x and is 0 for k>=x.”

Agreed, thanks for the insight.

I considered the case where z=1, but the OP says z is restricted to 2% to 50%. I don’t think it matters in that range, and in fact I don’t think it matters for any z strictly less than 1. It just affects how many trials you need to do to use up all the balls.

Thanks for that clarification. I think that higher values of z are equivalent to simply starting with a smaller number y of balls to begin with, given my allowed ranges. Therefore, I agree it can be ignored. The process itself should not take any longer with higher z values, because a “lost” ball is not redistributed. (Only balls that bounce out of full buckets are going to be put back onto the ramp.) The z variable will affect the average number of balls per bucket (a number I’m also interested in) due to the reduction in total balls distributed. However, to simplify the problem going forward, I agree that we can ignore z.

I apologize for my typo above, but I was unable to edit my post. Please replace the incorrect word “boxes” with “buckets”. (I guess these words are filed too close together in my brain.)

@Buck_Godot, thanks so much for the detailed reply, and for pointing me to the Poisson distribution. The formulas that you provided for the expected number of balls in a single bucket, and for the probabilities of getting exactly k balls into a given bucket, seem to check out and give reasonable values against my real-world scenario (and you were even able to include the z factor!).

Although I wasn’t particularly interested in that question, your reply definitely has me intrigued me to think about that, especially since your formulas worked very well for all the steps leading up to that.

Can you explain what you mean by “the final answer”, and what this formula is calculating?

Just curious, but are you assuming here that we have exactly k balls in the bucket to simplify the math? (sorry if I misunderstood what you meant–based on my original puzzle some buckets would have more balls than others, so I guess that’s where I’m not following along.)

By the way, I really like how you were able to display the equations…is there a tutorial that you can point me to where I can learn the notation?

Thanks again for the thorough answer!

I started writing a long post in response, but I then realized that I missed a crucial point of your scenario. When a bucket is full, a rolling ball will be redistributed, my model assumed that if a ball hit a filled bucket they just fell off and disappeared.

If the vast majority of your buckets aren’t full at the end so that few balls are recycled then the Poisson method is probably the way to go. But if you are filling a lot of buckets then Your problem is more complicated. The value for Z is very important as it makes things much more complicated since as more baskets get filled and balls bounce to the top later balls have a higher chance of multiple runs and so a higher chance of getting knocked off by z. I’ll have to think further, and the answer if I can find it would probably not make for a good message board post. I’ll PM you if I feel I can get somewhere.

As for your other questions…

This formula (if I fixed my typo so the sum started at k=0 and went up to X-1 instead of starting k=1 and going up to X) was for calculating the probability that a given compartment had no balls in it (based on my incorrect reading that balls landing in filled bins were eliminated instead of recycling.).

The way it was calculated was considering all possible number of balls in a basket from zero to x-1 and working out the probability that a given compartment was empty in that scenario.

probability that the compartment is empty = (Probability that the compartment is empty and the basket had 1 ball) + (Probability that the compartment is empty hits and the basket had 2 balls) + … (Probability that the compartment is empty and the basket had X-1 balls)

if the basket has X balls then there are no empty comparments so I the probability is zero and I don’t need to include that term.

for any k<x
(Probability that the compartment is empty and the basket had k balls) = (Probability that a given compartment is empty assuming the basket has k balls)*(the probability the basket has k balls)

(Probability that a given compartment is empty assuming the basket has k balls) = {\left(1-\frac{k}{x}\right)}

while
(the probability the basket has k balls) =(\lambda^k\ e^{-\lambda})/k!

So multiply these two components and then add them together for every k from 0 to x-1 and get your answer.

As far as the math for displaying formulas, I write a formula using the equation editor in Microsoft word, then copy and paste it here, and finally put $ signs on both sides of it and it seems to work.

With no z I think limiting the capacity of each bucket is not that much worse: instead of nice exponentials you have truncated exponentials, so if we want to know the distribution of let’s say the number of empty buckets after throwing in 5n balls, we could introduce a variable u and figure out/approximate the coefficient of x^{5n} in the exponential generating function

\left(u+x+\frac{x^2}{2}+\cdots+\frac{x^{25}}{25!}\right)^n.

Lost balls may be regarded as falling into a distinguished infinite-capacity bucket…

Ok I had some more thoughts last night and have it reduced to the solution of an intractable differential equation, but one that can probably put some pretty reasonable decent bounds on to get a good bench mark estimate, and I think its simple enough to be posted here.

Part 1:
I want to introduce a new variable L. This is the number of times a ball hits the bottom of the ramp. This is not quite the same as y, it could be smaller (since due to z some balls don’t make it all the way down) and it could be larger than y since a certain number of balls will hit a filled basket and so have multiple runs down the ramp.

If we have L shots at a bucket and there are n buckets to choose from then number of times a bucket gets hit will follow a Poisson distribution with an average of L/n balls hitting the bucket, so defining

P(k,L)=Probability bucket gets hit k times given L shots on goal = ({L/n})^k e^{-L/n}/k!

Now what I am particularly interested in is the probability that a bucket gets full (i.e has x or more hits) lets call it q(L)

q(L)=\sum_{k=x}^{\infty}{{(L/n)}^k\ e^{-L/n})/k!}

or more easily calculated one minus the probability that you had fewer than x hits

q(L)=1-\sum_{k=1}^{x-1}{{(L/n)}^k\ e^{-L/n})/k!}

Part 2:
So all we need to do is calculate L, the number of times a ball reaches the bottom. To make things more complicated I’m going to introduce a new notation L(t) is the number of times a ball has reached the bottom after t of the original y balls have completed their journey. L(y) is going to be the final number of hits on bottom.

To calculate L(t), lets look at what happens to ball t+1 as it starts its journey down the ramp:
With probability z it gets knocked off course and gets zero hits
With probability (1-z) it makes it to the bottom and so counts as one hit on bottom.
Then if it did hit bottom there is also probability of about q(L(t)) that it hits a filled bucket and gets knocked back up to the top.

If we let E(L(t)) be the expected number of hits on bottom from this one ball

E(t) = z*0 + (1-z) + (1-z)*q(L(t))*E(t)

The first term is for getting knocked off the ramp. The second term is for getting the single hit on the bottom while the third term is for reaching the bottom and then getting kicked up to the top at which point by definition it has an average of E(t) shots on goal.

Doing some algebra we get,

E(t)=\frac{(1-z)}{1-(1-z)q(L(t))}

Now we have so many balls rolling down and so many buckets to fill that we are firmly in the area where the law of large number kicks in and we can more or less assume relatively speakdng that the expected number of balls, is the actual number of balls. So that the difference between the number of hits down stream after t original balls vs after t+1 original balls is E(t).

In other words L(t+1) - L(t) = E(t)

now again since we have so many balls rolling down the discreteness of L isn’t much of an issue and we can just assume that this difference equation actually represents a differential equation, or in other words

\frac{dL(t)}{dt}\approx\frac{L(t+1)-L(t)}{1}=E(t)=\frac{(1-z)}{1-(1-z)q(L(t))}

Part 3:

From here we would ideally start with the boundary condition that L(0)=0, solve the differential equation for L(y), and then plug that value in for L in the Poisson in part 1 and get your answer.

But differential equation is non linear and very intractable given that q(L(t)) itself is a complicated sum so we won’t be able to solve it explicitly

However it is a continuous monotonic function of q(L(t)), and q(L(t)) is itself continuous, monotonic and bounded between 0 and 1, so it is probably possible to solve it numerically. Or more simply just make reasonable ball park assumption about how large q(t) is likely to get. to bound your solution. Unfortunately differential equations is not my forte so someone else will have to chime in to carry this further.

Huh, that implies that Microsoft is storing equations internally as LaTeX or something like it. That’s uncharacteristically sensible of them.

If you click the equation tab of Office, you have the option of storing the equation as Unicode, Latex or plain text.