Another probability question (not homework!)

I have been asked to evaluate a proposed change to an existing computer based exercise that generates a “random” sequence of N values, lets call them red and blue marbles. I would appreciate input on the validity of my analysis from the more knowledgeable members here.

The existing system uses a digital coin toss routine (adapted from the text “Numerical Recipes” by W. H. Press… Cambridge University Press, 1986) and chooses a red marble for a true (heads) or a blue marble for false(tails). N such coin tosses produces the sequence.

The proposed change first populates an array of length 2N with alternating values (red, blue, red, blue, etc.) and then does a random shuffle of the array. The first N elements of the resulting shuffled array then provide the random sequence.

My question is this, to analyze the new method, I have made the assumption that this is equivalent to selecting N marbles from a set of N blue and N red marbles (i.e. tossing this many marbles into a bag, shaking it up, and then blindly drawing N of them out). If we can assume that the shuffling of the array does an adequate job of randomizing the order, is my assumption valid?

I can work out the resulting probability distributions for the two methods, and as expected they are not the same, but I am not as confident that my approach for the new method is sound.

They’re not the same because under the new method your sampling from a population of 2N without replacement while the old method is sampling from a population of 2N (or any even number) with replacement.

A simple way to see this is if you use the new method and then show me N-1 of the balls, I would almost always have a better than average chance of guessing the last ball. The only time I would not is if N-1 is even and it contains an equal number of blue and red balls.

For example, if N = 2, then you’d randomize 2 red and 2 blue balls selecting two of them. If you show me one ball picked at random, then the chance the other ball is the same color only 1/3 not 1/2 as it would be for coin flips.

If you show me one ball, I always have a better chance than 1/2 of guessing the next.

Thanks for the reply, and yes I arrived at this conclusion as well, for the reason you state. My concern however is whether choosing the first N elements from a shuffled array of length 2N is in fact the same as randomly selecting N marbles from a bag of N red and N blue. Maybe I’m over-thinking this…

Yes they are the same provided your shuffle is unbiased.

Fisher-Yates is a very simple algorithm to do an unbiased shuffle.

Thanks for the responses, and yes the new method will use the Fisher-Yates based “Algorithm P” as mentioned in your reference. The new application will be in java, so I guess the question becomes how “good” the built-in random number generator is.

My biggest concern is that your new method looks more complicated than the old one, and people who add complications into random number generation schemes usually do so because they don’t understand the schemes, and not understanding the schemes can lead to all sorts of other problems.

Oh, and the algorithm in that link isn’t actually O(n) unless you consider “generate a random number” to be the same difficulty regardless of the size range on the number. If you count random bits, it’s at best n log(n), and might be worse (shuffling algorithms are subject to the same inherent constraints as sorting algorithms).

I’m wondering why the OP’s proposed change is to an array of length 2N? Why not 22N or 200N or 314159265N? The bigger the multiple of N, the closer the result is (or at least ought to be) to the original algorithm.

The really interesting question (to me at least) is how the biases in the actual shuffling algorithm compare to the biases in the actual coin-tossing algorithm. All modulo the biases in the underlying random number generator on the target system. That’s what’s really going to determine whether the new method is “better” than the old.

Which leads to the question: what’s “broken” about the current method that is proposed to be “fixed” by the new?

When I read something like this I immediately know that the person proposing it is clueless about random sequences.

Why interleave the values??? Start with all red then all blue followed by a random shuffle and you get the same result. There is no need to specify any initial order.

Also, it’s very easy to mess up a random shuffle. Take just randomizing n distinct items. It may seem obvious that going thru the sequence using i from 1 to n and picking a random number j from 1 to n and swapping items i and j produces a random mix. It doesn’t! Counting arguments show that. You can produce n^n reorderings this way, but there are n! possible distinct orderings and n! doesn’t divide n^n. So some orderings are more likely than others.

This is where the method in the OP fails. The number of possible orders isn’t a multiple of the number of orderings red and blue marbles.

If you can’t figure out the Math of this, then don’t pretend you’re actually doing anything truly random.

I don’t understand this objection. In a shuffling algorithm, the random numbers you are generating are array indexes. If you make the entirely reasonable assumption that the array you’re sorting has less than 1.8 x 10[sup]19[/sup] elements (FAIRLY reasonable since that exceeds the memory capacity of any existing computer by a factor of several million), then 64-bit indexes suffice and each random number you generate will be exactly 64 bits. In real life, of course, 32-bit indexes are usually sufficient. The size of the items being sorted is irrelevant to the algorithmic complexity.

A question disguised as a quibble if I may …

The OP is dealing with what’s ultimately a sequence of bits. Every 1-bit is equivalent to every other 1-bit. Ditto the 0-bits are individually not distinguishable.

You’re certainly right that the number of possible orders of N individually distinquishable objects is indeed vastly larger than the number of orderings of N red and blue marbles. But that’s not the Q the OP asked.

I’d argue that the number of distinguishable orderings of drawing N red and blue marbles (assuming N of each are available to draw from) is exactly 2^N is exactly the number of distinguishable bit-patterns in an N-bit string.

Which, if I read him rightly, *is *the question he asked.
To be sure, in his list of length 2N algorithm, the likelihood of the various orderings is very far from even. As a simple example, getting the last red ball to make an all-red sequence of length N-1 complete as an all-red sequence of length N becomes very unlikely since by then the blue/red mix to draw from is very lopsided in favor of blue.

But that’s a different objection to the list of 2N algorithm.

You’re right that about [log(n)-1] bits are necessary (and sufficient with adequate care), on average, for each shuffling step, but this is a small cost. If n = 4 billion (how often are arrays this large shuffled? :eek: ) then [log(n)-1] is just 31 bits.

If you’re going to quote computational complexity as a function of n and accommodate hugely large n, then many simple operations treated as O(1), e.g. multiplication or even addition, have costs higher than O(1).

[off-topic] I recall a paper on computational geometry in which the complexity was carefully quoted with a factor of A[sup]-1/sup … where A[sup]-1/sup is the inverse Ackermann function! :eek:

This was my thought, too. The old method is perfectly random (to the extent of your random number generator). It’s mathematically impossible to do any better by changing it and in fact, the proposal is less random.*

In any way I can imagine the new proposal is also less efficient.
Unless there’s some unstated goal here (maybe you don’t actually want perfect randomness), I can’t see an advantage to the new proposal. And again, it worries me that whoever proposed this might screw something else up because of their lack of understanding of randomness.

Now, if the goal is to be less random – maybe you want a more even mix of Yes and No than true randomness gets you – this proposal might work. But even then, I think it’s better to have a procedure that’s much more explicit and clear about what you’re trying to do. After all, someone looking at the proposal might think it is perfectly random.

  • With the proposal, for any run-through, every time you get a Yes, it’s a little more likely you’ll get a ‘No’ the next time. If you don’t believe this, try it with N=2. There are six equally likely possibilities, and so you’ll get Y,Y one time in six, Y,N one time in three, N,Y, one time in three and N,N one time in six. Truly random would be one in four for each possibility.

septimus, would you say that the sorting problem can be solved in O(n)? Why or why not? After all, you’re rarely sorting lists large enough for that log(n) to contribute much, either.

chronos, would you say that sorting requires O(n log(n)[sup]2[/sup])? Why or why not? After all comparison and swap are each operations which require O(log(n)) time.

… But to answer your question, if you think O(n) is even remotely a fair estimate of sorting time, then it’s certain you’ve sorted only sets with, comparatively speaking, minisculely small n.

Of course you already knew this. The essential point is that some O(log(n)) activity (e.g. that associated with sorting) really will mean you call the key functions log(n) times as often. Other operations, e.g. simple arithmetic, will not cause performance to suffer significantly until log(n) > 31 on typical machines. In our example this means shuffling an array sized bigger than 4 billion but less than 16 quintillion may take twice as long as otherwise predicted! :smack: But I do not think this will be your major concern when you try to shuffle arrays sized in trillions or quadrillions!

Regarding “log n”. There are bit operations and then there’s factors.

(Let n here mean the number of items in question. Array size or whatever.)

I taught the rule of thumb: standard operations on numbers of size log n bits takes constant time. So adding 1 to an index to an array is constant time.

2^64 is really, really big. No one is ever going to have anywhere close to 2^64 items.* Even if you only have 32 bit hardware, doing 64 bit arithmetic is still a constant. It’s only when you get to bit vectors that are unrelated to log n do you have to work that math out.

As to the difference between something like O(n) and O(n log n). Yeah, it makes a difference in a lot of situations. If you don’t care about it when sorting, then you’re not sorting anything of decent size.

As to PRNG efficiency, a now classic example is the Mersenne Twister. An O(n) startup initialization. O(1) per call usually. But another O(n) fill every nth call. An amortized cost of O(1).

  • That’s size. For time, I consider 2^80 as basically infinity.

Big-O notation refers to the number of elementary operations required for execution of an algorithm. Chronos seems to be quibbling about what an elementary operation is. For example, analysis of almost all algorithms assumes that addition and multiplication are elementary operations, requiring O(1). However, as septimus noted, in the most general case when using numbers of unlimited size, addition is O(n) and multiplication is O(n[sup]2[/sup]). If the algorithm isn’t going to deal with values larger than 2[sup]64[/sup], then we don’t need to worry about this and we can assume that these arithmetic operations are O(1) and just count the number of such operations required. Similarly, in the analysis of Fisher-Yates, we assume that it requires O(1) to generate a random array index. We don’t need to worry about the case where the array being sorted has more elements than the number of elementary particles in the universe.

This is what I truly appreciate about the Straight Dope, thanks everyone for your insights.

To answer the one question, why the change to the method? I coded the original application almost 12 years ago and it has survived as a desktop application with only minor upgrades all this time. Bowing to the inevitable, it is being updated to be web-based and I am supervising the process, but no longer coding anything. The fill-an-array-and-shuffle method was proposed by the new development team, I get to approve it or insist that they use the original method. I just wanted to make sure that I was on firm ground when I reject their proposal.

As an aside, while I was working out the probability distribution for the N marbles from 2N set of N red and N blue, I first churned through the algebra for the case of an array of arbitrary multiple length, mN, with the one restriction that mN was even, and then substituted 2 into the result and then used excel to chart the two different probability distributions for a case where N = 12 and N = 25.

What delighted me was that I found (and I don’t know if this is already a well known result) a neat identity,

Sum(k = 0 to n)[nCk]^2 = (2n)! / (n!)^2, that is the sum of the squares of n choose k, for k from 0 to n, is equal to 2n factorial (not 2 times n factorial) divided by n factorial squared.