I think the short answer is that random sampling/random assignment is a necessary but not sufficient condition for comparability. That is, we are not in fact allowed to assume that two samples are comparable just because the members were randomly drawn and assigned; more information is needed, specifically, the sample size and the population variance on the characteristic you’re comparing them on. If the population is water molecules, samples of size N = 1 will probably suffice for most purposes, because water molecules don’t vary a lot on many characteristics (but if you’re interested in comparing them on some characteristic on which they do vary, a larger sample will be needed). If the population is humans, a larger sample is probably needed, because humans do vary on a lot of characteristics. However, even if your population is all objects in the world, you may or may not need a large sample, depending on what characteristic you’re comparing them on. If the characteristic is “ability to travel backwards in time”, then a sample of size 1 is (as far as we know) sufficient, because the population has zero variance on that characteristic – no matter what two objects you pick, no matter how different they are on other characteristics, they will have the same value on that characteristic (no).
So what principle/proof/theory would suggest that a sample size of N=30 (or even N=900) humans are comparable on a psychological dimension? If I wanted to run a psychological experiment, how-- on the theoretical level I am describing in this post-- would I know what the sample size needed to be? If humans are composites of thousands of (latent) characteristics and we don’t know how each of these characteristics affect the variable being tested, it seems like there is no way to establish what an acceptable sample size (if there is one at all) is.
I’m probably just going to re-hash everything that everyone else said, but let me see if I can break it down into easier terms with a concrete example.
Let us say we are interested in focusing on the effect of prenatal classical music on IQ. So we start by collecting 200 pregnant mothers. Now we know that there are lots of other effects that go into IQ and which probably have a larger effect on IQ than prenatal music, there is education, socio-economic class, genetics etc. etc. At this point there is only one group, if we let the babies in this group grow up without any experimentation, they would have a wide variety IQ values due in part to all of the underlying factors and also due in part to luck. But for a given baby you could also see its value for each of the underlying factors also being determined by luck. So that overall a baby’s IQ is just due to the luck of the draw.
Now we take our 200 patients and divide them into two groups of 100 babies each. Each baby in each of these groups will have an underlying IQ that is random and based on the luck of the draw as defined above. We then subject the first group to womb Mozart 24/7, while we leave the other group alone. Ten years later we measure the IQ of each of the children in the two groups, and record the average IQ of each group. At this time we also measure the variability of the IQs within each group. As you say, we don’t know that the two groups are necessarily equal. In fact I can guarantee that they aren’t. One group will be slightly richer than the other and one group will have slightly better education etc. If our musical torture of the infants has no effect on IQ, than the only difference we observe between the two groups will be due the these random draw differences. We don’t know what those random draws are, but we do have a clue. If we have 100 low income people and 100 high income people in population it is much more likely that that these will divide between the two groups into something like 63/47 rich to poor in group 1 and 47/63 rich to poor, than it is that 95/5 rich to poor in group 1 and 5/95 in group 2.
Here is where the statistics comes in. Statisticians tell us that that if we know the variability of each member of a group and know the size of the group, and if the group is large enough, we can estimate the variability of the group averages. This is true regardless of the source of the variability. So if the difference we observed between the two means is so large that the statistics tell us that random draws will only give us a difference this big 1/00 times, we assume that there must be some other effect beyond randomness that gave us that result, and since the only non-random difference between the two groups is the Mozart inquisition that must be the cause of the difference. We may be wrong it could be that we were unlucky and all of the smart kids went in one group while all of the dumb kids went into the other, but this is very unlikely (one chance in 100) so we feel confident with our claim.
Now if you really do know before hand what factors are important in IQ, you can take that into account and look at the differences in the this factor between the groups when you do the analysis. This has the effect of reducing the unaccounted for random variability of the two groups and so makes it easier to detect a small IQ effect of death by Mozart. But that is more complicated than can be covered in this post.
More directly to your point, the theorem you are looking for is the central limit theorem (probably the most fundamental theorem in all statistics.) What it says, is that if you have a population with mean M and variance V, and you randomly select N samples from this, and take the average, then you will end up with a random value due to the fact that you made this random selection. If this were repeated you would find that the distribution of averages would have mean M and variance V/N, further if N is large enough (say around 30) and you plotted the averages on a histogram the would follow a normal (bell shaped) distribution.
As a corollary of this, if I take the averages of two randomly selected populations and take the difference of their averages, that difference will be normally distributed with mean 0 and variance (2*V/N). Note that as N gets very large the variance gets very small indicating that the differences between the averages will tend to be very close to 0.
Once you know the distribution in your populations, there are all sorts of ways to determine how big of a sample you need to see a given difference. This is calculated as statistical power. With that, you can calculate that (say) you detect a 20% average difference between two groups of 100 subjects over 95% of the time. (I’ll, uh, leave the determination of the underlying distribution as an exercise for the reader…) If your population has greater variability you’ll need a larger sample size to see the same average difference.
There’s a reasonably intuitive explanation of the central limit theorem here.