Using random "data" to test selective algorithms

I try and teach myself programming as a hobby, and sometimes you write programs that require a random number generator. For most hobby work, it doesn’t matter that the RNG might have undesirable qualities. In my studies, I’ve found that there are all kinds of different RNGs that are meant to target different kinds of pseudo-randomness.

The problem I face is that I have a selective algorithm, that is, one that attempts to assign positive and negative weights for real data based on some selection criteria. The concern of such an operation is having noise, and a poor algorithm, combine to create a pattern or suggest a phenomena when none is actually present. If I fear my algorithm might have this quality, I would test it with a fake dataset of random numbers and see if it still finds information that I know should otherwise not be present.

Is any particular RNG good for this, or does it not really matter?

If you want truly random numbers, you can download a table of random numbers, like these from the RAND Corporation. You can set up your program to access these. Analyses and descriptions of how the table was generated are found here. Random number tables are pretty much the Gold Standard for researchers who need truly random numbers for simulations and such.

Bear in mind that the nature of randomness is such that patterns do show up from time to time. So just cause your algorithm is finding stuff doesn’t mean it’s flawed–that stuff might really be there.

There are other issues associated with random data testing. Can you give us a high-level overview of what you’re doing?

We have a set of data that represents the combined effects of several uncontrolled variables, only one of which is of interest to us. The algorithm compares acquired data to a standard (which represents only what we’re interested in) and weights the data in order to distill the one of interest by comparison. It decreases the weight of data that fails to act like or correspond to the standard, and increases the weight of data that does act like or corresponds to the standard. Some are suggesting to me, “Hey, it works.” My response is, “I would never suggest otherwise. Of course if you only count the data you like you’ll find the pattern you’re looking for.” And so the discussion goes.

What is sticking in my craw is that there are ways to weigh data to account for effects like noise and other uninteresting phenomena. My position is that, unless we test this algorithm with data we are sure doesn’t indicate anything, we won’t know whether it is begging the question or operates as intended. The idea I had was to test several standards with several random datasets (each standard tests all the random datasets), but then the question came to me that there might be a preferred kind of pseudorandomness for tests like this.

Hopefully that was a little more clear.

As ultrafilter says, random data might not be what you want. You might be better off carefully constructing a data set to have exactly the properties you wanted and then testing it to make sure you get the expected result.

If you do want random data, there are lots of freely available implementations of pseudo-random generators available. Almost any crypto library will have good pseudo-random algorithms for key generation.

How can you be sure your random data won’t look like it indicates something?

A correctness argument is much more appropriate here.

What about the distribution? Uniform, Gaussian? What exactly is meant by “random”?

True, and I think that instead of using random data, we could probably just use different standards on real data and acheive what I’m looking for.

Unless I have been catastrophically unclear, that is my question.

“random” means “not repeatable” in this context.

I would never recommend this. The human brain is lousy at picking out “typical” or “anti-typical” cases, and if you try to, you’re almost guaranteed to put patterns in the data of a sort which you don’t want. Use random input, and just make sure that you understand the statistics of what you’re seeing, and that you have enough input for the stats to be useful.

It’s pretty trivial to generate data sets you can be sure of. They don’t have to look random. For example, if you’re writing code to do an average, one of the pre-set tests could be a data set containing all the same number. This will frequently identify where a coder has made an error in a summation or roundoff situation. If you’re testing a statistical program, you can generate a data set using a a formula for a perfect distribution to make sure your algorithm handles that case. Creating data sets to test very specific functions is a critical part of any mathematical programming. I’m not saying it replaces testing with random data or other sets, but it can easily expose fundamental flaws you won’t find otherwise.

Back to the OP,

Java’s pseudo-random generator will repeat after 2[sup]48[/sup] bits, which is a lot. For considerably smaller sequences of bits (I doubt you will need more than 2[sup]24[/sup] or so) the distribution is going to be essentially random. That is, depending on the seed, you could end up with absolutely any distribution, and the likelihood of each distribution is equal.

Unless your application is using its own linear congruential formula, and it just happens to be the same one Java is using, it is basically impossible for it to find a legitimate pattern in the data. It is a good test, don’t worry about it.