Using random "data" to test selective algorithms

erislover · October 22, 2004, 6:49pm

I try and teach myself programming as a hobby, and sometimes you write programs that require a random number generator. For most hobby work, it doesn’t matter that the RNG might have undesirable qualities. In my studies, I’ve found that there are all kinds of different RNGs that are meant to target different kinds of pseudo-randomness.

The problem I face is that I have a selective algorithm, that is, one that attempts to assign positive and negative weights for real data based on some selection criteria. The concern of such an operation is having noise, and a poor algorithm, combine to create a pattern or suggest a phenomena when none is actually present. If I fear my algorithm might have this quality, I would test it with a fake dataset of random numbers and see if it still finds information that I know should otherwise not be present.

Is any particular RNG good for this, or does it not really matter?

Q.E.D · October 22, 2004, 7:06pm

If you want truly random numbers, you can download a table of random numbers, like these from the RAND Corporation. You can set up your program to access these. Analyses and descriptions of how the table was generated are found here. Random number tables are pretty much the Gold Standard for researchers who need truly random numbers for simulations and such.

ultrafilter · October 22, 2004, 7:41pm

Bear in mind that the nature of randomness is such that patterns do show up from time to time. So just cause your algorithm is finding stuff doesn’t mean it’s flawed–that stuff might really be there.

There are other issues associated with random data testing. Can you give us a high-level overview of what you’re doing?

erislover · October 22, 2004, 7:51pm

We have a set of data that represents the combined effects of several uncontrolled variables, only one of which is of interest to us. The algorithm compares acquired data to a standard (which represents only what we’re interested in) and weights the data in order to distill the one of interest by comparison. It decreases the weight of data that fails to act like or correspond to the standard, and increases the weight of data that does act like or corresponds to the standard. Some are suggesting to me, “Hey, it works.” My response is, “I would never suggest otherwise. Of course if you only count the data you like you’ll find the pattern you’re looking for.” And so the discussion goes.

What is sticking in my craw is that there are ways to weigh data to account for effects like noise and other uninteresting phenomena. My position is that, unless we test this algorithm with data we are sure doesn’t indicate anything, we won’t know whether it is begging the question or operates as intended. The idea I had was to test several standards with several random datasets (each standard tests all the random datasets), but then the question came to me that there might be a preferred kind of pseudorandomness for tests like this.

Hopefully that was a little more clear.

micco · October 22, 2004, 7:59pm

As ultrafilter says, random data might not be what you want. You might be better off carefully constructing a data set to have exactly the properties you wanted and then testing it to make sure you get the expected result.

If you do want random data, there are lots of freely available implementations of pseudo-random generators available. Almost any crypto library will have good pseudo-random algorithms for key generation.

ultrafilter · October 22, 2004, 8:22pm

How can you be sure your random data won’t look like it indicates something?

A correctness argument is much more appropriate here.

bitwise · October 22, 2004, 8:36pm

erislover:

We have a set of data that represents the combined effects of several uncontrolled variables, only one of which is of interest to us. The algorithm compares acquired data to a standard (which represents only what we’re interested in) and weights the data in order to distill the one of interest by comparison. It decreases the weight of data that fails to act like or correspond to the standard, and increases the weight of data that does act like or corresponds to the standard. Some are suggesting to me, “Hey, it works.” My response is, “I would never suggest otherwise. Of course if you only count the data you like you’ll find the pattern you’re looking for.” And so the discussion goes.

What is sticking in my craw is that there are ways to weigh data to account for effects like noise and other uninteresting phenomena. My position is that, unless we test this algorithm with data we are sure doesn’t indicate anything, we won’t know whether it is begging the question or operates as intended. The idea I had was to test several standards with several random datasets (each standard tests all the random datasets), but then the question came to me that there might be a preferred kind of pseudorandomness for tests like this.

Hopefully that was a little more clear.

What about the distribution? Uniform, Gaussian? What exactly is meant by “random”?

erislover · October 22, 2004, 8:40pm

True, and I think that instead of using random data, we could probably just use different standards on real data and acheive what I’m looking for.

erislover · October 22, 2004, 8:43pm

Unless I have been catastrophically unclear, that is my question.

ultrafilter · October 22, 2004, 8:51pm

“random” means “not repeatable” in this context.

Chronos · October 22, 2004, 9:33pm

I would never recommend this. The human brain is lousy at picking out “typical” or “anti-typical” cases, and if you try to, you’re almost guaranteed to put patterns in the data of a sort which you don’t want. Use random input, and just make sure that you understand the statistics of what you’re seeing, and that you have enough input for the stats to be useful.

micco · October 22, 2004, 9:52pm

It’s pretty trivial to generate data sets you can be sure of. They don’t have to look random. For example, if you’re writing code to do an average, one of the pre-set tests could be a data set containing all the same number. This will frequently identify where a coder has made an error in a summation or roundoff situation. If you’re testing a statistical program, you can generate a data set using a a formula for a perfect distribution to make sure your algorithm handles that case. Creating data sets to test very specific functions is a critical part of any mathematical programming. I’m not saying it replaces testing with random data or other sets, but it can easily expose fundamental flaws you won’t find otherwise.

jawdirk · October 22, 2004, 10:12pm

Back to the OP,

Java’s pseudo-random generator will repeat after 2[sup]48[/sup] bits, which is a lot. For considerably smaller sequences of bits (I doubt you will need more than 2[sup]24[/sup] or so) the distribution is going to be essentially random. That is, depending on the seed, you could end up with absolutely any distribution, and the likelihood of each distribution is equal.

Unless your application is using its own linear congruential formula, and it just happens to be the same one Java is using, it is basically impossible for it to find a legitimate pattern in the data. It is a good test, don’t worry about it.

Topic		Replies	Views
Random Number Generation Question Factual Questions	22	3813	May 23, 2012
Question on Randomness Factual Questions	23	1941	April 11, 2009
How do I know it's a fair PRNG? Factual Questions	8	1188	November 28, 2008
Several probabilities questions Factual Questions	7	1814	November 27, 2012
Pre-generating random numbers, must all the numbers in the set be generated together? Factual Questions	2	1163	December 15, 2009

Using random "data" to test selective algorithms

Related topics