# Correlating sequences of (pseudo-)random numbers

First off, it’s not a homework question; I’m trying to develop some software and have run up against the limits of my already limited stats knowhow.

I’m trying to generate a sequence of random numbers whose PDF is normal with some arbitrary mean and variation. I can get this far. However, I’d also like for each number to bear some relation to the one before, such that if one number is significantly above the mean, the next number is more likely to be so as well. I seem to have trouble describing this so have drawn an outstandingly amateurish picture to try and illustrate what I mean:

Art, possibly.

Does this make any sense at all? The vertical red line is the last number I generated, and the red distribution is what I’d like to use to generate the next number; I just have no idea what the maths behind it might be. As an example, if the previous number were exactly the mean of the overall distribution, the next number would be unaffected. I just want a slower return to the mean value after outliers, without affecting the overall distribution after many numbers have been generated. So the question really is; what on earth function did I just draw? Does it exist outside my brain?

Any help would be greatly appreciated…

Isn’t that autocorrelation?

You aren’t looking for independently distributed random numbers, right? Googling for “autocorrelation” should return what you are looking for, to the extent that you can use the info returned. If I’m recalling correctly, it’s autocorrelation because it correlates with itself.

You might be able to achieve this with an error term that is random, so that y[sub]1[/sub] equals some random number, and then y[sub]2[/sub] = y[sub]1[/sub] + x, where x is a randomly generated number. That won’t give you a skewed pdf for the second number, but it will off-set its distribution in the appropriate direction. (Note that your second pdf isn’t skewed, IIRC, but just has a different mean.) So, let’s call the random term “e,” and it is just a random number drawn with each step in the process and from whatever distribution you wish, then,

y[sub]1[/sub]=e
y[sub]2[/sub]=y[sub]1[/sub]+e
y[sub]3[/sub]=y[sub]2[/sub]+e

and so on. This will take the red normal distribution in your figure and put its mean right at the value for your first number.

If you want you can make a running average, for example, that will slow down the movement some, so that, say, y[sub]4[/sub]=[(y[sub]1[/sub]+y[sub]2[/sub]+y[sub]3[/sub])/3]+e. That would add the random term to the average of the three previous terms and smooth out the movement.

Hope that helps.

Oh, I forgot, I don’t think that you can have what you want and have a normal distribution because the random numbers aren’t independent.

.

Somewhere you have the median point (the center of your bell curve) defined and use it to generate your data. Call this P. The data point you’ve just generated we’ll call D. How far the curve should move we’ll call Z.

P1=((D-P)/Z)+P

Use P1 to generate the next D, and recalculate P1.

You’ll probably want Z to be pretty high in comparison to P, else the whole thing could run away on you.

As js_africanus says, though, you won’t have a true Gaussian distrubution anymore if you do this.

::Frantically jumps into back seat of taxicab::

Thanks for the help guys, I think you’ve put me on the right track. Since I’m not after a mathematically rigorous gaussian, I’ve adapted js_africanus’s suggestion and am taking series averages of numbers generated from the original distribution, e.g. if the last 5 points were x1, x2, x3, x4 and x5 and a freshly generated number is “normal(mean,var)” then the new value is taken as

x6 = 0.5*normal(mean,var) + (x1+x2+x3+x4+x5)/10

which seems to give me a nicely autocorrelating series that still looks like a bell curve over very large numbers of samples, and doesn’t run away on me (I think). I’ll probably weight the thing differently though, to give more significance to recent values.

I think adjusting the mean, even if only slightly, is going to cause drift on the values, and working out a more mathematically rigorous description of the skew doesn’t sound tempting. This is one of those times I’m glad I’m not a real mathematician :).

Funnily enough, I’m simulating lap times for F1 cars, so I don’t think the taxi’s going to cut it…

If you generate enough numbers you will have chunks that have correlations and so on. As mentioned before the moment you start making scores dependent on one another the over all distribuition won’t be normal. Since it seems that you want an overall normal distribuition with what amounts to streaks I recomend you cheat. Take the random numbers and then have the program find number sets of what ever definition you want and order the data that way. The order doesn’t effect the distribuition and is what gives the appearance of slow return after outliers. I’m not sure if that will produce enough of what you want.

Alternatively, you could give each car a normal distribution with a different mean and variance.

If I’m doing my algebra correctly, you can adapt Mort Furd’s formula quite easily to something that might be easier. First, for the current step call your random number p[sub]x[/sub] and for the previous step p[sub]x-1[/sub], and call the median m. Using the z in the same manner and re-arranging terms gives:

p[sub]x[/sub]=[p[sub]x-1/subm]/z

What you can do now, since (z-1)m/z is a constant, is to just re-name it k, let’s say, and I think that gives you:

p[sub]x[/sub]=p[sub]x-1[/sub]k.

Once you’ve drawn your p[sub]1[/sub], that isn’t random at all, is it? I may be wrong, but that’s going to either give you something that grows or shrinks monotonically depending on whether k is greater than or less than zero, respectively.

Hmmm. I’ll assume I’ve made a mistake that someone will be kind enough to correct.

I’ve only had time to skim, but it looks like D is nonconstant through iterations of Mort Furd’s formula. That could be what’s screwing you up.

Perzackly.

D is the last point that your Gaussian random number generator kicked out. The formula I gave makes the center point of your Gauss curve follow (very slowly) the data points that you are generating. If you get a run of numbers that are above the center of the curve, then the center of the curve will move upwards. Likewise for number below the center of the curve.

If your source of chaos is good, then the center of the curve will stay put over long data sequences even though it will tend to wander (in the short term) under the influence of the numbers you’ve already generated.

Z must be large compared to the maximum you expect for |P-D| otherwise the system will walk off and keep producing numbers further and further from P.

Is D what I called p[sub]x-1[/sub] (to make it easier for me to read)?

Oh, shoot. Pronoun trouble? I saw “call this P” and thought you were referring to the median point, but by “this” you meant the data output of the random number generator, right? Is that correct?

Cool.

You need a couple equations to specify this fully. If I’m reading you correctly, you have p[sub]n+1[/sub] = (d[sub]n[/sub] - p[sub]n[/sub])/z + p[sub]n[/sub], and d[sub]n+1[/sub] = p[sub]n[/sub]. Am I reading you right?

You had it right the first time. P is the median point of the bell curve.

D is the number generated by the last iteration through the random number generator.

Lessee.

Where F is the function used to generate the random numbers based on the median of the bell curve P.

D[sub]n[/sub]=F((D[sub]n-1[/sub]-P)/Z)

Phooey. I only meant to preview, not post.

I don’t know what Dead Badger is using for F. All I know is that he has some random number generator that produces number with a gaussian distribution around some median point that he can define.

F((D[sub]n-1[/sub]-P)/Z) should be read as F of ((D[sub]n-1[/sub]-P)/Z) not F * ((D[sub]n-1[/sub]-P)/Z)

Damn, damn, damn. I really meant to preview. That should be D[sub]n[/sub]=F((D[sub]n-1[/sub]-P)/Z + P)

That goes for the following post, too.

I get it. But just for the heck of it, and cause I’ve already done the work…

Since this is a second-degree recurrence relation, you’ll need two initial values, p[sub]0[/sub] and p[sub]1[/sub]. Given those, p[sub]n[/sub] = (p[sub]0[/sub] + zp[sub]1[/sub] + z(p[sub]0[/sub] - p[sub]1[/sub])(-1/z)[sup]n[/sup])/(z + 1). It might be interesting to code that up and see what happens.

Not quite.
p[sub]n+1[/sub] = (d[sub]n[/sub] - p)/z + p, and d[sub]n+1[/sub] = F(p[sub]n[/sub])

It sounds like you want to generate a correlated sequence of normal random variables: that is, a sequence X[sub]1[/sub],X[sub]2[/sub],… where each X[sub]n[/sub] is N(m,S[sup]2[/sup]) (i.e., normal, with the same mean and variance) and E{X[sub]n-1[/sub]X[sub]n[/sub]} != 0. This is not hard to do. (Contrary to what some have stated, the individual X[sub]n[/sub] in this sequence can have normal distributions; it’s just not a white noise process.)

Let’s start with zero-mean random variables for now (it’s easy to get nonzero-mean RVs from these). Suppose we recursively define (for some real a with |a|<1)
X[sub]n[/sub] = N(0,s[sup]2[/sup]) + a X[sub]n-1[/sub]
where N(0,s[sup]2[/sup]) is a zero-mean Gaussian random variable with variance s[sup]2[/sup] (these are the independent samples generated by your Gaussian RNG). It is easy to verify that
[ol]
[li]if X[sub]n-1[/sub] has mean zero, then so does X[sub]n[/sub]; thus if we initialize with X[sub]0[/sub]=0 (or with any other zero-mean RV), then all X[sub]n[/sub] have zero mean.[/li][li]if X[sub]n-1[/sub] has variance s[sub]n-1[/sub][sup]2[/sup], then X[sub]n[/sub] has variance s[sub]n[/sub][sup]2[/sup] = s[sup]2[/sup] + a[sup]2[/sup] s[sub]n-1[/sub][sup]2[/sup]. Inductively this gives us a geometric sum (oh, how I long for TeX formatting!)[/li]s[sub]n[/sub][sup]2[/sup] = (sum)[sub]k=1[/sub][sup]n[/sup] a[sup]2k[/sup] s[sup]2[/sup]
for the variance; this converges rapidly (as n grows, for |a| not too close to 1) to S[sup]2[/sup] = s[sup]2[/sup] / (1 - a[sup]2[/sup]).
[li]the covariance E{X[sub]n[/sub]X[sub]n-1[/sub]} = a s[sub]n-1[/sub][sup]2[/sup] is nonzero, so consecutive elements (and in fact all elements) of the sequence are correlated.[/li][/ol]

So by generating a sequence in the above way (draw an independent zero-mean Gaussian with variance s[sup]2[/sup] and add to it a times the previous value) you converge to a sequence with mean zero and variance s[sup]2[/sup] / (1 - a[sup]2[/sup]). If you initialize this with a normal RV X[sub]0[/sub] then, since the sum of normal RVs is a normal RV, all of the resulting random variables in the sequence are normal. If you initialize with X[sub]0[/sub] = N(0,S[sup]2[/sup]) then the process is stationary: all of the X[sub]n[/sub] will have identical marginal distributions N(0,S[sup]2[/sup]) and covariance E{X[sub]n[/sub]X[sub]n-k[/sub]) = a[sup]|k|[/sup] S[sup]2[/sup].

To make random variables with nonzero mean m, simply define Y[sub]n[/sub] = m + X[sub]n[/sub].

This is probably off base, so flame away… but I’ve found that Excel’s autofill function is really good at figuring out pseudo-random sequences.