Probability of finding a short sequence within a longer one

quelquechose · May 16, 2006, 2:50am

Suppose there is a sequence of symbols, each of which is independently chosen from k possibilities with probability 1/k. How can I evaluate the probability that a shorter sequence appears within such a sequence of length n?

For example, if a sequence of 100 digits (each of which is one of 0,1,2,…, 9) is generated at random with each digit equally likely to occur, what is the probability that somewhere in that sequence the digits “12345” appear, in that order?

Dr_Paprika · May 16, 2006, 3:01am

If your sequence of length n appears once out of N possible symbols, there are (N-n) other numbers, and your sequence can appear starting in position 1 to (N-n).

So if N=100, n=5, for sequence 12345:

The sequence could be at position 1 followed by 95 other random numbers, this gives 10^95 possibilities.

Or in position 2, with 1 random number before and 94 after.

Up to position 95. So there are 95*10^95 ways of picking the sequence out of 10^100 possible sequences.

Or (n-n)*k^(N-n) / k^N.

But my combinatorics is rusty. Shouldn’t you be doing your own homework? Anyway this doesn’t account for multiple sequences.

Dr_Paprika · May 16, 2006, 3:14am

Hmm, make that (since the last sequnce is in position #96… N-n+1)

(N-n+1)*k^(N-n) / k^N.

Until someone smarter comes along.

quelquechose · May 16, 2006, 4:04pm

This isn’t homework, it’s just a problem I came up with. Originally I was trying to figure out the probability that a given length of random DNA has a stop codon, and this is just a generalization of that problem. As a related question, how could one calculate the expected distance between occurrences of the short sequence?

Omphaloskeptic · May 16, 2006, 5:44pm

The substring s=“12345” you chose has the nice property that it has no “self-overlaps”: i.e., none of its suffixes are the same as its prefixes. This means that in a string of length n there can never be more than floor(n/5) occurrences of s. (Contrast this with the string “12121”, for example, or “11111” for the extreme case.) This makes the counting easier. (The analysis below requires modifications for overlappable strings.)

For such a non-overlapping string s, of length m, in an alphabet of size k, let N(n) be the number of strings of length n which contain no instances of s. (This is the number you want.) Now we can write a difference equation for N(n):
N(n) = kN(n-1) - N(n-m) .
(Consider all kN(n-1) strings of length n containing no instances of s in the first n-1 characters. Clearly this set contains all of the strings we’re interested in; but it also contains some strings with s occurring at the end: namely, all of the N(n-m) strings of length n-m with string s postfixed.) Of course, we have the “initial” conditions N(n) = k[sup]n[/sup], for n<m.

The difference equation is straightforwardly reduced to a polynomial of degree m; for large m, finding analytic solutions will probably be nontrivial.

Omphaloskeptic · May 18, 2006, 2:00am

The expected length of a string with no occurrences of s except for one at the very end is just k[sup]-m[/sup], so the expected distance between occurrences of s (not counting the characters of s) should be k[sup]m[/sup]-m.

Topic		Replies	Views
A semi-random sequence Factual Questions	1	1443	August 1, 2011
Several probabilities questions Factual Questions	7	1814	November 27, 2012
How would I calculate the probability of a random 6 digit number repeating? Factual Questions	10	2460	May 21, 2021
A mathematics/probability question Factual Questions	14	1142	February 14, 2007
How long a monotonic subsequence must a random permutation of an alphabet contain? Factual Questions	12	2134	November 15, 2009

Probability of finding a short sequence within a longer one

Related topics