# Probability of finding a short sequence within a longer one

Suppose there is a sequence of symbols, each of which is independently chosen from k possibilities with probability 1/k. How can I evaluate the probability that a shorter sequence appears within such a sequence of length n?

For example, if a sequence of 100 digits (each of which is one of 0,1,2,…, 9) is generated at random with each digit equally likely to occur, what is the probability that somewhere in that sequence the digits “12345” appear, in that order?

If your sequence of length n appears once out of N possible symbols, there are (N-n) other numbers, and your sequence can appear starting in position 1 to (N-n).

So if N=100, n=5, for sequence 12345:

The sequence could be at position 1 followed by 95 other random numbers, this gives 10^95 possibilities.

Or in position 2, with 1 random number before and 94 after.

Up to position 95. So there are 95*10^95 ways of picking the sequence out of 10^100 possible sequences.

Or (n-n)*k^(N-n) / k^N.

But my combinatorics is rusty. Shouldn’t you be doing your own homework? Anyway this doesn’t account for multiple sequences.

Hmm, make that (since the last sequnce is in position #96… N-n+1)

(N-n+1)*k^(N-n) / k^N.

Until someone smarter comes along.

This isn’t homework, it’s just a problem I came up with. Originally I was trying to figure out the probability that a given length of random DNA has a stop codon, and this is just a generalization of that problem. As a related question, how could one calculate the expected distance between occurrences of the short sequence?

The substring s=“12345” you chose has the nice property that it has no “self-overlaps”: i.e., none of its suffixes are the same as its prefixes. This means that in a string of length n there can never be more than floor(n/5) occurrences of s. (Contrast this with the string “12121”, for example, or “11111” for the extreme case.) This makes the counting easier. (The analysis below requires modifications for overlappable strings.)

For such a non-overlapping string s, of length m, in an alphabet of size k, let N(n) be the number of strings of length n which contain no instances of s. (This is the number you want.) Now we can write a difference equation for N(n):
N(n) = kN(n-1) - N(n-m) .
(Consider all kN(n-1) strings of length n containing no instances of s in the first n-1 characters. Clearly this set contains all of the strings we’re interested in; but it also contains some strings with s occurring at the end: namely, all of the N(n-m) strings of length n-m with string s postfixed.) Of course, we have the “initial” conditions N(n) = k[sup]n[/sup], for n<m.

The difference equation is straightforwardly reduced to a polynomial of degree m; for large m, finding analytic solutions will probably be nontrivial.

The expected length of a string with no occurrences of s except for one at the very end is just k[sup]-m[/sup], so the expected distance between occurrences of s (not counting the characters of s) should be k[sup]m[/sup]-m.