First, here’s a good introductory page on Bayes’s formula and its applications: An Intuitive Explanation of Bayesian Reasoning. It has lots of interactive examples.
The most basic form of Bayes’s formula is
p(A | B) = p(B | A) * p(A) / p(B).
To see what this means, and why it’s true, imagine that you have a dart board, with the area of the dart board equal to exactly 1 area-unit. Suppose that you are going to throw a dart at the board. Let’s assume that you aren’t going to aim the dart at any particular part of the board, but the dart is guaranteed to land somewhere on the board. That is, it’s definitely not going to land outside the board, but it is no more likely to land on any one part of the board than on any other.
It follows from these assumptions that, if you shade in a region A on the board, the probability that your dart will land in A is exactly the area of A.
Some notation: We denote the area of a region A by p(A). We use the letter “p”, of course, because, as mentioned, this is the probability that the dart will land in A. Also, suppose we draw two regions on the board, A and B. Then we denote the region of their overlap by A & B. Hence, p(A & B) is just the probability that the dart will land both in A and in B – that is, in their overlap.
Now consider the following situation: For whatever reason, we are only interested in those cases where the dart lands in region B. Furthermore, we want to know, given a dart-throw landing in B, what is the probability that it also lands in A? We denote this value by p(A | B).
Observe the difference between p(A | B) and p(A & B). The value p(A & B) is the proportion of dart throws that land in A & B out of all throws, including those throws that miss B altogether. The value p(A | B), on the other hand, is just the proportion out of those throws that at least hit B.
So, to re-capitulate, p(A | B) is the probability that the dart hits A, given that it hits B. But we saw above that we can think of probabilities as the areas of regions on a dart board of area 1. So, to compute p(A | B), we can forget everything outside of B and just think of B itself as an entire dart board (though perhaps an oddly-shaped one). Then A & B is a region inside this new board.
This means that p(A | B) is the area of A & B after re-scaling so that B has area 1. That is, we are now computing areas in B by taking their “original” areas and dividing through by the area of B (which, you should note, makes the re-scaled area of B itself equal to 1, as desired.)
As an equation, the content of that last paragraph is
p(A | B) = p(A & B) / p(B) … (1).
But, we could just as well have carried out this reasoning with A and B switched. We would then have found that
p(B | A) = p(B & A) / p(A).
Of course, the area of the overlap of A and B is the same whether you put A or B first. That is, p(B & A) = p(A & B). So this last equation becomes
p(B | A) = p(A & B) / p(A) … (2).
Now we have p(A & B) in two equations. Solving for it in (2), and then plugging the result into (1), gives us
p(A | B) = p(A & B) * p(A) / p(B).
And this is Bayes’s formula.