It used to be that processors were fairly simple. The processor had some internal registers, and some control logic that made it all work. The processor would go through various stages to decode and execute every instruction. First it would fetch the instruction from memory. Then it would decode the operands needed by that instruction, which often required additional memory fetches to get the value (in other words, if the instruction is to add A and B, then you need to go fetch A and B from memory at that point). Once you had all of the values fetched, the processor would run these values through the arithmetic logic unit (ALU). The final stage was to write the value to its destination, which could be a write back to a memory location, a write to an internal register, or a write to a control register such as the program counter in the case of a branch or jump instruction.
The type of processor in a PC is called a CISC, or Complex Instruction Set Computer. The way you make these types of computers “better” is you give them more powerful control circuits and ALU circuits so that they can do more with each instruction. A newer idea is called RISC, or Reduced Instruction Set Computer. The idea behind a RISC machine is that you make a very simple instruction set and you made dedicated hardware in the processor for each step that the processor does. This is generally referred to as the “RISC pipeline”. You have a dedicated area of the processor that does the instruction fetch, another area that does the decode, another area that does the execute, and another area that does the write back. So you fetch one instruction and pass it down the pipeline to the decode section of the processor. The decode section decodes that instruction, while the fetch goes and gets another instruction. Then they each pass their instructions to the next stage of the pipeline, so that the first instruction is being executed, the second is being decoded, and a third is being fetched. The end result of this is that with this (oversimplified) four stage pipeline, you end up executing four instructions at once.
Now caches start to become really important, because you’ve got different parts of the processor that all want to access the RAM at the same time.
PC processors are CISC processors, not RISC processors, but they looked at that pipeline idea and said hey, we can do that too. Maybe we won’t be as efficient at it as a RISC processor, but we can still speed things up quite a bit. So back in the days of the 386, you started to see this sort of pipelining being used.
One feature of RISC processors is that all of the instructions are exactly the same size. This is not true of CISC processors. Instructions in a PC can be as small as one byte, or can take several bytes. The wider the memory bus, the more bytes get fetched at once. So you do one memory access, but you may pick up 3 or 4 instructions. The processor stores these instructions in what it calls its “prefetch queue”.
Now we finally get to talk about the OP’s question, caches.
How the processor fills up these caches is very complicated. Processors started out with a single pipeline, but as they got more advanced, the pipelines got more advanced as well. Pentium processors started doubling up instructions, executing two instructions at once in a dual pipeline arrangement. They also had a separate floating point pipeline.
One problem you run into with these pipelines is what if the first instruction is ADD AX to BX and store the answer in BX (AX and BX are registers inside the processor) and then the next instruction is SUBTRACT BX from CX. You can’t execute the second instruction until the first one completes, because it uses a value that is calculated in the previous instruction. If you are executing these in a pipeline, this causes what is called a “pipeline stall” where the second instruction just sits there and waits for the first one to complete. Once the first instruction completes, the second instruction continues on and the pipeline fills up again.
So the things a modern processor tries to do to make things go faster is reduce pipeline stalls as much as possible, and use as big of an instruction prefetch queue as possible so that the instruction decode stages of the processor doesn’t have to fight with the fetch stage of the processor as much.
Like I said, in a modern processor, this is where things start getting really complicated. You’ve got the fetch stage trying to fill up the instruction prefetch queue as much as it can. Modern processors even go as far as to re-order instructions in the queue if it will help speed things up. For example, if instruction 1 is ADD A+B, instruction 2 is SUB B from C, and instruction 3 is PUSH D up onto the stack, you’ve got a pipeline stall because instruction 2 has to wait for the result from instruction 1. Instead of sitting there doing nothing, a modern processor will move instruction 3 ahead of instruction 2 in the queue, so that instead of just sitting there doing nothing, it will execute instruction 3.
The instruction fetches are mostly going to be sequential, so the processor is just going to grab line after line of memory and shove it up into the prefetch queue. This is going to cause most of the executable part of the program to be fetched into the cache memory, which is good, because that means that once the program really starts running for a bit, the processor can just go out to the cache and grab the next instructions, which is nice and fast. Any data that the program uses often will also be cached.
But here’s a problem when you start filling up your instruction cache. What do you do about branch instructions? In other words, if an instructions says “if A is less than 5, go execute this subroutine, otherwise keep running this main branch of code”, do you load the subroutine into the cache or do you keep loading the main routine into the cache? This is called “branch prediction” and it is so important to processor performance that folks like Intel don’t like to say exactly how they do it. They have some secret mystical algorithm that they use to decide whether to fetch the subroutine or the main code and shove it into the cache. If their algorithm is right, then the instructions are already in the cache and the processor keeps humming along at top speed. If they predict wrong, then the entire instruction prefetch queue from the branch onward gets flushed, and everything grinds to a halt while the processor goes out to memory and fetches the instructions it actually needs.
So how the processor fills up the cache is basically a very complex algorithm for predicting what instructions it is going to need, as well as fetching what data the processor is going to need as the instructions march down the pipelines.
And it gets even more complicated than that, because not all memory can be cached. If you are writing out to video memory, for example, you can’t just store the data in the processor’s cache because that data actually determines what the user sees. It needs to go all the way out to the video RAM when it is written. And you’ve also got devices like ethernet chips that shove data into RAM without the processor’s help, so as they change RAM values the data that is cached inside the processor has to get marked as invalid and has to be fetched again by the processor when it needs it.