Help me understand cache memory

Note: I’m pretty sure this is within the rules. It is a question about something I’m studying in class, but this is not homework. I’m just trying to get a solid understanding.

So I’m taking an advanced computer organization class and the first topic we’re studying is cache memory. Specifically the cache that goes between the registers in the CPU and main memory. We’ve yet to get into issues like mapping strategies and what you do with misses, but I’m not really concerned about that right now. I’d like to outline how I currently think it works and maybe get some feedback on errors or points of confusion.

As I understand it, the aim of cache memory is to minimize the amount of time the processor is idle while waiting for data to be read from or written to main memory.

This is accomplished by using a relatively small amount of “fast memory” (does “fast memory” mean registers or what? What is the cache made of?) to “bridge the gap.” Our professor explained that for large enough hit ratios, this would decrease the effective memory access time to near the level of fast memory.

So I think I’m good up to this point (please correct any mistakes). This is where I kind of lose it, and I struggled with the wikipedia page. I don’t really understand how it’s possible to get a high enough hit ratio.

She explained that the cache is divided into blocks, and each block holds multiple values from main memory. Now, clearly you can’t just have a full copy of RAM in the cache, because then there would be no need for RAM. So that means only some values from memory are stored in the cache. So, how does the system choose which data from main memory to store in the cache? I imagine there’s some sort of algorithm, based on previous memory usage? Say, for example, the cache is one tenth the size of main memory. If data was picked at random, the hit ratio would be .1. Obviously hit ratios are much higher than this. How is data stored in the cache? Does each block correspond to a sequence of memory values?

Sorry that this is kind of rambling. I just feel like there’s some relatively simple underlying concept here that I need to understand before I can get the big picture. If you made it this far, I thank you for your time and would greatly appreciate any input at all.

The cache is a small set of registers on the same chip as the CPU.

When the CPU needs something from memory, it checks to see it the data is in the cache first. If it’s not, it loads a whole *block *of data from RAM into the cache. This block will contain the address it needs, but also a range of other nearby addresses.

Since memory accesses tend to cluster together, it’s likely the other memory accesses in the near future will come from the same block. For example, the instructions of a program are mostly stored in a contiguous sequence, so if you get a cache miss on one instruction and load a 16-word block, you’ll probably get cache hits on the next 15.

If you’re writing high-performance software, one of the things you try to do is arrange your code and data so that it uses the caching system efficiently. For example, if there are a set of values that will get used for the same computation, you want to make sure they are stored nearby in memory so they all get loaded in the same block.

I have a (potentially interesting) story about cache:
I was working for a small company that made wide-format laser plotter controllers. At the time, we could outperform any other controller on the market. We did this by very, very carefully optimizing the code, and hand-wrting critical sections in assembly. One of the things we discovered was that the dimensions of the bitmap used to store the plot data was critical - if the dimensions were the exact size of the page, it would often result in constant cache misses each time a vector was drawn, because the previous vector had invalidated the data. By adjusting the bitmap dimensions, we were able to speed up vectorization dramatically (we were something like 60x as fast as some of the embedded controllers of the plotters).

First, thanks for your answer.

What you describe sounds like what my teacher called “locality of reference” but the way you explained it makes more sense.

How does the CPU decide what data to store in the cache initially? Is it essentially random?

I’m picturing it working something like this:

Load random data into cache
Attempt to read from memory:
If the data you need is in the cache, just take it straight from there and bypass main memory altogether.
If the data is not in the cache, go get it from memory, along with some number of adjacent pieces of data, load them all into one block of cache (overwriting somethings that was already there), then load them to CPU
Is that basically how it works? Or does the cache start out empty, only taking on blocks of data after the program misses?

Would it be correct to say that the cache is essentially a container for chunks of adjacent memory cells? That is to say, the chunks are loaded into the cache as one block, rather than separately? If so is the “offset” portion of the block used to figure out where the target piece of data resides within the large block?

The cache starts out empty.
As main memory is accessed, the cache gets loaded with those values.
If your data is non-localized, then cache will actually slow you down, because you waste time loading the cache and checking it, when the data isn’t there.

It used to be that processors were fairly simple. The processor had some internal registers, and some control logic that made it all work. The processor would go through various stages to decode and execute every instruction. First it would fetch the instruction from memory. Then it would decode the operands needed by that instruction, which often required additional memory fetches to get the value (in other words, if the instruction is to add A and B, then you need to go fetch A and B from memory at that point). Once you had all of the values fetched, the processor would run these values through the arithmetic logic unit (ALU). The final stage was to write the value to its destination, which could be a write back to a memory location, a write to an internal register, or a write to a control register such as the program counter in the case of a branch or jump instruction.

The type of processor in a PC is called a CISC, or Complex Instruction Set Computer. The way you make these types of computers “better” is you give them more powerful control circuits and ALU circuits so that they can do more with each instruction. A newer idea is called RISC, or Reduced Instruction Set Computer. The idea behind a RISC machine is that you make a very simple instruction set and you made dedicated hardware in the processor for each step that the processor does. This is generally referred to as the “RISC pipeline”. You have a dedicated area of the processor that does the instruction fetch, another area that does the decode, another area that does the execute, and another area that does the write back. So you fetch one instruction and pass it down the pipeline to the decode section of the processor. The decode section decodes that instruction, while the fetch goes and gets another instruction. Then they each pass their instructions to the next stage of the pipeline, so that the first instruction is being executed, the second is being decoded, and a third is being fetched. The end result of this is that with this (oversimplified) four stage pipeline, you end up executing four instructions at once.

Now caches start to become really important, because you’ve got different parts of the processor that all want to access the RAM at the same time.

PC processors are CISC processors, not RISC processors, but they looked at that pipeline idea and said hey, we can do that too. Maybe we won’t be as efficient at it as a RISC processor, but we can still speed things up quite a bit. So back in the days of the 386, you started to see this sort of pipelining being used.

One feature of RISC processors is that all of the instructions are exactly the same size. This is not true of CISC processors. Instructions in a PC can be as small as one byte, or can take several bytes. The wider the memory bus, the more bytes get fetched at once. So you do one memory access, but you may pick up 3 or 4 instructions. The processor stores these instructions in what it calls its “prefetch queue”.

Now we finally get to talk about the OP’s question, caches.

How the processor fills up these caches is very complicated. Processors started out with a single pipeline, but as they got more advanced, the pipelines got more advanced as well. Pentium processors started doubling up instructions, executing two instructions at once in a dual pipeline arrangement. They also had a separate floating point pipeline.

One problem you run into with these pipelines is what if the first instruction is ADD AX to BX and store the answer in BX (AX and BX are registers inside the processor) and then the next instruction is SUBTRACT BX from CX. You can’t execute the second instruction until the first one completes, because it uses a value that is calculated in the previous instruction. If you are executing these in a pipeline, this causes what is called a “pipeline stall” where the second instruction just sits there and waits for the first one to complete. Once the first instruction completes, the second instruction continues on and the pipeline fills up again.

So the things a modern processor tries to do to make things go faster is reduce pipeline stalls as much as possible, and use as big of an instruction prefetch queue as possible so that the instruction decode stages of the processor doesn’t have to fight with the fetch stage of the processor as much.

Like I said, in a modern processor, this is where things start getting really complicated. You’ve got the fetch stage trying to fill up the instruction prefetch queue as much as it can. Modern processors even go as far as to re-order instructions in the queue if it will help speed things up. For example, if instruction 1 is ADD A+B, instruction 2 is SUB B from C, and instruction 3 is PUSH D up onto the stack, you’ve got a pipeline stall because instruction 2 has to wait for the result from instruction 1. Instead of sitting there doing nothing, a modern processor will move instruction 3 ahead of instruction 2 in the queue, so that instead of just sitting there doing nothing, it will execute instruction 3.

The instruction fetches are mostly going to be sequential, so the processor is just going to grab line after line of memory and shove it up into the prefetch queue. This is going to cause most of the executable part of the program to be fetched into the cache memory, which is good, because that means that once the program really starts running for a bit, the processor can just go out to the cache and grab the next instructions, which is nice and fast. Any data that the program uses often will also be cached.

But here’s a problem when you start filling up your instruction cache. What do you do about branch instructions? In other words, if an instructions says “if A is less than 5, go execute this subroutine, otherwise keep running this main branch of code”, do you load the subroutine into the cache or do you keep loading the main routine into the cache? This is called “branch prediction” and it is so important to processor performance that folks like Intel don’t like to say exactly how they do it. They have some secret mystical algorithm that they use to decide whether to fetch the subroutine or the main code and shove it into the cache. If their algorithm is right, then the instructions are already in the cache and the processor keeps humming along at top speed. If they predict wrong, then the entire instruction prefetch queue from the branch onward gets flushed, and everything grinds to a halt while the processor goes out to memory and fetches the instructions it actually needs.

So how the processor fills up the cache is basically a very complex algorithm for predicting what instructions it is going to need, as well as fetching what data the processor is going to need as the instructions march down the pipelines.

And it gets even more complicated than that, because not all memory can be cached. If you are writing out to video memory, for example, you can’t just store the data in the processor’s cache because that data actually determines what the user sees. It needs to go all the way out to the video RAM when it is written. And you’ve also got devices like ethernet chips that shove data into RAM without the processor’s help, so as they change RAM values the data that is cached inside the processor has to get marked as invalid and has to be fetched again by the processor when it needs it.

I hope I am not muddying up the picture here, or re-stating the obvious, but let me emphasize that there are often two different caches; an instruction cache and a data cache.

The benefit of a cache of adjacent instructions is most obvious if you have a “do-loop” of instructions; the same group getting executed over and over.

The benefit of the data cache shows up when your program is working through a block or table of data; bring 'em all in and the next one will be handy.

^Wow, thank you for the detailed post engineer_comp_geek. That’s very enlightening. I was really caught up with the issue of exactly how the computer decides what data to move into cache. It just seems like such a complicated problem, so it’s good to hear that it actually IS really complicated. I don’t think those complicated algorithms (especially the top secret Intel ones) are going to be part of this class, so I feel a lot better.

Actually, I hadn’t realized that there are two separate caches. That makes a lot of sense.

Actually, it’s even more complicated than that. In a typical dual core processor, for example, each core will have its own instruction and data cache. Then there will be a second level of general cache between the cores and the main RAM.

Great post and a really good summary of the information. Having graduated fairly recently with a EE degree and a senior project in computer architecture, I can confirm that what you’ve said here is spot on and a thorough summary of the basics.

I did have a question here about what you mean with your statement. Are you talking about direct memory access (DMA)? And how does the information get to the processor so it knows which cache line to mark as dirty in such a case?

Don’t forget L3. 24mb for Nehalem EX and 32mb for Power7.

One way to investigate this is to ask this in reverse. Ask instead how the system chooses which data to no longer store in the cache. Most if not all data that gets loaded gets cached and at some point what’s in the cache needs to be replaced by something else. Maybe the data has been invalidated. Maybe simply execution has moved on. But the processor has to decide which data can be replaced. No longer valid data is easy and obvious. But then you get into issues of age since last use, proximity, and more.

I hope this points you in the right direction.

It can be DMA. It can be other things as well. The PCI bus for example has a mechanism by which a PCI slave can temporarily take control of the bus (this is sometimes used by disk interfaces). In either case, the cache controller has to watch over the bus and any time a write is done to the system RAM, the cache has to invalidate any entries it has that contain those memory locations.

The term for this (if you want to google) is “bus snooping”.

I was a TA for 6.004 (undergraduate computer architecture) at MIT a few semesters ago. Here are some notes I wrote up for my students on caches.

It is by no means guaranteed to be free of errors, nor free of oversimplifications made to ease the task of teaching the material to a bunch of overworked undergrads.

http://etulas.s3.amazonaws.com/caches.pdf

I have to admit that this is the first time I have actually understood anything about how processors work. It was the first thing I finally decided was beyond me. Y’all are doing good!