OK, a quick lesson in processor architecture.
Modern processors have got faster over the years a great deal faster than the rest of the computer system. In particular the processor core - the bit that does the actual computation has become much faster than the memory it uses. Many many years ago there wasn’t a great deal of difference in the time it took to fetch data from memory and the time it took to say add two numbers together. Now the difference is of the order of a hundred times. Clearly this is bad. Indeed if there wasn’t a fix, computers would be very substantially slower than they are today.
The big win is to notice that programs have temporal locality. That is, things that they have used recently, tend to get used again pretty soon. Also they have spatial locality, things you need soon, are often close by in memory to things you recently used. Also, it is possible to make very very fast memory, that is close in speed to modern processors cores. But it is much much less dense, and much more power hungry than ordinary memory. Plus, in order to access it quickly you really need to have it physically right next to the processor core on the same chip. This very fast, but small memory is called cache. The other trick is that we can make the cache operate automatically, buy adding a heap of extra logic, which keeps looking at the memory requests that the processor makes, and short circuits them if the data needed is already in cache. And we make the cache operate in small blocks - so that the data held is in small blocks of memory addresses that are next to one another. Thus we can take advantage of both temporal and spatial locality. If we fetch a data item from memory, the cache controller will ask the memory system for the entire cache block surrounding that data item, and store it. If either that data item, or a close neighbour is needed again soon, it is ready and waiting to be delivered much faster than if the processor had to go all the way to memory. This is a massive win.
As time has gone on, we find that the balance between really fast performance, and the amount of data we can hold in cache starts to fail, and we really wish we had a much bigger cache, but we can’t make it fast enough. So the easy answer, add another layer of cache. So we have really fast, but not very big, and pretty fast, and much bigger cache. These are level 1 and level 2 caches. Very high end processors have three levels of cache. Expect even more.
The problem is of course that eventually a cache gets full. Then the cache controller throws away some data (writing it back to memory it it needs to) and reuses the space so freed for new data. Clearly it might have to throw away data that was actually still in use, and thus the program pays a penalty to get that data back again. In the worst case you can get cache thrashing behaviour.
So, no matter what, more cache means more data held close to the processor core, and the faster the machine will be.
The difference is access speeds is pretty impressive. I haven’t looked at the later x86 offering, but a typical set of numbers would be - access to data already in the processor, some fraction of a clock cycle. Access to level 1 cache - say three cycles. Access to level 2 cache - say 25 cycles. Access to memory - hundred plus cycles. That 3MB versus 6MB of cache is the level 2 cache. For completeness, it is worth noting that the level 1 caches are per core caches - and there are two per core - one for data and one for program code. The level 2 cache is shared between the two cores and can hold both code and data.