FinFETs are coming, so there is still a big call for new transistor technology. And I can assure you that there are plenty of companies who would be happy with extra speed, if just to have bragging rights for marketing. And there is plenty of research going on, fabs are still competitive, after all.
There was a startup making a processor with about 100 cores - but it didn’t make it to real production. But there lots of processors which support tens of threads today.
Looking at parallel processing, and especially the uses in big compute, where you see enormous numbers of processors sheds some light on things.
Computers built to be fast have always had some element of parallel computation in their design, even going back to the Cray-1 - which is arguably the first design intended to be a supercomputer as opposed to merely the latest, and thus fastest computer. (Before the advent of the minicomputer there is some truth to the idea that all computers were supercomputers - the speed jumps were huge between successive models). The Cray-1 and its successors were vector machines, and speed was dependant upon the ability of the processor to compute over vectors of data with vector operations. The ISA (Instruction Set Architecture) had vector registers, which held a vector of data, not one item, and, data was loaded and operated on in units of these vectors. Paths to memory were wide, allowing fetches to occur quickly. The point being that there could not be any interdependency between items in a vector register, as operations were logically parallel n them.
float A[n]
for i = 0 to n
A* = A[i+1]
would not run using vector registers. Although curiously
for i = 0 to n - 16
A* = A[i + 16]
might, assuming there were 16 elements in a vector register.
Machines were very good at parallel matrix operations, and pretty soon languages (say Fortran 95) allowed expressing it easily.
A = x * B + C
is perfectly good F95 where A,B and C are matrices. Better, most Fortran compilers will automatically parallelise such code on modern multi-code boxes, or across distributed systems (using say the MPI communication library.)
Vector machines were brilliant at handling regular meshes and matrices. This covered a huge part of the demand for high performance at the time. I always used to say that between use of arrays for large regular problem descriptions and linear algebra (which dominated by the numerical solving of ODE’s - which was little more than populate a big matrix, invert, substitute, rinse and repeat) you could cover two thirds of science and engineering. With care lots of fluids, aerodynamics (compressible fluids mostly, but moving into compressible and thermal stuff as they went supersonic) and finite element analysis, could be run over regular arrays of data, even when the physical system was over an irregular mesh. Vector machines went parallel with the Cray X-MP in the early 80’s. Attacking these sorts of problems was the backbone of high performance and parallel programming for at least two decades. Compilers go very good at spotting how to vectorise the operations, although in the early days they needed a lot of hand guidance. Even on modern x86 machines, especially the 64 bit versions, you see the legacy of this in the MMX and successor instructions, with an entire separate set of registers devoted to operations that are a mix of SIMD and vector.
The 90’s saw such machines as the CM-5 which included 4 64 bit wide vector units per processor node. A single cabinet CM-5 with 128 nodes could harness 512 vector units in parallel, to achieve some serious peak speeds. However the CM-5 was a distributed memory machine, and exemplified the direction and problems that would follow the big iron vector machines.
As the systems got bigger and bigger, with more and more processors, rather than the individual processors getting faster and faster, the memory hierarchy became the dominant speed bottleneck. The Cray vector machines didn’t have caches. They didn’t need them. Data came directly to the vector registers from memory, and went back. But this requires very fast expensive memory and very regular predictable access patterns (so much so that the memory fetch engine knew about the stride patterns your program had). Modern needs had to break the memory both into a multi level hierarchy - via caches - and also by distributing it. SGI brough us the ccNuma architectures - which had a unified memory address space, but distributed across the processing nodes - ccNuma Cache Coherent Non-Uniform Memory Architecture. Most other large systems were distributed memory, with local address spaces and a range of libraries providing a way of communicating data between nodes at the program level. And this has continued to this day. In the middle you can still build a large single memory space machine with a reasonable number of processor cores. But they don’t scale. Eventually you have to break them up into separate nodes. Still, you can build a dual socket machine with 24 processor cores (32 if you use AMD) and that will sit on your desk. 20 years ago that would have been a Top-100 supercomputer. Quad socket machines can get staggering performance. Areas like gene and protein sequence research love very large single memory address space machine. A Terabyte of memory is common. Then they just mine away.
But the question of making a large problem parallel is interesting. In reality some problems really don’t parallelise, but they are fewer than many realise. It is often very important to go back to first principles in the problem statement. Too often the mathematics are converted to an algorithm much too early in the process, and the algorithm is mistaken for the problem from then on. Mathematics doesn’t just refer to traditional things, but must include automata theory, graph theory, and the like. A problem processing a file is really automata theory, and parallelisation would want to start with a clear definition of the grammar in a suitable form. A massive number of computational problems are just tree walks, and a huge number of the rest are graph walks. But so often this is obscured by mess of other design or ad-hoc algorithm development.
One thing is that often parallel algorithms do need to perform more work than the single threaded one. But you should be willing to consider the price if you do twice as much work but can run it across 4 or 8 cores. But here second order effects come into play. The cost of synchronising data transfer can be very high - so you need to do it as little as you can and make it as effective as possible when you do. In distributed memory systems especially the ratio between the amount of data on a node versus the amount to be exchanged becomes critical. For say a problem that represents a 3D data space, the space will be divided into sub-cubes, and communication occurs typically across adjacent faces. The face area, and thus data communicated scales with the square of the number of nodes, whilst the data in a node scales with inverse of the cube. As the number of nodes increases the communication costs will become ever more dominant, and then you start to go backwards. But countering this can be cache locality. As the data on a node decreases in size more and more of it fits in cache, eventually reaching the point where all of it does. This can mean very significant increases in speed. So much so that across a range of sizes, adding more nodes means even greater increases in performance - super-linear speed up.
The tools for writing parallel code remain pretty primitive, but it isn’t an insuperable problem in many cases to get quite nice performance jumps on even quite crufty old code.
Yeah but this thread is about increasing the max speed of a single threaded process, read the OP. FinFET as I understand it lets them fit more transistors in the same space with lower power. In theory they could use this to increase clock speeds a bit, but they’re more likely to use it fit more and more cores on the same size die, or to make chips more power efficient.
Masterful. Thank you. My rusty knowledge moved forward 20 years right there.
Francis (and others who may know),
I’ve read that the L1 cache on consumer-level CPUs haven’t increased much with time. The L2 and L3 have. Any reason for that difference in CPU cache increases?
Does there tend to be a roughly equal trade-off between the latency and the bus of memory?
Voyager:
By “yield” you mean the percentage of wafers that are actually usable in a chip? I hear it’s usually around 50%.
Would it be possible to cut wafer into very small section (so the yield goes up) and then place the good section in close proximity to form a chip; A patchwork chip?
I brought up FinFETs to show that people are still doing lots of research. As I said, the main reason we’re not doing more complex cores is design time. Reuse is not just for software any more. It would be nice to have faster chips, and we are increasing speeds on ours, but nothing dramatic. And the benefits of the last few process nodes have been less than earth shattering in terms of speed.
You have it backwards. A wafer, which is a standard size, consists of a number of chips. The bigger the chip, the fewer that fit on a wafer and since wafer processing costs are reasonably constant, the more expensive the chip.
When the wafer is finished there is a test called wafer probe, where each chip is tested still attached to the others. Chips that pass and ones that fail are entered into a database, and a wafermap is produced showing the packaging people which chips to package. Package parts get tested further.
Yields are high proprietary, so I ain’t talking. They depend on how big the chip is - bigger ones have more room for defects and thus more fails. Small chips like the ones that go into consumer products have yields way above 50%.
As I’ve mentioned, memories with failing bits can be repaired by replacing bad rows or columns with spares. That is very common. In some cases you can sell parts with fewer than the maximum number of cores for less money, so you can sell parts with failing cores. The 100 core start-up planned to have spares, and actually sell chips with fewer cores than that as their standard offering.
Cache architectures is a whole area in itself.
L1 caches are tied to the processor core in a very intimate manner. They need to be able to deliver an operand to the fetch unit in only a couple of clock cycles. The pipeline can stall whilst waiting, making L1 data cache performance critical for speed.
Now you get to the trade-offs in cache design.
Caches work well when they store data in contiguous lumps (lines) as this allows the system to exploit locality in the program whilst limiting the number of memory operations to fill the cache. Memory bus data width tends to be chosen to fit the cache line size, and memory access tends to be done in terms of a cache line. Choosing the line size is thus already critical. Too narrow and you have lots of memory cycles and the system slows, too wide, and all sorts of things start to go wrong, or cost too much.
So the next problem is that caches are a finite size, and if you fetch something into one, you will need to evict something else. A big problem is how to manage this in a manner that tends to avoid evicting other live data, and thus thrashing the cache. The tactics that work best change throughout the cache hierarchy.
What really makes life hard is that you need to be able to find a line in the cache from its memory address very quickly. The easiest way is to simple hack a pile of bits off the address, leaving just enough bits to address a location in the cache. Say a cache has 64 lines, and is 64 bytes wide, you would use the bits [4:7] to address the cache. This is know as a direct mapped cache. It has the advantage of being really fast. It has the disadvantage that a line can only go in one slot, if there is another line with the same set of address bits denoting its location, that is also hot, the cache will thrash.
The opposite is to have directory attached tot he cache that looks up the slot the line occupies from the line’s full address. This allows the line to be placed anywhere in the cache, and so the cache will avoid trashing until it is completely filled with live hot data. The down side is that the directory lookup takes longer than simply chopping some bits out of the address. The directory is a slab of associative memory (you look up the memory by its contents and get the address back,rather than the other way around) A fully associative cache can have a directory that is as large as the cache memory. And the time taken for a lookup tends to go up with the number of entries. This last bit is the key to the question.
There is an intermediate form of cache, which uses a section of address bits for a line’s index in the cache, but which stores more than one line in a slot. These are set associative caches, and they work very well for the L2 and L3 caches.
L1 data caches tend to be fully associative, in order to avoid thrashing as much as possible, but their useful size is limited because the lookup lime must be brutally short to avoid slowing the processor core. Also they must be right next to the core, as every little bit of distance the signals travel adds latency. So there comes a balancing act. You might like a larger L1 cache, but making it bigger slows it down. Somewhere in the middle you find a sweet spot. As you scale the processor with smaller and smaller design regimes, this trade-off doesn’t tend to move much, so no matter which generation processor you see, the nature of the program mix, the L1 cache stays. Making it bigger actually slows things down.
It is important to realise that the sweet spot is determined by the expected compute mix. And the design of the lower levels of the cache hierarchy. This means the designs are benchmarked with a range of compute jobs, and the cache hierarchy design is chosen to best serve the expected jobs. Running a word processor is going to have a very different memory access pattern to a game or a job that does huge FFTs. Designs directed at servers may be designed with a different cache hierarchy simply because the job mix is so different to a desktop.
In addition to the data cache it is usual to have an instruction cache (incorrectly termed a Harvard architecture) and other caches for important data - such as the data structures for virtual memory translation (the translation lookaside buffer - TLB) and tables that keep track of branches taken to allow better prediction of branch direction based upon previous path taken by the code - the branch prediction table. The layout of all of these tends to be different. Instruction caches are often directly mapped, and so on. Lover levels of cache are usually unified - containing both instructions and data in one cache.
The reason you have further levels of memory cache is to buffer the latency of memory access to the L1 caches. The lowest level cache is designed to talk to memory, and its parameters are partly guided by the memory bus characteristics.
Once you go multi-processor caches get more complex, as you need to deal with consistency problems with shared data items. Cache coherency is another whole area. Caches need to know it they have the only copy or a shared copy of line, and they need to cooperate with one another to ensure that programs always see a consistent view. Or they need to provide a view of memory that can be sensibly reasoned about, but which loosens the coherency model (allowing better straight line performance, but requiring that compilers or higher levels work within the model.)
A zillion years ago, when I taught computer architecture as a CS subject, one of the exercises the students were set was to design a cache hierarchy to get best performance out of a given test program, and to do this on a set of three different die sizes (where the increase in die allowed more cache). It was interesting to see how wide a variation of answers were possible when only this one test code was used. In reality, if a really good job mix was used the design would tend to settle down.
An example from the past - Sony’s Playstation 3 used the Cell processor, which was designed with 8 PPE cores, but the Playstation console only guaranteed seven operational cores. Higher end uses of the Cell, most sold by IBM as computational acceleration boards, all had 8 operational cores, and came at a significantly stiffer price.
Caches tend to be designed in blocks, and if there is a failure in a block it is possible to simply disable the entire block. As caches tend to dominate the area of processor dies, especially the higher end ones, the chance that any included defect hits a cache rather than processor operational bits is quite high, so simply binning parts based upon the amount of operational cache, and pricing these accordingly presents an easy way increasing the effective yield, and filling different price points.
Do you have an example of cache repair being done on a block level? Our caches all get repaired using spare rows and/or columns as I mentioned, and that has been the standard procedure as far as I can tell. Memory defects (and I’ve seen lots of bit maps) are often column based, and so replacing a block would be inefficient. Plus, it is relatively easy to implement row and column repair in the addressing logic.
In fact caches are the things you repair. We have hundreds of smaller memories, which have low failure rates due to their small size and so are not worth adding repair logic to.
Sorry I can’t say what our repair yield improvement is.
The density of the mask on the wafer is far higher than the density that can be achieved via off-chip interconnects. So it would be a lot of planning and time-consuming in production for little gain. The effort would better spent trying to reduce the number of defects per square unit of surface.
The last chips that know of that had multiple wafer “chips” in the same package were the Pentium Pro (cache, same speed as CPU). The “Slot 1” Pentium 2 had separate packages for the CPU and cache (half-speed, multiple chips) which were mounted on a printed circuit board inside the Slot 1 plastic case.
The technology is called System in Package (SiP). Here is an example. I haven’t seen much about this recently. Today people are doing 2.5D or 3D packaging, In 3D something like a memory chip is put on top of the processor, and interconnect goes directly from one to the other. In 2.5D the chips are connected through a silicon interposer. The late Gene Amdahl had a company called Trilogy trying to do Wafer Scale Integration in the 1980s. It failed. Others didn’t try this much but did try 2.5D . They failed also, but it seems to be coming back.
I seem to remember one of the big problems they had with the wafer scale stuff was thermal. Long time ago however. Wafer scale was a big deal for a while there.
Apple are having a great time with dropping the memory on top of the processor in the iPhone and iPad.
That’s right. This was before low power devices. When I did research on Trilogy for my column I found that they claimed to have gotten it working - but it was way too late. They were also plagued by disasters of a Job-like quality, for instance the clean room under construction flooded, and after it was done they found they had mold in the walls.
I was peripherally involved in a 2.5D effort at Bell Labs - went to some meetings, made up excuses to avoid going to others. They eventually sold the technology to Alcoa of all people.
My column was telling younger 2.5D and 3D researchers today that they should read papers from 25 years ago, which are usually outside the event horizon of researchers today.
AFAIK Intel still mounts CPUs and GPUs together in the same package.
Do you know if they use standard interconnect or a silicon interposer?
I think they just wire-bond the two die to the same ceramic substrate.
Here’s a reference:
From here.
This article is a few years old, but I think they are still doing multi-die packages.
(Note that the article talks about the DRAM being a separate die, but I think they do separate GPUs, too.)
The last image in your link looks like IBM SLT from the 60’s.
It has been years since I’ve done anything in that field, but IIRC the goal was to metallize the two chips together, like a PCB plated-thru hole on a much smaller scale.
Never saw one of them, but I have seen [url=https://www-03.ibm.com/ibm/history/exhibits/vintage/vintage_4506VV2137.html] IBM Thermal Conduction Modules. [/ur]
3D stuff today is connected by Through Silicon Vias from one chip to the one sitting on it. We don’t do this kind of stuff, but I seem to have become the go-to reviewer for papers on this submitted to an IEEE Transactions.