Why is it nesessary to use such ridiculously huge number of elements in a logical device? I understand that there’s cash memory that would take at least one element per cell, so that would amount at best to 500-1000 million or so, but what accounts for the remaining billions? And how the heck doesn’t all this mess of unfathomably small chunks of matter tightly packed in a tiny piece of stone just diffuse and lose its sructure at high temperature while in operation?
What I found supported the transistor count, though none of it was sourced and Intel isn’t talking. The top of the line model has 28 cores, which accounts for a lot of transistors right there. The caches don’t seem to be excessively big. I don’t know which type of memory is used per cache, but L3 is the biggest and I assume is SRAM, which is 6 transistors per cell, and so could account for a lot also.
I’ve seen processors with really impressive heat sinks, and I bet this one has an even bigger one. I’m sure that there is lots of logic inside monitoring heat and shutting off parts of the chip not in use, so not all of these 7 billion transistors are going to be switching at the same time. If they were the thing would melt down - but I’ve heard of that happening quite a while ago for smaller chips.
The i9 series is not unique. The IBM Power9 CPU has eight billion transistors, the Apple A12X CPU in the iPad Pro has 10 billion, and the Oracle Sparc M7 had 10 billion back in 2015: https://en.wikipedia.org/wiki/Transistor_count
The “die shots” of modern CPUs show the layout: https://en.wikichip.org/wiki/intel/core_i9/i9-9900k
The above die shot doesn’t break down the logic vs cache, but the Power9 die shot does:
The Oracle Sparc M7 annotated die shot contains a break-out of a single core:
A standard floating-point benchmark is Linpack. The original Cray-1 did about 100 MFLOPS and consumed 115,000 watts: http://www.extremetech.com/wp-content/uploads/2014/10/cray-1-nersc-disassembled.jpg
Today’s Intel E7-8894 v4 Xeon consumes about 165 watts while doing about 3 TFLOPS which is 30,000 times faster than the Cray-1. The E7-8894 has about 7 billion transistors.
If you want the computing performance that modern society depends on, it takes lots of transistors on the CPUs. Fortunately the global semiconductor industry spends about $50 billion per year on R&D so this is possible.
7 billion transistors is small beans today. That’s the transistor count of the X Box One X CPU / GPU. Modern CPUs go up to 19 billion transistors and GPUs 21 billion and you can get a FGPA with 50 billion.
These things are really, really, small. They’re measured in nanometers. When the size of the chip is measured in millimetres you see that that’s a million times larger, so theoretically you could fit a trillion transistors in the area. Of course, much area is consumed by other circuitry.
As for heat, they’ve got really good at cooling.
Is “cash memory” the kind you have to pay extra for?
It’s when your dad says his mom used to give him a dime to go buy a loaf of bread.
So much is dedicated to graphics and media. Does a graphics card free that up for processing, or does it just become dead space?
Generally, it’s just wasted.
I suppose that it could be used for some GPU-assisted calculations, but there are few applications that make this worthwhile. Password cracking or Bitcoin mining, perhaps.
You can use BOINC to make use of the idle CPU/GPU resources to help the scientific community process data. Note: running your computers 24/7 wide open might increase your electric bill slightly.
While it is possible to create gates with one transistor or flip-flops (memory) with two, and that’s how it was done in early concept computers when discrete components were manually connected on large boards; generally today logic elements are far more complex with modern technology - since each element’s design is assembled by computerized graphics from templates. A simple gate or bit of memory could have a dozen transistors or more - the goal is speed; they even add complexity to increase speed, there are levels of cache, look-ahead, pre-execution (will fetch/ execute both choices of a branch while waiting for the comparison to complete…). All these extra pieces come with a massive number of transistors. Ditto for extremely fast cache memory. The gates that allow data onto the bus; and as mentioned earlier, quite a number of cores, so all the elements are duplicated over and over.
The primary answer is bandwidth - how fast can data be moved between functional blocks (CPU to cache, for example). Functional blocks that are on the same chip can use higher speed and wider data buses: speed is not limited by the need to drive signals from one chip to another through the motherboard; data bus width is not limited by a finite pin count on any reasonable package.Buses can be unidirectional, so there is none of the overhead that is involved in switching direction on a bidirectional bus.
Secondary considerations are increased cost of assembly for multiple die vs. a single die, decreased complexity of the motherboard (fewer chips = smaller board with less interconnect), and reduced EMI (electro-magnetic interference) since the very high speed buses are constrained to be on one chip.
The on-chip temperature (junction temperature) is generally limited to 125C to get an acceptable level of reliability. While this is hot for a human, it is far below the temperature required to cause the dopants which form the transistors to start diffusing to any significant degree. The metal on chip can be damaged by too high current at such temperatures (electromigration), but this is mitigated by careful circuit design as well as by choosing metal compositions that are more immune to electromigration (copper vs., aluminum, for example).
The M7, which I worked on, didn’t have much in the way of graphics, being meant for servers. I don’t know what you mean by media. Most of the I/O processing and bus protocol processing is done by dedicated processors that sit on the chip.
I have next to me the M6 processor and die photo on the little thing they gave to everyone who worked on the project. Here are the specs from it:
4.27 billion transistors
12 sparc cores
48 MB L3 Cache. (Called L3$, so the cash - cache pun is standard terminology.)
A big cross bar switch in the middle to route data between the caches and the cores.
The L3 caches (there are 4) are just about as big as the 12 CPUs in area.
It got announced in September 2013, but I don’t remember when it taped out or started shipping.
I never had to count transistors, though I did count flip-flops and there was nearly a million of them.
True about bringing functionality on-chip for efficiency and to use some of those transistors made available by new process nodes. But the days of wide data buses are long past us. High speed serial I/O is used today, which is a lot faster than the old way, is self-clocked, and which does not have the problem of crosstalk between data signals. The cost is very complex Serdes blocks on chip to decode all this stuff.
These damn things were always causing problems.
You are talking about the i9-9900K die shot. Yes in that case a significant amount of die area is for graphics and media. However with the Intel Xeon and IBM Power9, there is no space devoted to those items. Those are considered high-end workstation or server-class CPUs where (a) An integrated GPU is not needed or (b) An external GPU would be used.
That was the conventional design wisdom which prevented Intel’s Xeon from having an integrated GPU and associated media logic. IOW why waste all those transistors or let them sit idle.
In actual practice it was possibly a mistake, especially on workstations. Those frequently handle compressed video including H264, H265, Google’s VP8, VP9 and AV1. Those are all highly compute-intensive and a regular GPU using normal APIs like CUDA or OpenCL cannot meaningfully accelerate video encode or decode of those formats. This created situations where Xeon-powered workstations could be outperformed by i5-powered laptops on common video tasks.
If you ever watch any video on your computer – whether that is Youtube, Vimeo, Facebook, Instagram, or from almost any web site – those are usually using a compressed “long GOP” codec like H264. In all those cases the CPU functional block labeled “graphics and media” is not wasted but actively employed. An external GPU cannot do those functions as it requires “fixed function” hardware logic.
Some external GPUs have similar video acceleration features in a totally separate functional block, accessed by a totally separate API. nVidia’s is called NVDEC and NVENC, AMD’s is called UVD and VCE. None of those work as well as Intel’s Quick Sync and they are not widely used by developers.
Not just that, but individual chips are optimized to be just hot enough to meet timing requirements.
The higher voltage you use, the faster the chip will run (within limits, of course,) the more power you use and the more heat you generate. Chips in the fast corner of a process are called hot. So, what you do is take a chip during testing and keep reducing the voltage until it fails a timing test. You write the last good voltage somewhere on chip, and when it is in the system the programmable power supply supplies just enough voltage for the speed required. So 8 chips on a CPU board might all be running at different voltages.