Are microprocessors approaching a speed limit?

Microprocessors are getting faster, but the rate of increase is dropping rapidly. Is there any new technology on the horizon that will significantly speed up my next computer?

Some of the applications that I write are CPU intensive. Back in the days of uniprocessors, they would keep the CPU at over 95% for several minutes. Now, with four CPUs (i.e. cores), my CPU usage is pegged at 25% when I run. Since my programs can only run single-thread, adding more processors has little effect. I tie up one processor and I only need one other processor to handle the mundane stuff (like Firerox, Word, compilers, etc).

Between 1988 and 2003, processor throughputs were increasing over 40% per year. Between my 2003 desktop and my 2011 desktop, I realized a gain of 15% per year in processor throughput. Since 2011, the best I can do is 7% per year. Are we approaching zero?

Disk storage, computer memory, and cache memory seem to be increasing at better than 20% per year, why can’t processor speeds keep up?

I think this is your problem - processors have become faster by going to multi-core designs, but you aren’t taking advantage of it.

There are a few things going on.

First, processors have been generally increasing speed lately by going with more cores. Most applications take advantage of the multiple cores, and therefore most applications run a lot faster. This has made computers in general a lot faster. It’s only the rare single thread programs like yours that haven’t seen much benefit.

Second, tablets and phones have been where the money is for the past several years, so a lot of the R&D money has gone into making those faster, and less effort has been made to make desktop computer processors faster.

Most computers run more than fast enough for what most folks do with them. It’s mostly games that need more speed, and they get increased speed from both multiple cores and from better graphics processors.

There are also folks that do a lot of high-end number crunching. A lot of those folks have switched to using graphics cards as vector processors. You can make a pretty cheap super-computer of sorts by taking a fairly normal PC and stuffing a bunch of graphics cards into it.

Is there some reason you can’t make your program multi-threaded?

Basically heat. You can’t drive transistors fast without generating more heat, and eventually you can’t get rid of the heat fast enough to avoid damaging your chips. Also with smaller transistors, the current can leak through the gates if the voltage applied is too high, as would be required for faster switching times.

As other have said, they are keeping up, but they are doing so by increasing cores and pipeline width. To take advantage of this, you need to rewrite your applications to be multi-threaded.

You can’t just increase the core clock speed, as was done in the 80s and 90s. As you increase clock speed, you need to find a way to keep your processor from overheating. You can decrease the transistor sizes and voltages. But as you lower sizes and voltages, quantum effects such as leakage current start to become significant and reduce the reliability of your transistors. Alternatively, you can increase the size of your processors to spread out the transistors over a larger area. However, semiconductor wafers have a certain density of imperfections per unit area. So if you make your CPU larger, you increase the likelihood of a serious imperfection appearing in every processor you manufacture, and you have to toss out a larger percentage of your processors when they fail testing.

Heat is the first reason. Faster very small transistors are leaky, and use more power as well as generate more heat. What we call hot lots are faster but also draw more current. It is common practice for servers at least to regulate the voltage of each processor so it is just fast enough for the system. If the system can handle only a 2.5 GHz processor it running at 2.8 GHz does not help much. So you can reduce the voltage which slows it down and decreases heat and power.

Second is design complexity. Since the easy ways of improving speed through lookahead and pipelining have already been done, big improvements will require bigger and more complex cores. More complex engines means more area and more importantly more design time and more debug and bring-up time. Being slightly slower than a competitor is bad - being six months or a year late to market is a disaster. So instead of redoing the core of the processor you bring chipsets onto the chip and you do some minor tweaks and you may add some support for commonly used operations. Much safer.

For a single CPU we’re hitting some practical limits. That doesn’t mean there won’t be continued improvement, it’s just for now that increasing the number of CPUs on a chip gives a better return for the money. Once software is better adapted to take advantage of multi-processing there won’t be much need for more throughput through a single processor. Right now, and as always, computers are more limited by I/O than CPU.

engineer comp geek covered a lot of it well. Most processor intensive applications benefit to some degree, often a significant degree, from parallelization. Most modern games are designed to put more of the load on the gpu rather than the processor, and what use it can make of the processor is still typically parallelized to some degree. Obviously, any comutationally intensive problems, like simulations, are designed in such a way as to make use of as many processors as one is willing to grant it. I’m unsure what the state is now, but I do know that there were compilers in development several years ago that could do some advanced analysis and start doing some optimization for multiple processors.

So I have to echo his question, what about your problem makes it so it HAS to be a single thread? Even if significant portions of the problem are linear, say 80-20, you may still be able to see some noticeable speed up by parallelizng the portions of your code that you can. For example, if you’re crunching a lot of data, have threads specifically designed to manage reading and writing the data so that your primary processing thread doesn’t have to wait on disk reads. You may also benefit from more specialized processors that are designed for your purposes, or perhaps some other hardware would help like some faster RAM, an SSD, etc.
And also like he mentions, the consumer market in general is pushing toward multiple cores for a lot of reasons. Having a handful of slower processors rather than a few really fast processors has a lot of advantages. They tend to be cheaper, draw less power, and generate less heat for comparable processing power. All of this is imperative considering the focus on mobile processing, where power and heat are HUGE problems. Also consider that a typical user is generally either running a few applications that are either easily parallelized or where there’s plenty of leeway on what’s responsive enough, or they’re running a lot of applications and, thus, easily take advantage of many cores.
And to address the problem about a speed limit, other than the refocus, we are approaching the theoretical maximum of what Moore’s law models. First, the speed of light is a problem for system designs. Even as fast as transistors are, as clock speeds go up, those signals can only go so far. Consider that the speed of light is ~3x10^8 m/s, so even with a SLOW processor at 1GHz, light can only travel ~ 0.3m, or about 1 ft. As clock speeds tend to be more than that, you now have issues of needing multiple cycles just to send and receive signals to other components. One can design a system to mitigate this, but having all those high speed components near each other increases heat problems again.

The second, and probably more pertinent issue, is how small transistors are and quantum effects. Transistors are getting smaller and smaller, and as they do, they start to reach a point where electrons can actually tunnel across the transistor and, thus, result in major issues with reliability. As they get smaller, we need new materials to maintain reliability, and these are getting increasingly difficult to find and manufacture. As it is, a transistor is something like 50 atoms across IIRC, so we’re approaching the theoretical minimum size anyway, even without accounting for quantum effects. The other obvious issue is that as they get smaller and smaller, they get more and more difficult to build, test, and design. A modern cpu can have on the order of 5 Billion transistors. Moore’s law is already bending, it’s going to breakdown completely in the not too distant future.

Another possibility for speed-up, even if one doesn’t want to get into writing multi-threaded code, is to improve locality of reference by putting variables that are used together closer together in memory. Since core clock speeds have stopped going up, cache sizes and depths have gone up considerably, to the point where re-arranging memory so the thing you need next is already in the cache is a significant speed up.

Or write in assembler. :slight_smile: I’m betting that for almost all people the cache logic is going to do a better job.

Sometimes you just need good single thread performance. A professor I know didn’t even want free computers from us because, though it had multiple cores, he said our single thread performance sucked. Which it did.

BTW, we have a few process nodes left before we run out of steam. But due to economics a lot of companies are skipping nodes and building parts on the last node and not the latest and greatest one where you have to work with your fab to get everything more or less right during bring-up. The longer time between new process nodes these days is more business than technical.

I kind of wonder if the speed-up curves of the various things over time are just skewed early or late. I’d argue that processor speeds accelerated way out of proportion to everything else in the box, due to the clock speed being used as a marketing tool.

So things like hard disks, memory, and I/O methods are finally catching up now that processors have hit a sort of wall. That’s what it seems like to me- I can’t remember a time in the past couple of decades when I was seriously processor-bound, until relatively recently, oddly enough, and that’s mostly when I run image processing and conversion software.

Usually it was some combination of not quite enough RAM (when it was expensive) coupled with slow hard disks, resulting in use of virtual memory that bogged the whole shebang down. Later on, it was been outmoded GPUs or video cards. In either case- ram/disk or GPU inadequacy, the CPU was generally waiting on them, not the other way around.

I normally write in C++, and have encountered situations where thinking about locality gave significant performance gains. Just stuff like using arrays when it would make more sense to use linked lists otherwise, or passing things by value instead of by reference even though it is a bit more copying, so when there is a big batch operation stuff isn’t spread all over memory.

The point is, since clock speeds stopped going up, putting more cores on a chip is one thing that processor-makers have been doing to eat up the extra transistors that Moore’s law keeps providing. The other, just as significant but not as well-known thing, is increasing the size and complexity of caches.

There are a few cases where writing in assembler makes sense too. But not everyone is going to do better. Malloc’ing tiny chunks is going to hurt.

I am a processor maker. New process nodes do not mean you automatically blow up cache size. Big caches have a cost - they increase chip size which means fewer per wafer, and lower yield, which drives up the cost. Big caches increase manufacturing test time. Big caches mean that you need more redundant rows and/or columns for repair. Is the added cost worth it for your target market?

Same thing with cores. We all know how performance levels off as more processors are added (with a few exceptions.) If you are designing a processor for the cloud which will be processing lots of simple Google search requests you might want more cores/threads than if you are building one for consumer use.
It’s why the architects do so many simulations.

I can see future consideration for cache vs. core for larger machines breaking out of single CPU chips with dedicated memory buses. Some size of cache will keep some number of cores operating at full capacity, so when that core limit is reached and CPUs modules need multiple CPU chips the real estate may be better used for cache than more cores. But that assumes there’s a practical economy in that much tightly coupled parallel processing to start with.

ETA: “dedicated memory buses” was just a random thought there. actually that would decrease some of the cache requirements if the memory bus isn’t hitting maximum bandwidth.

Basically, microprocessors are successions of logic gates (transistorized AND, OR, FLIP-FLOPS, etc.) With each clock cycle, the contents of a particular logic circuit may cause the next gate to change state. (Very simplified).

So the limiting factors are fairly obvious - how fast can the clock go? How fast can a transistor change state? How long is the distance from one logic circuit to the next? (Resistance can slow electricity, but the theoretical upper maximum is of course the speed of light.)

I once saw Admiral Grace Hopper (“Grandma Cobol”) on the Dave Letterman show. One of the items she brought was picoseconds. Actually, it was a packet of pepper, but her point was - this is how far electricity travels in a picosecond.

So-

Transistor speed is getting faster as transistors get smaller. But obviously, the theoretical limit is the minimal size of a transistor - too small, and there are not enough atoms for the transistor effect. (Also, too small and random cosmic ray particle collisions can flip the transistor) In fact, what’s a problem is the technique for fabricating transistors this small. The separation distance between elements on a chip are on the order of nanometers (IIRC< there was an item about 7nm fabrication in the news recently).

Smaller transistors means shorter distance between elements. Plus, part of the design is to move the elements - registers, bus, gates to bus, ALU (arithmetic logic unit) etc. - so that the typical paths are as short as possible. Shorter distance - less delay in transferring data between elements.

As more of the components are moved from separate support chips to the single chip, this eliminates a lot of fairly long (many centimeters) path that the signal has to traverse - and also means the signal does not need as much power (voltage, current) to function.

We also read about stacked or multi-layer chips, where the pieces of the processor are layers in 3 dimensions rather than plain 2 dimensions. but then, fabrication processes and removing heat are issues.

Each time a transistor changes state from on to off, it generates heat. the determining element is how much power is involved. IIRC, the defining factors are the size, the amount of current flowing, and the speed of the change of state. Hence, the answers above about heat - change a transistor’s state (on-off or off-on) too many times a second, and even a low power transistor will generate heat.

Notice that chip speed has stalled at about 4GHz or less? That seems now to be a practical limit for heat and switching speed.

So - future enhancements - still smaller transistors, more dense packing, multi-layer; fitting more and more of the entire PC on one chip.

Open task manager in Windows, see how many parallel processes are running at once. There’s a real advantage to having multiple cores, because many of these processes are relatively independent of each other. Rather than waiting a turn for one processor, why not have several?

This is the same route older computers took. IBM mainframes and DEC VAX both hit this wall, there was only so fast processors could be; any further increase was very small increments, but adding multiple processors or cores made the computer that much faster.

Another trick is to offload tasks to other chips - disk IO, graphics, networking. This is essentially variant on parallel processing.

So have we hit a limit? No, but this particular configuration of PC processors is reaching its technical limit; the rate of increase, as noted in the OP, is slowing down. My current PC is 5 years old. 10 years ago, 20 years ago, a 5-year-old machine was barely usable with modern software. now, it’s not unusual. I have helped some clients who are just now dumping XP and still run Windows Server 2003.

That’s kind of mind blowing really. So, in the time that light takes to travel the few feet from my monitors to my eyeballs, the processor has cycled a few times. Have I got that right?

If you mean external memory, there are definitely dedicated external memory buses, usually SerDes, and chip sets to manage the communication between the processor and the memory. This logic is one of the things that sometimes gets imported to the chip.

A bit too simplified. The important thing is capturing a transition at the next flip-flop. The path between two flops consists of multiple gates (the size is called combinational depth) and the speed of the chip depends on how fast you can get a transition from one flop to another over the slowest path.
Before you tape out you estimate this using static timing analysis, and there is a massive effort at finding slow paths and closing timing. During bring-up you see how fast the chip actually goes, and for processors which are pushing the state of the art you find the actual slow paths using a variety of methods. I gave a paper on this last month. I can go into more detail if anyone cares.
BTW sometimes the path is too long, and you can define multicycle paths where you don’t expect the transition for two clocks. Not encouraged but sometimes necessary.

Heh. I once spoke opposite her at a National Computer Conference. I told my (small) audience that I understood they’d rather be listening to her. I’d rather be listening to her.

That’s in the lab. 18 nm is more common, and 15 is beginning. And this is feature size, not separation. Separation is governed by design rules which try to improve yield by having expect process imperfections not kill the design.

Layout of a chip is called floorplanning. You need to balance long global signals with shorter local ones.

Stuff you move inside usually sits at the edge of the chip - you save on I/O, but they are not all that close to other stuff.

Well, you have experienced a couple of clock cycles. With pipelining it is not all that clear what cycling the processor means, but it usually takes a lot more than three or four clocks for an instruction to get executed.

BTW I have a die for a processor I worked on in a little plaque on my desk. (Non-working, I hope!) It is almost an inch across - but we make big dies.