- Your experience is not typical
- I’m a very impatient person. 10 seconds is too long. It is technically possible to make photoshop load in under 1 second if it were architected differently and the user’s computer has a multicore cpu and a high performance SSD.
My experience isn’t typical because:
- I use OS X
- I know how to use my machine.
Most users are still stuck in the '90s, where quitting, relaunching, and rebooting were the norm.
Not me.
Does OS X have reboot-free updates? That’s been technically possible with Linux for decades now. Don’t get too smug - from what I’ve seen, software ported to Mac tends to be slower because less money is spent optimizing it.
Not for OS updates, but- so what?
Spending 30 seconds to reboot my machine every 6 months is no big deal.
It’s waiting for images to load from an SD card, or files to download from the Internet that is the real time waster, and neither of those is CPU-bound.
Do you mean OS reboot times or program load times? Yes doing updates without rebooting is very possible - my old Sun workstation had to be rebooted twice a decade or so.
I’d suspect initial program load times are a function of the initial working set of the program and how much initialization it does as well as the OS.
When the program is running the processor has an instruction cache which keeps the code which is currently running (and it is usually a small set) in the processor itself. If you start on a totally new section then the iCache gets effectively flushed and it takes longer. For games I’d bet it is loading new stuff from disk which really takes forever.
Actually, the application is the main bottleneck. I’ve worked on some which are easy to parallelize, relatively independent, but hit the wall because one of the sub-jobs might take significantly longer than the others, which limits the entire run to the speed of that piece. You can go much faster not by improving your parallelization techniques but by screening out problem subjobs.
You also have a problem about communication and synchronization. We’d love to do circuit simulation in parallel, one processor for each chunk of the design. But the communication costs kill you, so instead of running one big test case on 1,000 machines each with 1/1000th of the design, you run 1000 test cases on 1000 machines each with the entire design.
If you ever have taken a course on parallel architectures, you know the way the processors communicate is vital. That’s been true since I took one from Dave Kuck 40 years ago.
Isn’t another technical factor affecting performance, independent of transistor density, the use of RISC vs CISC instruction sets? I.E. fewer and smaller instructions (lots used) vs more and bigger instructions (fewer used.). At least I hope I got that mostly right.
Apparently, RISC won more than a decade ago. Intel chips are only CISC on the surface- they actually translate everything to an internal RISC machine. This is why Intel chips are not as power efficient as ARM, apples to apples, because the translation logic consumes power, while ARMs are natively RISC.
In theory, a clock-less microprocessor could out-perform any current design by many times. But, like fusion reactors, it always seems to be the technology of the future.
Some discussion here.
Well, Intel still uses microcode - and RISC came out of the microcode world. IBM, who truly invented RISC, had major microcode work, and Dave Patterson did his PhD on microcode verification. But VLIW also came out of microcode.
In fact RISC machines like SPARC don’t use microcode. One of the advantages of RISC is that you made it simple enough to implement directly in hardware. But today’s RISC machines are hardly simple, since you still need lots of logic to optimize instruction sequencing and your pipeline.
Ah,the cisc vs risc thing is mitagated by throwing the more transistors in.
Moores law was about the practical speed of transistors, and the number of transistors, on the same silicon.
Your point is more about transistor efficiency , you could measure that as Terraflops per mega-transistors. Or something… Intel mitigate the performance issues of 80x86 CISC by translating to RISC and then running RISC instructions in parallel, so it can complete, (and start, and work on 2,3,4 instructions per cycle … ( I don’t mean the instructions all start and finish in ONE cycle, they take a number of cycle, but in any cycle, up to 4 are starting , and up to 4 are finishing ! … )
So anyway , so what if Intel switched to RISC and therefore improved efficiency of transistor deployment… saved them for some other use ?
The cpu already has many transistors in the caches, this means that the savings from the actual CPU would not be a large increase in cache, and law of diminishing returns, only massive increases in cache provide mediocre returns.
You cold think they’d make 8 core ,16 core, etc cpu’s ? Well it turns out thats fine for video and DSP… but thats already done… How can Windows make use of that ?
It may be that the ARM will include 80x86 compatible some day… and ween us off CISC , because we could choose to use 80x86 or RISC… it may be that intel adds ARM machine code to the 80x86… that would solve the two words issue currently at hand.
The* performance loss *from CISC is mitigated by translating and running from RISC microcode. Translation may increase instruction latency but the difference is obviously not much.
However, the translation circuits eat power. This is why Intel wasn’t chosen for the smartphone and mobile device wars. ARM chips are natively RISC and thus are inherently more power efficient - also, you can integrated ARM processor IP on the same die with memory and gpus and other ICs to cram everything into a single chip, saving space in the smartphone.
Moores’ Law is arguably feeling rather poorly right now. Intel just announced that their 10nm process was being delayed for a second time, now until 2017. That is a big hole from their earlier hopes. IBM’s 7nm process has managed a few functioning transistors, which is quite some years from the few billion needed per chip.
The RISC versus CISC thing gets confused in the retelling.
Fist thing to realise is that the clock rate is not the rate that instructions are executed at. There is a link, but even on simple RISC designs it is not absolute. On CISC designs it can be a very wide difference. Things like floating point operations on older designs could take tens of clock cycles to perform a single divide.
There were arguably RISC machines right back in the glory days. The CDC 6400 and 173 machines were very close to the ideal.
The big win has been in pipelining and this has been true for both CISC and RISC. Where even if an instruction takes say five cycles to execute (the classic RISC: instruction fetch, instruction decode, execute, memory access, write-back) if the instructions are sufficiently independent you can have each stage of the the pipeline doing its work on five separate instructions - when this happens you get an instruction per clock cycle. Two parallel pipelines and both full and you could get two instructions per cycle.
The old Intel Netburst architecture had a very long pipeline of many simple stages - and could run with a fast clock rate because of that simplicity. But any individual instruction took at least the length of that pipeline to execute. Complex evil instructions could take a very long time. Interestingly the newer Core based designs have shorter pipelines.
Instruction decode was always a big problem with CISC - especially when some instructions were of variable length - and the processor didn’t even know how long the instruction was until it had decided it to its end. The VAX was replete with such instructions. One of the early wins was to simply tell the compiler writers to stay well away from the poorly performing instructions. Indeed you can only get the benefit of the newer IA32 designs by avoiding some of the ridiculous stuff. The basic mantra that drove the early RISC was “Make the common case fast, make the uncommon case correct.” This works for any design. CISC as well.
Caches are the big win. As clock rates increased the time a cache has to deliver the results drops, and the complexity of the caches has an impact on its ability to deliver. The multiple level caches are an attempt to cope with this - with different cache architectures at each level - each balancing efficiency against access time. One of the amusing things that has gone in favour of CISC is that it has a more efficient code density, and so the instruction caches work better for the same work encoded. It has been common for a while that a process shrink has usually been used as much to increase the area devoted to caches as anything else.
There are caches for lots of other things as well. Not all called caches, but the effect is there. Branch prediction tables keep a list of the direction a branch instruction takes, so next time the processor will start speculatively executing down that branch - and thus keeping the pipelines full. Memory writeback buffers with various ways of snarfing the contents back to the processor as well.
In the end the difficulty isn’t in making a core execution engine fast - it is in keeping it fed. Things bog down when the data isn’t available to be worked on. Which means keeping it in registers as much a possible. So more and wider registers. Bigger and fast caches. An interesting experiment is to take the same code, and compile it for IA32 32 bit and 64 bit instruction sets and run on the same machine. Just letting the compiler use the extra (and wider) registers can get you a dramatic performance lift. The RISC guys worked that out a long time ago. However the increase from 32 bit to 64 bit addresses means the caches are less efficient, so you may not always win as big. Depends upon the code. (Or you can try a hybrid 64 bit target ISA with the 64 bit instructions and 32 bit addresses.)
Getting more cores onto a chip has been the way more speed has come about, but there are limits. The Intel Xeon Phi puts 61 Pentium cores on a single chip. For the right job this is great, but it isn’t trivial to program, and for the number of transistors used, isn’t always the fastest answer.
What do you mean by RISC microcode? RISC machines are not microprogrammed since they are simple enough to not make it necessary. And there is not translation procedure, since the microcode interprets and executes the CISC instruction set.
Microcode is I suppose RISC by definition, but vertical microcode is simpler than horizontal microcode.
Now, there can be translation. The earliest Itanics, at least, translated x86 instructions which took area and was not fast.
And of course direct logic implementation is faster than a microprogrammed implementation.