A few things about CISC and RISC. The question of the instruction set architecture (ISA) go back much further than processor “chip” design. You could make a case that the first RISC like machine was the CDC 6600, designed by Seymore Cray. The first really popular CISC machines were the IBM 360 family. The 6600 wiped the floor with the 360 in terms of raw compute, but there were lots of good reasons for the IBM system design. The 360 was one of the first microcoded machines, and the ability to make instruction set compatible machines over a large price and performance range, plus the ability to provide legacy compatibility to older IBM codes brought IBM total dominance of the industry for over a decade.
The canonical CISC machine is the DEC VAX. The VAX 11/780 was the definition of the one million instructions per second machine.
The first generally recognised RISC machine was the IBM 801, built in 1980, but only as a research work.
After the VAX, DEC went on the create what is arguably the best RISC design we saw, the Alpha. In a seminal paper the VAX and Alpha designers write this: “By the mid-80’s there was a general consensus in DEC that for a given amount of CPU logic and a given technology, a RISC processor would achieve (at least) three times the performance of a CISC processor.”
There are a couple of very important points in there. This was at a time when transistors were still limited. Caches were off chip, and the amount of logic that could be devoted to the processor proper was constrained. Nowadays transistors are so plentiful that for the most part designers have run out of things to use them for.
The Alpha died when HP bought Compaq (which had bought DEC previously) and the remaining people and IP went into the sadly doomed Itanium. There is a lot of belief that the Alpha was killed by company politics - and that it could and should still have been king of the hill.
There is a big different between the clock rate of a processor and the instruction speed. CISC machines can require a very large number of cycles to perform a single instruction. Indeed some instructions can take a hilariously long time. Many RISC machines were designed so that the machine would retire an instruction every clock cycle, even with only one pipeline.
The usual metrics are cycles per instruction, and instruction count for a given problem. CPI, and IC. For instance a big CISC machine like a VAX could take an average of 6 times as many cycles to complete an instruction as the little MIPS RISC processor. But the instructions are much richer on the VAX, and you need less of them. However it turns out the MIPS only needed twice as many instructions, and so was - for the same clock rate, three times faster.
The big marketing lie was brought to us by Intel, who managed to convince an entire generation of users that clock rate was equivalent to performance. The x86 chips had astounding clock rates because they did so little on each cycle, and needed lots of cycles to complete an instruction. Clock rate is like quoting the maximum RPM of an engine and omitting to mention the engine’s capacity.
Something that is often missing from discussion of RISC is the maxim that drove the designs. “Make the common case fast, make the uncommon case correct.” In order to do this you need to know what is actually going on, and measure it. The answers were surprising, and drove the design. For instance procedure calls were very common, but with few or no parameters. Code execution before a jump was shorter than people expected. Measuring register use was critical. I became apparent that many tasks could be executed within a medium sized set of registers, and avoid the cost of constant movement in and out of memory. This number of registers was much larger than CISC machines, but was doable. Lastly the cost of instruction decode in CISC was horrendous. It wasn’t just that the instructions were complex, but that they has variable length, and some instructions had a length that varied with the operand encoding, and you didn’t know where the instruction ended until you had fully decoded it. This made the processor a huge mess to design and slowed it down.
So - critical ideas in RISC were things like fixed length instructions - avoiding the costly issues in working out how long an instruction was whilst trying to decode it. Also RISC designs tend to provide a lot of registers. The Sparc and related machines provided so many registers that you could make procedure calls inside the entire register set, and parameters were passed in registers.
Another point about the RISC designs was that they were designed so that the clock rates could be fast. Amny older ISA designs has instructions where there were annoying interdependencies inherent in the semantics of the instructions - and these made it hard to make a design that could be freely clocked as fast as the logic could go. The Alpha, when it came out clocked at about three times as fast as the equivalent x86 made from the same silicon process. This was a result of the clean slate and very careful choices of the ISA. This is actually a critical problem that is still with us. It isn’t enough just to have lots of transistors. Getting signals across the chip is instantaneous, and the time it takes to get a signal to a nearby logic block becomes a constraint on the clock speed. This actually is getting worse, not better as transistor densities increase. Famously however, the mega chip of all CISC efforts the Intel IAPX-432 was doomed by just this problem.
Another point about the difference between RISC and CISC - is that almost every trick and feature that we see in x86 machines can be applied to RISC as well. The Mips R10000 had out of order instruction issue in 1996. As more transistors became available RISC processors provided greater intrinsic parallelism, becoming superscalar - with multiple parallel instructions in flight.
Similar techniques were used in the x86. What is interesting is that the Netburst architecture decoded the x86 instructions and translated them into an intermediate instruction set - one that Intel never revealed, but was rumoured to look a lot like Mips. It maintained a cage of instruction translations. The Netburst also had a very long pipeline to execute these internal instructions. This long pipeline was how they got the clock rate up. However it was very power hungry and there were issues getting it to go much faster. The Core series machines actually have a shorter pipeline (nearly half) and a much simplified instruction decode.
This brings us to a critical point about x86 and speed, and CISC. The balance between compiler and ISA has not gone away. The x86 designers adopted the maxim “make the common case fast, make the uncommon case correct.” For x86 this means make the common instructions, and just as importantly, the common addressing modes, fast, and simply make the rest of the huge instruction set coronet, in case there is some old code that needs it. The compiler writers get to know that simple instructions and simple addressing modes go really fast, and to avoid the complex stuff. And this sets up a feedback loop. The common case of code is the simple instructions, and so making them really fast is the right answer, and as the uncommon instructions become even less common, the designers can devote more resources to the common cases without worrying that the uncommon stuff matters. Looking at the Intel x86 manuals can be enlightening - some instructions are clearly to be avoided.
Finally of course the x86 has grown hugely. In two ways. It has grown a massive SIMD bolt one - with an entire new ISA, which operates totally differently to the x86 ISA. Indeed it is much more closely aligned with something Seymore Cray would create. Secondly the x86 went to 64 bit. Not just making the registers wider, but doubling the number - making the ISA much more RISC like. In fact the performance improvement gained with just the additional registers is about 50% on many codes. The additional data width is actually a performance penalty as you lose a lot of cache on pointers that have doubled in width. So much so that there is an effort to create a code standard for 64 bit data, but only 32 bit wide addresses, that can take advantage of the extra registers, but maintain the cache utilisation advantages of 32 bit addresses.
The x86 does gain in some interesting ways. Eventually enough logic has been thrown at the instruction decode issue that it isn’t a bottleneck - so now it gains the performance advantages inherent in a more compact program coding (mostly cache locality.)