Explain why the ARM architecture is still around

So, in the 1980s, I had a BBC Micro Model B which I inherited from my brother. It had 32KB of memory and could display 16 colors (but not more than eight at any one time, IIRC.) It could run surprisingly complex games, but in modern terms it was a piece of shit. I’m pretty sure I’ve written shopping lists which had more computing power.

Today, I have a Nexus 4 phone with a 1.7 GHz quad-core processor that can generate 1080p HD video and run games whose sound effects require a hundred times more processing power than the Beeb could muster.

Here’s the thing: that quad-core processor is based on a processor that was developed using the Beeb, for its successor, the Archimedes… In fact, almost every smartphone is built around an ARM chip.

Why? Was the ARM design a masterstroke we haven’t been able to improve on? Or is it just an accident of history? I assume it has nothing to do with market domination, since Acorn computers barely made a dent in markets outside the UK.

ETA: I read the Wikipedia page and it doesn’t really explain why the ARM chip is still around, unless it was mixed in somewhere with the stuff about instruction sets and other things I don’t understand.

Why is the x86 still around? The original 8086 went to market in 1978. Because they keep making the buggers faster and more powerful, and it’s a lot less work to keep using software designed for a known instruction set than to rewrite everything every time a new processor model comes out.

The ARMs have found their niche in low-power, high-speed devices like smart phones.

Hell, they’re still making the classic MOS 6502; the same chip that powered your first-gen Nintendo and the Apple II. You can find them in control circuitry for ovens and washing machines and whatnot. They work and are time-proven, so why not?

ARM chips have always been about power optimization, something which is very important in mobile devices.

No answer is an answer without saying “RISC”.

These days its best said like this:
The reduced instruction set allows a lower power consumption CPU to be built, everything else being equal, because it would use less gates.
You could use the gates left over in more cache or in extra cores or in better execution of the instructions eg pipeline, execution units, stall avoidance…) but there are diminishing returns…

Yes, the point is that the original Acorn ARM was the first successful RISC chip, and RISC represents a whole different approach to chip design and function compared to the CISC chips that were used back then, and still today, in PCs.

CISC (Complex Instruction Set Chip) CPUs have the ability to do some relatively complex computational operations built into the chip, so that these operations can be carried out in one CPU cycle. RISC (Reduced Instruction Set Chip) CPUs have only a bare minimum of types of operation built into the chip, and must do more complex operations via software, over several cycles. This means that RISC chips can be smaller (i.e., with fewer transistors) and faster in doing those operations that they can do, but the more complex operations that are built into a CISC chip, must, when needed (which is not all the time, by any means), be done via software (down at the OS or compiler level, I guess) over several CPU cycles. There is a complex tradeoff between computational power, speed, chip complexity and power consumption, and, no doubt, many other factors. It has turned out that RISC chips (originally designed by Acorn for raw speed, I believe, but now mainly preferred where small size and low power consumption are the main issues) are particularly appropriate to devices like phones, even though CISC has continued to dominate the PC market, where computing power is a more important consideration as compared to size and energy consumption (and consequent heat generation).

I don’t suppose current ARMs have much more in common with the original Acorn ARM than current multi-core Pentium whatevers have with the 8088 chips used in the original IBM PCs. It is just that an ARM is a RISC chip and Pentiums and other, earlier PC CPUs are CISC. It just so happens that the acronym ARM has stuck around to refer to modern RISC chips. Originally it stood for Acorn RISC Machine, but now, I believe, is said to stand for Advanced RISC Machine, if anyone bothers to ask. In the same way, “Pentium” has stuck around as a brand through several generations of very different architectures of Intel CISC CPUs for PCs.

Another major factor in the dominance of the ARM chip for embedded systems was their liberal licensing terms - they worked with a variety of partners to implement ARM cores for low power and embedded System On Chip (SOC) use. It was this multilicensing that established ARM in so many products.

I think this is the key point.

A company with a need for a low power chip can license the IP and either stop there or modify it to their custom needs. They certainly can’t go to Intel and do that, so ARM proliferated.

This makes perfect sense. Thanks (and to everyone else who answered.)

If RISC was such a wonderful thing, why is ARM the only remaining “RISC” architecture worth discussing? RISC was about reducing complexity; i.e. letting the compiler try to optimize code rather than doing it at run time in silicon. Problem is it never panned out. x86 has continually increased its performance and power efficiency while most of the classical RISC architectures stagnated and are left to niche roles. Alpha is dead, MIPS is dead apart from some legacy embedded stuff, SPARC and PowerPC are clinging desperately to life, PA-RISC is effectively dead. Intel even repeated the same mistake with EPIC/itanium; compilers simply aren’t as good at optimizing the instruction stream as the CPU’s reordering engine is.

ARM is so popular right now because ARM ltd. started out with small, low power cores and (most importantly) licensed those cores to anyone who wanted to build a CPU with one.

To be honest, the failure of RISC was as much a failure of the Software Houses and compiler writers as it was of the chip manufacturers. I remember running x86 Windows software on Alpha - the DEC supplied dynamic recompiler could run Word faster than on a top of the line x86 processor of the time. But the compiler writers never wrote a (specifically Microsoft C) compatible Alpha compiler and Microsoft didn’t push to develop Alpha (or MIPS) releases of their flagship products. It was that decision that killed RISC.

At the time, there was talk of a semi-compiled executable standard - code would be compiled to a generic reduced form (what we now called bytecode) and then during installation/execution compiled to hardware-specific machine code (i.e. a JIT compiler). These days, we think nothing of using such schemes, but in the early 90s it was too far ahead of the curve - if Microsoft had adopted such a scheme when they had MIPs and Alpha NT on the go and delivered their apps (and the ability for 3rd party developers to do the same) on RISC platforms, I believe the CPU landscape would be completely different now.

The licensing the instruction set to others for a good price is a big reason the ARM instruction set is being used. When I started at my current company about 16 years ago my first project was to develop a board which had our cell phone modem in debug mode connected to an ARM7 test chip. This board allowed the software team to port the phone software to ARM while the ASIC team developed a chip with ARM instead of what was currently being used, intel 186. We did this for a few reasons. 1) licensing cost 2) Production cost. ARM 7 was smaller than 186. 3) power.

Can you realistically call it “failure” if the goal was improbable or impossible? Intel went with out-of-order execution (with deep re-order buffers) because they figured the CPU could better figure out how to keep its pipelines full than a compiler could. Sure, they had to burn transistors to do so, but as overall transistor count has increased, that sacrifice plays a smaller and smaller role. It’s also worth noting that even now, the higher-performance ARM cores (Cortex A8, A9, A15, Tegra, Krait, and Swift) all use out-of-order execution so they’re no longer purely “RISC” either.

Hardly fair, the Alpha 21064 was wider and a lot faster even in terms of raw clockspeed than anything Intel had to offer. The 21064 was launched at 192 MHz, the Pentium launched months later at 66 MHz. Alpha was literally a beast when it arrived.

Windows NT ran on Alpha; when I was at university the CAD labs were using Alpha workstations running a mix of NT 3.51 and NT 4.0.

And I take issue with your assertion that any decision by Microsoft “killed RISC.” RISC killed RISC. Y’know why? Because ideological purity must give way to the real world.

Microsoft, like any public corporation, is obligated to go where the money is. The money was with Intel. They had no obligation to prop up architectures relegated to $30,000 workstations and servers just for the hell of it.

This is certainly one of (if not the) major reason.

It’s why your payroll check, and your bank account, and your credit card authorization today, etc. ars still running on software written for the IBM 360 instruction set, developed in 1963, a full half-century ago. A large, complex software system is much hardware to change than hardware.

The POWER line of RISC chips has quite a few design wins, too.

This is indeed the answer. ARM by brilliance or luck was the leader in SoC processors, when IP really took off. Low power helped, but that was a part of going to an IP core model.
The OP seems to imply that the ARM architecture isn’t good. I haven’t studied it, but I did study 8086 versus Motorola 6800, and the 8086 was a piece of crap in comparison. (I wrote a simulator for the 8086 in grad school, so I know.) I’ve seen an exhibit at the Intel Museum about selling to IBM, and I got the distinct impression that they were under no illusions about winning on technical merit.

The Itanium group considered Alpha as the real competitor - and notice Intel bought that part of DEC. Alpha and Pentium were not really in the same market at the time.

The x86 instruction set dates back to the 8086 and must maintain perfect backward compatibility with it’s legacy. Over the course 35 years, many new instructions have been added some useful but many misguided.

The problem is, even if an instruction is 20 years old and not ever seen in a modern program, modern chips still need to be able to decode an execute it to comply with x86. As a result, x86 processors need a lot more transistors compared to the cleaner ARM instruction set. On desktops, this isn’t much of a problem as transistors were cheap but they became a big deal on mobile where power was the biggest concern.

The biggest coup for ARM was that the iPhone broke backwards compatibility and motivated developers to write all new programs. As a result, it gave the entire industry the ability to clean slate and start with a relatively uncontaminated instruction set. However, as ARM is evolving, they’re falling into the exact same problems x86 is with the ARM instruction set becoming more and more complex and harder to decode.

It may be in 30 years time from now, we’ll break away from crufty old ARM into an entirely new instruction set for the exact same reasons.

Microsoft spent a great deal of time and money producing NT as a platform-agnostic OS - it ran on x86, Alpha, PowerPC and MIPS initially. But the only major apps that were ported to these native platforms were high-end CAD systems (Autocad and the like) and specialised design tools. The only reason you could run Microsoft Office on Alpha was because DEC (who had expertise in this area from the PDP/VAX migration) provided a dynamic recompiler that mostly worked but wasn’t suitable for all applications. Given that most Windows applications at the time were written in Microsoft C using the Windows SDK, the failure of Microsoft to provide compatible compilers for PowerPC, MIPS and Alpha relegated their use to those specific applications like CAD and similar (which were mostly portable because they were also available on Unix workstations). Microsoft didn’t even port their own server products to these platforms, thus abandoning them and all the portability work they had done in the first place. Of course, it has come back to haunt them - they have had to keep reinventing Windows for new platforms: Windows CE/Windows Mobile/Windows RT/Windows for Itanium. All of these could have been built on a core OS that was platform agnostic and round a common compiler suite, saving time and money for developers and detaching MS from Intel, allowing alternatives to flourish. But it didn’t happen.

Oh, it wasn’t just Microsoft - DEC/Compaq/Intel all had a part to play in the demise of Alpha. MIPs was it’s own worse enemy, and IBM had a different agenda for PowerPC. The hardware was too expensive initially, as these Big Iron companies did not understand commodity pricing. But the failure to capitalise on the opportunity for a multiplatform OS/development environment rests with Microsoft (IMHO). Maybe in that sort of open environment, RISC would have still failed. But it would have had a better chance.

True, but they didn’t really give the market time to decide, after investing large amounts of money (and PR) on the promise of a platform-agnostic OS. I just think it was a shame.

A few things about CISC and RISC. The question of the instruction set architecture (ISA) go back much further than processor “chip” design. You could make a case that the first RISC like machine was the CDC 6600, designed by Seymore Cray. The first really popular CISC machines were the IBM 360 family. The 6600 wiped the floor with the 360 in terms of raw compute, but there were lots of good reasons for the IBM system design. The 360 was one of the first microcoded machines, and the ability to make instruction set compatible machines over a large price and performance range, plus the ability to provide legacy compatibility to older IBM codes brought IBM total dominance of the industry for over a decade.

The canonical CISC machine is the DEC VAX. The VAX 11/780 was the definition of the one million instructions per second machine.

The first generally recognised RISC machine was the IBM 801, built in 1980, but only as a research work.

After the VAX, DEC went on the create what is arguably the best RISC design we saw, the Alpha. In a seminal paper the VAX and Alpha designers write this: “By the mid-80’s there was a general consensus in DEC that for a given amount of CPU logic and a given technology, a RISC processor would achieve (at least) three times the performance of a CISC processor.”

There are a couple of very important points in there. This was at a time when transistors were still limited. Caches were off chip, and the amount of logic that could be devoted to the processor proper was constrained. Nowadays transistors are so plentiful that for the most part designers have run out of things to use them for.

The Alpha died when HP bought Compaq (which had bought DEC previously) and the remaining people and IP went into the sadly doomed Itanium. There is a lot of belief that the Alpha was killed by company politics - and that it could and should still have been king of the hill.

There is a big different between the clock rate of a processor and the instruction speed. CISC machines can require a very large number of cycles to perform a single instruction. Indeed some instructions can take a hilariously long time. Many RISC machines were designed so that the machine would retire an instruction every clock cycle, even with only one pipeline.

The usual metrics are cycles per instruction, and instruction count for a given problem. CPI, and IC. For instance a big CISC machine like a VAX could take an average of 6 times as many cycles to complete an instruction as the little MIPS RISC processor. But the instructions are much richer on the VAX, and you need less of them. However it turns out the MIPS only needed twice as many instructions, and so was - for the same clock rate, three times faster.

The big marketing lie was brought to us by Intel, who managed to convince an entire generation of users that clock rate was equivalent to performance. The x86 chips had astounding clock rates because they did so little on each cycle, and needed lots of cycles to complete an instruction. Clock rate is like quoting the maximum RPM of an engine and omitting to mention the engine’s capacity.

Something that is often missing from discussion of RISC is the maxim that drove the designs. “Make the common case fast, make the uncommon case correct.” In order to do this you need to know what is actually going on, and measure it. The answers were surprising, and drove the design. For instance procedure calls were very common, but with few or no parameters. Code execution before a jump was shorter than people expected. Measuring register use was critical. I became apparent that many tasks could be executed within a medium sized set of registers, and avoid the cost of constant movement in and out of memory. This number of registers was much larger than CISC machines, but was doable. Lastly the cost of instruction decode in CISC was horrendous. It wasn’t just that the instructions were complex, but that they has variable length, and some instructions had a length that varied with the operand encoding, and you didn’t know where the instruction ended until you had fully decoded it. This made the processor a huge mess to design and slowed it down.

So - critical ideas in RISC were things like fixed length instructions - avoiding the costly issues in working out how long an instruction was whilst trying to decode it. Also RISC designs tend to provide a lot of registers. The Sparc and related machines provided so many registers that you could make procedure calls inside the entire register set, and parameters were passed in registers.

Another point about the RISC designs was that they were designed so that the clock rates could be fast. Amny older ISA designs has instructions where there were annoying interdependencies inherent in the semantics of the instructions - and these made it hard to make a design that could be freely clocked as fast as the logic could go. The Alpha, when it came out clocked at about three times as fast as the equivalent x86 made from the same silicon process. This was a result of the clean slate and very careful choices of the ISA. This is actually a critical problem that is still with us. It isn’t enough just to have lots of transistors. Getting signals across the chip is instantaneous, and the time it takes to get a signal to a nearby logic block becomes a constraint on the clock speed. This actually is getting worse, not better as transistor densities increase. Famously however, the mega chip of all CISC efforts the Intel IAPX-432 was doomed by just this problem.

Another point about the difference between RISC and CISC - is that almost every trick and feature that we see in x86 machines can be applied to RISC as well. The Mips R10000 had out of order instruction issue in 1996. As more transistors became available RISC processors provided greater intrinsic parallelism, becoming superscalar - with multiple parallel instructions in flight.

Similar techniques were used in the x86. What is interesting is that the Netburst architecture decoded the x86 instructions and translated them into an intermediate instruction set - one that Intel never revealed, but was rumoured to look a lot like Mips. It maintained a cage of instruction translations. The Netburst also had a very long pipeline to execute these internal instructions. This long pipeline was how they got the clock rate up. However it was very power hungry and there were issues getting it to go much faster. The Core series machines actually have a shorter pipeline (nearly half) and a much simplified instruction decode.

This brings us to a critical point about x86 and speed, and CISC. The balance between compiler and ISA has not gone away. The x86 designers adopted the maxim “make the common case fast, make the uncommon case correct.” For x86 this means make the common instructions, and just as importantly, the common addressing modes, fast, and simply make the rest of the huge instruction set coronet, in case there is some old code that needs it. The compiler writers get to know that simple instructions and simple addressing modes go really fast, and to avoid the complex stuff. And this sets up a feedback loop. The common case of code is the simple instructions, and so making them really fast is the right answer, and as the uncommon instructions become even less common, the designers can devote more resources to the common cases without worrying that the uncommon stuff matters. Looking at the Intel x86 manuals can be enlightening - some instructions are clearly to be avoided.

Finally of course the x86 has grown hugely. In two ways. It has grown a massive SIMD bolt one - with an entire new ISA, which operates totally differently to the x86 ISA. Indeed it is much more closely aligned with something Seymore Cray would create. Secondly the x86 went to 64 bit. Not just making the registers wider, but doubling the number - making the ISA much more RISC like. In fact the performance improvement gained with just the additional registers is about 50% on many codes. The additional data width is actually a performance penalty as you lose a lot of cache on pointers that have doubled in width. So much so that there is an effort to create a code standard for 64 bit data, but only 32 bit wide addresses, that can take advantage of the extra registers, but maintain the cache utilisation advantages of 32 bit addresses.

The x86 does gain in some interesting ways. Eventually enough logic has been thrown at the instruction decode issue that it isn’t a bottleneck - so now it gains the performance advantages inherent in a more compact program coding (mostly cache locality.)

Wee typo past the edit window…

Getting signals across the chip isnt instantaneous