In the 90’s I was writing assembly for small 8-bit Motorola micro controllers. Due to memory constraints, using a compiler wasn’t as efficient as writing directly in assembly (I needed to implement a bit of recursive code in only a few bytes). I would assume that modern compilers are used more extensively these days.
Yes. All those chips which control things like refrigerators, washing machines, TV remote controls, need to be written in assembly or something close.
One big manufacturer of these chips is Microchip (Aka PIC).
With these chips, you need to get down to the hardware level, set registers in the chips, control timers, serial devices, and access the hardware of the chip exactly - “almost” or “sort of” does not work!
Also some of the higher level programming languages used for programming these chips do not always do exactly what you want, so you MUST write portions in assembly.
Following is a 500 page programmer’s manual for one of their chips. If you look through this, you will notice specific bits can be set/unset for ALL SORTS of stuff! (Best done with assembly.)
That is an excellent point. Modern CPUs are multi-core and so fast they can plow through a lot of sub-optimal single thread code. However if locality of reference is poor they can get choked on memory access: Locality of reference - Wikipedia
https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
This was a key disadvantage of RISC machines vs x86 since their code density was lower, which meant they required larger caches to compensate. Instruction or data cache misses would then translate to higher required memory bus bandwidth.
In the multi-core case (which virtually all CPUs are now) even a small fraction of serializable code will bottleneck all threads on Amdahl’s Law: Amdahl's law - Wikipedia
For this reason Intel has extensive advice for tuning software to run optimally on heavily cached multi-core systems: https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems/?wapkw=smart+cache
There are also post-link optimization tools which can improve locality somewhat without any programming changes: https://www.research.ibm.com/haifa/projects/systems/cot/fdpr/papers/fddo2.pdf
Increasingly CPUs have heterogeneous instruction sets, including specialized execution resources for vector operations: http://www.nasoftware.co.uk/home/attachments/NASoftware_AVX_CS_r02.pdf
And video transcoding. In the case of video transcoding, using Intel’s Quick Sync can improve performance by 400-500 percent: Intel Quick Sync Video - Wikipedia
Even though some of these (like AVX vector instructions) are exposed at the assembler level, unlike previous eras they would typically not be accessed via assembler but via a library or Software Development Kit.
As you said, nowadays it is often better to explore areas like this than to optimize by writing assembler.
There’s rarely (if ever) something that can be done in assembler that can’t be done in C (assuming the complier has support for the particular device you are working on). There are “pragma” directives and keywords that tell the complier what specific feature on the chip you are targeting (memory, Flash, Registers, I/O, etc.). The only time that “some assembly is required” is when a software timing loop is necessary, or instructions need to be emitted in an exact sequence (like, for Flash lock-and-key). Even then, C can usually do it.
In my experience, some C constructs don’t fit into microcontrollers very well. I used a simple “switch/case” statement in an interrupt routine, and it ended up exceeding the amount of cycles I had available*. When I looked at the code, I found that the complier generated an enormous amount of setup and tear-down code, and was laying the foundation for a very generalized switch statement. I re-wrote the code as a series of if/then statements and saved a huge number of cycles.
ETA:
- This was for a DMX controller, which communicates at 250 KBaud. The micro was being interrupted 250,000 times a second, and needed to do a bunch of processing to get the next value to emit. Even at a clock speed of 24.5 MHz, I only had 98 cycles to do what I needed to (and, I had stuff to do in the foreground, too).
I’m an old fart that argues that a decent programmer should have spent some time coding in assembly.
I say this because I have encountered too many examples of horrendously inefficient high level code that was written that way because the programmer didn’t know (and in some cases didn’t WANT to know) what the resulting machine code would involve.
YES, code needs to be readable and maintainable, and high level languages facilitate this. AND faster processors and tons of memory mean that many applications work fine with appallingly inefficient coding.
Efficient code means you can wring more performance out of existing hardware, or use cheaper hardware, which is the name of the game if you are building devices in quantity.
This is one real example:
The processor was the 8051 mentioned in posts above (well, an 8031… same thing with external code ROM) This was to serve as a serial communications interface, read some pushbuttons . and drive some lights on a operator console that also had a plasma video screen. The screen was run from a TI 34010 graphics processor. The '010 ran it’s program from video RAM. At boot, the 8031 needed to read 32 megabytes of '010 code from ROMss, and spoon feed them to the '010 via a 8 bit parallel port. (the 010 had no means to access the ROMs)
Now the 8031 only has a 16 bit address bus, so it can only address 64K of external memory. So I set up a paging scheme that gave us as many 16K pages as we needed. We could have used something besides an 8031, but we had the experience and tools (including ICE) for the '51, and it would work, so there we were.
I was called in to troubleshoot, because the 8051 was locking up when they tried to load the code for the 34010.
Except it wasn’t locking up. They just never allowed the 20 or so hours that would have been needed to completely load all the code.
The programmer had used one long int (32 bits) to hold the full ROM address, incrementing it for each byte. Now only the lower 14 bits needed incrementing, but OK, and this wasn’t the main problem.
He would then find the low order address using a modulo function:
LowerAddr=(address % 16384);
For every byte, he would then separately test the address to see if he was at a page boundary. He did this using a modulo 16K operation:
IF ( (address % 16384) ==0 ) THEN …<increment page register>
AND this was in a loop that needed to happen 32 million or so times.
When I looked at this, I said “You don’t need to be doing these two 32 bit division operations in this loop.”
“It’s not division, it’s modulus”
So basically this programmer didn’t know that you have to do division to get the modulus (remainder). Every assembly language programmer knows this.
He also didn’t know that doing 32bit operations on an 8 bit processor was very, VERY costly. Every assembly language programmer that has done any multi-word math knows this.
Every assembly language programmer knows that you can quickly divide by powers of two by bit masking or shifting.
He also didn’t know that even though an 8051 has an 8 bit division instruction, it is the slowest one in the instruction set (4 machine cycles IIRC)…because division is slow.
I had him rewrite the code as a couple of nested loops with inner one counting down, and figuring the address via bit shifting, and we got the boot time down to well under a minute. We did all this still in C. It was about 10-12 lines of code vs. the three or so he had originally written. He marveled at how my much less efficient code with two nested loops could be so much faster than his compact and elegant single loop implementation. I don’t think that guy ever did understand what the problem was.
I’m pretty sure I could have gotten it down under 10s in assembly, but it was deemed not to be worth the effort.
So you can tell programmers this stuff until you are blue in the face, but they won’t internalize it unless they have lived it. Yes, compilers can be amazing, but if your algorithms are the problem, they don’t help.
I once was introduced to a friend-of-a-friend. He was a programmer, and I guess my friend had told him that I had hardware experience. He turns to me and says “How would you generate a 256-step sine wave?” Without hesitating I said “pre-calculated look-up table.”
He said “It’s clear you’ve worked on tiny devices.”
We used to teach students on a small robot driven by a larger PIC. They used to write the entire thing in C. C is perfectly good at bit twiddling and messing with device registers etc. Sadly the class on assembly language programming was dropped decades ago. (I did the class on a CDC Cyber 172 - now that was a seriously interesting ISA. It was taught on VAXen for years after. I don’t think the kids learning the VAX instruction set knew how easy they had it.) The thing about messing with device registers is that the compiler needs to somehow be told that the address range is special. (Pragma keywords like “volatile”) Many device registers behave quite dreadfully. Reading **or **writing to one would cause a different action in the device, and many had quite different functions on read and write. Those that set an action on write, and delivered a totally different value (ie status) on read were a favourite. The complier must be prevented from attempting read-modify write cycles on those locations. Microcontroller compilers will have such support.
But there are PICs and here are PICs. My favourite example at the low end was the remote control button that sat on the cord of a pair of earbud headphones for a iPod. The remote had four buttons and talked to the iPod. Years ago I had a high-end Sony Walkman that had a very similar remote. (It really sounded fantastic for the time.) It used a couple of resistors in the remote to encode what switch was pushed. The Walkman noticed the appropriate voltage drop on the control wire and did the right thing. The iPod uses a serial link with a five byte payload and the controller had a PIC the size of a grain of rice - whose entire job was to watch the buttons and send the five byte packet. As I explained to my computer architecture class - the PIC was actually the cheaper solution. I bet it was never coded in anything other than assembler.
I suspect the the PIC family may now be the most ubiquitous CPU on the planet.
I used to teach the computer architecture students how a CPU could be built up from logic gates and the machine I would arrive at was essentially a PIC. I think they were all a bit stunned at how simple it was.
A friend of mine that I used to work with (who sadly is no longer with us) used to make fun of me because he said I wanted to put a PIC into everything. There’s probably a certain amount of truth to that. I like intelligent hardware bits that can self-diagnose themselves to some degree. They may not tell you what’s wrong, but they’ll at least tell you what they think is wrong, which is a lot more than what you get from dumb circuits.
While C compilers for PICs aren’t all that bad, all of the PIC code I have ever written has been in assembly.
What’s amazing to me is how cheap micros are. Often times, it’s cheaper to use a microcontroller in place of a dedicated IC (like a divide-by-N). I don’t know how they do it (volume, I suppose).
I do about 99% of my professional programming in assembly on a very large project. Lots of people in control systems and intelligent adapters still use assembly to increase performance and reduce cost. Every transistor used on a chip increases cost, heat and (especially) power which is often the difference between you and a competitor. Using assembly lets you optimize down to the minimum number of gates, RAM, and instruction cycles. Of course, at home I use C# and try to be as lazy as possible.
I literally cringed when I saw that line
I play around with Raspberry Pis so I come across references to ASM programming for the Pi quite a bit. There are several books, etc.
A quick Googling shows lots of info about it.
It does appear that most is oriented towards chunks of ASM code called from C, etc.
The Pi has GPIO connectors (too hook up motors, sensors, etc.) and some other accessible low level hardware that can be more precisely controlled vis ASM. Efficiency of code is not always the goal, but making sure certain events happen at precise times and in a certain order. Especially regarding whatever hardware is connected to the pins.
As to the OPs question about machine code. Notice we’re not really talking about that. While once in a while you have to dive into the actual bits and bytes of your data, the numeric values of your code are quite rarely anything you have to look at, let alone even code.
Note that first with compiler-compilers and then cross compilers, the classic bootstrap software development process of the classic era is long gone. (I.e., write a simple assembler in machine code, use that assembler to write a real assembler. Use that to write a simple compiler. Use that to write a better compiler and then a better one and a better one …)
They may have had it “easy” relative to some weird ISA, but the truth is that the VAX (and the PDP-10) were among the relatively sophisticated (for the day) breed of machines whose large instruction sets and myriad addressing modes were quite a bit more of a chore to learn than the relatively simple instruction sets of older machines where you basically had a few register manipulation instructions and some test and branch stuff and and that was pretty much it. The PDP-8 had literally 8 instructions, although the 8th was a rather clever combo of accumulator test and manipulation micro-instructions. The PDP-11 hit about the right balance of power and relative simplicity. The VAX and PDP-10 instruction sets made comp sci students’ heads spin. When you have a machine like the VAX that has a specific pair of instructions to save and load process context to facilitate the work of a timesharing OS, you know that you’re into a seriously non-trivial instruction set.
Don’t demoscene programmers do a lot of assembly? Hard to see how someone could write a 4K program with animation and music in a high level language.
C is by no means high-level, because C has the same approach to memory management as assembly: Let the programmer handle it! And, yes, it’s just as inefficient as assembly, because humans are typically crap at manual memory management, and leak and misuse memory left and right unless it’s practically trivial to manage things correctly. It’s true that, in principle, it’s always possible to fix the leaks, but it’s also true that programmer time is almost always more valuable than that.
C is more of a middle-level language, because it can be very portable, but it punts a lot of the hard problems to individual programmers, which high-level languages by definition do not do. C++, with RAII and smart pointers, is high-mid-level, kinda, but it still has a lot of manual memory management, and, therefore, a lot of leaks in practical contexts.
My friends and I referred to it as “scraping the metal”. And back in the '70s and '80s we did it all the time. Sometimes to speed up processes, sometimes to access hardware features that you just couldn’t access any other way at the time, and sometimes just for bragging rights.
(“Work” projects only involved the first two reasons, of course.)