When it comes to performance there can be a lot more to things than how good the assembler listing looks. Modern processor implementations perform a lot of optimisation that is not reflected in the machine code.
I burned a few years of my life coding a mixed machine code, C++, and Python system. I wrote a discrete event simulator kernel in machine code that was used by a simulator library written in C++ where the simulations themselves were created manipulated visualised and debugged in Python. Ran on both Windows and Linux. So some amusing things to learn about the deep parts of each. One became intimately familiar with the systems and tradeoffs. One thing is that C++ does not play well with the Windows exception model. So much so that Microsoft advise that they preferred you didn’t use it.
One of the most important aspects of speed is that it isn’t about the instructions. You need to keep the pipeline full and that is about avoiding stalls. That is mostly about the data. Getting the data to and from the processor pipeline is the dominant place where you lose time. Caches are what makes things run fast. Then anything else that can stall your pipeline. Which mostly means jumps. Jump prediction tables and speculative execution help. None of this is obvious from inspection of the machine code. But modern compilers can perform some very deep optimisations if given the chance. Things like partial execution of code can yield insights the compiler can use to choose how blocks are structured or branches are taken. Hints can be inserted for preferred branch direction, and given enough information vector instructions can be used for tight critical stuff. It can be pretty amazing what can be done, but also depressing when it doesn’t actually work.
Data layout can kill your performance. On a 64 bit ISA alignment can mean huge wins or losses. But pressure on the I-cache can be an issue.
If you have a language that can express the semantics well it gives the compiler a lot more freedom to create good code. Writing code where you are trying to optimise things at the language level can be counter productive. The compiler can almost always do a better job of tracking dependencies in the code and working out what to evaluate when and how register allocation can be used to best effect. A lot depends upon the sort of code you are writing. Numerically intensive code has significantly different trade-offs, and the level of intrinsic parallelism possible similarly so.
There are lots of places where optimisations occur that are not obvious at the code level in higher level languages. Java has used just-in-time compilation for decades.
Compilers can perform lots of storage allocation optimisations. If it can statically prove an object allocated inside a routine is not visible outside the routine it can allocate it on the stack and avoid leaving it for GC. This can lead to significant gains in unexpected places. Some languages have missed a few tricks with how they lay out objects.
There has been an argument that using C++ let’s the compiler produce really good code. IMHO and experience the language is still too messy to allow really good optimisation. Believe it or not, a modern FORTRAN will wipe the floor with it for many HPC codes. For really low level tight operations getting down to the metal can yield big gains. The simulator kernel I wrote was tweaked down to the cycle level and we profiled it to death understanding where every cycle went, especially watching for cache and branch prediction gains. The core dispatcher could not be written in anything but assembler because it needed to bridge the gap between the OS process model and the call standard of the compiler. On Windows we needed to access segment registers as well, and manage both the C++ and Windows exception models. The main simulator was byte coded using a threaded interpreter. Again, one needed to understand exactly what the compiler generated, you could lose a huge mount of time with naive code structures, and understanding what each compiler did was critical. The Windows and Gnu C++ compilers do not generate similar code, and any idea that you could assume good code from either without going deeper was sadly wrong. This perhaps underlies the problem with assuming that any high level language gets you close enough to the machine to always craft fast code. With the same source code, compare the output of different compiler back-ends and see what you get. You can get some remarkable surprises.
Oh you mean the fun stuff! That was the sort of thing that kept me enjoying the game. There is little more satisfying than tracking down these sorts of bugs and fixing them.