Two things that have come up specifically in my work:
[ul]
[li]Modern processors run so much faster than their system memory that cache misses are a much larger contributor to execution time than producing slightly faster code. So reorganizing to improve data locality is generally better than switching to assembly to make the code itself faster.[/li][li]Many places where we would previously resort to assembly for speed, we now use LLVM-IR and let LLVM do the low level code generation. It probably can’t replace all uses of assembly, but it does a lot of things.[/li][/ul]