I wouldn’t use the number of instructions as any sort of metric. Many general purpose CPUs have a low number of simple instructions. That is what defined RISC. x86 exists despite it ISA shortcomings and insane instruction set.
The other thing about GPUs is that there are a whole lot of things they are bad at that don’t matter for parallel computing of arithmetic.
The core driver is the maxim: make the common case fast, make the uncommon case correct.
Most CPUs spend their time doing stuff other than arithmetic and are optimised for their common case. A general purpose CPU benefits from caches, branch prediction logic along with speculative execution, and needs to cope with all the cruft an operating system brings, including interrupts, exceptions and security, in an efficient manner. This burns a lot of silicon to make fast. Especially the caches, but even all the extra junk attached to the main pipeline increases area massively. Adding virtual memory is another headache, with translation lookaside buffers to manage, more exceptions, and yet more junk in the way between computation and memory.
A GPU, at least in the basic geometry engine or raster engine needs to do linear algebraic operations and bit blatting on a large and regular dataset. This typically involves doing the same thing on each data item. It is rare to need to branch. And there is no need for complicated memory access control. Usually GPUs have their own optimised memory.
This leads to an architecture that contains memory control mechanisms good at striding large amounts of memory in regular patterns. It allows streamlining of memory control and pipelining of accesses. There are no caches. No need, and they would just get in the way. The regular patterns of accesses mean operands are always ready. Similarly branching is rare. So the instructions can be crafted to execute without contention for internal resources, with no hazards to cause a stall, and can run without worrying about needing to back out.
Some general purpose CPUs did gain some additional capability for GPU like operations, such as MMX and its ilk. Or the i860. But these still existed in the environment of the CPU, and you didn’t get the scale possible with a GPU. (A graphics system that included entire boards filled with i860 processors was a thing.)
A modern GPU is thus a very specialised processor. Cutting down what it needs to be good at to a very tightly constrained set means a huge amount of cruft needed by a general purpose CPU can be eliminated. The result is vastly smaller in terms of device count and thus chip area. Where a top end CPU chip might struggle to place 64 cores on a die, as a huge fraction of that is taken up with caches, and the CPU cores themselves are bloated out, A GPU die an be filled with legions of tightly designed cores and highly efficient memory controllers feeding them. Small also can mean higher clock rates. It all adds up.
Early GPUs also took advantage of the limited precision needed for graphics. You don’t need 64 bit IEEE floating point when all you are doing is calculating the location of a line on a screen. Even early GPUs with limited floating point capability were good enough for some physics - lattice gauge QCD for instance. The GPU manufactures realised that there was value in putting the extra precision of proper floating point to address these new markets.
AI on GPUs really means large scale matrix manipulation. So addresses the deep learning transformers paradigm. That seems to the the current hotness in AI so is a big deal. Blockchain in the abstract often requires some metric of computational work done to limit the chain extension (although not always.) Bitcoin just involves calculating hashes, so is trivially parallel and each calculation involves grinding along a large slab of data, so a great fit for each calculation as well.