I was talking to a friend the other day and we somehow got onto the topic of integrated circuit (‘IC’, aka ‘chip’) yields. Background: when they make a chip, they’re basically: projecting a microscopic image onto a photo resist, etching away the stuff that did (or didn’t) get exposed, laying down a few atoms of something or other; repeat twenty or thirty times. Please don’t jump in at this point and dump on me, I actually do understand the process; I’m just simplifying here for the sake of the discussion - thanks.
Anyway, obviously if there is any tiny misalignment or odd bit of dust/dog hair/snot/etc in the process, the twenty plus layers are not going to be what you expected and the chip is gonna be disfunctional. Further, they make several identical copies of the chip on a big wafer of silicon, one could be munged and its neighbor could be okay. The ratio of the number of functional chips to the attempted number is the manufacturing yield.
Okay, so the friend and I are talking and he asked ‘How do they test the chips to see which ones are good?’. I tossed off a flip answer (because I’d seen this somewhere a long time ago) that they have an octopus-like machine with hundreds of microscopic gold fingers that connects into each chip wanna be, powers it up and sees if it actually works. If I remember correctly, they would squirt a bit of dye onto the ones that didn’t respond correctly and then garbage them later in the packaging process.
With the full benefit of hindsight, I’m regretting this explanation. For simple chips this might work, but suppose you’re trying to make Pentium III’s - how many million individual transistors? You can give it power and a clock and see if looks for external memory, but what does that tell you? Okay, so simulate some external memory and a boot program and see if it executes it. Good, good. You’d have to test about every individual op code, though, wouldn’t you? Seems to me that the dreaded ‘Hello, world’ program isn’t going to exercise every logic gate in the thing.
So, the question is: how do they test complex chips? How many marginally defective chips get through the first test and is there a second level of testing? How many actually defective chips (ignore the whole floating point error problem, that was a design problem, not a manufacturing one) get delivered to end users? Suppose you buy a new machine and it runs fine for a few days and then mysteriously crashes - this can be: marginal software; a random cosmic ray; or a lurking hardware bug. How do you tell?
Hmmmm, not just one but several related questions … bonus. Any Hardware Gurus here that can help me understand the process?