Chip (IC) manufacturing yields

I was talking to a friend the other day and we somehow got onto the topic of integrated circuit (‘IC’, aka ‘chip’) yields. Background: when they make a chip, they’re basically: projecting a microscopic image onto a photo resist, etching away the stuff that did (or didn’t) get exposed, laying down a few atoms of something or other; repeat twenty or thirty times. Please don’t jump in at this point and dump on me, I actually do understand the process; I’m just simplifying here for the sake of the discussion - thanks.

Anyway, obviously if there is any tiny misalignment or odd bit of dust/dog hair/snot/etc in the process, the twenty plus layers are not going to be what you expected and the chip is gonna be disfunctional. Further, they make several identical copies of the chip on a big wafer of silicon, one could be munged and its neighbor could be okay. The ratio of the number of functional chips to the attempted number is the manufacturing yield.

Okay, so the friend and I are talking and he asked ‘How do they test the chips to see which ones are good?’. I tossed off a flip answer (because I’d seen this somewhere a long time ago) that they have an octopus-like machine with hundreds of microscopic gold fingers that connects into each chip wanna be, powers it up and sees if it actually works. If I remember correctly, they would squirt a bit of dye onto the ones that didn’t respond correctly and then garbage them later in the packaging process.

With the full benefit of hindsight, I’m regretting this explanation. For simple chips this might work, but suppose you’re trying to make Pentium III’s - how many million individual transistors? You can give it power and a clock and see if looks for external memory, but what does that tell you? Okay, so simulate some external memory and a boot program and see if it executes it. Good, good. You’d have to test about every individual op code, though, wouldn’t you? Seems to me that the dreaded ‘Hello, world’ program isn’t going to exercise every logic gate in the thing.

So, the question is: how do they test complex chips? How many marginally defective chips get through the first test and is there a second level of testing? How many actually defective chips (ignore the whole floating point error problem, that was a design problem, not a manufacturing one) get delivered to end users? Suppose you buy a new machine and it runs fine for a few days and then mysteriously crashes - this can be: marginal software; a random cosmic ray; or a lurking hardware bug. How do you tell?

Hmmmm, not just one but several related questions … bonus. Any Hardware Gurus here that can help me understand the process?

I believe all of what I say below is standard general practice for all chip manufacturers.

At several steps in the process, they inspect the wafers. Typically, this is done with machines that take pictures of each chip and compare the picture to a known good reference and/or other chips; the bad chips will, of course, look different than the good ones. Other tools scan lasers across the wafer’s surface and mark points where the light scatters as defects. The laser-type tools are good for picking up things like dust on the surface of the wafers, but don’t pick up flat defects, such as dried residue. In addition to these tools, they also use electron microscopes and plain, old-fashioned optical microscopes (although the technology is all but beyond the resolution of white light optical imaging; some scope manufacturers are moving to ultraviolet light to get better imaging.)

The tools are also checked on a regular basis using wafers that mock up the process layers (metal on oxide on polysilicon on etc… on the original wafer) but don’t have a pattern to confuse the defect machines. Pattern alignment is measured by tools or people that read the alignment of features on the product wafers or on test wafers patterned with these features.

The machines that test the finished chips * are * getting pretty complex, and getting expensive as a consequence. IIRC, memory is pretty easy to check, but logic is still a RPITA to test. Some guys are using testing programs that only test some of the opcodes on each chip and change which ones are tested from chip to chip. The idea is that if the wafer tested good in the factory, then the chips on the wafer should test similarly.

Finally, everything in the making of the chip that is known to affect the final product is tightly controlled. Any variation from the original design, good or bad, is addressed by the chip or process deigners or they guys in the fab.

Thanks cornflakes, but suppose that one obscure gate off in the corner is nonfunctional - say the chip can’t do floating point divides if the exponent is greater than 64 and odd, something subtle like that - will a visual (or automated visual) check detect that?

I understand that the process has to be tightly controlled and ultra clean - we’re talking features on the scale of, what now, .28 microns? Just seems to me that you’re going to get some bad chips going out - as in the example above. Thinking about it, memory would be easy to test - just write and read a couple of bit patterns (like we watch the assembled machine do as it boots).

  • Probably, * but that’s part of the challenge; I’m not entirely sure how they do it. I think that the main problem is getting the instruction to execute reliably, yet fast enough, and that is easier to determine statistically. I wish I could describe it a little better.

The subject makes me think of something a neighbor once mentioned. He has a degree in programming. When he went back to school, he discovered that there’s a big push to set up programs as formulas so they can be mathematically proven. Apparently, the software side has gotten pretty complex as well.

If there’s one thing I know, it’s silicon test - I’ve been doing it for more than 15 years. Optical scan techniques are used to inspect the masks, but I am unaware of anyone using optical techniques to inspect production die… It’s too expensive, too slow, and not particularly effective. Though optical scanners are used to insure that the packaged chips all have straight, planar leads and that their markings are good.

There are a number of techniques used to test chips. The most common is with the use of ATE (Automatic Test Equipment) to provide inputs and sample outputs of each chip. Try searching on the likes of Teradyne, LTX, Schlumberger, and Advantest. Their web sites usually have some pretty good pictures of the machines involved. These machines force and measure voltages and currents for some of the basic tests and apply digital test vectors for the logic tests. A vector is a sequence of input and output values for a given instance in relative time for the chip. It is not within the scope of this forum to give a complete explanation. We have techniques for determining the optimum number of vectors to sequence to insure that the chip will function correctly under nearly all circumstances. With modern circuit complexities it is impossible to check every possible input/output combination sequence. We have to settle for a statistical confidence level, for instance 99% confidence or fault coverage.

There are variants, scan testing (which requires special test circuitry to be added to the chip), IDDQ testing (which I equate to taking the chip’s temperature), self test (for chips with embedded microprocessors), and others.

Typically chips are tested at wafer level then packaged and tested again, possibly at environmental extremes, depending on the target application (i.e. chips that go into cheap watches are tested less rigorously than chips that go into airbag controllers).

If you’re really interested in details, I’ll do some searching around the web and see if I can find some resources that are reliable. I’m also willing to answer questions, as long as they don’t lead me into proprietary areas.

Smock up, go in the fab and ask the metrology guys how the KLA works.

I think I adequately described how KLA Tencor’s main products work. Maybe the newer ones are different, but that’s how I had the tools described to me. Besides, if you don’t have completed circuits (or completed transistors, for that matter,) how else are you going to check for defects on the chip?

Perhaps my description of the final test procedure was incorrect, but it still remains that not every possible function is tested on every chip. It’s just as important to keep things clean and in control on the process floor as it is to test them after the die are built. I stand by my original response.

SuperNerd, here’s the KLA link, if you want it: http://www.kla-tencor.com/site_frame.html

Wow - thanks for the link, it answers just about everything. So, it can be done optically, and it sounds like it’s pretty reliable and fast (40 wafers per hour). If I wasn’t so lazy, I could probably work that out to number of processers per hour. Well, lazy and I probably have better things to do.

Thanks again, guys …

Much of the fab part can be done optically; other tests (sheet resistance, film stress, etc.) are done by electrical or other means. The test guys have the final say, and they do it (mostly) electronically.

Dumb tangent, and not particularly relevant, but I once heard someone say “Let’s play machine shop. I’ll build something and you snort and reject it!” By minimizing the number of physical defects on the wafers, the folks in the fab help control the individual bad gates (or metal lines, etc…) and build quality in. The Test guys check the finished product to see whether we did it right (and occasionally snort and reject it.) :wink:

I work as a Design Test Engineer for a semiconductor manufacturer, which means my primary job is initial functional testing of new designs and characterization so I am somewhat removed from production sort and yield issues. Since we are primarily an EEPROM and DCP company, PQR has the luxuary of being able to test pretty much all of the chip. Still, test time is money (And it is a lot of it.) so, one of the first things that happens is we start to look at correlations, if chip X passes Test A, it is 99% likely to pass test B, so just do Test A sort of thing. As the die get smaller, Well thought out vectors become more important as do things such as ET Data. In fact most design programs will pretty much generate Test Excersize Vectors automatcally, these are vectors that will excersize the largest number of gates with the fewest number of vectors. Give me a day or so, and I’ll ask for a more detailed explaination when I meet with one of the PQR people tommorow to go over char data.


>>Being Chaotic Evil means never having to say your sorry…unless the other guy is bigger than you.<<

—The dragon observes

The plot thickens a bit - if I’m reading this right, then there’s a 1%, probably smaller, chance it’ll fail test B? This is not meant in any way as a criticism of the semiconductor industry, you guys all do some amazing things and it’s a miracle anything works at all. But … what I’m hearing here is that there’s a chance that defective chips could get sold.

Actually, this could explain that cheapo memory I bought and had to return - the sales associate (I use the term in the most derogetory manner possible) gave me that ‘Have you ever heard of static’ look. Of course I have, you idiot - one of the things just flat out didn’t work.

I suppose if you’re Intel and expecting that I’m going to fork over $1,000+ for the latest Pentium flavor, you could afford to boot it up and seriously stress test it? If you’re marketing a simple chip that retails for $3.99 at Radio Shark, you can either: completely test each one (could actually be feasible); or do your best but assume that one out of ten thou are going to be bad and take your lumps.

I work at a company that makes IC testing equipment, primarily probe cards. We’re currently developing a couple of tight-pitch, Multi-DUT (Device Under Test) probe cards. For more info, see www.cerprobe.com . This site contains a lot of good information about IC testing.

cornflakes,

Perhaps it’s just a matter of semantics, but what you are describing as optical test is what I would call optical process control. The difference being that, unless a wafer is rejected due to high defectivity, all of the die are passed on in the production process. The purpose of this kind of inspection is not the removal of defective units from the population, but detecting when something goes awry in the fab process.
SuperNerd asked:

Absolutely, however there is a bit more to the story. All chips have a number of specifications that spell out the absolute worst case operating conditions and performance characteristics. The test environments usually “push” these conditions past the extremes - we call it guardbanding. The test environment is the chips worst nightmare. The application designer works in sort of the opposite direction. He/she will choose components based on specs that far exceed his/her expected requirements. However, sometimes defects still slip through our nets. It is not uncommon for chips to have latent defects that look OK when we’re testing them, but after a bit of use they change characteristics and begin to function in some substandard manner. We even have tests to try and reduce these risks, called reliability tests. A process called burnin is one of the most common ways to do this. Thousands of chips are placed on special boards which are then put into ovens and the chips are set about on some activity. There’s this somewhat morbid term called infant mortality and the associated theory predicts that the failure rate of a population of chips will resemble the curve of a bathtub. There will be a measurable number of chips that fail early in their life, then a long period of relatively few rejects, until you start to reach the wear out region where the incidence of rejects starts to climb again (imagine a bathtub that is cut lengthwise - it’s high on the two ends, but low and flat in the middle). The purpose of burnin is to truncate the bathtub on the left side and eliminate the infant mortality rejects.

No customer woth their salt is going to accept untested material. All chips are tested to some level. Generally (but not always) the cost of test is proportional to the complexity of the chip. We test chips that sell for only 50 cents each - it just doesn’t cost as much to test these as it does to test a Pentium II. There are occasions where the test costs are not proportional to the chip complexity. In these cases we have to elevate the chip costs accordingly… One factor in the overall chip cost has to be your cost of test, and this is a function of the time to perform the tests and the cost of the equipment required to perform the function. BTW, you’re typical ATE (chip tester) will set you back about a million damn bucks (depending on configuration). Plus, you need another piece of equipment called a handler that presents chips one-at-a-time to the tester. Handlers can set you back anywhere from $200K to $500K (and above). A reasonably sized production test floor will have a few dozen of these tester/handler configurations. We’re talking some serious dollars that chip manufacturers invest in testing their products.

Probably more than you wanted to know… sorry, I got carried away.

Strainger,

You guys make some good products. I’ve been using them for years! We’re about to start using your Vi product - I hope it’s more robust than Wentworth’s Cobra matrix…

JoeyBlades, you’re absolutely right. Test is test, and what we do in the fabs keeps things clean. I originally read the OP as a question about how the entire process is maintained.

If I might add a tangent, how forgiving are PCs and software to the minimal faults that get through? For example, I seem to recall that the multimedia instruction sets are basically shortcuts that take the place of common routines. If one of these instructions failed, would my PC notice it and try it again another way?

The cutting edge of this sort of thing is called JTAG, though I do not know what the acronym means. There are special shift registers (memory locations) that record the state of a large number of the gates of the chip when in a special testing mode. The JTAG register is then read serially and you can tell where there may be a problem.

I know that the Motorola PPC’s support it, but I do not know about the Pentiums.

Actually JTAG is old and it is mostly used to
test the interconnections of chips on boards.

That said lots of people use the JTAG pins to
also test the chips via a method called scan.
I believe that most everybody uses scan to
test complicated logic chips because otherwise they would take forever (literally) to test.

I just have to jump in here and mention that you folks are beating the crap out of ignorance on this thread – I’m learning a ton. Keep up the good work!

Where in the process does one encounter implant metrology, and how does that work? I understand that they do this non-destructively, but simple white-light optics doesn’t intuitively seem up to the task.


Livin’ on Tums, Vitamin E and Rogaine

JTAG - Joint Test Action Group
(AKA IEEE 1149.1)
(AKA Boundary Scan)

It would only be used as one of several in a suite of tests. JTAG, on it’s own, would probably not yield very comprenhensive coverage. There are a number of different varieties of scan and I think they are all sometimes lumped colloquially under the JTAG banner because they share some of the same architectural features, though I see that changing rapidly in the industry. Mostly everyone just says “scan” and then you have to get into details as to whether its JTAG, LSSD, TAP, PRPG, GPTR, MISR, ABIST, LBIST, and let’s not forget MBIST, or combinations of some of the above, or others that I may have failed to mention due to their proprietary nature, etc… *(I hope you’re all suitably impressed with my hardcore knowledge of scan test acronymns) *

BTW, with respect to the Motorola PPC, the chip has features that support JTAG testing of the “target system” (i.e. application). It may also be true that Motorola uses these features in some way to augment the testing of the chips, but I don’t think they would advertise that…

Oops, thanks for clearing that up about JTAG, I just wanted to point out that there are ways of testing the chip that do not involve an “applictation-like” test.

The chips I work with are just one big transistor, actually a bunch of transistors in parallel, which simplifies testing a lot.

The trick comes when you test it at 4 GHz when it is generating 150 W of power.

Yikes! We are from different worlds, you and I…