The company I work for sells a family of embedded devices. I was the operating system guy working on a new prototype motherboard, when one of our hardware engineers asked for me help with something. One of the stress tests we run on new boards involves sending network traffic at high rates to it. The device is expected to echo all of the traffic right back out, and we verify that no packets are dropped or corrupted. Some boards were failing this test because they were resetting quite early in the test with no explanation as to why. The hardware guy tried replacing various components, and eventually verified that the failure followed the CPU: no matter what board he put the CPU in, that board would reset during the test. Even more disconcerting, he found several CPUs that failed in the exact same way. He thought that the CPUs might be bad, but I knew that the problem at to be on our end somehow.
He showed me the failure. The first thing that I noticed was that the system actually became completely unresponsive for about 30 seconds before it finally got reset. It took me a couple minutes, but I finally remembered that we had a last-ditch failsafe built into the hardware that would reboot the machine if it was unresponsive for 30 seconds. I disabled the failsafe and reproduced the failure, and verified that the system now hung indefinitely instead of being rebooted. Ok, that explained the reset, but it was really bad that this failsafe was triggering. This was the absolute last-resort recovery method. For this failsafe to be kicking in, some other failsafes that were able to produce some debugging information about the failure were not even triggering.
Fortunately this particular CPU had some hardware-based debugging capabilities via a JTAG port. Basically you could attach a special piece of hardware to the CPU that would allow you to single-step through instructions or examine memory and register contents from another machine. So we got that set up, and tried to break into the CPU when it hung. Nothing doing. The CPU was almost completely unresponsive even over the JTAG port, which meant that it had gotten into a horribly bad state. What this did tell me was that the software could not be at fault. Something was going wrong with the hardware itself. I told the hardware guy all this, and theorized that there was some power problem on the board. Maybe one of the power rails went marginal when the network card was stressed, and certain CPUs were more sensitive to out-of-spec voltage that others.
After a couple of days he gets back to me. He swears up-and-down that the power is absolutely perfect: everything is well within spec when the failure happens. Also, he’s discovered some more CPUs that fail. The first batch that he’d found would get hung within seconds of starting the test. Some of the new failures would only fail after minutes. And some other ones would only fail after hours of testing. Overall, about 10% of the CPUs that we’ve received fail in this way.
He’s really starting to believe that the problem is bad CPUs. I’m still rolling my eyes, but he’s insistent so I humour him and come up with a test that can rule out the actual CPU being the problem. It’s pretty easy in concept: we have a reference motherboard designed and manufactured by the maker of the CPU. All we have to do is run the same test we run on our hardware on the reference board. If the failure happens there, with all of our hardware removed from the equation, the problem must be the CPU. Now it’s not quite that easy: it takes me about half a day to write a new test that mimics the essentials of our network test, but run on the reference board. We try it and, surprise, surprise, it doesn’t fail.
The hardware guy goes off to investigate more. Eventually he comes back and says that he’s not entirely sure which CPU we tested on the reference board. It might have been one of the ones that didn’t fail immediately, but only after minutes or hours. He wants to re-run the test with a CPU that he knows will fail in seconds.
Well, now that the test has already been written that’s easy enough to do, so we set up the test. Within five seconds the reference board was completely hung. My jaw hit the floor at this point. Honestly, I think that I could probably go the rest of my career without ever running into another CPU bug like this again.
So we write up a bug report for the manufacturer and send them our reference board test setup. Honestly, I was expecting a little more reaction from the company. Here we had proved that if you sent network traffic at high rates at a system with their CPU that it stood a good chance of freezing completely. We got one support-level engineer looking at it for 2-3 months. Then they finally escalated it to the design team. Within a week they announced to us that another company that used that CPU had run into the same issue, so now it was the highest priority in their division.
I guess that other company must have been a big one.
I don’t really have a good resolution to the story. There was a race condition in the memory controller that our test stood a good chance of losing. Really, I just wanted to one-up the “I found a bug in a compiler story”. 