Computer people - share your stranger bug fixes

What are some of the stranger bug fixes that you’ve made?

I have an application that tracks people’s weight over the course of 6 years. They come in for a checkup, get weighed, then their weight gets entered into the app and we can see how much they’re losing. We can run a report and look at every single patient, and apply various filters to narrow the data down.

One user came back to me with the useful information “It doesn’t work.” So I tried it, and indeed, I was getting an odd error. Never got it before, so I decided to filter by doctors. It was, indeed, one doctor who was making the error happen.

Long story short, it was one patient who was blowing up the system. And the error I was getting was “Cannot convert numeric to float.” WTF? I looked at all of the patient’s weight entries and there was nothing odd. She never weighed 0 pounds, or -27.3 pounds, or ‘jujubee’ pounds. Just nice normal numbers.

One thing we track is percent excess body weight lost. If a patient is supposed to weigh 150, and they weigh 200, then they have 50 pounds excess body weight. If they then lose 20 pounds, then they’ve lost 40% of that weight.

Pretty simple math, right?

This particular patient is supposed to weigh 134, and she started out at 140. If she just lost 6 pounds, that would put her at 100%. Instead she went up to 204. That’s a loss of -1066.666666666667%. The decimal data type wasn’t happy about that. I changed it to float, and now it works.

But it was WEIRD.

And someone get that woman a treadmill.

It wasn’t exactly a code fix, but it was definitely a strange bug fix…

We use a proprietary database for our alumni and giving records. One year, I started running our annual report. Back in the days before the company mercifully upgraded the reporting module, this was a process that could take four or five hours at a time, so I left it to run in the background while I did other work. About four hours later, I finally heard the “ding” that signaled the end of the report…only the report wasn’t done, it had instead crashed after four hours of running. The error message simply read “Numeric Stack Overflow”–no indication of where among the 30,000+ records something might have gone wrong. I figured it was a blip, and before I left for the day launched the report again. I came in the next morning to the same error message.

For the next few days I ran successively smaller segments of the database until I finally identified the record that was causing the overflow. I opened the record, but nothing seemed particularly unusual with it…until I looked at the giving history. The first gift for the record was dated 1/1/1751. A quick look through the doorstop manual confirmed that “(the database) does not support gift dates before the year 1801.” Great–the database allowed the user to enter a date prior to 1801, but anyone unfortunate to try to run a report with that record would be tearing their hair out trying to find it. Way to go, guys.

Postscript: a bit of checking around the office revealed that the gift date was not put in by mistake, but was intentionally changed to 1/1/1751, for reasons unknown.

That is weird. I could see it being 1753, but 1751 was before the switch to the Gregorian calendar.

A little over a decade ago I was running a financial report for the company I worked for. It was a basic Word doc with a table full of dollar amounts, summed up at the bottom. For some reason it was always coming out $2000 over. We’d add up the amounts by hand, or through Excel, and it always came out right.

After days of troubleshooting, I came to a realization. Had I run that report a week earlier – in late December – it would have been off by $1999.

Again, not a straight bug fix, but a most illustrative story -

Working on a radar data processor (RDP), which was basically a big card rack for proprietary circuit cards hand built just down the road at another branch of the company. Suddenly, the RDP is blowing up cards - we’d get the new, tested cards, plug them into the RDP, they wouldn’t work, and when retested back at the production facility, they’d be shorted out in random ways. Testing all the voltages, etc. in the RDP turned up nothing useful. Finally, after a few weeks (and many expensive cards), the head tech got a new card, opened the box, closed it back up, and sent it back. The card was bad.
When the tech went to the production facility and demanded to see how they tested the cards, he was shown how the cards were run through a circuit emulator box and then stamped ok when they passed.
The tech stuck an ohmmeter on the surface on the ink pad used to stamp the cards as ok, and found that the testing group had saved a few cents on a cheaper brand of ink pad. The ink was conductive. Everytime the card tester stamped the back of the card as ok, they shorted out a random set of traces on the circuit board…

That reminds me of a story a former coworker told.

For a while he had his own business.

One client kept complaining that the application wouldn’t install. They spent weeks trying to figure out why.

It was their practice to send their applications to clients on a set of floppy disks. All of the disks in a set would be in black casings, but disk 1 would have a yellow casing. There were clear intructions to use the yellow disk first. But the client kept throwing them away because “It looked like it went bad.”

A bit old school, circa late 90s’, working with mid 90’s level stuff.

Had a project that required a specific computer with a specific program on it running an early version of Windows that could go to DOS mode. That computer was a bit flakey and so was the program. I had seen both crash several times in the recent past.

Well, its Friday afternoon and it looks like I’ll be pulling an all weekender, so I ask the keeper of the software to leave it where I can get to it if I need to. He refuses to do so (jerk). Well, sure enough, somebody else pulling a weekender crashes that damn computer late Friday.

I return early Saturday with my vast personal collection of DOS tools on floppies and my crazy mad DOS skilllzz. I eventually get the computer back alive and then the needed program. But, when I run it to process my data, it comes back with a “subroutine/file XYZ not found” and stops. I spend hours trying to resurrect/find that damn subroutine/file and nothing works.

There was just no practical way to process the data manually. I HAD to HAVE that program working. Finally, after much poking about in DOS directories, looking at subroutine names and the program itself I realized there was a good chance the subroutine it couldn’t find was not something the program would actually use to process my data.

Then I had bright idea. Just took some other file, make a copy of it, and renamed it what the program was looking for. I did it and wham, bam, thank you mam, it worked. Got my data processed and proceeded to work on the project from there.

When Monday came around I did a proper reinstall and double checked my data (and it was fine, no change).

There’s a game design class at my university, and there’s a group of us making a Hero Defense game. So we’re getting basic functionality implemented, and finally move on to out first spell, a little fireball “bullet.” It seems to be working fine, you prime the spell, click, and it fires. It gets to the spot you clicked and stopped, cool, that’s intended. If it hit somethings it plays its animation and deals damage. Excellent. If you cast another spell or move is homes in on the new location. Wait… what?

Now, we’re using Slick2D in Java, and in our UI layer, using their Vector2f class. So we pass the Vector2f into the spell so it knows its destination and store it. Wait… Vector2f is an object, and Java passes a reference to that object. So whenever you click, in the UI layers the click location gets changed. So therefore the vector we passed gets changed. >.< It was fixed by cloning the vector.

The curveball homing fireball swarms were kind of cool though, so we hacked it to have the same behavior with other instances of the fireball spell with a static variable for a time, but when we moved to our new component-based spell system we ended up having to remove it completely, shame.

(re: conductive ink) That’s almost better than the tale of a hard drive manufacturer that was seeing batches of bad drives. They scratched their heads and started analyzing the life history of the bad drives, only to find one assembler who’d insert the platters and head assembly into the frame, then whack the unit against the bench to “seat” the parts. :smack:

Back in the early 80s I was converting a finite element design app from CDC 7600(?) FORTRAN into FORTRAN IV for the IBM 370/MVS. This was 20K lines of spaghetti code written & maintained by civil engineers over a decade. IOW a real nuclear-powered mess.

It had all sorts of cool stuff using EQUIVALENCE to overlay different-shaped data structures into the same RAM locations. Which was OK enough except the CDC was a 36-bit 6-byte word and the IBM was a 32-bit 4-byte word. And one system laid out arrays in row-major order while the other was column-major. Getting that reworked was a huge headache. But isn’t the topic of this war story.
The real craziness started was when I first got it clean enough to even compile & start to run. I’d submit our job & nothing would come back. Typical turn-around was 3-ish hours, but I’d get nothing, not even a compiler report. So I tried again about 4 hours later. Eventually I added an extra job step at the end to run regardless of success or failure ahead & at least produce something. Got nothing. Again. And again. Started to wonder if my account had been disabled or thre was something hincky in my JCL. But it all tested good.

The 10th-ish time I tried this I found a handwritten note in my output tray saying to call the head systems programmer ASAP. It didn’t look like he was happy.

Turns out I had been crashing MVS. Not just blowing my partition, but taking the machine all the way to stone dead. In modern terms, we BSODed an MVS mainframe. For modern PC users that doesn’t sound like a big deal, but BSODing a mainframe is/was rare enough to have IBM itself real interested from the start, and even moreso when it continued to happen. The PTB were not happy & initially accused me of vandalism, since the term hacking hadn’t been invented yet.

[Insert about a week of frenzied analysis by our & IBM’s serious propeller head.s]

Turns out the issue was that the initial entry point routine of the code did some error handling setup then called the actual routine which started the actual work. Some clever engineer back in CDC world had named that inner routine “Main”. And unbeknownst to all of us, the IBM FORTRAN compiler automatically names the outer entry point routine “Main” without saying so. And somehow neither the compiler nor the linker saw the duplicate name issue.

So what happened when the code ran was in essence an infinite recursion loop. The call which should have been from the initial entry point routne into the inner Main was routed instead back to the initial entry point Main. Which should have just led to an out of memory error & the code being shut down by the OS.

Instead, our entry point routine didn’t allocate any memory itself. But what it did do was call into the OS’s master exception handler setup logic (STAE for you legacy IBMers out here). And when that happened, the OS went into the IBM equivalent of kernal mode and allocated a new exception management data block & wired all the necessary linkages together. And that OS code was not written to deal with an out of memory condition when it tried the allocation.

Which would happen after a few hundred recursions, IOW after a second or so of task startup.

So once the OS figured out it had clobbered kernal memory it just halted. The IBM OS engineering support folks were flat amazed this was happening.
The fix? Rename that one routine called from just that one place from “Main” to “Main1”. Viola.

But the systems guys there hated my guts until I left.

I’m sure I have some, but not many are springing to mind at the moment. One that I can recall: I was a tech support person at the med school. A doctor called from one of the labs complaining that his computer was pink.

As I recall, the computer was a Mac IIcx. Sure enough, the monitor was profoundly pink. Okay, check the software, all is well, check the monitor, all is well. Time to open up the case.

Inside, I see the dustiest computer I have seen, before or since. Most of the motherboard isn’t even visible under the thick layer of dust. Yikes. A not-so-quick trip into the hallway with a can of compressed air followed, as did a small, extremely localized indoor dust storm.

And that was it. The dust had choked or shorted something on the video such that it was only displaying the red channel, problem solved with a brief cleaning. Since then, I’ve seen other computer have problems with dust, but they usually result in broken components, not a jolly pink display.

1066? Clearly she lost the Battle of Fastings.

not mine, but I enjoyed the “More Magic” story.

Not a bug fix, but a fun bug cause.

We were rolling out thin clients to remote offices and there was some location specific data that needed to be configured at the time of installation.

To support this we created an image that during the first boot up ran a script that searched the network for some specific information and reconfigured the thin client.

We then contracted a certain technical company to install the provided images on 5000 clients and then have their techs install them onsite. After the first two installs where 50% of the terminals required reimaging on site we began to dig into why.

Turns out the vendor had been testing the terminals before shipping them to the site. 50% of them. They worked great in the factory and performed as expected :slight_smile:

Between the time I wrote the first reply in this thread and now our proprietary database has thrown up another doozy.

The database program is comprised of several modules, two of which are called “Query” and “Export”. You’re probably thinking you can only export from the “Export” module, right? Of course not…you can export from the “Query” module just fine. Even better, you can export certain fields only from the Query module. I wrote a program in VBA that takes advantage of this strange state of affairs. But today, when I ran said program, the latest few lines in the export to Excel had disappeared.

A quick analysis revealed that while the “Export” module allows you to export up to 1 million lines to Excel, the Query module is stuck in time, and only allows the old limit of 32K lines. The reason? The company decided not enough people would complain about the Query export limitations to make the necessary upgrades.

This was over 30 years ago.

I was having weird and inconsistent errors with my program. After hours of debugging, I was able to isolate the section of code that was corrupting registers. After several more hours, I was able to prove that the code was “bullet proof”–it should have not been possible for the code to have errors!

After over a day of debugging for ONE error, I finally realized that the compiler had a bug, and was compiling my code incorrectly! The fix was simple, once I realized what the compiler bug was. Instead of performing a calculation within a loop condition, I created a temporary variable to store the calculation, and then tested the value of that variable in the loop condition.

I was torn between between feeling angry for wasting so much time over one error that wasn’t even in my code, and elation for finding such an obscure and unreported error in the compiler.

I watched a co-worker do this one, quite a while ago. We were using Tandy TRS-80 computers with audio cassette tape decks in leu of disk drives on the desk next to them. She got a data error somewhere near the middle of the dataset. So she calculated what fraction of the cassette deck read time should take her to the bad record, then she played the cassette that many seconds from the start of the file, and then hit the red Record button and wrote one record. I tried to explain to her why that wouldn’t work, but she wouldn’t listen. Neither would the computer; it worked fine.

In the late 80’s I was leading a development project for an inventory system. All Cobol code with a green-screen IMS interface. During final user acceptance testing we’d have end users bang on it from 9:00 - 12:00 then we’d stop it from 12:00 to 13:00 to install updates and let them back at it for the rest of the day. The production roll out went well except for one user occasionally complaining they couldn’t access the system on Wednesdays. We’d check all the available logs and performance data and everything was fine. We finally sent someone out to sit with this user to see what was happening. It turns out that on Wednesdays this user used a different desk. One that had a terminal that was used for the acceptance testing. One that had a Post-it stuck to it that said “Don’t use between 12:00 and 13:00”. Removing the Post-it fixed the ‘bug’ for this user.
Not a software bug but a process bug:
One heard from a friend who had done PC support in the early days. A small company had a single PC that the big boss used to store the company’s important data. The boss delegated the backing up of the data to his PA. The PC support guy wrote up explicit instructions for the PA on how to do the backups. One day the PC’s hard drive dies so my friend the replaces it and goes to restore the data from the last backup. The diskette is blank. Try the previous week’s diskette, blank too. Try several at random and they’re all blank. Boss is about to fire the PA on the spot for not doing the backups. Support guy asks her to show him how she did the backup. She pulls out her cheat sheet that reads something like:

  1. Select a new diskette
  2. Put in drive
  3. type ‘format a:’
    She says “See, I do all these steps religiously every time!”
    The support guy takes the cheat sheet from her hand, turns it over, and it says
  4. type ‘c:\backup.bat’
    Finally one from an ACM journal from the early 80’s. A user would occasionally find they were getting ‘bad password’ errors from a relatively simple password. The strange bit was that if they were sitting down they could get it right every time, but get it wrong every time while standing up. The ‘bug’ was that 2 key tops on the keyboard were switched. The user was a touch typist and didn’t look at the keyboard while sitting but did one-finger hunt and peck while standing.

I’ve posted this before, but about 3ish years back, I was working on our desktop computer. I had to switch users from my son’s user to mine, and something blipped on the display.

And when it unblipped… it was mirror image. As in, everything read right to left, as if we were looking in a mirror.

Now, I don’t mean that trick where you hit ctrl-alt-uparrow or whatever to rotate the display. That was the first thing I thought of: that I’d somehow hit that key sequence by mistake. So trying it did nothing.

Nor did uninstalling / reinstalling the video driver. Nor tweaking all the monitor settings, nor talking to tech support for the computer. Nor emailing tech support for the monitor.

Finally, we borrowed a monitor - and saw that it worked perfectly. So clearly the problem was the monitor.

Well, as it turned out, the damn monitor had to be UNPLUGGED, not just turned off… and it reset itself.

Oh - and indirectly related to that monitor:

We bought a used computer from a colleague. Took it home, set it up… and it went through bootup and we started to use it and it froze. Sometimes it would display the BSOD.

Other times, it would not even make it through the full bootup cycle.

The colleague took it back home and tried to figure out what was wrong… and it worked perfectly. Several cycles of this and the same damn thing happened. He replaced the hard drive. Same damn thing.

Finally we thought “interference”… we had one of those “make a fake phone outlet” things that used house wiring, and the computer was using that for internet, so we removed that. It worked fine.

Then we moved things around to put the computer on the desktop as we didn’t want people tripping over it.

It crashed again.

We’d noticed the monitor display (an old CRT from 1994 or so) was not the greatest… so we put two and two together and realized that the damn thing failed every time it was physically too close to the monitor.

Yep, the monitor was emitting some RF or something that was confusing the hell out of the circuit board.

So we put the computer back on the floor, told the kids to be careful around it… and it worked fine.

Needless to say we replaced the monitor (with the LCD one that went into mirror-image mode).

We’d been going NUTS - thought the new computer didn’t like us, or we had bad feng shui, or something… when all along it was a wonky monitor and a badly shielded case.

The company I work for sells a family of embedded devices. I was the operating system guy working on a new prototype motherboard, when one of our hardware engineers asked for me help with something. One of the stress tests we run on new boards involves sending network traffic at high rates to it. The device is expected to echo all of the traffic right back out, and we verify that no packets are dropped or corrupted. Some boards were failing this test because they were resetting quite early in the test with no explanation as to why. The hardware guy tried replacing various components, and eventually verified that the failure followed the CPU: no matter what board he put the CPU in, that board would reset during the test. Even more disconcerting, he found several CPUs that failed in the exact same way. He thought that the CPUs might be bad, but I knew that the problem at to be on our end somehow.

He showed me the failure. The first thing that I noticed was that the system actually became completely unresponsive for about 30 seconds before it finally got reset. It took me a couple minutes, but I finally remembered that we had a last-ditch failsafe built into the hardware that would reboot the machine if it was unresponsive for 30 seconds. I disabled the failsafe and reproduced the failure, and verified that the system now hung indefinitely instead of being rebooted. Ok, that explained the reset, but it was really bad that this failsafe was triggering. This was the absolute last-resort recovery method. For this failsafe to be kicking in, some other failsafes that were able to produce some debugging information about the failure were not even triggering.

Fortunately this particular CPU had some hardware-based debugging capabilities via a JTAG port. Basically you could attach a special piece of hardware to the CPU that would allow you to single-step through instructions or examine memory and register contents from another machine. So we got that set up, and tried to break into the CPU when it hung. Nothing doing. The CPU was almost completely unresponsive even over the JTAG port, which meant that it had gotten into a horribly bad state. What this did tell me was that the software could not be at fault. Something was going wrong with the hardware itself. I told the hardware guy all this, and theorized that there was some power problem on the board. Maybe one of the power rails went marginal when the network card was stressed, and certain CPUs were more sensitive to out-of-spec voltage that others.

After a couple of days he gets back to me. He swears up-and-down that the power is absolutely perfect: everything is well within spec when the failure happens. Also, he’s discovered some more CPUs that fail. The first batch that he’d found would get hung within seconds of starting the test. Some of the new failures would only fail after minutes. And some other ones would only fail after hours of testing. Overall, about 10% of the CPUs that we’ve received fail in this way.

He’s really starting to believe that the problem is bad CPUs. I’m still rolling my eyes, but he’s insistent so I humour him and come up with a test that can rule out the actual CPU being the problem. It’s pretty easy in concept: we have a reference motherboard designed and manufactured by the maker of the CPU. All we have to do is run the same test we run on our hardware on the reference board. If the failure happens there, with all of our hardware removed from the equation, the problem must be the CPU. Now it’s not quite that easy: it takes me about half a day to write a new test that mimics the essentials of our network test, but run on the reference board. We try it and, surprise, surprise, it doesn’t fail.

The hardware guy goes off to investigate more. Eventually he comes back and says that he’s not entirely sure which CPU we tested on the reference board. It might have been one of the ones that didn’t fail immediately, but only after minutes or hours. He wants to re-run the test with a CPU that he knows will fail in seconds.

Well, now that the test has already been written that’s easy enough to do, so we set up the test. Within five seconds the reference board was completely hung. My jaw hit the floor at this point. Honestly, I think that I could probably go the rest of my career without ever running into another CPU bug like this again.

So we write up a bug report for the manufacturer and send them our reference board test setup. Honestly, I was expecting a little more reaction from the company. Here we had proved that if you sent network traffic at high rates at a system with their CPU that it stood a good chance of freezing completely. We got one support-level engineer looking at it for 2-3 months. Then they finally escalated it to the design team. Within a week they announced to us that another company that used that CPU had run into the same issue, so now it was the highest priority in their division.

I guess that other company must have been a big one.
I don’t really have a good resolution to the story. There was a race condition in the memory controller that our test stood a good chance of losing. Really, I just wanted to one-up the “I found a bug in a compiler story”. :slight_smile: