Programmers: Your weirdest bug stories

We had a customer for which we had written a custom extension for our AR Invoice Posting process. They had a recurring but infrequent bug where they had partial batches posted to the G/L.

We checked every time it happened, and it always looked like the final step of the post was running out of sequence. This should have been impossible, because the job was executing from the JobQ, with each step putting the next step in the queue so it would run when the current step finished.

We checked and checked, and couldn’t figure out how that one step was getting started early.

Finally the customer’s IT manager noticed something that gave us a clue. It only happened when certain terminals initiated the job, and only on those workstations which had a terminal ID that began with the letter E! (On the old IBM System/36, all workstations had a 2 character terminal ID, which could be any two numbers or letters).

So our program always worked on terminal W1 or X2, but (sometimes) screwed up when run from terminal E2 or E4 (or E anything). Very strange!

With this in mind, we scanned through the code for references to the terminal ID.

We finally found the problem, not with the terminal ID itself, but with the data structure that just happened to be immediately before it in memory. This structure held the name of the day of the week, which we displayed on the screens for the convenience of the operator.

The programmer who had set up the date data structure had somehow not noticed the fact that “Wednesday” has 9 letters, while all the other days of the week have 8 letters or less. The data structure was one character short, and on Wednesdays was pushing the Terminal ID down one byte in memory when it overflowed.

This wouldn’t have mattered, except that the data structure that happened to be on the other side of the terminal ID was a flag byte we used to tell the program whether to execute from the JobQ, or whether it should run immediately (be “evoked” in System/36 terms). If the flag was an E, the job evoked another occurrence of itself, which began to run immediately, then died. Presto! Job step out of sequence.

So not only did the bug only happen when the job ran from workstations with an ID that began with the letter E, but also only if the job was initiated on Wednesdays.

That was a bugger to find.

Our programming teacher (also a former Royal Marine Commando) showed us a file containing what on first glance looked like complete gibberish. No returns (carriage return , line feed) so almost no whitespace.

It was working compileable code! I think the lecturer’s intention was to show us ‘unreadable’ code.

That was the first question I asked after looking it over.

It was f77 in the distant past. But it wasn’t run thru any automated Fort->C process. The aforementioned coder was (apparently) hired to re-write the whole thing in C (from scratch). All he did was change every statement to a rough C equivalent. I think, for some bizarre reason, he got worried about having element [0] unused in all these arrays.

I want to rewrite in C++ (this task just cries out for OO.), but there’s no budget for it.

Probably pulled from here.

I’ve worked in several industries, over coughmumblecough years. Dinaroozie , the first computer I ever used professionally was a slooooooow unix box. And we were programming Cobol that required all upper case. jjjjjjjjjjjjj “Oh f%&# I had caps lock on”* Oh, and the compiler name was “rmcobol”.

At this same place - Tandy for reference - I was testing the special order function. I looked through the database and found a computer that could be special order called “Priam”. It was $10,000. I “ordered” as many as I could on the screen.

It broke every system it went through. :smiley:
One place I worked at, my group did physical security for government installations. Have you heard of “two-man rule rooms”? I worked on code that enforced that. Once and only once (at least that we knew about), one person was trying to get in a 2MR room and another person was trying to get out. Somehow, both terminals worked (when one person starts a transaction, the other terminal is supposed to lock up). But the strange part is that the two people involved in the transaction had their PINs reversed. I looked at the code occasionally but I never did see any case where it would even get remotely near doing something like that. Did I mention that the two people involved were the two people from the company that supervised us?

Worst screwup? Typing rm -fr from root. Logged in as the super-user. :eek:

I managed to screw up Word a long time ago. I think I managed to make a loop of different styles, so that the first one ended up referencing the last. At least, I could never open the file again.

One that happened to a co-worker - he was working a project for Martin Marietta in the deep south. The programmers hit one of these “will never happen” errors, so they put a swear word in the error message. It happened. To a rather conservative employee. Martin Marietta required for them to prove that there were no other swear words, period in the code, in addition to fixing the one that was found.

Oracle weirdness - I once was trying to pinpoint some sort of data in Oracle. I had one table, and two conditions in the where clause. Worked fine. Added on more where and I got exponentially more rows. And yes, I was using AND to join the conditions. I’ve never seen that since.

And then there was the person who decided to write object-oriented code in C. And didn’t comment to say he was doing so. He was also dyslexic so he wrote at least one program with “falg” in the name. Drove me nuts, but I still love him.

My fourth job I didn’t find any bugs because I blew off the job most of the time.

My current job I don’t think I’ve found any really fun bugs, but I’m working on it!

Finally, I have what is probably an urban legend. One of the Tandy employees in R&D was writing code for Tandy’s version of unix. He got to a hardware error that “would never happen”. So the error message he wrote for that condition was “Shut her down, Scotty, she’s sucking mud” :stuck_out_tongue:

  • for those who don’t know vi, a lower case j goes up one line. An upper case J joins the line and the next line together.

A few years ago, one of my reports told me that, only on her machine, the process wouldn’t report status. I took a quick look, and told her to reboot, and bet her that the problem would go away. She did, and I was right: no problem.

It turns out that the process status was using the Win32 GetTickCount() counter to report status. About 30 days after booting, the timer turns over to 0, and our code broke.

The programmer hadn’t rebooted in more than a month, so she hit that bug. It was an easy fix, and a lucky guess on my end, but I think she still thinks I’m clairvoyent to this day.

Quick background info:

(1) Delphi has a built-in string type, which can hold either a pointer to reference counted string data, or a null pointer to represent an empty string. The compiler handles all reference counting automatically when you assign a new value to a string variable. To avoid crashing when you first assign a value to a string variable, the compiler initializes string variables to zero (i.e. empty strings), even though most other variables are not automatically initialized.

(2) Functions in Delphi return a value by assigning it to Result, which is (or at least appears to be) a local variable of the same type that the function returns.

So it took me forever to track down the bug illustrated here:


function MakeXs(num: Integer): string;
var
  i: Integer;
begin
  for i := 1 to num do
    Result := Result + 'X';
end;

procedure Test;
var
  str: string;
begin
  str := MakeXs(5);
  Writeln(str);  // displays 5 Xs
  str := MakeXs(10);
  Writeln(str);  // BUG: displays 15 Xs
end;

One project I’ve worked on is a database report-writing tool, written in Visual FoxPro. We had one bug that every once in a while, a column on a report would display the path to our executable on every line, rather than the database field data itself.

The program allows the user to modify a “Picture” format mask to their fields. For example they might enter “$999,999.99” for a sales amount field. Somewhere down in our code we were assigning this to a property called Picture on a field object. Now in VFP, “Picture” is also the name of the property of the image class for storing the file location of an image. When you assign a filename, VFP will add the path to the filename to that property. Under some set of circumstances, VFP was mistaking our Picture property for the one from the image class, and adding a path to the start of it.

The strangest thing was when we finally had this isolated in the debugger, the line oField.Picture = “” would result in oField.Picture having the value “c:\Program Files…”

He probably got it from here.

I put that link on my course web pages for my students, too. :slight_smile:

Earlier in my career I wrote and maintained a lot of code. Finding and fixing bugs were actually kind of fun (since they weren’t really my bugs). One time I encountered a bug that was generating the wrong numerical result which was in the millions with required precision to the penny. Usually, those were among the easiest to correct. I checked the code and logic, traced computations at every step, but couldn’t find the error. It was quite frustrating. Finally, I traced the error to a hardware problem (don’t ask for details; it was too long ago). I ended up rewriting the code to get around the hardware limitation. Yep, the old “work around” fix.

My least favorite (and somewhat weird bug) bit me in the rear while I was working on my senior project some 3 years ago. We were building (or at least trying to build) a bluetooth wireless voip phone.

We put together a MIPS-based system with all the fixings (based on the Dragonball-EZ processor), but we left out one very important detail. We only had one serial port to connect to the board.

There were lots of interesting challenges involved in getting our board to work. We ran across problems like the memory repeated every 16 bytes (had to do some resoldering), the flashing tool we had wasn’t compatible with our flash (I had to write a home-made flash-writing tool), just getting uC-linux running (that was a major pain), the list goes on.

Well, finally, we had the operating system working on the board, and it was ready for me to add the bluetooth. At this point, the operating system seemed to be fairly stable, and the driver I had for the bluetooth module was working correctly on my laptop. However, any time I tried to get the bluetooth module on the device to connect, the device would go bye-bye.

Back to the original important detail: we only had one serial port on the board. Forget the fact that we were miles and miles away from even remotely getting a debugger up and running on the board. No, that would not have done any good, because the bluetooth module connected to the board through (yup, you guessed it), the serial port. The only serial port.

Up to this point, all the debugging had been done through the kernel’s printf (modified to write to the serial port’s Tx register). So the problem: talking to the bluetooth module causes the board to go bye-bye, but writing to the serial port (and peeking) caused the driver to fail, such that the board didn’t go bye-bye.

I finally hit on creating a circular debug buffer in RAM on the board. I would initialize the bluetooth module, wait for the board to hang, then reset the board and use a utility to read the debug buffer I had written to. I traced the problem down to a kernel configuration problem (basically, there were some problems with the #defines used while compiling the kernel for our processor, and so memcpy was compiled such that it didn’t worry about memory access alignment problems).

Oh, and couple this with the fact that the OS image for the board was over a megabyte, and it took probably 10 minutes to download (after a while, I got the actual kernel binary and the filesystem split into two, so I could update the kernel in around 5 or 6 minutes). Every change: recompile kernel, build image to download, download image to board RAM, copy image from RAM to flash. Every single change.

We never did get audio working, so we ended up with a bluetooth wireless web server, which was still pretty nifty.