Programmers: bugs which have driven you out of your mind

IANA(pro)P, but just came up with this thread idea today, so posted it even tho the technical details may be beyond the ken of some of the folks here. Fire away in any event.

Back in the MS-DOS days I was trying to debug an automated process. This process, which moved various files around, would write out its status to a file (call it STATUS.TXT). To view the results, at the DOS prompt I would type:
TYPE STATUS.TXT
which should have displayed the contents of the file on the screen. Instead, it produced:
File not found.

I did a DIR and verified the file was there, and it was. I repeated the command, and again got:
File not found.

Fast forward about 20 minutes of me going increasingly crazy until I finally figured out what is probably obvious to most of you: the contents of the file STATUS.TXT was the text “File not found.”

I wrote some multi-platform (limited to Windows, OS X, and Linux) code last year to handle remote communication via ssh. Since I was handling arbitrary commands, I had no way of knowing how long the return output – which I wanted to capture – would be, nor whether commands were guaranteed to be executed in order (they weren’t). The easiest way of handling that was to perform an echo command after every execution, allowing me to parse out the echoed tag to separate output.

I got it working in interactive mode, then moved to testing automated commands. It seemed great at first – Linux to Linux machines worked, Windows to Linux machines worked, but then…Linux to Windows machines…didn’t. Sometimes, I’d see superfluous blank lines. Other times, my parsing code would lose its place. Not that I identified the issue that clearly at the time – that only happened after days of false leads, delving into the SSH protocol, and designing/implementing exhaustive tests to isolate the behavior.

Fucking Microsoft and their damned CRLF.* Gah, I hate developing for Windows. :mad:

*For any non-programmers reading: Microsoft uses two characters to signify an end-of-line (EOL): carriage return (CR) and line feed (LF). Besides sending an extra byte for every EOL (throwing off my parsing and confusing byte counts), of course there’s also a (literal) visibility problem – neither CR nor LF characters display on the screen. Unix systems use just a LF.

I was maintaining an old program written in Delphi 3 on top of dBase files - every once in a while, the thing would halt, reporting that the disk was full. (it wasn’t anywhere near full)

Turned out to be something in the Borland Database Engine - internally, the variable type must have been too small to contain the value representing the number of bytes free on modern hard disks - so some sort of binary truncation was happening - any time the free space on the drive was particularly close to a multiple of 4gb, the BDE interpreted this as near zero and halted.

I could find no patch for this, so I had to lash up a horribly dirty process that would not allow the free space to stray close to a multiple of 4gb (by copying or deleting large text files)

A friend of mine was working on an old Burroughs system as the night shift computer operator. He got an error message that basically said “ERROR xxxx - This error should never occur. Please call tech support immediately.”

Gotta love it.

nevermind

Two hours debugging a getter. When I was debugging it it would just fire seemingly at random, at a time when I was not even getting the getter. When the data it relies on is not even loaded.

Turns out that when you put a call to a getter in the watch window and forget about it, the watch window remembers.

Ah, yes, errors that should never occur…

This is going back about 35 years. I was writing a pretty massive interactive FORTRAN system for a DEC 20 system. The DEC20 had a number of useful extensions to the language including error exits in function/subroutine calls. I don’t remember the exact syntax, but it was something like “a = flagellate(b, c, *100)” where the function would exit normally to the next sentence, but if it encountered an error (say, it detected that it was about to divide by 0) it would exit instead to line 100.

The problem was that if you’d written the function to handle error exits, then any statement that called it had to include an error exit. Now we all know that there are times when you call a function just knowing that it’s not going to error exit and I was in a hurry, so I loaded the system with error exits that simply called a generic “oops()” routine that I wrote that took no arguments and simply printed out the word “oops” and stopped the system. I mean, “oops()” was never going to get called, so why bother making it do anything else?

For years, the following scenario would regularly occur: A coworker would walk into my office and announce “I got an oops”. I’d ask “what caused the oops?” and he/she’d say “I dunno - I was working along and suddenly the program said “oops” and stopped running”.

I became known as the oops guy - and it was all my own fault.

Yes, I’ve been in that situation also.
I was part of a team of COBOL programmers. This was back in the time of batch processing using tapes.

One person was writing a program to massage some data, she had to do a kind of decision tree where, if this condition exist, do this, and if that condition exist, do that.

So, she went to our manager and asks what to do if the program encountered a condition that was not programmed for. The manager went to the analysts and the answer he got was, “Oh, that will never happens”.

So, he told my colleague, “If you find something unexpected, simply output ‘This can never happen’ in the catch all field”.

The output of her program was then input to my program, a program that was merging the data with the previous master file.

Upon testing my program, I get an abend. I go into debugging mode, start reading the core dump, and find, in a field that should be numeric the string ‘This can never happen’.

I now go see our manager, tell him what I found, he starts laughing and tells me what he told the other programmer to do.

“Ok, what do I do now?” I asked.

“Simply ignore the record” he tells me.

So I did, but I placed a statement in the code to display a message every time it happens

Years later, when I was no longer working on that application, I had a chance to see some output and the message was this being displayed numerous time per run.

Previous thread, with my prize story in the OP.

The first system I worked on didn’t have a debugger. I had to use display statements and if statements to find bugs. Later I was on a system with a full screen debugger. What a luxury! Being able to examine & deposit into variables, break points etc. That’s the life. :wink:

I actually introduced an error in my first job. I took my COBOL classes on a Vax. Our teacher was very anal about using optional clauses. He thought it added readability. I got in the habit of using the after advancing clause on all my write statements.

My first project in my new job. I had to modify a program to write a data work file. That would be read in another program.

so, I did something like Write out-rec after advancing 1 line.

A week later my supervisor called me in. She’d spent a day tracking down an error. My file wasn’t reading right. Turns out, after advancing creates a print file. On a Vax it didn’t matter because it’s still a regular sequential file. On a Honeywell a print file has a printer control column. That extra column was throwing off my data alignment.

Good thing my Supervisor had a lot of patience. Otherwise my career would have ended the first week. I never used after advancing on a data file again.

Can a kind soul please explain this bug to me? I even ran it (with slight modifications) as a Java program and a C# program but can’t get it to display the supposed results.

Well, this is not my own… I’m sure I’ve done much worse; I’ll post when they come to mind.

I paid my extra mortgage principal, online, with US Bank.

They ask you for the last 4 digits of your SSN for confirmation. I mistyped the SSN. I was immediately greeted with a pop-up that said, “Your SSN is wrong” (or something along these lines). Well, this was before any data was submitted to the server – so I looked at View Source, and yup, saw this:
function validateSSN1()
{
if (document.Form1.txtssn.value != “1234” && document.Form1.txtssn.value != " " )
{
document.Form1.txtssn.value = “”
document.Form1.txtssn.focus();
alert(“Your entry did not match our records. Please enter the last four digits of your social security number.”);
return false;
}
else
{
return true;
}
}

(replace “1234” with the actual last 4 digits of SSN … and, yes, the “security system” does accept 4 blank spaces as a valid SSN)

I sent them a nastygram about client side checking. So, they changed their system to do service side checking. Now, they do the checking service-side, but if you look at the view source, you will still see the SSN being passed in as a hidden input field, along with one of blank spaces…

<input name=“txthidSSNNum” type=“hidden” id=“txthidSSNNum” value=“123-45-6789” />
<input name=“txthidSSNNum2” type=“hidden” id=“txthidSSNNum”
value=" " />

Yup, 4 spaces still is accepted as input. If you’ve got an US Bank mortgage – go ahead, make an extra principal payment online, view the HTML source, feel a part of yourself die.

Have a seat. This will take a while.

In the late 90’s, I got hired by a company that had a program that was very near the end of its life. It was a communication layer that could read data from a mainframe database server to a Windows or Mac client. Newer protocols had pretty much made it obsolete, but there were still some dedicated customers, mostly in Europe. The company was trying to develop something new, but was happy to have the maintenance contract revenue for the old software. So they released one more minor version, which was Y2K compliant and included some other bug fixes. I got hired very late in the process, so I had very little to do with developing it.

The program gets released, it runs, maintenance fees are still coming in, everybody’s happy.

And then a customer discovered a bug. It was a crash on the server side. It didn’t happen all the time, even when they did exactly the same series of steps. We told them to turn the logging option on, so we could see what was happening at the time of the bug. With the log turned on, the bug didn’t happen.

As a stopgap, we did a special build for them with the log file directed to /dev/null. They let me log in remotely to their server. I even had root access. I would be logged in to their database server and running a debugger, while talking to them on the phone so they could run the series of operations on the client that triggered the bug.

Couldn’t reproduce it.

So my company decided to send me on-site to the customer, in Munich, Germany. I’d never been overseas before, so I had two weeks to get a passport. Thankfully, I live in the same state where I was born, so I could get a certified copy of my birth certificate. (I’d never even seen a copy before.) You can get your passport processing expedited if you have proof that you’ll be traveling on short notice, or at least you could in 2000. It still took five trips to the federal building in Boston to get it right. After the third trip, I was on the way back to the office before I noticed that they had spelled my name wrong. It didn’t occur to me until later that I could have changed my identity with no questions asked.

So, I flew to Munich with one week to fix this bug. The first thing they did was show me the steps on the client that would cause it. It turned out they weren’t doing it exactly the same each time. Their application would produce a database query that would be handled by our app, and part of that query was a range of dates that came from a side-scrolling calendar in a dialog box. There was a bug in that dialog that if you selected one date, scrolled, and shift-clicked on another, it wouldn’t remember the correct location of the first click. Once I found that, the bug became easier to reproduce.

I tried the debugger again, but it was pretty clear that wasn’t going to help. I knew the variable that was causing the bug, but it had been corrupted by something else before the actual crash. When I put a watch point on that variable to see when it changed, the application slowed down so much it would have taken weeks to get to the crash. And since there were signs that the bug was dependent on some random factor, like timing, I couldn’t be sure that would even work.

I started putting output statements in the code to write the values of certain variables to a file. Of course, we didn’t want the client to have access to our source code, so I’d log in remotely to my desktop back in the U.S., copy the file to Germany, add the statements I needed, copy it back home, run a build, copy the executable back to Germany, install it, reproduce the bug, then look at the output file to see the history of the variables I was watching.

The name of my company, and the domain I had to log in to, had a ‘y’ in it. The ‘y’ and ‘z’ are reversed on German keyboards.

I was having a great time. Munich is a fascinating place, and I was there during Fasching, the local Mardi Gras. The only trouble was, the street party seemed to shut down at 5:00. I got to the Marienplatz and there was a band dressed like the Blues Brothers except they all had huge, plastic animals on their heads. I heard them do one song, then everybody packed up and went home, and workers started cleaning up all the broken beer glasses and champagne bottles with little snowplows.

Don’t order a pepperoni pizza in Germany.

I was making good progress, tracking variables through different functions. When I saw something interesting, I’d add a new variable to the list. The bug stopped happening.

At this point, I was dealing with a heisenbug (one which alters behavior when you try to isolate it) and a mandelbug (one whose underlying causes are sufficiently complex as to make it appear random). It looked like timing issues, like network lag, could affect it, now it looked like the code size was changing the way dynamic memory was being allocated, and that affected it, too. I had to take out some of my tracking statements for variables that I had eliminated from consideration.

Then, I found it.

This was a database communication package. The client would assemble an SQL query and send it over the network to the server. Someone had defined a maximum statement length of 2300 characters. There was a buffer on the server side that would read from the network, and if it didn’t have a full statement, it would read again. That buffer was twice the maximum statement length. But the test was messed up. It would read text, then read some more, but it didn’t catch that it had finished the first statement and could remove it from the buffer for processing, and the second statement wasn’t complete either, so it would do a third read. That would only happen if the timing was just right for the buffer to get those partial statements, and if the statements were long enough, that third read would overwrite the end of the buffer and corrupt whatever data was there.

I fixed the bug by changing a ‘2’ to a ‘3’, and I had one free day left in Munich.

Funny, I thought about starting a thread like this just the other day.

Here’s one that drove me nuts a while back. On one of the projects I work on, we dynamically generate reports for the customer in Excel spreadsheet format. Each report has five or six tabs, the first tab being a summary tab and the other tabs being more detailed. One of our testers noticed that every report was opening up on the second tab rather than the first, so I got assigned to fix it.

Anybody who has worked with Excel knows that when you open an Excel file, it opens on the tab & cell that was selected when the file was last saved. So the first thing I looked for was to see if there was a way to tell Excel to open a file on the first tab, no matter where it was when saved. No luck there. (Maybe there is a way, but I never found it.)

As I said, the reports are generated dynamically. So the next thing I did was to look at the code that creates the spreadsheet file and populates the data. Nothing out of the ordinary that I could see, but I did try things like adding code to select the first tab, or cell A1 on the first tab, before saving the file. That didn’t work either. Every report I generated still opened on the second tab, cell B10.

I spent nearly an entire day looking at the documentation for the toolkit we use for generating the Excel files, trying to select different tabs and cells, trying to save the file multiple times or at different points in the code. No matter what I tried, every report file still opened on the second tab, cell B10.

Then, for some reason, I scrolled up a couple of pages from the section of code where I had been focusing for most of the day. And there I saw that the very first thing the code did was to load a template file. Huh, I didn’t know there was a template file - I thought the entire file was being generated dynamically. (As it turned out, the template file had the tabs and a couple of header rows already set up, but was empty otherwise.) I tracked down the template file, opened it up – and it opened on second tab, cell B10. :smack:

So I selected the first tab, put the cursor in cell A1, saved and closed the template file, then reran the report, and everything worked just fine after that.

In the days of punchcards I wrote a FORTRAN program that wouldn’t compile. The error made no sense, it being centered about a DO loop. I spent two days modifying this program to no avail. Only after I was given privileged access to a monitor that I saw the problem. Turns out the font of the printer was such that the zero and O characters were indistinguishable. On screen, the display font put a slash through the zero, so I saw that I had erroneously typed D-zero rather than D-oh.

D’oh!

About a decade ago I was working on a dynamic Word document. Essentially it was a blank document that populated from some pretty extensive VBA code. The code made a call to a SQL database and pulled up revenue amounts. Word then added these amounts to give a total revenue for the fiscal year.

The problem was that the total never matched the sum of the amounts. It was always off. we added it by hand, and it was just not coming out right. Very weird, because we had a couple of entries for $2,000, but it was counting like there were 3 of them. Why would it count $2,000 three times?

No matter what, it was constantly over by that amount.

I would have figured out the answer a lot faster had we run this a couple of weeks earlier, before the holidays. Because then the overage would have been $1,999.

I was just about to post a story very similar to this one! I had the same problem wherein I had a zero instead of an oh in one of my variable names.

When I was reading Robot Arm’s post about halfway through I thought “buffer overflow” or “uninitialized variable” problem. I’ve seen way more of those than I can remember and they always drive me crazy. At least back in the old days there were extremely hard to track down.

My classmate started up Dreamweaver and a dialog box saying ‘No error has occurred’ appeared. He clicked ok and went on with his life.

From the ‘high school students discovering programming for the first time’ files, we were supposed to be doing one of those ‘loop from 1 to 100 and show the output’ sort of things. My friend managed to get an infinite loop printing out ‘infinite’. That took some skillz.