Computer people - share your stranger bug fixes

My husband has had some weird ones involving time zones and daylight savings time. One was a java bug, another was some weird timezone in Brazil that had an offset not equal to an hour.

Years ago I had dial-up from Starpower, which eventually was bought by RCN. RCN proved incompetent – after paying for 3 years of service, with a billing error made each year when the annual service charge came due, I called them and was informed I must be mistaken because they do not offer service in my state. This was the same state their headquarters was in.

So I dumped them and got Earthlink for a month.

After that, I signed up with Comcast for dial-up at last. I personally removed Earthlink from my computer. Comcast techs came out and set up Comcast service.

But they could not get it to work. Three visits and I still could not access “Comcast high-speed internet.” Not even for an instant. The techs called their support lines endlessly, and finally told me my computer was “different” and could not be fixed.

Friend of mine who troubleshoots these things came over to help. He looked at the system for a few hours, and finally found the problem.

Earthlink offered a security package that cost extra. If you followed a simple shortcut Earthlink put on your desktop, you’d be sent to a page advertising the OPTION to download and install Earthlink’s firewall and antivirus and other malware protection doohickeys (which cost extra money).

This desktop shortcut was preventing Comcast from working – just as effectively as if it were an installed firewall.

Never mind that no one had ever followed the desktop shortcut, let alone paid for, installed, or run the security package software. Despite the fact that nothing had been executed from this shortcut, and that Earthlink had been uninstalled and the directories deleted, the mere presence of the shortcut on the desktop somehow locked out Comcast.

Dragging the shortcut icon to the trash had me up and running in seconds.

This made (and makes) no sense at all to me.

I did call Comcast and try to explain it to them, in case they ran into the problem again.

So remember, kids, when you’re having mysterious problems, it might be that software you never installed or used, being run from the shortcut you never clicked, that takes you to an ad page you never looked at.

Back in the mid-1980s, I was the technical representative for company “B” when company “S” was porting their operating system to company “B”'s hardware. We were coming down to the final week when we started running into data corruption problems during the installation. The installation was an OS image file that was split across 15 disks or so, and somewhere in the mess, a few random bytes here and their would be flipped around.

We went into massive debugging mode, pulling in developers from all over company “S” so that multiple pairs of eyes and brains could be used to find the issue.

Of course, putting in printf() statements every so often to dump out status messages made the problem disappear. This typically means that a buffer someplace is being overwritten. We had at least 2 teams of programmers who went over the installation program trying to find the buffer problem, but everything was squeaky-clean.

Finally, the guy who wrote the installation program said, “It’s a real pain in the backside to debug this threaded process. Do you mind if I comment out the fork() call so it only does serialized reads and writes?”

We did that, and the problem went away.

It turned out that we were reading and writing data so fast that the IO chip was overheating, causing the data corruption.

We ended up putting the threading back in and putting in a printf(".") every now and then just to cool off the chip.

One oddball thing that I just remembered was a user that clobbered an Oracle system when they left the company.

One of our many old standard ID naming formats is a combination of the first five letters of a user’s last name followed by their first and middle initials. That’s fine and dandy until a person named something like Charles T. Conner comes by, and ID CONNECT is created for them by the LAN/Email people. This person gets set up in an Oracle-based application and they’re doing whatever it is they do with no problems.

Some months later, this person left the company and a request to drop their IDs is sent in. Dropped from email and LAN with no problems. Dropped them from the Oracle system and all hell breaks loose five minutes later. Nobody could log in. We still don’t know how, but dropping that user’s ID somehow dropped the CONNECT function and paralyzed a production system. Fortunately, a couple DBAs were logged in at the time and were able to put Humpty Dumpty back together again.

Oops…

I installed software with a spray paint can.

I had just purchased my first PC (an AT&T 6300), and bought a software package called Ability that came on a 5 1/4" floppy (maybe more than one).

Popped the floppy into the drive, and got a “no disk” type error. Tried some other floppies, they worked fine. Called the Ability hotline. After determining my PC type, he had me color the corner of the green floppy black - apparently, the disk driver sensor that noted if a disk was in the drive was optical and blind to green.

My first job out of college was with a well-known electronics retail chain with the initials R.S. Yeah, that one.

One story I heard, but didn’t see, came from R&D. I’ve seen this on the web, so I’ll go with that version. When RS was developing their version of Unix, there was a particular error condition that was almost impossible to reach. The developer put in a special error message for testers who managed to create that condition… “Shut her down Scotty, she’s sucking mud again”

I was tangentially involved in this one - my group worked on the store operating system for the retail outlets. Our software was designed to be very strict in the requirements of the store managers actions. They were required to make a backup every week, and validate the previous weeks backup. If they didn’t do this, they couldn’t record the day’s sales, and if they didn’t dot that, they got a call from the regional manager. We had one manager in another state whose previous backup was corrupt every single time he validated it. Our group did everything it could to diagnose the problem remotely, and finally flew one of my co-workers to personally watch this manager make the backup. Note that this was in the days of 5 1/4" floppies, when floppies really were floppy.

So with my co-worker watching, the manager made the backup, pulled the floppy out of the drive, placed it against a wall, and proceeded to place a magnet against the floppy to hold it against the wall. :smack:

And this guy was a manager of a chain that sold computer and audio supplies?
This one I caused - we were testing a new version of our software and how it played with the other systems that we sent data to, and I was testing the special order function. Although we were supposed to be testing normal functionality, I made an order for the max number of the most expensive thing the managers could special order - a microcomputer called Priam that retailed for $10,000.

That order broke every single system that it hit. For a couple of weeks, I had someone come to my desk proclaiming - “You broke my program!” I got in trouble for that, but it was worth it.
A weird problem that was never fixed - I was working on a re-write of a quartet of nearly identical programs, trying to consolidate them into one program. I nearly had it IIRC, but there was one computer that it simply would not run on. We ran out of time to determine just what was causing the problem, so they scrapped my code and did a quick re-write.

Okay, the backup story reminded me :slight_smile:

The very first IT project I worked on, we were rolling out servers to a brokerage firms offices to support some new software. Networking and servers was all new to them, they had worked on stand alone PC’s using floppies to transfer data up to this point, so in most offices we had to tuck the servers into closets hastily fitted with locks, security was… light.

We had major issues with one site. The rollout went perfectly but every night at a random time between 9pm and midnight the server would shut down and restart almost 30 mins later. By the time our techs arrived it would be sitting at the login screen. After a week of this the business team was livid that we hadn’t figured it out so we sent out a small team of specialists to investigate. I was sitting there with the lan expert, the server expert and the software rep (my job, keep them from killing each other) and we’re chattering away waiting for the server to fail so we can investigate. I get the call from the monitoring group, the server has gone down. I yell over the sound of the cleaning ladies vacuum to the team that we need to go to the server closet and investigate.

I’m sure the answer to this is clearly obvious now but it actually took us a few minutes to figure out that the power cord hadn’t quite reached the UPS so they plugged in an extension cord that ran outside the server closet. The cleaning lady was unplugging the server every night to plug in her vacuum.

I fixed a Y2K+1 issue between dBase and MS Access.

The xBase database format expects the second byte of the file to contain a value equivalent to the number of years since the last turn of a century, to when the file was last modified (so in 2005, it would contain 5)

In some of their dBase drivers, Microsoft appear to have interpreted this as needing to contain a value equivalent to the number of years since 1900 - so in 2005, this would contain 105)

So the place I worked had an access db that exported files in dBase format to feed into a legacy system with a dBase 3 back end. in January of 2001, the files just stopped working in the legacy system. I discovered this by exporting two copies of the same file (the second one after setting the system date back) and comparing them. Then I wrote a little function to hack the second byte of the file to the appropriate value - and we had to run this against the exported files. It worked perfectly.

Some years ago I had finished compiling an application and was left with “myapp.exe”
It worked just fine, and all was good with the world.

A few days later I was tweaking some stuff, so I decided to name the file “myapp(2).exe” or some such parenthetical name. It crashed and burned.
I spent some days scratching my head on this one. The problem was that it just wasn’t a serious bug—if it doesn’t work with parens in the name, then don’t rename it. But I wasn’t satisfied with that answer.

And I couldn’t debug the problem because when running in debug mode, I wasn’t running the executable.

Finally I hit on the odd idea of renaming the executable for MS Visual Studio and seeing if that would cause the problem. Sure enough, the app failed. But this time I could debug it and find out exactly where it failed.

The problem happened when trying to make an Oracle database connection. And the strange error message I got from Oracle hinted at the cause.

This is what happened:
Oracle has a convoluted way of naming database connections that describes them with loads of nested parentheses, as if it were a LISP program. The TNSNames file format is an example of this.
When a database connection is being made, apparently that string is passed over the wire from the client to the database server.

It turns out that Oracle embeds the name of the currently running executable inside this packet of information that goes to the database. So, by putting parentheses inside my application name, I broke the nested-parentheses syntax of their little data packet.

Something like this was going over the wire:
((dbname=bozo)((user=smithj)(client=myapp**(2)**.exe)))

And it was choking the parser at the database server end.

This was a cool little Oracle bug that affected every Windows application, including Oracle’s own client applications. I tried an experiment and renamed the SQL*Plus executable to a few other things to see if it would still work. Choices like “bozo.exe” worked fine, while “bozo(1).exe” crashed.

They have fixed this.

Once I added a fairly major feature to a network card driver. Everything was working great, but after a couple of weeks we discovered that extremely rarely, it appeared that network traffic was being corrupted. At first it appeared that the corruption was limited to NFS traffic. I started investigating and I discovered that NFS traffic over UDP is prone to silent data corruption over high-rate links. I switched us to using TCP and everything seemed fine for about a month, when the corruption re-occurred. I started digging again and discovered a minor bug in my driver where I didn’t clear out a pointer that I should have, which could potentially lead to double-frees and other fun stuff. Again, the bug appeared to go away, but then came back again after several weeks. Overall, I think that I “fixed” the corruption bug 3-5 times over the course of nine months, each fix more speculative than the last.

Then, less than a week away from our planned release date – a date that had very little leeway left in it, as we needed to ship the software before fiscal year-end – the corruption returned with a vengeance. Our operating system started crashing because of random memory corruption. It was horrible; the crashes were frequent enough that we couldn’t even think about releasing, but happened rarely enough that actually reproducing the problem on demand was impossible. I looked into one of the crashes we’d seen, and discovered that the contents of a broadcast packet had been written over top of a completely unrelated structure.

For some time my suspicions had been that the root of the problem was in how my driver has handling network buffers. I thought that I was doing something illegal like double-freeing a buffer in some corner case. So I went into the allocator and wrote some code that would track every network buffer in the system. I don’t exactly remember what my debugging code did, but I do know that after a network buffer was allocated, my debug code allocated a separate piece of memory to help track the buffer. Imagine my surprise when all of the sudden, the memory corruption started crashing my test system almost every time it tried to boot. I quickly discovered that my buffer tracking memory was getting hit by the corruption almost every time. Once again, a broadcast packet had been splatted on top of what should have been there.

“A clue!” I said. “Clearly the corruption always affects the memory that gets allocated just after a network buffer.” With the problem all of a sudden being so easy to reproduce, I started looking into it. Unfortunately, I ended up spending over a day on a red herring caused by a minor bug in one of the kernel’s debugging facilities. Once I finally got past that I started working on narrowing down where the bug was coming from.

It was very interesting to me that the corruption always ended up involving a broadcast packet overwriting memory. When I added that feature to the network card driver, I had to make broadcast packets be handled specially. Basically, in order to receive packets to allocated network buffers and put them in a queue controlled by the NIC hardware. The NIC would write a packet into a network buffer and raise an interrupt, and the driver would replace that buffer with a fresh one, and send the packet to the upper layers of the network stack. For reasons that aren’t particularly relevant, I had to split that up into two queues. Broadcast packets would be received into one queue, and unicast packets would be received into a second queue. As an experiment, I tried disabling the special broadcast queue entirely. That broke my feature, but it also eliminated the crash. Now I was definitely getting somewhere: this confirmed that the bug lay somewhere along my special broadcast-handling path in the driver.

I methodically worked backwards from the point in the driver where broadcast packets were handled in the stack. I kept removing more and more code that did anything with broadcast packets, but still the system kept crashing with a broadcast packet overwriting something. Finally, I got it to the point where the driver never did anything with broadcast packets at all. It still enabled the special queue for broadcast packets and put buffers in it for the hardware to fill, but I never took packets out of the queue – and still I saw a broadcast packet splatted over memory somewhere. If I never turned on that queue, I wouldn’t crash.

Now that told me an awful lot. Nothing in software could ever write a broadcast packet to anywhere in memory, and yet I was still seeing that damned broadcast packet written to where it shouldn’t have been. The only possible explanation was that the NIC itself was writing the packet there. And I was quite certain that the hardware would only write a packet to a place that I told it to, so how could this be happening?

In order to continue to explain, I have to get into the details of how you told the hardware to write a packet to a buffer. The queue is a fixed-size ring buffer. Each entry in the queue is a pointer to a network buffer. When a packet is received, the hardware first writes a packet into the first buffer in the queue, and then it writes some information about the packet where the pointer used to be. The most important piece of information is the length of the packet, but there are some other things like “did the Ethernet CRC match and were the checksums correct?”. It also sets a bit that indicates to the driver that there is now a valid packet waiting there.

So I wrote some code that went in and periodically checked that the entries in the queue always pointed to the correct network buffer. Sure enough, this found that at some point during boot, the interface was being reinitialized and now there were some entries in the queue that were pointing to random pieces of memory(actually, not so random as I was to discover very soon).

It took a good 15 minutes of thinking, but finally the light bulb went off. When the interface was reinitialized, the hardware reset its internal state for the receive queues and started back at the beginning of them. The bug happened when the interface happened to be reinitialized while there were some packets in the queue that were still waiting to be received by the driver. The queue entries for those packets no longer contained pointers to network buffers. Instead they contained the information that the hardware had written back, like the packet length. When the hardware got back to that entry, it treated the length and whatnot like a pointer and wrote the new packet to some random piece of memory. The fix was simple: before turning the queue on the driver was supposed to go through and make sure that every entry contained a valid pointer, but I forgot to do that for the broadcast queue. It was a freaking one-line fix for a bug that plagued me for 9 months.

Now, if you’ve been paying very close attention, you’re probably thinking “but I thought that the corruption always hit memory that was allocated right after a network buffer.” And this is why I am the luckiest programmer in the whole world. The corruption, of course, could hit any piece of memory at all. However, it just so happened that on the network I was testing with, pretty much the only broadcast packets that you would see were ARP packets. The key thing here is that ARP broadcasts always have exactly the same length. This ended up meaning that the NIC pretty much always wrote back the same integer value to the broadcast queue. So, as it happened, the memory corruption was not affecting a random piece of memory. Instead it always hit the exact same piece of memory. It was pure fluke that my buffer tracking memory always got hit by the corruption: it just so happened that my buffer tracking memory got allocated early enough during the boot process that it always got the same piece of memory, and that piece of memory was the one piece out of 8GB that always got hit by the corruption. :eek:

A couple of weeks ago at work, several people come up to me and tell me that a particular program I’m responsible for is broken. It didn’t entirely surprise me – I’d had a couple of reports that the program didn’t work when compiled for 64-bit, even after I thought that I had fixed the issue. What did surprise me was that everybody was emphatic that it was the build that was done that week that had introduced the breakage. As far as I knew that was impossible – literally nothing had been changed in several weeks.

From the symptoms that people were reporting, I was pretty sure that this was another instance of a bug I had fixed a while ago. The program in question interacts with a particular bit of hardware via some memory-mapped registers. There’s one very important rule for accessing those registers: your accesses must be 4-bytes wide or smaller. If you break this rule, you read back all-ones instead of a valid value. The previous bug that I had quashed happened because I got a bit too clever in how I pulled data out of those registers, and when compiled for 64-bit the program ended up trying a 64-bit (8-byte) read on a register. I made some changes to the program at that point to ensure that all accesses to the memory-mapped registers went through a single pair of read/write functions, and wrote those functions to try and guarantee that all accesses would be of a valid width.

Knowing all that I had a really good starting point, but I was still mystified as to how things had been broken in the previous week. I quickly confirmed that build X-1 worked while build X would always fail. But I also verified that they were built from exactly the same code. WTF? The first thing that I did was jump right into the disassembly of the two builds. I wanted to verify that I was actually issuing the read with the right width. I knew exactly which read was failing so it was easy to track it down in the disassembly in both cases. This is where I started to get very confused, though: the assembly that did the read was identical up to renaming of the registers and the offset at which a particular stack variable lived.

Ok, so the next step is to single-step through the failing code. I set a breakpoint right after the instruction that read from the memory mapped register. gdb blew right past it and the program completed. Damn, I must have been looking at the wrong bit of disassembly after all. I set a breakpoint on a function that gets called right after the failing read so that I can figure out what I’m actually supposed to be looking at. I do hit this breakpoint, but when I look backwards from where I am I see that the exact same read instruction that I set the first breakpoint on – but I never hit it! Ok, this time I’ll set the breakpoint just before the read, and single-step to it. I set the breakpoint, and pretty soon gdb hits a branch that I can’t make heads or tails of. It doesn’t correspond to any if statement that I can see in the code. I ignore it and step past it, and the program jumps away from what I believed to be the failing read instruction. Ok, that explains why my breakpoint wasn’t triggering, but why isn’t the program reading from the hardware? I keep stepping until gdb gets to an SSE instruction, and that’s when the penny finally dropped.

Later versions of gcc try to use SSE instructions for accessing pieces of memory larger than 16-bytes wide. However, gcc only ever emits an SSE instruction that accesses memory that’s aligned to a 16-byte boundary – the aligned versions of the instruction is way more efficient. Now, the C code that was failing looked like this:



uint8_t array[count];

for (i = 0; i < count; i++)
    array* = read_register(REGISTER_ARRAY_BASE + i);


gcc was clever enough to inline read_register, and after doing that it realized that I was copying from one contiguous piece of memory to another. So it unrolls the loop and uses 16-byte SSE memory instructions to do the copy. But there’s a catch: gcc only issues the aligned instructions, but it can’t guarantee the the destination array is properly aligned. So that mysterious branch that I couldn’t figure out? gcc was testing the alignment of the destination array: if it was properly aligned, it jumped to the unrolled loop with SSE instructions. If it wasn’t aligned, it jumped to a fallback loop that only used 1-byte accesses. This fallback loop was where I kept setting that breakpoint that never got hit. So the reason that one build consistently worked while the other didn’t was that the stack layout changed subtly between the two builds for whatever reason. In the failing build, the destination array happened to be aligned to the 16-byte boundary needed to use the SSE instructions, and everything went to hell.

Let me tell you, I got some really funny reactions from some people at work when I told them that the new build was broken because the stack layout happened to change.

I was working on some special-purpose graphics hardware which ran on Intel i860 processors running in big-endian mode attached to a SunOS workstation. Certain colors weren’t looking quite right. An inspection of the code showed that a lookup table mapping the integers 0 to 255 to their floating-point values had some typos in it. However, for some reason (I forget exactly why) I wasn’t able to build the executable from the source code. So I opened up the executable in emacs, turned on hexl-mode, and went searching for the hexadecimal representation of the floating-point numbers. But I couldn’t find them.

After more investigation, it turned out that the i860 code image was in a very strange format. Normally, bigendian 32-bit values have the high byte at index 0 and the low byte at index 3. For some strange reason, the image had the high byte at index 1, the next-highest byte at index 0, the next at index 3, and the low byte at index 2. After finally puzzling it out, I corrected the erroneous binary values, saved the file, and was good-to-go.

I remember when doing my thesis on computer vision that I programmed a grid that solved a variational calculus problem. The grid was full of unit needles, that were supposed to keep the same lenght all the process, while changing theirs slant and tilt. For some mistake, when I runned it one time the needles started to grow little by little from the plane, and they grew so much that the computer crashed! And the screen was full of needles all over, that looked like a mashed windshield. I started to laugh about such a nerd situation!

I’ve had two wierdnesses in my career.

When I was in college, we had a DECsystem-10. It had dumb terminals, all identified as TTYnn. I discovered that the DEC Fortran compiler had a command that would mount an external device, and lo and behold - it would accept the TTYnn designation. I promptly wrote a program that would latch onto an open terminal, then when a newbie freshman came in and logged in, it would log their userid and password, which we would promptly log in on and play Super Star Trek until their account timed out.

I wrote it up as part of a paper that I did on MIS security. We gave a copy to the DEC guys and a couple of weeks later, there was a compiler upgrade and we could no longer attach TTY devices.

The second was when I was on Kwajalein in 87-88. We had the first generation PC-ATs with the 20mb hard drives that would fail after a year or so. I had one fail and the user couldn’t boot. I tried everything to get the machine to boot, and finally gave up. As a joke, I put one hand on the computer, raised the other hand and chanted, “Lord - HEEAALLLL this machine of its afflictions!”.

Damn thing booted up and I was able to back up their data. :smiley:

Reminds me of plenty of old days PC and micro computer stuff. I musta had dozens and dozens, if not hundreds of times where something wouldnt boot up right or a floppy wouldn’t read right.

Most people with a bit of sense would give up after a few tries. Not me. I’d do it for hours, if not days on end if I had to. I never had a scenario where perseverance didn’t eventually pay off :slight_smile:

I remember in the 80s I was working in a very slow basic based computer. When I programmed I finished the programs by typing “quit”. But if one repeated the same instruction at command line, the computer shut down! I made the mistake twice, and shut down the companies computer!

This thread is reminding me of a auto repair column I read (or maybe a Click & Clack show) about a car that didn’t like vanilla ice cream. The family would go out every Sunday to the grocery store to buy ice cream, and if they got vanilla ice cream, the car wouldn’t start properly. But if they got any other kind, they wouldn’t have any problem.

It turned out that the vanilla ice cream was in a freezer in the front of the store, whereas the rest of the ice cream was in the back. So that small difference in the amount of time it took to go to the back of the store vs. not going to the back made a difference in the engine temperature, and that was causing a problem with the motor starting.

Here you go - UL on snopes.

Although it’s classified as legend, I really like that story because of the dogged, thorough approach the engineer takes: instead of accepting the weirdness of flavour influencing the car, he takes all data available, with patience enough to track down the error and find a logical, instead of weird, explanation.

That’s exactly how it should be done, and a good way to tell a story to illustrate the scientific method, I think.

[QUOTE=Folacin]
I installed software with a spray paint can.

I had just purchased my first PC (an AT&T 6300)…
[/QUOTE]

I’m so sorry. Those AT&T PCs had some of the strangest internals I’ve ever met. IIRC, they were designed by Bell Labs but built by Olivetti, which made for some interesting characteristics like the power supply leads being bolted to the motherboard instead of the standard plug, and they had their own unique drive rails, which made adding drives a challenge. After trying every PC store in town, (this is in the pre-Web era, so it was impossible to just type “6300 WGS drive rail” into a search engine.) I wound up whittling some bits of wood to fit.

I remember when my work place first started playing around with C language. After using a somebodies personal copy to fool around a bit, the Bosses okayed us buying an official copy for office use.

We could NOT get the old programs to compile. It gave ALL sorts of strange error messages. We started going line by line comparing commands from the older version to the newer version, which seems reasonable given none of us really knew C. We would “fix” things that seemed like they might be a problem, but nothing worked. After many wasted man hours, it occurred to somebody that our programs extensions might be the problem. The old codes were named stuff XYZ.C. Our new copy of C was C PLUS PLUS. We renamed the old programs with new names like XYZ.CPP and all was fine!
Duh!!