Programmers: Best bugs

Previous threads:

I’m working in Go now, and have been getting into concurrency. I’ve been writing a few slightly complex component based systems – for practicality I decided to make it concurrent in various places. Obviously this means dealing with all the shit concurrency makes you deal with.

First, a tiny bit of syntax background since most of you haven’t used Go, a goroutine is essentially a routine that operates concurrently with other goroutines (including the main routine). They’re not threads at the OS level (but can be run in separate threads), but for a lot of cases you can basically think of them as threads, since they operate in similar ways. To spawn a goroutine you type “go” in front of a function call. You can sort of think of a function spawned with a goroutine as a unix command launched with & – all it means is that your current goroutine/thread doesn’t wait for it to return.

Another thing to understand is the “defer” keyword which will call a function after the current function returns. Useful for closing files and unlocking mutexes.

I am so happy my test cases accidentally caught this, because I wasn’t testing for it. They just caught it by accident.



// Without too much fluff, this just means "this function belongs to a pointer to an instance of an object of type "system"
func (sys *System) DoSomething() {
    go functionWithSideEffect // Spawn a new goroutine with a side effect
}

func (sys *System) functionWithSideEffect {
  sys.RWMutex.Lock() // lock a mutex for writing
  defer sys.RWMutex.Unlock() // unlock mutex after functionWithSideEffect returns
  sys.someVariable = something
}

func (sys *System) GetSomething() VarType {
  sys.RWMutex.RLock() // lock a mutex for reading
  defer sys.RWMutex.RUnlock() // Unlock after return
  return sys.someVariable
}


Seems okay, maybe.

Here were my tests (well, the minimal version)



sys := NewSystem{}
sys.DoSomething()

a := sys.GetSomething()
if a == nil {
  Errorf("a wasn't set. Oh no!!")
}


This, being a bug thread, was obviously calling the error function.

What’s the problem? Well… I had a race condition. With mutexes. Specifically, would the reader or writer lock it first?

I figured that it “made sense” to lock the mutex in the new goroutine. This was a bad idea. You see apparently whatever overhead goes into spawning a new goroutine happens concurrently, so DoSomething() returned before the code ever entered functionWithSideEffect().

However, the GetSomething function was not concurrent and did not have that overhead, so it got the lock first (i.e. didn’t wait for the function we called DoSomething for to do its thing), and thus when it returned the desired variable, it wasn’t set yet because the writer was waiting on the reader to unlock the mutex.

This was easily fixed by locking the mutex before spawning the goroutine, but the reason I put the mutex where it was in the first place was to try to head of race conditions (and concurrent modification problems, obviously).

I’m sure anybody who’s worked with concurrency and multithreaded environments before is laughing at me and how obvious a bug it was. And to be fair, I fixed it in about 10-20 minutes, I just found it annoying and a bit funny that there was a race condition about who was going to lock the thing meant to prevent a race condition.

I once wrote some JavaScript to position an element at the bottom of a webpage – actually the bottom of the portion of the page that’s visible in the browser window. Because of an off-by-one error, it actually positioned it one pixel too low, so the element was slightly cut off, but not in a very noticeable way. However on particularly short pages, this meant that the element would cause a vertical scroll bar to appear (longer pages would off course already have the scroll bar). And if the width of the page was initially just shy of what would require a horizontal scroll bar, then the appearance of this vertical scroll bar would cause the horizontal scroll bar to appear.

Of course the horizontal scroll bar pushed up my element, so it was no longer extending beyond the visible region, and the vertical scroll bar was no longer needed. Without the vertical scroll bar taking up space, there was enough room side-to-side that the horizontal scroll bar was no longer needed. Then my element was able to move back to its original position, causing the vertical scroll bar to reappear, causing the horizontal scroll bar to reappear, and so forth. On each disappearance and reappearance of the scroll bars the content of the page would shift slightly to accommodate them.

The bug report said simply: “Causes the website to shake violently back and forth.”

A race condition with mutexes is pretty funny, I’ll grant you.

I do more debugging than writing, myself. I’m much better at troubleshooting code than at writing it from scratch. As a result, my “best bugs” are mostly things I’ve found in other people’s code.

Sadly, the very best bug I’ve found would be both highly recognizable and somewhat embarrassing to people in two large corporations who could make things uncomfortable for me, so I won’t go into too much detail here. I can tell you this much: if a customer started a session in the last 13 seconds of one particular minute in late afternoon, something very dramatic (though fortunately not disastrous) happened to their account…and there were enough customers that this happened at least a few times every day. It was, of course, also a race condition. Sort of. The immediate cause was someone doing something quite clever with a config file without knowing an obscure bit of undocumented trivia hiding in over a million lines of code.

The best bug I’ve personally created was in a class in college. It was a microprocessors class, and we were coding in assembly for 80386 in the lab. Most of it was pretty easy, but my team ran into a bizarre bug when we wrote a keyboard interpreter for an assignment. It worked fine…except for one key. If you hit the “w” key, it produced an error that our test system reported as “Missing Math Coprocessor”. We never did figure out what it was doing, and when the prof admitted defeat after several hours of poking at it, we gave up.

Well…there was also a bug that caused a large, fuzzy spider prop to unexpectedly fall on my teammate’s head. That one was pretty good, too. :stuck_out_tongue:

I just thought of another one. I think I might have posted it here before, but I can’t recall.

I was in a Game Design class, and I was doing the AI. For path finding we tried A* and Steering Behaviors. Neither worked, and we had several professors and out instructor look it over. Nobody could figure out the bug. Unfortunately, to this day I have to idea what that bug was, so you will not be hearing that.

Instead I decided to move to potential fields (these ended up being the solution we used in the final product, but it was still a little buggy).

Anyway, to create a potential field you essentially divide up your game world into a bunch of little boxes (ours were 5x5 units, a unit ~= 1 pixel big), and for each box calculate the field. Because of the cost of recalculating the whole field, a box would only be recalculated if the value was needed.

I can’t recall exactly what I did, but I needed to set or check some value every update cycle, but only for certain boxes. What I did – and mind you I knew this was a bad idea at the time, but it was a band-aid fix I meant to optimize later – was iterate through all 100000 elements every update cycle (an update is supposed to only take a small fraction of a second) to check which ones needed to be changed. This made each update take a very long time, meaning the game was uplayably laggy. In fact, this game had waves of enemies, and after <x> seconds (or all the enemies were killed) it would go to a between wave level up screen until you started the next wave. The updates took so long that before your input could be registered, the next wave screen would show (partially due to a time-elapsed handelling bug in the engine we were using).

I couldn’t figure out what was wrong though. Why? Because the Java’s JVM optimizer is stupid-smart. You see, my group mates on Mac and Linux were yelling at me because I broke the build, but I couldn’t see the bug. It would lag on the first update, but then be fine. My group-mate on Windows started having the problem too. But it didn’t happen on my windows laptop (my Mac was in the shop), or my home windows machine.

Until I unplugged the charging cable from my laptop.

You see, the JVM’s JIT optimizer/compiler does some cool stuff – but only on Windows. On top of that, it seems to have some way of checking if you’re plugged in or not, and turns off some of the more power-intensive optimizations when you’re unplugged. One of these features is the best damn loop unrolling mechanism I’ve ever seen. After just a couple of calls to this function, it managed to unroll the loop to a point where it was only taking a fraction of a millisecond to execute – without that optimization it was taking 1-2 seconds.

So there was a bug I couldn’t see because an environment dependent, platform dependent JVM optimization feature was too good.

Oh god, that bug I mentioned in the OP was even worse.

You see, part of the reason I put the mutex inside the new goroutine (and in fact, the entire reason that side effect is in a goroutine in the first place) was in case a thread deadlocks itself, because if a function is called through down the stack during a call to the (non-concurrent) Update method that mutex is already locked. So putting it outside the goroutine is pointless.

I had to fix the mutex race condition to guarantee the order of Reading and Writing with a waitgroup. Sigh

Now this is a simple and dumb bug that isn’t particularly impressive but to put it into context it was written by a developer of 20+ years experience who should know a lot better. Mistakes like this you get out of your system in the first year or two of pragramming.

Trying to set the default date of a control to be the end of the month:



_jcbAuditReportToDate.ComboBox.SelectedValue = new DateTime(now.Year, now.Month + 1, 1).AddDays(-1);


All sorts of problems in December. Even better this is called at Start Up so for a whole month we were spammed error messages whenever anyone opened the application.

About a thousand years ago I was lead developer for a Cobol based, green screen mainframe application. During the final QA phase we’d have real users based in different locations around the site do testing for us. We’d bring the system up in the morning and afternoon but would shut it down over lunch to drop in any patches we had ready.

After we went to production it was all running pretty smoothly except for one user would complain that the system was down. Everyone else was working fine. After the user reporting the same thing every day we sent someone out to their workspace to see them in action. He turned up at the user’s desk around 11:45 and the user pointed at the terminal and said “see it’s down!”. There was a post-it stuck to the terminal that said “system down 11:30 - 12:30 daily”. My colleague pulled off the post-it, tore it up and said “there - bug fixed”.

This is a different wait group from the one mentioned before but… uh, if you want to prevent deadlocks when passing waitgroups around to different concurrent routines, make sure you pass a POINTER to the waitgroup instead of passing by value. I can’t even believe I made that mistake. (To be fair, there’s some funkiness with pointers and interfaces in Go that makes sense if you think about it, but confuses me sometimes).

My favorite was posted by somebody else somewhere, maybe in the other threads. A program was misbehaving and he had it write a debugging message into a text file. Then he tried to read the text file and got the message “File not found.” He went crazy for a while, because he could see the file listed right there. Then it finally occurred to him that “File not found” was the text his program had written into the file, not an error generated when he tried to read it.

A simple shell function:



function test_file()
{
    local line

    grep "$1" "$2" | while read -r line
    do
        if some_condition "$line"
        then
            return 0
        fi
    done

    return 1
}

This function was always returning returning 1. The bug was not in some_condition. It was incredibly obscure and the only reason why I wound it was because I’d see a similar bug elsewhere. I’d be surprised if anybody at the SDMB even knows the piece of “trivia” that causes the bug:

The problem is that because we are piping into that while loop, the entire while loop runs in its own subshell. When you run return 0 inside the subshell, it returns to the line after the done keyword, and hence it always returns 1. I hate shell scripting.




I’ve worked on machines where both hardware and software had bugs; that led to some real doozies. I’ve already mentioned some of the more interesting, on other fora, under another alias. If you wish I’ll post details of how to detect a single-bit (“correctible”) storage error on one S370 model (due to, arguably, a firmware bug), without any Diagnose instruction or such.

An interesting set of bugs came to my attention when a computer manufacturer had unexplained poor disk performance running Unix. A “disksort()” was intended to insert incoming requests for seek efficiency, but ended up with opposite effect. One of the reasons for problems was that, although a BSD struct buf is a large object of over 100 bytes, someone chose to save space by trying to let one field do double duty:



#define b_cylinder b_resid              /* Cylinder number for disksort(). */

The detailed failure on one disk driver would be tedious to explain, but during sequential overwrite of a large file, every 3rd op would be a distant seek to overwrite the file’s indirect block – an unnecessary write except that the seek-ordering flaw led to internal buffer exhaustion.

The same vendor had another disk driver, which also had a (completely different) bug in its interface to disksort(), and completely different (and worse!) associated performance problem. In each case, the bugs were overlooked since they affected only speed, not “functionality.”