Holy shit, did I really do that? Or: stupid programming mistakes

friedo · October 25, 2009, 3:30am

So I’ve got this process. It’s a long-running process. It runs for months and months and months, daemonized, pulling messages off a queue and executing the jobs contained therein.

I was proud of my process. It was simple, elegant, and did its job without complaint, merrily humming away, without any human interaction at all for 98 days.

Then one of the things it does stopped working. Oh well, I can’t complain, this thing has worked perfectly for 98 days, something was bound to screw up sooner or later. So the first thing I do is tail the log file to see what broke.

“Hmmm, this is weird,” I thought to myself. The problem only appeared today, but the last entries in the logfile are from a month ago. WTF? What’s it been doing that past month?

So I check everything that was supposed to happen in the past month, and it all happened. So clearly something is screwy with the log file. Oh wait! The partition must be full of logfile cruft, so it got truncated. Nope. 12% full. Not even close.

So I run the module that broke manually so I can see what happens. Wait a minute – now there are new entries in the log, but dated last month? Wait…it can’t be…

Yes, ladies and gentlemen, my super-awesome, multi-buffered, multiple-filehandle-caching logging function. It gets the current timestamp. And I forgot that the standard time struct returns the month zero-indexed. EVERY log file EVER generated by this thing over the past 98 days has the month off-by-one.

I blame those confusing summer months, during which I wrote this thing, when I never know what number month it is.

BTW the thing that broke was really easy to fix. So was the log function.

Rysto · October 25, 2009, 4:16am

In fairness, numbering the months from 0 is one of the stupidest things I’ve ever heard of.

Do they number days of the month from 0, too?

friedo · October 25, 2009, 5:28am

No, the days are numbered from 1. The zero-indexed month thing is a standard libc “feature.” The tm struct from time.h has a month field which can be 0-11. Perl’s localtime function is just a wrapper around the C version, so it returns a zero-indexed month, too.

The worst part is that I’ve been bitten by this exact thing before, in both C and Perl, but I never seem to learn my lesson.

Jragon · October 25, 2009, 6:38am

The days of a month are 1 based
The days of a week, if represented as integers, are 1 based
The years are 1 based
The months are 0 based

I think it’s a C-based language thing, Java’s built in calender classes do it too, and I want to shoot whoever thought it was a good idea.

Edit:
If it makes you feel better, friedo, I’m the absolutely king of off-by-one errors. I don’t think I’ve ever written a program where my self-written test suite hasn’t had a failure because I typed a for loop that goes to 6 instead of 7 or similar. I recently had to program an underlying model for a battleships game for a class and managed to program it to make all boats one space too big…

Stealth_Potato · October 25, 2009, 7:35am

I found a recent example of my stupidity through the simple expedient of popping open one of my IRC logs and doing a search for my nickname followed by “I’m an idiot.”

I was doing some work on a compiler I’m writing in C++. Specifically I was building a module whose job is to take expressions from the parsed syntax tree and produce minimal unambiguously parenthesized expression strings. E.g., ((a + (b * c)) * d) would become just (a + b * c) * d. (It’s a dynamic language, and converting already parsed code back into easily human-readable program text which is still re-parsable is a built-in feature.)

It was a simple enough task: the AST representation of all expressions includes the binding power of the associated operator, so all I had to do was basically add an ToString method for each expression node class that checks the binding power of its subexpressions’ operators and inserts parentheses when there would be an ambiguity. So I slam out a base class method and overrides for the various classes and run the unit tests. I immediately start getting erroneous results. It’s not like it just explodes, though: sometimes it works, and sometimes it doesn’t.

I spent about half an hour poring over the code for every override of ToString, satisfying myself that it must be correctly identifying ambiguity (it’s a pretty simple check), and then suddenly I realized the real cause of the problem: I had forgotten to mark a base class’s ToString definition “virtual.”

All along, half of the override methods I had written were just never getting called, and I hadn’t even thought to run them through the debugger. :smack:

It’s a fairly understandable slip, all things considered (though I don’t even write code in Java all that often, so it’s not like I have that excuse). I just felt dumb for not figuring out the problem more quickly. :o

Consistency. Gotta love it. :dubious:

friedo · October 25, 2009, 5:48pm

I don’t understand the purpose of virtual methods. In most OOP languages, any function from a parent class can be overridden by simply naming it in the derived class. But then, I have learned to stay as far, far, far away from C++ as humanly possible.

Stealth_Potato · October 25, 2009, 6:40pm

Well, the idea in C++ is that, since dynamic dispatch is more computationally expensive than static dispatch, all class methods are non-virtual by default, and you have to specify which ones you want to be overridable.

It’s a performance decision made for broadest applicability: if there’s even one conceivable application of C++ where dynamic dispatch overhead would be non-insignificant, they figure it’s better to make it optional. Of course, maybe it would have been better to treat methods as virtual by default and just have the compiler decide which functions can be called by static dispatch. But then, the design philosophy of C++ also values not constraining compiler writers (hence the broad areas of undefined behavior in the language), which is perhaps an ironic decision considering just what a mess the language is to parse.

It’s just one of those things C++ programmers will have to deal with until the language finally goes the way of the dinosaur. :smack:

ftg · October 25, 2009, 7:15pm

Waaay back when, I was the local Conway’s Game of Life expert. I had figured out a way of programming it that saved a huge chunk of memory back when 8K was typical. So a friend took me to his place of work on a Saturday so I could program it on a minicomputer there. Had to translate the code, etc. It didn’t work. Something was clearly wrong. The original code was fine, but the new code wasn’t. Spent all afternoon and got nowhere.

Saved the code and stuffed it into a box. I’d come across from time to time and I’d try to figure out what it was. It bugged (!) me that much. Finally, maybe 15-20 years later I saw It. It had two nearly identical lines. I duped the first and made changes. But I had goofed and not changed everything. It was absolutely crystal clear. Can’t believe I missed it.

A classic example of the mind sees what it thinks should be there vs. what is actually there.

Re: months and 0. This makes sense if you need to do arithmetic on dates. Add six months to a month, take quotient and remainder. Simple. Never need to do that with years. Should never do it with days of the month. Just don’t ask me why days of the week are like that.

in_hiding · October 25, 2009, 7:44pm

I was playing around with the Buddhabrot, and though something fractal-shaped came out, it wasn’t really what I expected. While looking for something unrelated, I came across this:



p_r = p_r * p_r - p_i * p_i + c_r;   //real part
p_i = 2 * p_r * p_i + c_i;           //imaginary part

Hm. :dubious: argh :smack: I guess I was really tired.

I hope.

arjee · October 25, 2009, 7:54pm

One time waaaaaaaaaaay back in my days as a batch job scheduler, I started receiving many, many messages, one after another, regarding the abend of the same user-submitted job. I recognized the user id, so called the programmer, just as the abend messages stopped.

He was very sheepish. He had been using a product called ESP to submit his job. ESP understands normal language, so if you type in to run the job “every Monday at 22:13,” or “once an hour starting December 10th and ending January 17th,” it will. The programmer wanted his job to run that day (which was a Wednesday), then every second Wednesday after that (ie., it would run this Wednesday, but not next Wednesday, but would again the following Wednesday, etc.). So, he had typed in, “run every second Wednesday at 12:00.”

It was after noon, so ESP submitted the job. Every second. On Wednesday.

MindWanderer · October 25, 2009, 7:56pm

I was rolling out some updates for software used by Quality Control Techs at my company. It takes in measurement data for our products and validates that it is ok, and so forth. The updates were causing the software to flag everything in certain types of measurements as invalid.

I was sure it was an error in the database itself (we had moved to a new schema), but after a lot of investigation I found that the error was in my SQL statement.

Each numeric measurement has an upper bound and lower bound, and these values are stored in the database and retrieved to check user input. I had an algorithm that took all the bounds that applied for a measurement for a specific product, and compacted them together, giving the highest of the lower bounds, and the lowest of the upper bounds.

Some measurements were receiving bounds that were not for the product in question. With the extra bounds included in the compactifier, the result was measurements flagged as invalid.

I had been thinking that because only measurements that were used in the product were included that it would not be a problem, but I had to add an extra check to make sure each Bound belonged to the right product.

All this trouble, delaying the launch of the updates significantly because of one missing part of the WHERE clause.

Stealth_Potato · October 25, 2009, 8:01pm

Hilarious! Also a brilliant example of why “natural language” interfaces are unequivocally a terrible idea.

beowulff · October 25, 2009, 8:05pm

Here’s the stupidest one I’ve ever made:

I designed a box that used Apple’s ADB (Apple Desktop Bus) protocol. I wrote the code in 8501 assembler. Everything worked great, and we burned hundreds of EPROMS with the code.
Then, a few days before the shipping deadline, one of the developers (who were developing software on the Mac that used the ADB box) said “The box locked up - but I reset it and it seems fine.” So, I did what any good hardware guy would do, and ignored it.
Then, he came to me the next day, and said “The box locked up again - I was away from my desk for a few hours, and it was locked up when I returned. This is going to be an issue.” At this point I decided that it was probably a real bug, and started to try to replicate the problem. I worked for hours, but everything seemed fine. Then, I managed to get it to lock up, by messing with the ADB signals. I finally determined that if the bus got reset more than four times, the box locked up.
So, I looked at what was happening, and it sure looked like the stack was getting clobbered. I thought to myself “what could cause the stack to get whacked? Something like jumping out of a function call, without reseting the SP.” But, was I doing that? Yes - every time the bus got reset, I simply LJMP’d back to the start of the code, where I had never explicitly set the SP. So, simply setting the SP at startup fixed the problem.
But, I wondered about why the box ran so well in testing. It turns out that Apple reset the bus very infrequently during idle times - like once an hour. So, the thing would work fine if you were actually using the machine, but lock up if you left it sitting there. So, I had to disassemble all of the boxes and erase and reprogram all the EPROMS, but the final shipping device worked to everyone’s satisfaction.

friedo · October 25, 2009, 8:19pm

That’s awesome. It’s no wonder good ol’ cron has survived largely unchanged for 40 years.

I’ve seen a lot of scheduling systems that add a huge amount of complexity trying to handle weird cases (like “every Friday except the first Friday of the month or the second Friday if the day is less than 9.” (This is a real example from a previous job)).

I’ve long since resolved that the best way to handle cases like that is just schedule a cron for every Friday and have the program itself decide if it’s the right Friday to run. If not, just die. It’s not like you’re wasting huge amounts of resources to fire up some weekly batch job for a tenth of a second and then not run it.

beowulff, I’m glad I’ve never had to do any hardware hacking. I have no idea how you guys do that stuff. ADB was a nice protocol though.

Rysto · October 25, 2009, 8:53pm

The company I work for maintains an internal fork of the FreeBSD operating system. Maintaining the fork mostly works fine, but it means that we get further and further behind as new versions of the OS get released. This bit us pretty hard over a year ago – our developers needed a newer version of some third-party library they used, but the version of FreeBSD we were running was just too old to support the new library. We had to put together a crash program to get a later version of FreeBSD and merge our own changes into that version. All in all, it went quite well, with one small problem: our systems would always hang when trying to mount their root filesystem. The strange thing was that we hadn’t touched the disk subsystem at all. Suspicion immediately fell upon a set of changes I had made to the PCI subsystem involving enabling memory-mapped access to the PCI configuration space registers. Thanks to some idiocy in our revision code software, it took a lot longer than it should have for me to confirm that was indeed the problem. When I finally found it I was infuriated with myself:


register_addr = base_addr | register_offset;

You see, I had based my code on code for a different CPU architecture, where they were dealing with physical addresses. As it happens, the hardware guaranteed that the low bits of the base address would be zero, so the code was perfectly valid(if perhaps a bit too clever). When I ported the code to our architecture, it never for a moment occurred to me that I was dealing with a virtual address, and there was no telling what the low bits were going to be. As it happens, on the older version of FreeBSD that I had done the work on, the virtual address of memory-mapping hardware registers always had the same alignment as the memory-mapped physical address, so the code worked just fine for a year. But the new version of FreeBSD had improvements to how memory-mapped hardware registers were mapped into the virtual address space, and as a side effect the lower bits of the virtual address could now be anything. And so, because some overly clever programmer used ‘|’ where he really meant ‘+’, and I stupidly parroted him, I spent a week trying to track this down.

Otanx · October 25, 2009, 8:59pm

Another off by one guy here. Apperently I can not count rows, or columns in Excel. Had a script that would connect to Active Directory, connect to an xls file, and then write the updated address for the location to all the users in AD. The updated address was based on the old one. If user was at 123 A st. Then it was supposed to get the address from Sheet1 Row6. If users old address was 456 B st. then the new address was on Sheet2 Row6. I missed my count for which row I needed, and becuase it was such a simple script I didn’t test. Updated 800 users, and then I found out that I just updated all the address information with the header row. Now all my users had the exact same address - Street: Street, City: City, etc.

Had to get a list from HR of what user was at what address, and write a new script to compare user names to the HR list, and update the address again. I tested it this time. Guess what? First time I ran the test I had my rows right but the columns were off by 1 so City was in Street, etc.

-Otanx

Jragon · October 26, 2009, 3:39am

Eh, nevermind

Jragon · October 26, 2009, 8:02am

Yay, I messed up again!

When do networking in Java, make objects that go over tubes serializable. >.<

si_blakely · October 26, 2009, 10:08am

For the OP, at least you didn’t use your log files to do this.

As for simple programming errors, I was doing a course based around programming a Z8000 VME bus dev system (second year CompSci course, I got accelerated entry during first year). We had to demonstrate binary multiplication using assembler. I was a bit late doing my test run, as were many others, and when I got to the labs, the test room was locked, and the only access to the dev systems were by telnet (no access to the small red reset switch). That would be fine if everyone knew how to terminate a simple loop and handle arithmetic overflow, but not everyone did :smack:

There were 8 dev systems, and when I got in, all of them were locked up with stupid bits of endless assembler. I managed to catch a dev system as it reset (probably from a stack overflow), loaded my code up and saw it work first time. I then spent a couple of hours reviewing other peoples code before letting them have a go on the VME system, to keep as many systems running as we could. Eventually, though, there were no running systems and some upset assembly language programmers.

I loved that course - the lecturer was a great communicator, a hard marker, and I (a few years later) married his daughter and passed his network communications course (I won’t tell you which was the more difficult of the two ;)).

Si

Crowbar_of_Irony_3 · October 26, 2009, 11:56am

Today here’s how zero became my nemesis.

I am doing an application which measures how far an object travels in a movie. The user is able to draw a line to set how many pixels is one metre. So the unit length is naturally distance in metres divide by pixel length of line. , I set the default unit length to 1. So the actual world distance travelled is distance in pixel multipy by unit length.

There is a funny error which enraged me for 2 hours. After I load a second movie, the distance is always reported as zero. I’ve pulled my hair out, screamed at the monitor in impotent rage and swore, when finally something struck me. What could happen to make everything become zero?

I check the re-initialization method that is called when a new movie is loaded. It sets the unit length to 0. Whereas the constructor sets it to 1. Which is why the first movie loaded always works, but the rest that follows would always report a distance of 0…

Topic		Replies	Views
Programmers: bugs which have driven you out of your mind Miscellaneous and Personal Stuff I Must Share	18	1869	January 23, 2010
Computer people - share your stranger bug fixes Miscellaneous and Personal Stuff I Must Share	53	5947	May 5, 2011
Have any Y2010 bugs been noted yet? Factual Questions	34	5772	January 8, 2010
Does anyone else hate it when the month names and numbers don't match up? Miscellaneous and Personal Stuff I Must Share	36	4817	October 23, 2013
You can call me a backward Limey if you want but The BBQ Pit	57	2346	May 17, 2002

Holy shit, did I really do that? Or: stupid programming mistakes

Related topics