We had a customer for which we had written a custom extension for our AR Invoice Posting process. They had a recurring but infrequent bug where they had partial batches posted to the G/L.
We checked every time it happened, and it always looked like the final step of the post was running out of sequence. This should have been impossible, because the job was executing from the JobQ, with each step putting the next step in the queue so it would run when the current step finished.
We checked and checked, and couldn’t figure out how that one step was getting started early.
Finally the customer’s IT manager noticed something that gave us a clue. It only happened when certain terminals initiated the job, and only on those workstations which had a terminal ID that began with the letter E! (On the old IBM System/36, all workstations had a 2 character terminal ID, which could be any two numbers or letters).
So our program always worked on terminal W1 or X2, but (sometimes) screwed up when run from terminal E2 or E4 (or E anything). Very strange!
With this in mind, we scanned through the code for references to the terminal ID.
We finally found the problem, not with the terminal ID itself, but with the data structure that just happened to be immediately before it in memory. This structure held the name of the day of the week, which we displayed on the screens for the convenience of the operator.
The programmer who had set up the date data structure had somehow not noticed the fact that “Wednesday” has 9 letters, while all the other days of the week have 8 letters or less. The data structure was one character short, and on Wednesdays was pushing the Terminal ID down one byte in memory when it overflowed.
This wouldn’t have mattered, except that the data structure that happened to be on the other side of the terminal ID was a flag byte we used to tell the program whether to execute from the JobQ, or whether it should run immediately (be “evoked” in System/36 terms). If the flag was an E, the job evoked another occurrence of itself, which began to run immediately, then died. Presto! Job step out of sequence.
So not only did the bug only happen when the job ran from workstations with an ID that began with the letter E, but also only if the job was initiated on Wednesdays.
That was a bugger to find.