Killing integer; a clerk's nightmare (weird till incident)

I work in a small corner shop. Today, both our tills stopped working. And it baffles me.

My co-worker was behind the counter when that happened. He tried the scan an item, till changed its display color like it’s in a wrong mode and died. Before I had the chance to say anything, he scans it on the other till. This one went dead after a second.

I think I know what caused it: he tried to scan a SIM card. Next to the usual 13-digit bar code, there’s a printout of phone number, PIN,PUK…and a lenghty 20(ish) digit bar code.

I went through the manual, internet, tried all the “secret” troubleshooting tricks to restart the till. Nothing.

I managed to open the drawer, and continued to write down each and every transcation until we closed the shop.

I called my boss. He wasn’t too happy. Repair guy is coming tomorrow. Apparently this is not the first time.

So, my question is: what the hell? How could an string of numbers on input kill the device, be it an electronic cash register or any computer’s cousin with a microprocessor? I would expect an error beep, “not found” message or automatic restart. Worst case scenario, frozen untill you turn it off and on again. Hell, even destroyed operating system and garbled screen. I’m not IT-knowledgable, but could a “bit overflow” or whatever that was kill a device? And shouldn’t registers have some kind of protection against such occurences?

(It’s a professional Casio till, less than 10 yrs old)

I’m a software developer, although not a firmware* developer.

This doesn’t surprise me at all. Two reasons:

  1. “overflow” errors are very common in a number of low level languages. It’s one of the top three most common errors even experienced developers make in C. It’s easy to do in assembly also. If the software for your register didn’t have good QA, or product decided it wasn’t worth developer time to add the ability to handle longer codes, or management decided it was better to ship with a known bug on time and handle any errors as they came, this would happen. To understand why it could brick your register, you need to understand a bit about how computers handle code. At a very high level:

Code is ordered into logical groups called functions. A function might be something like “calculate tax given the amount and the tax rate” or “add all the numbers in this list together” or “make the screen blink three times”. Functions can call other functions; for example, “print_receipt” might call “print_header”, then call “print_item_line” a bunch of times, and then call “print_footer”. When a function is called, the computer uses what’s called a stack to pass arguments to a function - for example, for the calculate_tax function, it might put the amount on the stack, and then put the tax rate on the stack, and the function knows that the first thing on the stack is the rate, and the second thing is the amount, and that’s how the function gets that information.

BUT, a key thing to note is that the computer needs to know where it left off when a function was called. So, something called the “return pointer” is put on the stack before any arguments. The functions reads it’s arguments (say, the rate and amount from earlier) does it’s magic, and then reads the return pointer so the computer knows to go back to where it left off. However, if the computer expected that a certain argument would be a certain length, and it was longer, that argument can overwrite the return pointer, which means when the computer tries to follow that pointer, it will go to some (usually unhelpful) place and usually crash hard. (Doing this on purpose is called “smashing the stack” and was a common method early hackers used to break into systems).

Modern languages have safeguards against things like this, but low-level languages like the ones probably used to program your register don’t. And if the return point happens to point to somewhere in the base code of the register, important stuff can get overwritten and cause the register not to even boot up (Most modern systems ALSO have safeguards that don’t let you write over that part too)

  1. Some developers are just idiots.
  • Firmware is software that is semi-permanently written to a device and very rarely if ever updated. Most embedded devices, things that aren’t attached to the internet, or things that are expected to go through some rough use, get firmware. Your register is somewhat like that. Firmware is often written with lower-level languages that have fewer of the “nice” safety features of other languages, usually for performance/memory reasons, so frequently when things go wrong there, they REALLY go wrong.

I know that could possibly be it, but its a very remote possibility its actually gone to write firmware. It probably can’t even update its own firmware. That would be a JTAG job. It seems to me its failing to start merely due to the storage of transactions being corrupted (perhaps with the SIM codes still on it .). Its surely a DATA issue, not corrupt software. Maybe its the server refusing to talk to the offending till… which means the server has “data” that says (on purpose or accidentally) the till is to be ignored …

Thanks for the in-depth answer, it sheds some light, although I won’t pretend I understood everything that you wrote there.

I’m still blown away because the device is totally dead; no power suply buzz, no fan spinning.

Does it mean, in theory, a virus could crush a PC down to the hardware level??

Your home PC has WAY more safeguards for stuff like that. The operating system restricts access to the sector of the hard drive that holds “the important stuff”, so that’s generally not possible with any OS released in the last two decades.

BTW, the technical term for hosing Flash to the point where the machine won’t boot is called “Bricking” it.

I can see a buggy program crashing a (crappy) operating system, but frying it so that the power supply is toast and the fan does not spin? :eek:

Sometimes, actually quite frequently, the fan is controlled by software so the speed can be adjusted based on temperature. I’m dubious about the power supply though. Either the OP is mistaken about the power supply really being dead, or the failure didn’t really happen due to reading invalid data; the invalid input was just coincidental and the actual problem is a hardware fault. The latter seems more likely to me. It is pretty unusual for a well tested product to be brickable by an invalid input. I’ve worked in firmware development for many years and while I’ve seen it happen, I’m mildly skeptical that a Casio product could be bricked by just scanning an invalid barcode. That would require rewriting firmware, or storing some erroneous value in nonvolatile memory which the firmware reads during boot and causes it to crash hard. While it’s not impossible, I’m betting that you’ll find that the power supply failed due to a hardware issue and the invalid input is entirely unrelated.

In addition to the OS, which does a good job of protecting the hardware, modern PC hardware itself has firmware which will protect it from external fiddling. It’s hard to impossible to convince most PC hardware to damage itself using only software at this point.

Wasn’t always thus. Back in the 1990s, the old CRT monitors were largely dumb as rocks. If you misconfigured your OS, it was possible, although probably unlikely, to come across a combination of configuration settings which would cause the monitor to physically damage itself trying to obey commands it couldn’t cope with. Back in the 1980s, before personal computers had fully-functional OSes and any text editor could interface with hardware directly, some computers could be physically damaged by any piece of software you ran, if it did the wrong thing.

Again, these days, if you try to tell a modern monitor to do something stupid, the monitor will say “buzz off” and put an error message on its screen. Monitors are also smart enough to outright tell the OS what they can do, which is a great convenience.

To expand on that, every device you connect to your PC is actually it’s own limited computer, validating inputs and emitting output in addition to its primary operations.

I’m guessing that when the system was tested, they tried a wide variety of valid inputs, found that it handled them all correctly, and concluded that the system worked, and that nobody even thought to test it on invalid inputs.

Highly unlikely for a major company like CASIO.

But it’s impossible to test every possible 13-digit code, let alone every possible combination of 13 digits + arbitrary number of extraneous digits.

As the scanners just use serial communications and/or emulate a keyboard this is just poor coding though, they don’t have to test every combination they just need to normalize the input.

In this case I would be that they probably used malloc vs using some sane, safer modern programming method. As UPC is a well defined standard, there is no excuse for this type of bug in a modern system, any function that does not preform bounds checking and input validation should fail a code review with extreme prejudice.

I know that the world runs on questionable quality code, but I need to strongly point out that this is not a problem complexity problem but a code quality problem.

Note if you can hard lock a system with a scan you can probably also figure out how to make it execute arbitrary code.

I realize that this this is past the point of your post, but if any current, aspiring or new programmers read this, this type of bug is not tech-debt, it is pure negligence.

This is not just an opinion, we live in a connected world and POS systems have high value PII and PCI data, and checking for/dropping unknown or extra input is a hard requirement.

Another possibility:
3. Many scanners are designed to be easily set up by average people. They get a book of special barcodes to scan to ‘program’ the machine. Like scan this barcode to set it for messages in English, scan this one to record currency in dollars rather than pounds or euros, etc., and finally one barcode to shut off programming mode and switch the machine back into normal operating mode,

It’s possible one of those other barcodes on that sim card triggered the machine into a programming mode, but then didn’t make sense to it, or didn’t ever shut that off and switch back to normal operation.

[This still an error by the system designer – it ought to be programmed so that if no valid programming commands are input for X number of minutes, the machine automatically switches back to normal operating mode.]

Yeah, I was going to suggest accidental interpretation as a scanner control code, but the OP’s description of the problem appears to go a bit further than just the scanner itself behaving oddly after the scan.

I agree it sounds like an overflow issue.

We’ve got to distinguish two scenarios:

  1. Invalid input causes the system to crash.
    This is certainly possible. Poor quality control, but certainly possible.

  2. Invalid input bricks the system.
    Something happens in processing the invalid input that makes the system permanently unable to boot. This is a far less likely scenario. The invalid input has to somehow cause the system to modify nonvolatile memory in a way that prevents booting. And, according to the OP, actually prevents the fan and power supply from starting up. I have serious doubts about whether this is really what happened.

I guess there’s option 1.5, where invalid input causes the system to behave in a way it shouldn’t, and this behaviour ends up bricking the system - sort of a cascade of events, such as:
[li]Invalid input causes a crash [/li][li]Crash is of some sort of unterminated loop nature, ramping up CPU to max[/li][li]Software monitoring of temperature is not executed because the system is stuck in a loop[/li][li]Fan shuts down (or just doesn’t start up when it should)[/li][li]Something overheats and toasts itself beyond recovery[/li][/ul]

Of course there are ways to quite easily design scenarios like that out of existence, but not everybody bothers - there are enough cases of cheap, expedient designs that are flawed in some way such as this.

For a sec there, I read pos as “piece of …”

Assuming it is bricked, vs powers up but doesn’t turn on the screen. And assuming you were able to do a hard reset (if has batteries, it might’ve done a soft). Then I give 60% hardware problem, maybe power surge. 39% a botched firmware update, 1% a barcode problem.

The thing is, embedded devices have to keep their program in a flash rom or similar. Amd reload from there every poweron. Bricking would require writing bad info to a critical address, and then successfully doing an eprom write (so two kinda opposite failures). Even going into setup mode should be fixed by hard reset. Now, pos has to have some external connection to process credit cards etc, so Casio is probably able to push out a firmware upgrade. Depending on the method, there is a brief window in the middle of writing the new to eprom where an error could corrupt the rom.

Tangent, but some devices handle power buttons, fans, etc with dedicated simple circuits; no code & no direct link to the micro. And jtag is only required to do first programming (assuming the device has some sort of net or PC connectivity) Been there, done that.

Did the store’s net connection die? I can think of a couple scenarios where no connection could brick it.

No, connection was there. Lottery, credit card and payzone machines were working fine. I don’t think tills are connected to net in any way, they’re not even interconnected.

To recap:

  1. tills were repaired/reprogrammed next morning. Can’t tell any details, I wasn’t there. But: till 1 was fine, till 2 went dead again after 30 mins, so maintenance guy had to come back.

  2. Similar thing happened a few years ago.

  3. Memory (all prices/items) was deleted (not sure if all of it). Storage is an SD card (permanently attached).

Are the machines purchased or rented? If the latter, has the bill been paid?

I remember one incident of a combine harvester or tractor dying in the middle of a field because the payment had failed to go through.