Xerox copiers can't copy correctly

Xerox copiers have been found to mix up numbers when copying, due to the image compression, Jbig2, used.

Huh? Our character recognition is so good that a copier can recognise characters? And it’s reliable enough that it’s being used to compress copies? How is Captcha still working, then?

In that case the letters are often highly distorted, with background patterns. However, computers have had a good success rate in getting past CAPTCHAs for years.

JBIG2 is a bit like a combination of OCR and compression - it recognises similar groups of pixels as symbols, which can then be very economically compressed.

So, yeah, when it goes wrong, it could mix up numbers.

Heck, the Wikipedia article even mentions this as a flaw.

It seems like OCR might actually help avoid this, as it would know not to equate symbols known to be similar yet have different meanings.

Optical character recognition (OCR) (in which the computer interprets the scanned image and outputs a text or Word doc) is commonplace, but most users know that they will need to proofread the results to make sure the computer didn’t misread the original document. The same is not true for simple image scans: you know the resulting image file may have a noise pixel here or there, but you don’t expect to have to proofread entire tables of numbers or bodies of text to make sure all the numbers and letters are unchanged.

In the case of the Xerox scanners, it’s not deliberately recognizing alphanumeric characters. Instead, it sees a block of pixels that looks very similar to a block of pixels it saw somewhere else in the document, so instead of of recording that new block of pixels, it just records a reference to the old block of pixels. That’s fine for random lines and corners and stuff, but when it’s replacing entire letters with other letters that look “similar” but in fact are entirely different, the implications are frightening.

The guy’s blog gives some examples of the substitutions, and ponders about the possible effects, including potentially lethal prescriptions and construction mistakes. Xerox ought to be wetting their pants right now.

Interesting. I don’t think I was aware that any compression system in use analyzed and replaced content on this basis. Could lead to some disastrous results for pages of critical financial information or other data.

I don’t get it, why are the copiers changing stuff around? Don’t they just copy what it sees and then reprint it? Seems to me they are now trying to scan it, read it, then reproduce it. Seems like a lot of work, and now prone to errors, what does it gain by doing this?

Reduced memory usage/file size. Check out the guy’s blog for complete details about what’s happening and why.

These copiers are typically used to scan pages directly to a fileserver (or email) in an office. If you configure them to use image compression you can end up with mistakes like this, due to compression errors. It’s not an OCR error.

If you’re scanning important stuff with lots of figures, turn off image compression.

I think you’d be hard pressed to identify a business document that was not regarded as important and didn’t have any characters or numbers in it. Which means the “normal” setting (which is the setting where this problem occurs) is pretty much useless. Seriously, in which documents would you be willing to tolerate massive numerical errors? Payroll? I-beam dimensions? Radiation dosimeter readings? Meeting times? Production quantities?

The scanner does provide a warning when you select the “normal” setting (see article here, scroll about halfway down), but the warning only appears for whomever changes the setting; it doesn’t show up for subsequent users. Moreover, one would not reasonably expect a “normal” image quality setting to be swapping digits around in your documents; that’s something you might expect from a “double extra shitty” image quality setting.

This is behavior that most end users do not expect from an image scanner.

Do the copiers allow for a choice of compression, such as JPEG or CCITT4? Any completely image-based compression relying on schemes like RLE should be no problem. This notion of trying to replicate blocks of the image gives me the willies for exactly the kind of reasons above.

I read through some of it, but didn’t fully understand it. I can understand why one would want compression for scanning the document, but not for a simple copy. Unless I missed where it’s not really happening when copying and only when scanning.

Meeting times. Definitely meeting times.

It’s going to be very interesting to see how Xerox handle this. This is the sort of incident that can break a company. Just think how many scanned documents are now going to have to be manually checked.

Changing a 6 to an 8 is almost understandable.

But I was looking at the original blog post and in one of the examples, it changed 21.11 to 14.13. That is pretty bad.

It did that because it had already seen the 14.13 elsewhere in the image (check out his test scans on his blog; 14.13 was first observed in “place 1” on the blueprint). It scanned in the pixels that spell out 21.11, analyzed it, and said to itself “hmm…that pixel pattern is a lot like a pixel pattern I’ve seen before (the pattern that spells out 14.13); I’ll just use that old pixel pattern again instead of trying to remember this new one.” It’s not that it’s got a permanent inventory of things that look like other things and are therefore acceptable substitutions; it’s adaptive, looking for patterns in the document at hand and then using those patterns to reduce file size later on.

From his blog, it looks like it happens when scanning hardcopies into PDF files, so presumably it does not happen when making a photocopy

I have to say, this rather horrifies me. We (the general We) use .pdf files to archive documents, and this protocol is able to alter the very meaning of the document when archiving it? That is insanity, how could a machine designed to scan documents allow this protocol to be used on a text document?

Anybody know who still makes carbon paper? I may decide to invest.

How could a hammer designed for construction be used to break a window? Different tools have different uses. If you don’t want your copier applying image compression to your documents, use OCR mode. Recognizing characters is what it’s for. That’s why it’s called OCR.

Admittedly, these things could be documented better, but I don’t think it’s constructive to blame Xerox for the mistakes of people who incorrectly use their tools. Image compression is for scanning images. OCR is for documents. Hammers are for nails. Or cockroaches. I hate cockroaches.

I don’t mind image compression, I mind the scanner changing the numbers on my document. This isn’t me smashing a window with a hammer, this is the hammer head flying off into the window, and getting “you should have bought a more expensive hammer” as the excuse.

If the head flies off, I don’t care if it looks like a hammer, you shouldn’t sell it as a hammer. If the scanner changes the text of the document, you shouldn’t sell it as a document scanner. Or, at a minimum, don’t ship it with “document altering” as the factory default quality setting.

OCR is great for having the computer read a hardcopy consisting of pure text and spit out an editable Word document. If I want a digital version of a hardcopy that contains a combination of graphics and images (e.g. a building plan or a CAD drawing with dimensions/specs on it), OCR is not helpful.

Don’t you think a compression setting called “normal” should be expected to not be swapping out big blocks of pixels in your image? If they’re going to give me a small file size, I’d rather have a blurry image than a perfectly legible one consisting of wrong numbers; the former is easier to spot, the latter requires a very time-consuming proofread. One expects to proofread an OCR document, but not an image scan.