My scanner+OCR reads p. 105 as jibberish. ONLY p. 105! What gives?

stuyguy · February 22, 2006, 10:38pm

I’m in the process of converting a 400-page typewritten manuscript into a Word doc. I’m using a scanner equipped with OCR software. It’s a slow, brain-numbing process, because the OCR program makes a lot of errors that must be corrected by hand, but I’m grateful that it can do the heavy lifting so that I don’t have to retype the thing myself. (I’m a miserable and slow one-fingered typist.)

Anyway, things are chugging along splendidly until I get to page 105. In a batch of about six sequential pages that all render fine, page 105 comes out as utter jibberish.

“Huh!” I say. Maybe I placed the page on the scanner upside down. So I try again. Jibberish.

One more time. Again jibberish!

So I try a different page (108). It comes out fine!

What is it about page 105 that makes my scanner and OCR program choke?

Q.E.D · February 22, 2006, 11:07pm

Hold on while I warm up my super clairvoyant powers, and I’ll tell you.

Shagnasty · February 22, 2006, 11:20pm

Maybe the paper is different. Try putting a blank piece of paper behind that one.

Q.E.D · February 22, 2006, 11:28pm

Or maybe fingerprints? Coffee stains? Smudges? Ribbon ran low on ink? In other words, there are a lot of possibilities. Might help if we can get a good scan of the page in question, and perhaps of a good page, for purposes of comparison.

stuyguy · February 23, 2006, 12:10am

Okay, I think I’ve got it somewhat figured out.

First of all, page 105 is stain-free and the paper is identical to all the other pages, so those ideas, I figured, were non-starters.

But I examined the text up close and noticed that there was a mis-typed character in one of the words that rendered like some weird foreign letterform on the page. I thought that maybe this one mishapen letter might be fooling the software into thinking that I had a page filled with non-English text. So, I covered the word containing the bad letter and tried again. This time it come out perfect!

Just to double-check my theory, I uncovered the faulty word and scanned it one more time. Gibberish!*

*Yes, I finally figured out how to spell “gibberish.” Sorry I got it wrong in my first post.

Q.E.D · February 23, 2006, 1:28am

Just out of couriosity, what kind of OCR software are you using that’s so easily thrown off?

stuyguy · February 23, 2006, 5:31am

ABBYY FineReader.

Tim_T-Bonham.net · February 23, 2006, 10:25am

I suspect you have left the default value set to ‘automatically detect the language’. If you are generally scanning only English text, you can set it to ‘English’ to avoid situations like this.

stuyguy · February 23, 2006, 4:08pm

Oddly, no. The program was set for English. Curious, no?

Topic		Replies	Views
Printed journals from the nineties - can I make Word documents of them? Factual Questions	6	959	January 23, 2009
E-Books are Full of Errors! Cafe Society	65	11737	March 2, 2012
How did this copy and paste operation misfire? Factual Questions	11	2090	September 20, 2012
epson scanner with scripting error: layout mistake. what is that? Factual Questions	0	676	March 3, 2008
Scan book to searchable file In My Humble Opinion	9	411	May 29, 2019

My scanner+OCR reads p. 105 as jibberish. ONLY p. 105! What gives?

Related topics