Reading an article today about going paperless, it occurred to me that I periodically photograph pages or documents to keep a record, but never convert those photos to text using an OCR program.
What is the best program for converting say JPEGs to text files? Are multiple-column pages a problem?
Last week I was talking to a blind guy I know, and apparently OCR has improved a LOT in the last few years. It handles tables and columns and everything; you just have to give it a high-enough-quality image. I’d like to find out more, but he was describing the equipment he got through CNIB. I have no idea what regular civilian OCR is like these days.
Acrobat, by the way, requires a minimum resolution of 144 dpi for OCR to work. If i’m scanning something for OCR, i usually scan at 300 dpi on my garden variety CanoScan 4400F.
Here’s a page from a pdf that i scanned for my class the other day, and here’s what the first paragraph looks like after selecting, copying, and pasting. I’ve done no cleanup, and made no corrections:
The line breaks are “hard” breaks, so if you paste it into Notepad or Word or something, the line breaks will come with the text. There may be a way to avoid this line break issue and get continuous text, but i’m not sure how.
Anyway, as you can see, the OCR did a great job of turning the scanned page into text and punctuation. Just one error, at the beginning of the second sentence, where the OCR added a superfluous “l”.
Of course, JPG can be converted to TIF, but if the compression was high in the JPG, you will get a lot of unwanted artifacts that will decrease the OCR accuracy.
After a bit of digging, I found this site which reviews several programs. The one that seemed most appropriate for my use (in particular, converting JPGs) was TopOCR. After playing around briefly with the program, I’m not as enthusiastic as the author of the review, but it seems adequate.
I tested it out on three photos; This one converted perfectly.
This one is more similar to most of my photos, and was not converted very well.
And a photo on my hard drive of a highway historical marker plaque converted quite well, especially after I changed it to grayscale, jacked up the contrast, and inverted the color so that it was black text on white background. It was nice to be able to fiddle with the image within the program, rather than having to save the doctored image in Photoshop every time.
Also, I tried Adobe Acrobat Pro 6.0 on the first two photos. On one of them, the resolution was outside the acceptable range. On the other, for whatever reason the photo was converted when it was imported into a picture that resembled the original as viewed through glass blocks - naturally OCR was useless on that.