Character recognition software for image PDFs?

Can anyone recommend some good character recognition software that will work on image PDFs? Free, preferably, :slight_smile: but I might be willing to pay a small amount (like up to about $20) if it bought me markedly more accurate recognition or noticeably more speed. The ability to output to a PDF too would be nice, but is not really necessary. I don’t crave any other bells or whistles.

I do have a copy of ABBYY FineReader 5.0 Sprint Plus, that came with my old AIO Printer/Scanner, but it seems only to be able to read regular image files and not PDFs (certainly not multi-page ones).

[I keep wanting to say “OCR software,” but, of course, what I want to do does not actually involve anything optical.]

I’m on Windows XP Pro SP3, by the way.

One of the PDF converter products from Nuance might work.
The cheapest one in 49.95 however.

I have the Enterprise Version so I can use the “redaction” feature to remove data from PDF’s and not suffer the embarassment of the Department of Homeland Security who drew some easy to remove boxes on some pdf’s and put them out on the web recently.

From their website, it is not at all clear that Nuance can handle image PDFs at all, as opposed to ones that contain real text. But $50 is beyond my budget anyway.

I am talking about PDFs that have been scanned from a paper document, so that they are really a collection of page images. I know this sort of conversion can be done, because Google does it for certain image PDFs that are on the web. You cannot search the text of an image PDF in Acrobat Reader or other PDF viewer, but Google can and does search and index within them.

PDF Converter Enterprise will take an image PDF and do OCR on it. I just tested it on a pdf Scan and it did OCR both when I elected the “make searchable” option and also when I exported to a Word document.

It looks like PDF Converter Pro can do the same thing. Not clear about PDF Converter.

They do have a free version that uses a Web hosted service for conversion, I expect they charge for that, but I could not tell for sure.

Do note that these things don’t tend to work that well on any resolution lower than 300 dpi. Most images are saved at around 100 dpi.

Acrobat Pro (not the free Reader) will do this, then put out a new PDF with the text on top of the image. It’s not free (or cheap), but you may have already got it as part of a suite of Adobe software, or through a site license.

Thank you for thinking to post the OS. I opened the thread to outline a way to do this on Linux; I’ll continue even though it’s probably not gonna help you at all.

I did some freelance work last year OCRing some legal docs and putting them into a database. For Windows, the package that was recommended for me was ReadIris – supposedly very good, but I couldn’t get it to install. On Linux, I ended up using the convert program (part of the ImageMagick package, IIRC) to manipulate the scanned pdf documents, then used tesseract to OCR them. Since I was creating database files, I didn’t need to generate pdfs, but that would’ve been easy enough via any number of tools…though I would’ve had to do something about re-formatting them, now that I think about it.

Anyway, just thought I’d mention my experience; maybe that IRIS program would be worth looking into.

Wow, from what people have said so far (and the general lack of response), it looks like the only way to do this would be to spend lots of money I do not have. :frowning:

I must say that I am rather shocked, in this day and age, when all sorts of utilities (including ones for reading, manipulating and even creating PDFs, and including OCR software), not to mention entire Office suites, and operating systems, are available for free, that there is nothing like this that is not lots of $$$. It is not like image PDFs are all that rare. For one thing, the older archives on JSTOR are full of them, and I am sure that lots of other impoverished academics, like myself, would love to be able to create searchable versions.

tesseract, as mentioned in the post above yours, is free. As is ImageMagick.

The reason no one else is doing it free is the same reason a lot of stuff isn’t free: the fact that it needs to be done by mostly businesses means developers can make a pretty penny off of it. The higher demand actually makes higher prices make more sense, in the capitalistic world.

Well, according to that post, they are both for Linux only, although your edit line seems to be saying otherwise.

But in any case, neither of them seem to do what I want. I already have, or know where to get, free programs that would enable me to convert an image PDF to text by using several programs in succession, and many steps. It would however, necessitate processing each page at a time separately through several steps, and would be very tedious and time consuming. What I want to be able to do is simply load a PDF into a program and have it spit out a text version. Free (and even open source) software for doing each stage of this is out there, so it should be no trick for some programmer to have put them together into a package that would do this, which I am sure is something many people besides me would find useful.

If it has not been done, I guess it hasn’t. Maybe I am just out of luck. But I find it surprising. (Maybe it is just too trivial a programming problem to interest the sort of people who write code for fun. I don’t know.)

I do not understand you final point about business needs and high demand leading to higher prices (or lack of free stuff). How do you account for the existence of Open Office then? Certainly it is not just businesses that need to do this task. I am not a business, and as I mentioned, I think it is highly likely that there are lots of academics who would like to be able to do it, on an occasional basis too (and probably lots of other people too)…

Well, I’m a Linux guy. So that’s the OS on which I ran them. According to the links above, both are available for Windows. As I’ve never used them on Windows, I can’t say anything about their use on that platform.

Yes, I did this via a perl script. I forget now how much of it was non-generic (for instance, I programmatically cropped each page via convert, due to 3-hole artifacts being treated as 'O’s). There was also a subroutine I used to make ‘manual’ changes, which turned out to be way larger than I expected, and a state/zip code lookup routine, etc., etc. See, it was a one-off project, so I just handled things as they came up. I think it’s pretty organized for spaghetti code, but it’s spaghetti code nonetheless.

Well, open-source/free software is generally written by people with an itch to scratch. Like the script I mention. You’re welcome to it if you think it would be helpful (I think you’d be better off finding a turnkey solution – that IRIS software I mentioned had a free 30-day trial). PM me if you want more direct contact.

I think Snag It might grab text from an pdf image.

I use Snag It a lot to capture text from web sites that don’t allow text to be selected/copied with the mouse.

I’d suggest installing the free trial version. Set the Capture Mode to text (instead of Image). It’s very easy to use.

There’s a video demo of text capture here.

I just did a quick search (“pdf character recognition”, i think) on download.com and saw this one but haven’t checked it out…

http://download.cnet.com/Docsmartz-PDF-Converter-Pro/3000-2079_4-10290777.html

My question is why are people creating scanned/image-based pdf’s of documents that were created recently, obviously with a word processor, such that they are not searchable? And wouldn’t a text-based pdf be a lot smaller than storing it all as graphics? If it’s something old with no w/p file available I can understand the need to scan, but not with modern-day docs(?)

I would guess that the person making the PDF doesn’t have access to the original file.

Or they’re worried about people plagiarizing significant portions of the book, so they want to disable copy and paste. That’s one way to do it.