I have a PDF that contains a mix of text and text images. I need to be able to search for bits of text that are images, but obviously can’t.
I don’t know how it was created, but it started out as all text so it’s as clean as possible (i.e. it’s not like a captcha or anything). I tried using Adobe’s (I have Acrobat Pro 9) OCR on it, but for many pages I get the “can’t OCR when there is readable text on a page” error.
Is there an easy way around this? I’m tempted to print and scan the damn thing but that’s a lot of work. Would one of the free Adobe readers work? Is there some other function besides OCR that I’m overlooking? A setting?
It’s a public document so I can upload it somewhere if it’ll help see what I’m talking about.
Acrobat has always worked well for me in OCRing text, even imperfect scans of books in archaic fonts. What precisely is causing the error you’re seeing - what type or combination of material?
do a search for freeware ocr programs. I’m sure there are at least a few open source ones available.
The commercial ones like omnipage from nuance will give you the option of retaining the original format with images and implanting the ocr’ed text to make a searchable pdf.
I don’t think so, but I’m not entirely sure what you’re dealing with.
“Text image” is rather vague, but I believe you mean a picture that has text in it, let’s say a pic of a church sign that says a phrase or what the sermon will be.
Just because you can read it, doesn’t mean a computer can, because it’s not technically ‘text’. It’s letters that have been converted to many pixels.
There are to many variables to give a definitive answer.
This is text that was horribly converted to a PDF. As in text-text, not pictures of text and not a scanned document of text. It would have started as either a Word or Wordperfect document (or some other word processor). Earlier versions do not have this problem and are entirely searchable. It’s not intentional or some security measure (among other things there’s a list of preferred spellings–why would that not be searchable?). I’ve contacted the communications office but have not heard back.
I put the file on WikiSend if looking at it will help.
Off to look at PDF converters and free OCR programs…
If you have a mix of renderable text and images with text in them, adobe won’t do OCR, as you’ve learned. Rather than print and scan, save the PDF as a tiff, then re-import it into acrobat. That will convert the text to images as well, and you can then run your OCR on the whole thing.
Try using the Acrobat option to remove hidden information. That should get rid of the old OCR information and then allow you to tell Acrobat to re-OCR the whole thing.
Yeah, that is horrible. The system used to produce the PDF (probably some sort of printer driver conversion) has turned most of the text into glyphs compressed into streams - probably due to the wordprocessor kerning the characters individually (yes, I have seen postscript where that has happened - it was Word[im]perfect on VMS, but it was 20 years ago) or a font issue (some of the bold text is selectable, all the normal text seems to be nonselectable glyphs).
Render to bitmap and OCR is your only chance, if you cannot get a better version of the PDF.
Ah. That’s nasty.
I hadn’t looked at the doc because I didn’t want to be downloading dodgy PDFs at work.
It sounds like **si_blakely **has it. The only solution in this case is to run OCR.
Some years ago I was dealing with trying to text mine tens of thousands of scientific documents that had been generated by a version of Adobe Distiller that generated perfectly valid PDF with that same kind of one-way mapping from displayed characters to font elements.
The hallmark of this problem is that you can copy and paste text from the document, but it comes across as garbage.
The short version: We ended up licensing the Abbyy Finereader OCR toolkit for Linux and scripting the OCR process on all 10,000 document pages.
Imperfect, but it did the job.
Thanks. This file is beyond obnoxious—more so because it’s a freakin’ style guide. You want preferred spelling but you’re not going to make the list searchable? :rolleyes:
The conversion thing worked (thanks!) but it’s ugly on the eyes–it’s clearly an image with artefacts here and there, a slightly choppy font and noise around each letter. Bah.
I have a colleague who has worked with PDF documents for many years - it’s essential to his line of work and communication methods. Despite many patient discussions, he does not appear to understand that you can PRINT to PDF; he is locked into the method of printing the documents out and SCANNING them into PDF. I can’t get across to him that this is about like building cars by bending all the metal with your bare fingers. He is far from unintelligent, just… locked. I suspect a lot of people use PDF this badly and that’s the reason so many people dislike it.
The problem actually can be Print to PDF - when this happens, the output is wholly dependent on the interaction of the source app, the windows GDI renderer and the printer driver (with a pile of font based madness thrown in).
In the worst case, the output will consist of individually pathed glyphs placed in random order on the page, with no logical context. In the best cases, the printer driver/app loads font tables and kerning rules into the postscript, then the words are placed in order on the page and rendered correctly into a searchable text PDF. And at the bottom of the PDF, a pretty animated pig will fly across the footer. Usually the result is somewhere inbetween. This document is on the bad end of the scale.
Sad to say, the best results are generated from proper PDF tools like Acrobat, as they read the source document and preserve the contextual data implicit in the source. Of course, it is the most expensive option, and it is no wonder some people resort to PDF print drivers. LibreOffice/OpenOffice has a pretty good PDF export, with good context preservation, and supports a good (but not perfect) range of import formats.
Why don’t you write the UNDP and ask for a better version? Their 2008 and 2002 editions were in regular, real-text PDFs… so somebody in the office knew how to make them at some point.