Making a PDF with imaged text searchable

Rhythmdvl · June 13, 2013, 7:11pm

I have a PDF that contains a mix of text and text images. I need to be able to search for bits of text that are images, but obviously can’t.

I don’t know how it was created, but it started out as all text so it’s as clean as possible (i.e. it’s not like a captcha or anything). I tried using Adobe’s (I have Acrobat Pro 9) OCR on it, but for many pages I get the “can’t OCR when there is readable text on a page” error.

Is there an easy way around this? I’m tempted to print and scan the damn thing but that’s a lot of work. Would one of the free Adobe readers work? Is there some other function besides OCR that I’m overlooking? A setting?

It’s a public document so I can upload it somewhere if it’ll help see what I’m talking about.

Whack-a-Mole · June 13, 2013, 8:00pm

You might give Google Drive a try: Convert PDF and photo files to text - Computer - Google Drive Help

No idea if it will work but worth a shot (note it says it only scans the first 10 pages of a PDF so if longer you may need to break it up).

Amateur_Barbarian · June 13, 2013, 8:03pm

Acrobat has always worked well for me in OCRing text, even imperfect scans of books in archaic fonts. What precisely is causing the error you’re seeing - what type or combination of material?

deltasigma · June 13, 2013, 8:03pm

do a search for freeware ocr programs. I’m sure there are at least a few open source ones available.

The commercial ones like omnipage from nuance will give you the option of retaining the original format with images and implanting the ocr’ed text to make a searchable pdf.

HipGnosis · June 13, 2013, 8:03pm

I don’t think so, but I’m not entirely sure what you’re dealing with.
“Text image” is rather vague, but I believe you mean a picture that has text in it, let’s say a pic of a church sign that says a phrase or what the sermon will be.
Just because you can read it, doesn’t mean a computer can, because it’s not technically ‘text’. It’s letters that have been converted to many pixels.
There are to many variables to give a definitive answer.

deltasigma · June 13, 2013, 8:22pm

OCR programs are quite good at recognizing text, after all ocr stands for optical character recognition.

Rhythmdvl · June 13, 2013, 8:29pm

Here’s the exact error message:

This is text that was horribly converted to a PDF. As in text-text, not pictures of text and not a scanned document of text. It would have started as either a Word or Wordperfect document (or some other word processor). Earlier versions do not have this problem and are entirely searchable. It’s not intentional or some security measure (among other things there’s a list of preferred spellings–why would that not be searchable?). I’ve contacted the communications office but have not heard back.

I put the file on WikiSend if looking at it will help.

Off to look at PDF converters and free OCR programs…

deltasigma · June 13, 2013, 8:50pm

Have no idea what any of that means and don’t have an acct on wikisend. lo-tech solution is print, scan, ocr.

Twoflower · June 13, 2013, 8:54pm

If you have a mix of renderable text and images with text in them, adobe won’t do OCR, as you’ve learned. Rather than print and scan, save the PDF as a tiff, then re-import it into acrobat. That will convert the text to images as well, and you can then run your OCR on the whole thing.

minor7flat5 · June 13, 2013, 8:56pm

Try using the Acrobat option to remove hidden information. That should get rid of the old OCR information and then allow you to tell Acrobat to re-OCR the whole thing.

Reply · June 14, 2013, 2:57am

You could try to export the PDF to images and then make another PDF out of the images. Or use the XPS image printer that comes with Office.

si_blakely · June 14, 2013, 11:06am

Yeah, that is horrible. The system used to produce the PDF (probably some sort of printer driver conversion) has turned most of the text into glyphs compressed into streams - probably due to the wordprocessor kerning the characters individually (yes, I have seen postscript where that has happened - it was Word[im]perfect on VMS, but it was 20 years ago) or a font issue (some of the bold text is selectable, all the normal text seems to be nonselectable glyphs).

Render to bitmap and OCR is your only chance, if you cannot get a better version of the PDF.

minor7flat5 · June 14, 2013, 1:44pm

Ah. That’s nasty.
I hadn’t looked at the doc because I didn’t want to be downloading dodgy PDFs at work.

It sounds like **si_blakely **has it. The only solution in this case is to run OCR.

Some years ago I was dealing with trying to text mine tens of thousands of scientific documents that had been generated by a version of Adobe Distiller that generated perfectly valid PDF with that same kind of one-way mapping from displayed characters to font elements.

The hallmark of this problem is that you can copy and paste text from the document, but it comes across as garbage.

(here is my original writeup of the struggle)

The short version: We ended up licensing the Abbyy Finereader OCR toolkit for Linux and scripting the OCR process on all 10,000 document pages.
Imperfect, but it did the job.

Rhythmdvl · June 14, 2013, 1:57pm

Thanks. This file is beyond obnoxious—more so because it’s a freakin’ style guide. You want preferred spelling but you’re not going to make the list searchable? :rolleyes:

The conversion thing worked (thanks!) but it’s ugly on the eyes–it’s clearly an image with artefacts here and there, a slightly choppy font and noise around each letter. Bah.

Amateur_Barbarian · June 14, 2013, 2:18pm

I have a colleague who has worked with PDF documents for many years - it’s essential to his line of work and communication methods. Despite many patient discussions, he does not appear to understand that you can PRINT to PDF; he is locked into the method of printing the documents out and SCANNING them into PDF. I can’t get across to him that this is about like building cars by bending all the metal with your bare fingers. He is far from unintelligent, just… locked. I suspect a lot of people use PDF this badly and that’s the reason so many people dislike it.

si_blakely · June 14, 2013, 2:39pm

The problem actually can be Print to PDF - when this happens, the output is wholly dependent on the interaction of the source app, the windows GDI renderer and the printer driver (with a pile of font based madness thrown in).

In the worst case, the output will consist of individually pathed glyphs placed in random order on the page, with no logical context. In the best cases, the printer driver/app loads font tables and kerning rules into the postscript, then the words are placed in order on the page and rendered correctly into a searchable text PDF. And at the bottom of the PDF, a pretty animated pig will fly across the footer. Usually the result is somewhere inbetween. This document is on the bad end of the scale.

Sad to say, the best results are generated from proper PDF tools like Acrobat, as they read the source document and preserve the contextual data implicit in the source. Of course, it is the most expensive option, and it is no wonder some people resort to PDF print drivers. LibreOffice/OpenOffice has a pretty good PDF export, with good context preservation, and supports a good (but not perfect) range of import formats.

Rhythmdvl · June 14, 2013, 2:40pm

I… I… words fail.

What does he say when you show him the option on thw print menu?

Reply · June 14, 2013, 7:12pm

Why don’t you write the UNDP and ask for a better version? Their 2008 and 2002 editions were in regular, real-text PDFs… so somebody in the office knew how to make them at some point.

deltasigma · June 14, 2013, 7:57pm

Do you have access to omnipage pro? That will preserve the image and make it searchable.

edit: It’s one option anyway.

Topic		Replies	Views
Character recognition software for image PDFs? Factual Questions	14	6092	October 3, 2010
I've still a treasure in this world--A Picture Of Some Text Factual Questions	6	1164	June 8, 2009
Extracting text from a pdf file Factual Questions	24	1743	November 14, 2008
Adobe Word Search Error??? Factual Questions	2	899	February 10, 2005
How do you search in a PDF? Factual Questions	2	727	May 10, 2004

Making a PDF with imaged text searchable

Related topics