Adobe Word Search Error???

I did a word search in an Adobe (pdf) document. The results made me suspicious, so I scouted about and found an occurence it missed. However, the occurence was across a page break.

Is this a major limitation of text searching within an Adobe file? Is the page break like an embedded character between my (otherwise) adjacent words?

Does Adobe know about this…? :confused:

  • Jinx

I’ve seen similar cases where Adobe outright reported no instance of a string yet there was string in the middle of a sentence I was reading.

Since I don’t have any PDF authoring tools, I always assumed that the author had to preprocess the document somehow in order to make it “searchable”.

Perhaps a PDF author will chime in.

I’m no Adobe expert, but i do have Adobe Pro on my computer.

Ii my understanding is correct, when a pdf is made from a file such as a Word document, it is searchable by default, and it shouild be possible to find all the words in the document by searching.

Where this sometimes falls down is if the converted document was originally a jpeg or a scanned image or some other type of non-text file. If you create a document from such files, and you want to make it searchable, you need to use Adobe’s “Paper Capture” function, which “looks” at the image and applies optical character recognition (OCR) technology to identify the characters in the file. It then creates a searchable text file which is invisible to the user, and sits “behind” the bitmap image that the user sees. When you search for a word or phrase, Adobe searches the text file and highlights the appropriate section/s, and the user sees this highlighting on the main bitmap image.

The problem with technology such as OCR is that it is not infallible, and it also relies heavily on the quality of the original image. Acrobat requires that scanned images be at least 200dpi resolution, and 300+ is usually preferable. If the book or other imported scan/image is of pooor quality, or if there are gaps or smudges in the originial, then this can fool the OCR technology and result in not all the characters being recognized and included in the searchable text.

As for the issue of page breaks, it does seem that a page break will prevent Acrobat from finding a search string. And this seems to be the case whether the Acrobat file is created from a text document like MS Word, or from a scan/jpeg etc. I tried to find cross-page strings in a bunch of documents, and couldn’t.

As an experiment, i also tried to scan for a single word in one of my documents (the word “public”), knowing that one occasion the word was the very last word on a page. In all other instances, the search results showed the word “public” along with the words that followed it; but in the instance where it was the last word in the page, the search results showed “…public.” as the last piece of the search result, with none of the following words.

If it’s not quite clear what i mean by this, you can see a screenshot of the search results here. The first instance, where it says “wider public,” is the one where “public” was the last word on the page. In all the others, you can see the following words. This suggests that Adobe sees the end of a page as a break, across which it cannot/will not search. Now, as i said at the outset, i’m no Acrobat expert, and it’s possible that there’s a way to circumvent this problem, but i don’t know what it is.