Extracting text from a pdf file

Some people in this thread are missing something important which is that IF the PDF is an image then the only way to convert to text is to do OCR.

You relly need to understand the difference between text and graphics.

Now that I’m home and in front of my Mac…

If the “full version of Acrobat” you speak of is the product called “Adobe Acrobat Professional” then do the following:

  1. Make a copy of the file to work on
  2. Go to Document / **OCR Text Recognition **/ Recognize Text Using OCR…
  3. Check All Pages and click OK, assuming the settings are good

If the result is still munged, try this:

  1. Go to Document / Examine Document…
  2. Check all of the checkboxes you can
  3. Click Remove All Checked Items
  4. Repeat steps 1-3

Hope it works out for you!

If you have Microsoft OneNote, you can “Print to OneNote” as a printing option in Adobe, and it will do the OCR for you. After that, you just need to copy and paste the text into whatever thing you wish.

I did it just earlier today with a PDF that had no search-able text in it.

If it’s letting you select text, then it’s text, not a bitmap. I’ve had this problem, too, with old PDF’s. The usual solution for me is to take them home and do a copy and paste on my Mac. Problem solved. Of course this only works for l33t people (like myself) that have used Macs since 1986, or have since become converts. So…

When I have problems similar to this (and not just for PDF’s), I set up a text printer, and print to file. Add a new printer, select Generic as the manufacturer, and using common sense, whatever the equivilent to to text or line printer is (there are only a few choices). Now when you print, make sure you elect “print to file” in the printer dialogue, specify a name when asked, and there you go: a perfectly good text file with the .prn extension. Import into Excel, open it in your favorite text editor, whatever. You have pure text.

Bummer. Okay, here is a page describing Open Office 3.0s pdf reading capability. You can download it from OpenOffice.org I haven’t tried it yet, and it is a beta, but it might be a solution. I hear the download is pretty big, btw.

I do my ps stuff in Solaris. I’ve never tried to play with it under Windows.