Copying and editing from a scanned PDF?

Work problem here. I’ve got an approximately 160 page document that was scanned as a PDF. I need to update it and edit the format. Is there any way I can copy the text into a word or text file so I don’t have to have the entire thing retyped?
It’s a good quality scan, so it seems to me that there should be some way to select and copy the text. and then paste it into a document.


::whimpers at the thought of retyping the whole thing::

Acrobat’s a read-only file type. If you own the full version (not Reader), you can make limited text edits and such, but I’ve never heard of a way (bar jumping through hoops with cracking software and such) of copying and pasting the text from a PDF into an editable format - that’s not what PDF is intended for, so apparently they decided it was unnecessary.

Hopefully someone will prove me wrong, but that’s been my experience in three years working with the damn things.

When you say “scanned”, I am taking that literally as “optically read by a device, then rendered as a raster image (e.g. JPEG, TIFF, BMP, PNG, etc.).”

The answer is no – you don’t have text in the PDF, you have a picture of groped letters. There’s no way to copy and paste.

There is software available that can interpret a scan of a printed page and render semi-useful editable text (which usually needs heavy editing).

Can you trace the original 160-page doc upstream to see if anyone has (maybe the author) has a digital version? I know this is not always an option, but it’s worth checking out.

PDFs are intended to be un-copy-and-paste-able. They are intended to be “secure”, un-editable documents so you can give them to your customer and the customers cannot foobie them up.
Either take **bordelond’s ** advice to find the author or hire a temp. You have my deepest sympathies.

Try this program out. It has a 15 day trial that may do the trick (assuming it’s not crippleware).

It claims to be able to do OCR from PDF files, which is exactly what you’re looking for as far as I can tell.

Disclaimer - I don’t know this software from a hole in the ground. I just googled for PDF capable OCR and this came up. YMMV etc.

To do this, you want to research Optical Character Recognition (OCR) software. There are a lot of products that do this but, OmniPage is one of the better products out there. As I recall it can take scans you have already made and stored on your pc and covert them to text, but you have to have them saved as some picture format (TIFF as I recall).

The first thing you should do is try to get the original document the PDF was made from. Then ignore the PDF and edit that file.

If the text is in the PDF as text, use the Text Select tool to select it. The icon for the Text Select tool is a capital T. That should work in Reader as well as the full version.

However, if the original was really scanned to make a PDF, what you’ve got is a big graphic. It isn’t text, it’s a picture of some text. In that case, your only hope is to OCR the picture. This will probably involve using the Graphic Select tool (next to the Text Select tool) to select each full-page graphic, then copy-paste it into a program that can do Optical Character Recognition. If the original scans were good enough, and they didn’t JPEG-compress them too much when they made the PDF, you may get the text back.

If neither of these work, then you are definitely going to have to retype.

I wouldn’t hold much hope of getting the layout of the page right with either of these methods. You’ll just have to start over on that part,

Something to be made clear is that there is a big difference between:

  1. A scan of a printed page made into (say) a JPEG, and

  2. A Word document version of that same page
    The first is raw collection of dark and light pixels. In no way are individual alphanumeric characters encoded.

The Word doc does encode the characters, as well as font info, formatting, page size, and a host of other specifics.

Not quite the case. PDF’s can certainly be secured against copy-paste, if that’s what the originator wants, but the format can also be freely copied from if the originator allows it. That’s why there are selection tools in Reader.

Saltire is correct. That’s why I made the opening caveat in my first pot to this thread – my understanding is that copying and pasting text are absolutely not an option at this point.

      • If it is a bunch of page scans in a PDF, then you need to extract the images (full-version of Acrobat is easiest, but Illustrator can do it too one page at a time) and then run them though OCR software. What will come out is a Word file that should mostly be correct as far as the text content, but the formatting might need help. How big is the filesize now?

. . . . -or someone who had the software already might offer to do it for you, since if it’s a clean scan it shouldn’t hit the OCR software with many errors–but they wouldn’t be able to do that if your email was disabled, would they?..