Extracting text from a pdf file

In theory, if you have the full version of Adobe Acrobat, you can select text and use it to paste it. Normally, this works pretty well. I currently have a pdf that was created back in around 1999 and I can’t extract the text - wahat I get is this

e,)+? :@
:@
< :@ @ Aa;!

This isn’t very useful. Anyone have any idea how to extract useful data from an ancient PDF???

If anything else fails you could OCR it.

Modern OCR works pretty well but you need access to a good scanner. I had a hard copy of some family history that I fed into a heavy duty copier at work and it read, converted, and e-mailed me a new version of 20 pages within 5 minutes with near 100% accuracy. Home scanners may take longer.

Ghostscript, Ghostview and GSview might be worth trying.http://pages.cs.wisc.edu/~ghost/

Why on earth would you need a scanner when you can print the PDF to an image (TIFF) file directly?

Another thing you can do is to print it to a Postscript file, and then clean it up. I have cause to do something like this. I use a collection of sed and Perl scripts to clean up the crud, which gets me most of the way there. If your pdf is fairly simple, this might work pretty well.

I’m a bit confused. Plain old for-free adobe reader has the ability to extract text from pdf files (at least modern ones). This has been true back to at least version 5 of the freebie. While it’s not great for dozens of pages, it’ll work OK if you just need to grab a couple pages.

What program & what process is the OP using to get the gibberish sh/she is getting now?

PDF can store text, but it can also store bitmaps, gifs, jpgs and other image types. It is possible that you have a PDF that is showing several scanned pages (images). You would not be able to extract text from an image, and as others have pointed out, you would need to perform an OCR on them.

I would be willing to take a look at it, if the file isn’t too big. You may use the address in my profile.

The full version of Acrobat

There’s no OCR in the office

Ah, the joys of confidential documents.

Looks like I’ll have a busy few weeks typing. I’ve nothing better to do :smack:

Really, print it to a Postscript file and open it up with an editor. What have you got to lose?

The postscript file doesn’t contain any readable text :frowning:

3 ideas for you;

Well… just as a thought it might not have anything to do with the actual capturing (it’s getting something) but rather the font conversion. You might try the extraction process on a system with a more compete install of font files.

Assuming you can “see” the text in the old pdf file try printing it as a pdf file. This is will force saving and conversion to a newer pdf/font format and perhaps enable text capture.
This free program looks like it might solve your problem

If full Acrobat can’t retrieve the text, I’m gonna bet you really have pictures of text, i.e. embedded gif/jpg files, rather than actual text. If so …

When you say you have no OCR …

Assuming you’re on Windows, not Mac or ??? …

If you have Office 2003 or later, and you have MS Document Imaging (look in the start menu under Office >> Office Tools), then you have all you’ll need for OCR.

All we need now is to convert the PDF itself to an image format file.

I don’t know if full Acrobat can do save as tiff, or save each page as a jpg. I’d try that first. If not …

If you have, or can buy for $15, a fax modem, yuo can print the PDF to paper, fax it to the fax modem, which yields a TIFF file.

If your office has a multi-function printer with scanning, that’d be the better choice. Or if any of your co-workers who’re cleared for seeing the document have one at home.

Once you do get the scan file, then drop that in MS Document Imaging & viola.

If the document is less than, say, 15 pages, you could probably retype it more quickly, unless you get lucky. But if the doc is 200 pages then yuo can afford to spend a day or more trying to extract the text before it gets cheaper to re-keystroke it.

Heck, if it’s gonna take a man-week to re-keystroke, it’d probably be cheaper to go buy a scanner/multi-function printer & be done with it.

The TIFF would be only pictures of the pictures of the words. In order to get an electronic copy of the text that can be edited, one needs to bring it into a text editor. If all one has is a picture file (GIF, TIFF, whatever), then one “simple” way to move the data (rather than the images) is to get it into a printed document, then use a scanner that will convert the images to text.

How about converting the PDF to another format? Try this free program that can convert the PDF to HTML or other formats… it may or may not successfully preserve the text.
http://www.mobipocket.com/en/DownloadSoft/ProductDetailsCreator.asp

I normally use it to convert PDFs for reading on my Kindle, and it works somewhat, well… haphazardly. But it’s worth a try; you’ve got nothing to lose.

Not sure what you mean by the “Full Version” of Acrobat.

I use Acrobat Pro 8 on my Mac and it has full OCR capabilities; I use it all the time to OCR documents that I have scanned in.

In addition, you can use Acrobat Pro to delete any metadata from the file so that you can start fresh. Presumably the document is a scanned image, with embedded metadata containing old corrupt text information. If you strip out the bogus metadata and run a fresh OCR on the doc, you might get it.
(Of course, use a copy of the document; keep the original file somewhere else).

Acrobat Reader can extract text from files that are converted/published to .pdf, but not from files scanned to .pdf. You need Acrobat (the full version that you pay for) to do that.

You can do this without wasting paper by just OCRing the picture directly (if necessary, by first converting it via a virtual printer to a compatible format). See LSLGuy’s post.

If you have MS Word, you can use PDFtoWord. You can download it from VeryPDF.com. If you cannot get it there, let me know.

I think you missed the intention of my post. Someone said to print it to paper and then scan it and I am saying there is no need for that as you can print to image much faster and with better quality. That’s my point. That printing to paper and then scanning makes no sense whatsoever.