Best method to convert PDF to text

Acrobat 9 Pro sucks donkey balls at converting PDFs to text. I vaguely recall that a more recent version promised better conversion, and I know there are more Acrobat alternatives out there today.

Do any of them (Updated Acrobat or other software) do a reasonably good job at conversion? It’s for our business so it doesn’t necessarily need to be free or low-cost (not that we want to spend money).

Did the PDF originate as text, or was it a scanned document? I think you’re thinking of the latter, in which case you might look at standalone OCR programs. If it’s the former, I’ve usually just copied and pasted the text into another document, although you lose formatting this way.

PDFs come from all directions. We do editorial and graphic design work, so it could be a source document, a document for a client, etc. Most PDFs would have started as a Word document, gone to a designer who used either Quark or InDesign to do the layout, then converted to a PDF for publishing on the web. These aren’t our documents, so we don’t have access to the interim files.

Another alternative would be a reader that does a better job in selecting, copying and pasting. Right now, if I select and copy text from a PDF with a 2-column layout, I have to paste it into Word then do a series of steps to make it usable (i.e. remove all the extra paragraph breaks). I use a few macros to make it easier, but it’s still a giant PITA when I have to copy several paragraphs (can’t just convert all the extra breaks to spaces), manual hyphenation, and indented bullets.

Scanned documents are they’re own problem, but I’m mostly looking to convert published, laid-out documents.

I use FreeOCR.
As stated you will lose a lot of the formatting but it beats the hell out of typing out a large document.

Its difficult to find a good converter. Usually, it treats each line of text as a separate sentence, or unfamiliar text as a graphic.

We get a lot of PDF’s, and most of our documents are in numerical outline format (A.1.a.1.a.i format), and none of that is done in automatic format. Its rough to try and fix all of it.

In my experience, if the document is under three or four pages in length, its just best to retype it. If its longer than that, ask the originator if its possible to get an original copy.

So, no, Adobe 10 or 11 doesn’t do a better job.

If its a scanned document, forget it. Retype it.

Crap.

I’ve had better luck with ABBYY FineReader. At least it recognizes structure (columns, etc.) and the OCR portion is pretty good too. Trial version available to see if it works better for you.

http://finereader.abbyy.com/

As an example, here’s a screenshot of it it converting a two-column magazine article to Word.

All I did after scanning was tell Word to make it one column instead of two. No manual re-pagination or whitespace cleanup was needed.

And for super complicated documents, the free tool Briss lets you recrop multi-column PDFs into a single column document with more pages (and ignore headers, footers, etc. in the process). The output of that is a lot easier to work with in any version of Acrobat. It only takes a minute or so to set up per document, because it shows you all the pages intelligently overlaid and then you just drag boxes around the main text body columns.

Thanks. Fortunately, 99 percent of the files I work with are professionally produced and the text is actual text (unlike this bastard, but OCR made it searchable—which I believe you helped with too). Hereis an example of what I’m talking about. Say I want to pull a subsection and paste it into a Word doc that I’m working on. I don’t need perfection, but the multi-step adjusting for extra paragraph returns and whatnot can be a pain.

ETA: Sorry, I didn’t refresh before hitting submit–going to give Briss a try.

I wish I had that tool 10 years ago when I was converting some scanned two column documents - with illustrations. A major pain.l

Turns out that if you save the PDF as plain text from within Acrobat, it gives you pretty neat paragraphs that you can then copy and paste. There’s something whacky with the capitalization in headers, but I think that’s just a matter of that particular document being typed (maybe the guy was randomly holding down the shift key). Most of the text body turned out fine.

FineReader was also able to extract and consolidate the columns, but the OCR is not as accurate as extracting actual text from PDF (which Acrobat does).

So I guess you have to decide which is more valuable – getting the text or the formatting.

I don’t think Briss would work too well for a document this complex (because it’s hard to manually overlay and crop the columns unless you remove all the in-between picture-pages too. For something simpler, like an academic paper without fancy colors and images, Briss works extremely well. This particular document seems to go way overboard with the pretty formatting.)

One last option you might consider if you have a lot of extra time is something called calibre, an open-source e-book converter and PDF wrangler. Basically it can take PDF or other text files and play with line spacing, paragraph removal, line-unwrapping, etc. through like 10,000 different parameters. I did not have the patience to make it work right, but in theory it might be able to do something like what you’re looking for. When I tried it on your document, the output wasn’t as good as just saving as text from Acrobat.

Tell me about it! This one tool has been more useful to me than even Acrobat itself. I use it to crop PDFs down to Kindle size or for my tablet and it makes reading much more pleasant.

Gack, I hate the short edit window here. I forgot to paste an example of the saved text file:
https://www.dropbox.com/s/1u90n4ndcv0sjyn/GEF-ADAPTION%20STRATEGIES%20(plain).txt

The last pdf I opened had an option export to word or excel online. Sorry didn’t do so & so can’t tell you if it works

Ah… It’ll cost you https://www.acrobat.com/exportpdf/en_GB/pricing.html?trackingid=KFUPE