Best method to convert PDF to text

Rhythmdvl · March 14, 2014, 9:03pm

Acrobat 9 Pro sucks donkey balls at converting PDFs to text. I vaguely recall that a more recent version promised better conversion, and I know there are more Acrobat alternatives out there today.

Do any of them (Updated Acrobat or other software) do a reasonably good job at conversion? It’s for our business so it doesn’t necessarily need to be free or low-cost (not that we want to spend money).

Dewey_Finn · March 14, 2014, 9:30pm

Did the PDF originate as text, or was it a scanned document? I think you’re thinking of the latter, in which case you might look at standalone OCR programs. If it’s the former, I’ve usually just copied and pasted the text into another document, although you lose formatting this way.

Rhythmdvl · March 14, 2014, 9:43pm

PDFs come from all directions. We do editorial and graphic design work, so it could be a source document, a document for a client, etc. Most PDFs would have started as a Word document, gone to a designer who used either Quark or InDesign to do the layout, then converted to a PDF for publishing on the web. These aren’t our documents, so we don’t have access to the interim files.

Another alternative would be a reader that does a better job in selecting, copying and pasting. Right now, if I select and copy text from a PDF with a 2-column layout, I have to paste it into Word then do a series of steps to make it usable (i.e. remove all the extra paragraph breaks). I use a few macros to make it easier, but it’s still a giant PITA when I have to copy several paragraphs (can’t just convert all the extra breaks to spaces), manual hyphenation, and indented bullets.

Scanned documents are they’re own problem, but I’m mostly looking to convert published, laid-out documents.

zoid · March 14, 2014, 9:47pm

I use FreeOCR.
As stated you will lose a lot of the formatting but it beats the hell out of typing out a large document.

Noelq · March 14, 2014, 9:49pm

Its difficult to find a good converter. Usually, it treats each line of text as a separate sentence, or unfamiliar text as a graphic.

We get a lot of PDF’s, and most of our documents are in numerical outline format (A.1.a.1.a.i format), and none of that is done in automatic format. Its rough to try and fix all of it.

In my experience, if the document is under three or four pages in length, its just best to retype it. If its longer than that, ask the originator if its possible to get an original copy.

So, no, Adobe 10 or 11 doesn’t do a better job.

If its a scanned document, forget it. Retype it.

Rhythmdvl · March 14, 2014, 10:15pm

Crap.

Reply · March 14, 2014, 10:43pm

I’ve had better luck with ABBYY FineReader. At least it recognizes structure (columns, etc.) and the OCR portion is pretty good too. Trial version available to see if it works better for you.

http://finereader.abbyy.com/

Reply · March 15, 2014, 3:44am

As an example, here’s a screenshot of it it converting a two-column magazine article to Word.

All I did after scanning was tell Word to make it one column instead of two. No manual re-pagination or whitespace cleanup was needed.

Reply · March 15, 2014, 3:47am

And for super complicated documents, the free tool Briss lets you recrop multi-column PDFs into a single column document with more pages (and ignore headers, footers, etc. in the process). The output of that is a lot easier to work with in any version of Acrobat. It only takes a minute or so to set up per document, because it shows you all the pages intelligently overlaid and then you just drag boxes around the main text body columns.

Rhythmdvl · March 15, 2014, 5:29am

Thanks. Fortunately, 99 percent of the files I work with are professionally produced and the text is actual text (unlike this bastard, but OCR made it searchable—which I believe you helped with too). Hereis an example of what I’m talking about. Say I want to pull a subsection and paste it into a Word doc that I’m working on. I don’t need perfection, but the multi-step adjusting for extra paragraph returns and whatnot can be a pain.

ETA: Sorry, I didn’t refresh before hitting submit–going to give Briss a try.

Voyager · March 15, 2014, 6:06am

I wish I had that tool 10 years ago when I was converting some scanned two column documents - with illustrations. A major pain.l

Reply · March 15, 2014, 9:41am

Turns out that if you save the PDF as plain text from within Acrobat, it gives you pretty neat paragraphs that you can then copy and paste. There’s something whacky with the capitalization in headers, but I think that’s just a matter of that particular document being typed (maybe the guy was randomly holding down the shift key). Most of the text body turned out fine.

FineReader was also able to extract and consolidate the columns, but the OCR is not as accurate as extracting actual text from PDF (which Acrobat does).

So I guess you have to decide which is more valuable – getting the text or the formatting.

I don’t think Briss would work too well for a document this complex (because it’s hard to manually overlay and crop the columns unless you remove all the in-between picture-pages too. For something simpler, like an academic paper without fancy colors and images, Briss works extremely well. This particular document seems to go way overboard with the pretty formatting.)

One last option you might consider if you have a lot of extra time is something called calibre, an open-source e-book converter and PDF wrangler. Basically it can take PDF or other text files and play with line spacing, paragraph removal, line-unwrapping, etc. through like 10,000 different parameters. I did not have the patience to make it work right, but in theory it might be able to do something like what you’re looking for. When I tried it on your document, the output wasn’t as good as just saving as text from Acrobat.

Tell me about it! This one tool has been more useful to me than even Acrobat itself. I use it to crop PDFs down to Kindle size or for my tablet and it makes reading much more pleasant.

Reply · March 15, 2014, 9:48am

Gack, I hate the short edit window here. I forgot to paste an example of the saved text file:
https://www.dropbox.com/s/1u90n4ndcv0sjyn/GEF-ADAPTION%20STRATEGIES%20(plain).txt

madrabbitwoman · March 15, 2014, 9:59am

The last pdf I opened had an option export to word or excel online. Sorry didn’t do so & so can’t tell you if it works

madrabbitwoman · March 15, 2014, 10:03am

Ah… It’ll cost you https://www.acrobat.com/exportpdf/en_GB/pricing.html?trackingid=KFUPE

Topic		Replies	Views
.pdf to text Factual Questions	12	1502	February 24, 2009
Third-party utility for converting PDFs to Word files? Factual Questions	6	1104	May 26, 2011
.pdf to MS Word. Factual Questions	21	1732	September 14, 2008
Extracting text from a pdf file Factual Questions	24	1743	November 14, 2008
saving .pdf file as text? Factual Questions	9	899	May 14, 2002

Best method to convert PDF to text

Related topics