PDF interpreter?

Oy · February 25, 2005, 12:17am

Does anyone know of a really good tool that takes a PDF as input and produces HTML or text that actually resembles the page organization that appeared in the document as a PDF? Adobe Reader’s convert-to-text or html is no good - fields that may appear to be columns in a table seem to be randomly scattered throughout the text.

I need to use PDFs as input for software, but the organization/layout of the data in the original document provides information. If you are looking at a table containing three columns - Goats, Dogs, and Ducks, it’s no good to get text that may have Newfoundland, Nubian, Basset Hound, Basenji, Teal, Pygmy, etc. in apparently completely random order.

Anyone know of a good tool?

Thanks!

Khadaji · February 25, 2005, 1:33am

Try Planet PDF and see what they got. I’m trying to think, there is another site too, but it isn’t coming to me.

Are you a developer?

Oy · February 25, 2005, 3:08am

Yes, I am a developer - these days, in Oracle. Among several projects, I am dealing with loading data from multiple sources, and unfortunately most of these sources are PDFs.

Thanks so much for the site - I’ll have a look first thing in the morning.

Khadaji · February 25, 2005, 1:37pm

The other place to look is PDFZone. Which I remembered when I was looking for an open source tool kit that I have used to dump PDF text. Of course, I can’t remember that now either. But perhaps it will come to me later.

Khadaji · February 25, 2005, 5:44pm

XPDF is a good tool and I was able to make it compile under Borland C++ Builder in about 5 days or so of tinkering (along with other work.) It did a great job of extracting to text. Maybe with a little more tinkering, you could make it do HTML. One thing that is different from when I did it (5 years ago almost) is that I am pretty sure that Adobe introduced flowable PDF (or some such) in Acrobat 6, so it might not extract as well. I haven’t done anything with PDF since about 2000.

Hope all this helps and good luck!

Oy · February 25, 2005, 6:16pm

Thanks so much. It’s looking as if what I’m going to get out of this is not a file I can load directly, but a text or excel file that can be manipulated via Cut and Paste into a file I can load directly. That’s primarily due to the layout of these files, and the fact that the HTML conversion I’ve seen thus far doesn’t include text formatting such as font and color, so my web mining tool can’t tell the difference between column headings and column data.

Still, it beats the heck out of what Adobe acrobat had produced, or data entering stuff from a print out!

Thanks again for your help. Further suggestions are welcome.

Topic		Replies	Views
is there an unmet but hidden need for parsing text out of pdf files out there? Factual Questions	16	2070	March 25, 2011
Are there any free PDF editors? Factual Questions	10	1516	November 15, 2006
Best method to convert PDF to text Factual Questions	14	2813	March 15, 2014
PDF / .DOC => CSV parsing conversion - cheaply, efficiently, accurately Marketplace	1	5610	June 3, 2011
Extracting a PDF table Factual Questions	11	733	February 23, 2019

PDF interpreter?

Related topics