Does anyone know of a really good tool that takes a PDF as input and produces HTML or text that actually resembles the page organization that appeared in the document as a PDF? Adobe Reader’s convert-to-text or html is no good - fields that may appear to be columns in a table seem to be randomly scattered throughout the text.
I need to use PDFs as input for software, but the organization/layout of the data in the original document provides information. If you are looking at a table containing three columns - Goats, Dogs, and Ducks, it’s no good to get text that may have Newfoundland, Nubian, Basset Hound, Basenji, Teal, Pygmy, etc. in apparently completely random order.
Yes, I am a developer - these days, in Oracle. Among several projects, I am dealing with loading data from multiple sources, and unfortunately most of these sources are PDFs.
Thanks so much for the site - I’ll have a look first thing in the morning.
The other place to look is PDFZone. Which I remembered when I was looking for an open source tool kit that I have used to dump PDF text. Of course, I can’t remember that now either. But perhaps it will come to me later.
XPDF is a good tool and I was able to make it compile under Borland C++ Builder in about 5 days or so of tinkering (along with other work.) It did a great job of extracting to text. Maybe with a little more tinkering, you could make it do HTML. One thing that is different from when I did it (5 years ago almost) is that I am pretty sure that Adobe introduced flowable PDF (or some such) in Acrobat 6, so it might not extract as well. I haven’t done anything with PDF since about 2000.
Thanks so much. It’s looking as if what I’m going to get out of this is not a file I can load directly, but a text or excel file that can be manipulated via Cut and Paste into a file I can load directly. That’s primarily due to the layout of these files, and the fact that the HTML conversion I’ve seen thus far doesn’t include text formatting such as font and color, so my web mining tool can’t tell the difference between column headings and column data.
Still, it beats the heck out of what Adobe acrobat had produced, or data entering stuff from a print out!
Thanks again for your help. Further suggestions are welcome.