is there an unmet but hidden need for parsing text out of pdf files out there?

well, not according to the freelancer sites that I frequent, but maybe people with all the pdfs are just too shy.

In the near future I will be building a tool for extracting table-like text data out of pdf files, including those where if you do simple things like copy-paste or export to spreadsheet it ends up looking unusable ugly. Well, so I am wondering if this is an area where it would make sense to build not just a quick-and-dirty solution for the current problem but rather invest in making a more generic tool that would be easy to customize for any format of the file.

But the question here is, is there an actual unmet need for such a service that I could tap into once I have it up and running? Are there lots of people out there sitting on top of a big pile of pdf docs who just cannot wait to get them accurately and fairly cheaply parsed and exported to spreadsheet/database? Or is the even best case demand for this about as low as can be inferred from lack of posted freelance projects?

I’m not sure if this answers your question, but I occasionally have need to convert a PDF file into Excel or Word. But I use PDF Professional which lets me do that.

It works on documents where (as a PDF) you can select text and numbers, vs. for example a scanned document that was converted into a PDF.

There are programs like pdftotext that extract text from pdf files, but perhaps that wouldn’t suit your needs.


I am aware of pdftotext and have written parsers for its output. The problem is that it does not do a good job of handling tables that have some cells missing. When you start parsing the text produced by pdftotext in this case, it may be pretty hard (or impossible) to figure out which cell the particular tidbit of text belongs to. Another problem is cells containing several words separated by whitespace - sometimes it becomes unclear if that whitespace is due to division between cells or internal whitespace of the cell contents.

In short, pdftotext is better than nothing but not, AFAIK, a basis for a generic fool-proof solution to the problem.

Part of the problem with PDFs is it depends on how they are created.

For instance, if I take an Excel document and I have a full pro version of Adobe Acrobat, I can convert the spreadsheet into an PDF and it will be accurate.

If I print this document out and scan it and make a PDF out of the scan, I have to rely on OCR (optical character recognition) programs to read everything in the scan correctly.

Those can be good or bad depending on the quality.

So you first step is determaining how your PDFs were orginally created.

I do this a lot with PDF Converter Enterprise 6.0 from Nuance.

Open PDF, “save as” .doc (Word) format and the PDF tables are converted to Word tables pretty well. The conversion often breaks the tables at a page boundary but that might be also in the original document and not a function of the conversion.

PDF Converter Pro would also do the exact same thing, I have “enterprise” version for other features/reasons.

We built a tool that would extract text from and index text in PDFs about 10 years ago. It allowed you to organize and quickly search your PDFs. It did not sell. But many things can impact that - poor marketing or bad timing.

I think there is definitely a need for a polished tool that does handy things with PDF documents like you describe. There are tools available that do things like this, at least when parsing scanned documents, which is not too far away from manipulating PDF docs.

I use PDF Pen for some stuff and that tool does support tables to some extent.

A couple of years ago I used Abbyy Finereader’s engine on Linux in order to process tens of thousands of scientific documents that were in PDF, but the text was not searchable. We simply had the engine read in each doc and run OCR on the generated layout.

They had options for exporting to Excel, but I never used them.

You will find all kinds of interesting issues with PDF as you go along your way. The aforementioned scientific documents are an example of one: The PDF spec does not require a generation tool to provide a means to map printed glyphs back to text characters.
PDF is, first and foremost, a format for defining the layout of printed documents unambiguously. It is quite possible for an app to have an array of glyphs (that bear a striking resemblance to letters) that are referenced in the remainder of the doc, as in “Glyph #12 goes here; then Glyph #114; then Glyph #12 again”
This glyph array might represent three different fonts, and only the characters that were used. In other words, it is quite possible that there is no “Z” nor “Q” present for some of those fonts.
On top of that, as I said earlier, there is no requirement to provide a “reverse map” that converts a glyph to some text element.
The result? A document that is perfectly legible, but when you open it in Acrobat and copy/paste to somewhere else you get weird gibberish.

The only solution for these annoying docs was to use the Finereader engine to render them as PDF in memory, then OCR the rendered pages (and Finereader does this in one step from the command line).

Has anyone tried Google Docs’ OCR functionality? They made a point last summer about being able to convert PDFs to text with their own OCR option, but I’ve not tried it.

hmm, to clarify, the stuff I am working on (and am considering a broader application to make more money) has nothing to do with OCR. It has to do with pdf documents that already contain ascii text (accessible by export to Word or by copy paste to notepad) but which have tables that cannot be exported correctly for whatever reason. So my tool would extract these tables (let’s say into an Excel spreadsheet) with greater accuracy.

Fair enough. That would certainly be a step up from plain OCR.

Nuance PDF Professional does as you describe… I use it to take PDF’s of tabular data and convert them to Excel. If the PDF is 10 pages (as an example), each page get’s it’s own tab or worksheet.

Here’s one way to find out:

Build a single purpose web site, (use a catchier title of course). User uploads a file and it returns the extracted data. No account required, no ads, and it’s free. Maybe build a quick REST API too. Tell people about it, document the API. Let it run and collect analytics while you work on something else.

After a while you’ll have your answer. Maybe no one uses it. Maybe it gets a million hits a month. Maybe there are half a dozen users who discover it’s the perfect solution to a very specific problem.

The trick is to make it as simple as possible. Perfect is the enemy of good. Build it quickly, launch it quickly and avoid arbitrary limits. You can change it later, when you know more and don’t have to make assumptions.

I don’t think there is an unmet need for this as a service. There are many off the shelf solutions for extracting text from PDFs, including some online tools. Anyone who does this sort of work in bulk is surely aware of them, or has created their own tool.

Definitely some market (since we use acrobat itself to do this every day in our main product - and live with the problems). But it is not an easy problem as you’ll soon discover. If you want to sell it, you probably should package it up along with an implementation of an open source PDF to text tool, so your whole package is “PDF to Text, with really good table rendering” rather than “PDF Tables rendered better”, as I would believe most people who want the tables out want the entire package out.

Here’s my opinion as a technical writer who works with PDFs on a daily basis: If I needed the data out of a lot of tables published as PDFs, I’d be asking the person/organization supplying the data to provide me with another format (such as Excel). Any conversion solution would introduce flaws that could take hours of labor to correct. Then there’s the non-negligible cost of the solution itself. PDF is an end format, therefore the source of the tables would either have to be A) a database that can presumably export to something more flexible or B) typed or hand-drawn paper tables that only OCR could extract anyway. I suspect A is much more likely than B these days, so I would not hesitate to ask for another format.

I don’t know how many other people have this problem, but if it accurately keeps tables? I’d love it. I’ve had many clients send a PDF and want the content inside posted as HTML to a website. I usually end up formatting everything by hand because text extractors are horrible about formatting.