Recognizing & copying tables from PDFs?

Is there software that can copy data from a tabular data from a PDF (like this one) into a spreadsheet?

I don’t mean OCR – the text is already copyable – but preserving the individual cells.

A preliminary answer to my own question: http://www.tamirhassan.com/competition/dataset-tools.html

Trying it out now and will report back if it works. But if you know of any others, please do mention them.

What you are trying to do requires a mixture of PDF text extraction (to get the cell values) and OCR techniques (to identify tabular patterns) - a nontrivial problem that is still looking for a good solution (thus the competition to challenge people to solve the problem).

I’ve done similar with by copying (Ctrl-A),(Ctrl-C) from the pdf and pasting into Excel, then using Data>Text to Columns to separate the data. With the data in your example, I used both the tab and the vertical bar, “|” or (Shift-) as delimiters in the Text to Columns window.

For one or two pages, I’ll just cut and paste the individual cells where I want them, for many pages (hundreds, in my case), I wrote a little VBA macro that would do that for me. It took a bit of time to 1) learn how to write a macro in VBA, then 2) actually write the macro, but I like to think that it took less time than doing it all by hand with the added benefit that I learned a lot on how to work with VBA macros.

Monarch Pro is an excellent tool for this, but pricey. It will convert authored PDFs to text and generally do a good job of maintaining the table format. From there, you can define a “data template” to give the software hints as to what makes up the data (vs headers, formatting, etc). The hints consist of indicating certain characters or character types in certain column positions, etc. The templates can be used to automate the data conversion if you frequently need to grab the same data from a document with identical formatting.

A cheaper alternative with fewer options that may work for you is Able2Extract.

If you have the option to select the font within the PDF, be sure to use a monospaced font, as it will result in more consistent formatting of the tabular data regardless of whether you use a tool such as above or copy and paste as a prior poster noted.

Nuance PDF Professional will convert pdf’s with tables to Word documents with tables. It tends to break a table in the pdf that extends across page boundry, but othewise works well.