Extracting a PDF table

Folacin · February 22, 2019, 2:50pm

I have a PDF file that has a table in it something like this:



name|value1|value2|value3|value4
a        1            1     1
b        1      1     1      1
c        1            1
d        1      1     1

I want to get it into an Excel file so I can do some analysis on it (the original data is not accessible to me). COpy/pasting to Excel ended up with every field on a separate line, and missing cells not generating a line

The Google told me to select and copy the table from the PDF and paste it as a table into Word, and then move that table to Excel. This also ended up with the empty cells in the original PDF table not generating cells in either Word or Excel. In other words, the copied table looks like this:



name|value1|value2|value3|value4
a        1      1      1
b        1      1     1      1
c        1      1
d        1      1     1

Since there is no way from the data to assume where the empty cells are, I can’t figure a way to create them with a script.

Any ideas? 380 rows or so, so a lot to do manually. Ideally, there would be some way to copy the null cells (or insert something into the table but the PDF is not editable), but I can’t find anything.

GaryM · February 22, 2019, 3:23pm

Not in front of my PC right now, but have you tried Import as an Excell command?

TroutMan · February 22, 2019, 3:30pm

Going directly to Excel in general won’t work. Instead of copy/paste into Word, try opening the PDF from Word. I’ve had decent luck getting it to convert into a table that way, then you can copy/paste from Word into Excel.

Folacin · February 22, 2019, 3:35pm

I hadn’t seen that option - but it looks like that only works for txt/csv files - at any rate it didn’t recognize the PDF.

Using the word ‘import’, Google did find an Adobe tool to do the work that has a free trial: https://acrobat.adobe.com/us/en/acrobat/how-to/pdf-to-excel-xlsx-converter.html

I may try that when I get home tonight.

Reply to Gary, guess I should have quoted!

Folacin · February 22, 2019, 3:38pm

Well, don’t I feel silly - that worked and is somewhat obvious now that it has been pointed out. The Word table has all the columns, I can go from there.

Thanks!

bordelond · February 22, 2019, 3:43pm

In Adobe Acrobat (at least as early as 2010’s Version 9), you can text-select all the items in the PDF table and then right-click within the selection and choose Open Table as Spreadsheet. There are also options to Copy as Table and to Save as Table, the latter of which converts the items you selected into a Comma-Separated Values (CSV) file which can be read natively in Excel and other spreadsheet programs.

All of that presumes that the PDF was created by exporting or saving from the original program into a PDF. Scanned-file PDFs will render the table “un-copyable”.

GaryM · February 22, 2019, 3:46pm

OK, I tried using Excell’s Import feature and all I got was the PDF formatting data with lot’s of special characters. So that’s didn’t work, at least with my file that was mostly text anyway.

But as I own the Adobe Acrobat program I tried saving the file for inside Acrobat as an Excell file at that worked perfectly. However it did not carry over any formulas, but they didn’t exist in the PDF so I guess I’m not surprised.

I’m willing to do the conversion for you if you’re comfortable with that.

Folacin · February 22, 2019, 4:12pm

Thanks for the offer. As noted above, Troutman’s solution is working well enough, so I’m good.

Tim_T-Bonham.net · February 23, 2019, 1:30am

A last resort (but one that I had to do once):

Print the pdf.
Use a pencil to make a mark (period or zero) in each of the empty cells.
Scan it back in again, either to Word or as a PDF, but now things should be kept in the proper columns.

Can be a lot of work, so only an absolute last resort. But if the table contains a lot of complicated numbers (where re-keying would create a lot of errors), you might need to do this.

septimus · February 23, 2019, 3:52am

Among last-resort solutions, let me offer mine:
Run the data copied from the pdf through a filter program like


sed "ss       s . sg"

where the number of spaces is set appropriately.

Sed is a rather simple program which was available free-of-charge long ago, before Microsoft Corp. and Adobe Inc. even existed. Sed and its siblings are very powerful: too powerful, I’m afraid, for the Windo$e environment; they’d obviate the need for various $19.95 shrink-wrapped solutions.

Dancer_Flight · February 23, 2019, 6:25pm

I’ve had good luck extracting tabular data from PDFs using tabula. I don’t recall if it handles empty cells particularly well. The interface (runs in a web browser) is a little clunky and it isn’t the fastest thing in the world, but it did well extracting a unified table from a multipage PDF for me in the past. Bonus: free to use and runs on linux, windows, and mac. Output is in CSV format.

yabob · February 23, 2019, 7:05pm

Seconded. I’ve been using this, and it works OK. The documents I’ve been extracting with it had empty cells and it managed.

Topic		Replies	Views
Recognizing & copying tables from PDFs? Factual Questions	5	741	June 13, 2013
is there an unmet but hidden need for parsing text out of pdf files out there? Factual Questions	16	2073	March 25, 2011
Help importing text from PDF to spreadsheet In My Humble Opinion	7	219	July 23, 2022
.pdf to .jpg question Factual Questions	15	2048	January 20, 2015
Adobe and Excel - How to get from one to the other? Factual Questions	13	881	April 3, 2003

Extracting a PDF table

Related topics