I have a PDF file that has a table in it something like this:
name|value1|value2|value3|value4
a 1 1 1
b 1 1 1 1
c 1 1
d 1 1 1
I want to get it into an Excel file so I can do some analysis on it (the original data is not accessible to me). COpy/pasting to Excel ended up with every field on a separate line, and missing cells not generating a line
The Google told me to select and copy the table from the PDF and paste it as a table into Word, and then move that table to Excel. This also ended up with the empty cells in the original PDF table not generating cells in either Word or Excel. In other words, the copied table looks like this:
name|value1|value2|value3|value4
a 1 1 1
b 1 1 1 1
c 1 1
d 1 1 1
Since there is no way from the data to assume where the empty cells are, I can’t figure a way to create them with a script.
Any ideas? 380 rows or so, so a lot to do manually. Ideally, there would be some way to copy the null cells (or insert something into the table but the PDF is not editable), but I can’t find anything.
Going directly to Excel in general won’t work. Instead of copy/paste into Word, try opening the PDF from Word. I’ve had decent luck getting it to convert into a table that way, then you can copy/paste from Word into Excel.
Well, don’t I feel silly - that worked and is somewhat obvious now that it has been pointed out. The Word table has all the columns, I can go from there.
In Adobe Acrobat (at least as early as 2010’s Version 9), you can text-select all the items in the PDF table and then right-click within the selection and choose Open Table as Spreadsheet. There are also options to Copy as Table and to Save as Table, the latter of which converts the items you selected into a Comma-Separated Values (CSV) file which can be read natively in Excel and other spreadsheet programs.
All of that presumes that the PDF was created by exporting or saving from the original program into a PDF. Scanned-file PDFs will render the table “un-copyable”.
OK, I tried using Excell’s Import feature and all I got was the PDF formatting data with lot’s of special characters. So that’s didn’t work, at least with my file that was mostly text anyway.
But as I own the Adobe Acrobat program I tried saving the file for inside Acrobat as an Excell file at that worked perfectly. However it did not carry over any formulas, but they didn’t exist in the PDF so I guess I’m not surprised.
I’m willing to do the conversion for you if you’re comfortable with that.
Use a pencil to make a mark (period or zero) in each of the empty cells.
Scan it back in again, either to Word or as a PDF, but now things should be kept in the proper columns.
Can be a lot of work, so only an absolute last resort. But if the table contains a lot of complicated numbers (where re-keying would create a lot of errors), you might need to do this.
Among last-resort solutions, let me offer mine:
Run the data copied from the pdf through a filter program like
sed "ss s . sg"
where the number of spaces is set appropriately.
Sed is a rather simple program which was available free-of-charge long ago, before Microsoft Corp. and Adobe Inc. even existed. Sed and its siblings are very powerful: too powerful, I’m afraid, for the Windo$e environment; they’d obviate the need for various $19.95 shrink-wrapped solutions.
I’ve had good luck extracting tabular data from PDFs using tabula. I don’t recall if it handles empty cells particularly well. The interface (runs in a web browser) is a little clunky and it isn’t the fastest thing in the world, but it did well extracting a unified table from a multipage PDF for me in the past. Bonus: free to use and runs on linux, windows, and mac. Output is in CSV format.