Help importing text from PDF to spreadsheet

No Discourse, this is not similar to “Battle of Stalingrad far more important than D-Day in defeating Germans?”.

I have a PDF that I’d like to import into a spreadsheet (Excel online, Excel 2016 Mac, Google Sheets, Apple Numbers are all available). The problem is that the text is arranged in columns in the PDF with a lot of empty columns in each row. The empty columns per row aren’t recognised by the import / paste techniques I’ve been using so far and so the column information is lost.

An example of the PDF formatting is shown below:

So in this example, I don’t have anything that will import a PDF directly (I don’t think I do) so I’m copying the text and importing the text. I’m unable to copy the top header rows because they don’t get selected properly, so it’s just the data rows that I’m selecting. Because empty columns are ignored, the Night Cop and Day Cop column just get merged into one column and the Instrument time is getting concatenated with the adjacent crew names.

I thought I’d solved the problem by using Adobe’s PDF to Excel converter but I eventually realised it had done dumb things like considering two columns to actually be one column but with some of it aligned left and some of it aligned right.

I don’t have a huge amount of data, just 15 pages, and this is a one-off process so I don’t need a long term solution, just a way of getting this data into a spreadsheet without doing a heap of error-prone typing.

So far, so good. Can you tell me in which software you are viewing the PDF? Is it Adobe Acrobat, by any chance? Adobe Reader is a different, less capable program.

I’m unable to copy the top header rows because they don’t get selected properly, so it’s just the data rows that I’m selecting. Because empty columns are ignored, the Night Cop and Day Cop column just get merged into one column and the Instrument time is getting concatenated with the adjacent crew names.

How comfortable are you doing editing and formatting in Excel? You’re suspicion is right – you will have to remake your headers in the target program, but that’s not so bad. Dealing with the vagaries of how the columns end up after pasting/importing is trickier and likely not something that can be automated.

I thought I’d solved the problem by using Adobe’s PDF to Excel converter but I eventually realised it had done dumb things like considering two columns to actually be one column but with some of it aligned left and some of it aligned right.

Looking at your sample data and how it’s formatted … you might do better copying all that data into Notepad or an equivalent stripped-down text editor (it’s doable in something like Word, but a PITA visually unless you know to use a monospace font).

Anyway, to do the column-setting in Notepad (for later pasting into Excel): Note that you have 14 column headers. For Excel to translate plain text into columns, it has to have a delimiting character that Excel can interpret as “start new column here”. Going from Notepad into Excel, the TAB character serves that role natively. Accordingly, since you have 14 columns, each row of your data in Notepad will have to have 13 TAB characters in it (the first column does not need to be delimited).

So. Let’s look at that very first line of data. Not having your PDF in front of me, I’m not sure what it’s rendering when you copy the text out of it, but I am thinking and hoping it would copy out that first row into Notepad like so (ignore the coloring):

02DEC2018 OXL 0437 AKL WLG 01:02 A.YOUNG/SL.FERGUSON

Note that once pasted from the PDF into Notepad, there will be no TAB characters at all between the columns – just spaces. Additionally, there will not be spaces for every column – so wherever there are multiple consecutive empty columns on a row, it will paste into Notepad with only a single space. That’s why there’s only one space ahead of the ‘Other Crew in Flight’ values and not eight spaces.

What you’ll want to do to fix that first row is to get the aforementioned 13 TAB characters inserted into the correct positions. Once that’s done, you can Select All, Copy & Paste the whole shebang into Excel and the columns will set themselves automatically.

So you’ll need to edit that first row in Notepad like so (assuming the string 0437 AKL WLG is how ‘Flight Details’ is rendered and those spaces are supposed to be there). Remember that there has to be a TAB character for the empty columns, too:

02DEC2018[tab]OXL[tab]0437 AKL WLG[tab][tab]01:02[tab][tab][tab][tab][tab][tab][tab][tab][tab]A.YOUNG/SL.FERGUSON

I don’t know to render in Discourse what the end result will look like in Notepad. However, you can practice this “filling in the tabs” technique in Notepad and then paste into Excel as a check that the columns are falling as they should.

I hope this explanation helps a little. This is the kind of thing that’s much easier to show someone than it is to offer written instructions.

That’s helpful thanks, I will try it. I’m on a Mac computer so don’t have notepad but I’m hoping Textedit will work the same way. Just to be clear, you are suggesting that in Notepad only the top line has to be tabbed and all the others will be tabbed appropriately, or will I have to do it for each line?

I don’t have Adobe Acrobat, I have MacOS Preview which is a handy PDF program but not as powerful as Acrobat, and I have the Reader plugin for Chrome.

I’m cool with working with Excel and will have to do a bit of work with the raw data once it’s in there anyway.

I have had a lot of success using Tabula to extract spreadsheet-type data from PDFs.

I think in your case you would need to use “stream” mode and not “lattice” mode in Tabula.

Thanks, I’m trying that now.

By Jove you’ve done it! It looked like it wasn’t going to work initially but it turned out the PDF had become corrupted somehow and a new version of it worked a treat. You were right the lattice wasn’t going to work because the tables don’t have any lines on them.

@bordelond, thanks for your efforts as well, I appreciate the time you took to help me out.

Now I just need to do a bit of Excel work to format the data how I need it. This is for my girlfriend’s pilot logbook, I’m helping her transition to an electronic logbook and this is what we have to work with from the employer’s records.

It didn’t pick up all of the columns, but that doesn’t matter because those columns aren’t used on any of the pages.

Glad that it worked for you!

Now I just need to convert the text duration values into duration values recognised by Apple Numbers so I can add them up.