Using Excel, can I extract specific items from this terrible data file?

Disclaimer: nobody needs to do my homework for me. I just want to know if what I want to do is even possible to automate in Excel, or if I’m going to have to do it by hand.

I have a terrible, terrible report designed by a sadist, that exports into XLSX files with single cells containing variable data like this. The blank lines are line breaks (soft returns):

CELL 1:
Country: United States

State: California

City: Fontana

Governance: Council-Manager

CELL 2:
Country: United States

State: California

County: San Bernardino

County Seat: San Bernardino

CELL 3:
Planet: Earth

Country: United States

State: California

State Flower: California Poppy

State Capital: Sacramento

My job is to extract the bolded items, which might be cities, counties, or states. There is other info that might or might not appear, such as state flower. Cities and counties will never appear together in the same record, so if I see either of those, I want the characters that follow either "City: " or "County: ", starting with the first letter after the space and stopping when a line break is reached (or the end of the record in that cell, example not shown, but it’s possible.)

State will always appear, but if neither City nor County is present, then I want the state. So out of the examples above, I want Fontana, San Bernardino, and California.

Any of these items might appear anywhere, as in the CELL 3 example where I want the state. The appearance of “Planet: Earth” puts the State on a different line than in the other examples, so I can’t assume I can grab a particular line in any case.

Is there any set of formulas that could achieve this? I’m no expert but I’m willing to study and learn.

You can split the content of 1 cell into multiple cells. Data → Text to Columns →

Without seeing more data, I can’t be sure, but if it all looks like your example, it’s a hierarchy. Cell 1 is a child of Cell 2, which is a child of Cell 3. If that holds, the job becomes much easier to tackle.

Oh no, that’s unfortunately not the case. These could have been shown in any order.

Power Query can help you here!

Well, not much to go on so far. You’ve shown three layouts and given us some constraints that make it appear hierarchical in nature. If you were to count combinations of fields, such as Type 1 has Country, State, City, Governance. Type 2 has Country, State, County, County Seat, etc., how many different types are there?

For all practical purposes, there are an infinity of types. State always appears. City or County may appear (but never together). A million other random things might appear. Generally there are 3 to 6 items in nearly all cells, but other than State, I have no way of predicting what the items might be, or how many of them.

In what format is the underlying report ? It sounds like it’s being ported over into Excel. What’s the source file look like ?

How much data are we talking about here ?

One hard lesson I learned years ago: sometimes the inelegant solution is the best solution. If there isn’t that much data, and there’s no other source for the data that is much easier to work with, you may be best off just getting busy and cleaning up this shit-show manually.

“May be …”

Out of your example above, do you want Fontana, San Bernardino, and California to be related (since they are)? Or could it be Atlanta, Los Angeles County, Michigan?

I do not know the format of the underlying report. It’s from a fairly uncommon ERP system that I don’t want to name (potentially personally identifying). I run reports in this ERP, then export To Excel (the only other export option is PDF).

We are talking about five or six thousand rows in my Excel export, but I can pre-filter and get the relevant rows down to a few hundred. Thus for one-off work, it’s a task that I could do by hand in several hours, but ideally I’d like to be able to perform this task frequently in the future.

No, it could be Atlanta, LA County, and Michigan. I should have made up my examples better.

Do you have any experience with parsing text files? Save the file as a CSV and then you should have a text file with all that info. Then you can use your text parsing skills to extract what you need.

Depending on what kind of wizardry others come up with, it’s very possible that the ERP derives all this data from tables, and that finding another way to export it is neat and clean, if not easy.

Do you have IT resources – either in-house or from the ERP provider – that can tell you if .XLSX and .PDF are really the only two options ?

I’m not sure I’d do this (^) for a one-off, but if you want to do this repeatedly, and with large amounts of data, it may be worth a couple of phone calls, emails, or online searches (trying to figure out if there are non-obvious alternative formats in which you can generate reports from the ERP).

No, but I’ll look into it, thanks! I’m sure those would be great skills to have.

=TRIM(MID(A1,SEARCH($G$5,A1)+LEN($G$5),16))

A1 is the first box of data. The ($G$5) was a random box where I wrote State:. The 16 can be changed, it is the number of characters you want after State:. It’s messy as hell. You could set it for the longest number of characters possible and then clean it up. You could set it for the first three letters (if that works) and get a three letter “state code”. It wouldn’t replace manually doing it but it might help.

The short answer is that I am almost positive this can be done. It will require multiple nested functions including IF, SEARCH, ISNUMBER. a basic example would be to say (and I am writing this in plain language to make it easier to understand)

IF a SEARCH of CELL A1 for the string “County” returns value of TRUE from the ISNUMERIC function, then use SUBSTRING with the value from the search expression as the first character, and grab from that point to the next soft return character. ELSE search for “City” similar to above, ELSE search for state, else print "OH BUGGER "

I purposely didn’t give it all away there, you will need to tweak the idea a bit, but if you can parse what I have said above, you will make it through the rest. Will probably need to search for the ascii value for the soft return.

If you exported it to a CSV you could likely write a powershell script to recognize & extract the data.

If you’re not familiar with powershell though it may take you longer to write the script than it’d take to do it by hand.

Will a reasonable CSV be created absent – AFAICT – clean delimiters ?

These are the lines I was thinking along, in a vague kind of way. Thank you for naming those functions.

I was able to accomplish what you want (I think) by having Excel do the following:

  1. First look for whether “City:”, “County:”, or “State:” was in the text.

  2. Use that information and nested IF functions to determine what the “bold” text is that should be extracted, prioritizing City or County over State.

  3. Here, I removed all the line breaks from the data using a simple Find and Replace and replaced every line break with a series of XXXXXXXX, a string that I assume will not show up in the actual text. So now, no more line breaks. This also lets me determine where line breaks were in the original text.

  4. Determine where to start extracting text using the FIND command, searching for “City:” “County:” or “State:” based on Step 2.

  5. Use the FIND command again to find the first set of XXXs AFTER the appropriate City/County/State text - this allows me to determine the length of the text to be extracted.

  6. Then I used the MID command to extract text starting immediately after City/County/State and extracting the number of characters based on the length determined in Step 5.

My file is not pretty, but I’m happy to share it: