Excel and/or XML gurus, I need your help!

For those of you only interested in the technical questions, fell free to skip past the next two paragraphs.

The situation is that I’m using reports generated by a database which I have heavily restricted access to, being an end user. The problem being that the reports have to be gone through manually and double-checked against a third system, but the reports have an atrocious layout and provide way more information than I actually need. (The reports provide 40-50 cells in a row per instance, with between 1 and 222 instances per report and it’s all jumbled together with incomprehensible abbreviations and obfuscating context-sensitive placements - I only need 13 or so of the fields, all of which are either text or 4 to 8-digit numerical.) Bringing in direct outside help is also off the table, since the data is financially sensitive for my employer. (If someone needs it, I can make a simulacra, though.)

I’ve requested a simplification of the report system and it’s being considered, but the system is currently being rolled out piecemeal accross the country and that will take at least another year, so it’s way down there on the list. There’s nobody who’s not currently otherwise engaged to ask for help, so I’m bringing this to the Dope.

And now for the techy bit:
I fortunately stumbled on to a way of exporting the report(s) I need to a local XML file which I can manipulate. Vaguely remembering some similiar ad-hoc tech work from the past, I remember that Excel can read and treat XML files, so I imported it into Excel. This gave me a readable table that I could manipulate by (manually) removing columns I don’t need, changing their order and removing duplicates while still retaining the sort order.

So question number 1:
Is there any way I can retain this Excel table as a sort of template and then ask it to update only the columns I have retained from the XML file? (Which I can just overwrite each time I export the reports.)

Doing this by hand led to consistent occasions where if the comments field had gone over the character limit for the database, it would dump a second (or more) instance identical to the first one, only to fit in the rest of the text in the comment field. This is very undesirable, leading to…

Question number 2: Is there any way to ask Excel to check if the row-cell with the unique instance identifier is the exact same as another row-cell higher or lower in the column and if so, remove the duplicate fields and merge the two (or more) rows?

(I was thinking something like “IF B1=A1 THEN DELETE B1-B12, SELECT A1-13+B1-13 AND MERGE” but then that would screw up the third or more duplicate instances.)

Let it be said that I have only casual and cursory experience with both Excel and XML - or any other programming outside HTML and some miniscule amount of terminal/command line work from the bad old days. As in, I have a snowball’s change in hell of programming this myself in VB/XSD/etc, unless given explicit instructions. Of course, if the only solution is to teach myself XDS - which seems likely - then I’ll try, but I was kind of hoping for a clever ad-hoc solution; doesn’t have to be pretty, only has to work! :smiley:

Thanks!

It sounds like you are asking how to solve a problem that requires programming without programming. I’m also a little confused by what you mean by “instance.”

You could certainly automate a lot of your Excel manipulations with VBA. It would be fairly easy to learn but VBA solutions tend not to be robust and break when the data varies too much.

I think the ideal solution would be to import the reports into your own database and then write SQL queries to produce your own extracts. Unfortunately, this requires some database expertise.

Sorry, I realize this isn’t what you are looking for.

That’s alright, I’m resigned to probably having to buckle down and teach myself the rudimentaries of VBA/XSD/Excel programming in order to accomplish what I want. I was just hoping someone might point me to, for instance, a set of Excel functions like “Advanced remove duplicates” or the like, where I could meddle with the parameters instead of building the code myself.

And by instances I essentially mean rows. To specify, I work in a hotel and what I’m talking about is for instance an arrival report (specifying the name, booking number, rate, comments, requests, persons, etc, etc.) The report should make one row per customer (what I think of as instance but probably has its’ own completely descriptive word that I can’t think of) but doesn’t. It makes two or three, if one of the cells go over the character limit, with all the data being identical except for the spillover text from that cell.

Here’s a fake example to illustrate:



Row 1: Name McNameson - 1 person - Rate $75/night - 1 night - Comments: Guests requests non-smoking ro
Row 1: Name McNameson - 1 person - Rate $75/night - 1 night - Comments: om, preferably with street view.


Actually your specific example, if it really is that simple, could probably be corrected purely in Excel without programming. If the data exists in column A as a long string, split it into multiple columns using “Data -> Text to columns” specifying the delimiter “-”. Then to extract your instances, learn how to use Pivot Tables in Excel. They are powerful.

Note Excel also has a “Remove Duplicates” function.

Ah, sorry, I meant for the - to mark the end of a cell and start of the next one. But I’ll look into the pivot tables. Thanks! (And my report is roughly speaking that simple, though obviously with a lot more values.)

I’ve been thinking about it, but I’m not a great hand at this kind of logic. Here’s what I’ve come up with, something along the lines of:

if(Xn=Xn+1,n+2,n+3)
then concatenate(Yn+Yn+1,n+2,n+3) into Yn
then delete(Xn+1,Xn+2,Xn+3,Zn,Zn+1,…)
then remove row(Xn+1,…Zn+1,…)

Legend:
X=Column with unique identifier
Y=Column with comments (affected by cell overflow, causing the trouble in the first place)
Z=Standin for the columns between X & Y
Concatenate: In the sense of merging two text strings in the specified order.
Delete: Removing information from the cells in the now redundant row.
Remove row: Removing the cells in the row, effectively “lifting” the rest of the table to fill the gap.

Which means this example should go like this:

  1. Starting example:


Row 1: Name McNameson | 1 person | Rate $75/night | 1 night | Comments: Guests requests non-smoking ro
Row 2: Name McNameson | 1 person | Rate $75/night | 1 night | Comments: om, preferably with street view.


  1. Run the first part of the code:
    if(Xn=Xn+1,n+2,n+3)
    then concatenate(Yn+Yn+1,n+2,n+3) into Yn


Row 1: Name McNameson | 1 person | Rate $75/night | 1 night | Comments: Guests requests non-smoking room, preferably with street view.
Row 2: Name McNameson | 1 person | Rate $75/night | 1 night | Comments: om, preferably with street view.


  1. Run the third part of the code

then delete(Xn+1,Xn+2,Xn+3,Zn,Zn+1,…)



Row 1: Name McNameson | 1 person | Rate $75/night | 1 night | Comments: Guests requests non-smoking room, preferably with street view.
Row 2:                            |               |                         |           | 


  1. Run the last part of the code:
    then remove row(Xn+1,…Zn+1,…)


Row 1: Name McNameson | 1 person | Rate $75/night | 1 night | Comments: Guests requests non-smoking, preferably with street view.


And that’s kind of how I imagine it, but I have no basis to believe that’s how it’ll actually work. (And I don’t know how to express the above in Excel, though I’m reading up on it.)

Just saw your last comment - yes, Excel has a remove duplicate function, but it only deletes rows with exact duplicates. Remember that the last column is different, though the rest of the columns are exact duplicates, so it didn’t work. (Though if there’s some way to edit the parametres, that’d be brilliant!)

And thanks for the replies!

So it looks like you need to combine the comments that span multiple rows in addition to preserving your instances. You can do this fairly easily with simple Excel formulas.

To simplify, imagine your instance is entirely held in column A and your comment is in column B.

Create a formula in column C to combine comments as long as they are part of the same instance. Put this formula in C2 and then copy and paste it all the way down column C:

=IF(A2=A1, B1&B2, B2)

Next you need a formula that displays the comment from column C only when the instance changes. Put this formula in D2 and paste it down the column:

=IF(A2<>A3, C2,"")

Then you need a formula that display the instance only when it changes. Put this in E2 and paste it down the column:

=IF(A2<>A3, A2,"")

Now columns D and E have the unique values you need with some blank values interspersed. I think you will be able to take it from there.