Sorry about the confusion, I’ll try to clarify. I have to be careful to anonymize as much as possible because I’m in a niche industry so a proper description will identify my company. I consider it a competitive weakness that we have this problem, so I don’t want to highlight it to any possible industry insiders.
The data is flat files. We are slowly, ever so slowly, moving towards XML which should make this all so much easier. But we’re not there yet.
We provide the data for the entire industry to use. As you can imagine, it’s very competitive and one of the reasons we publish documentation that outlines how to process the data is so that all players are operating on a level playing field. They can set their own prices and policies to be competitive, but we discourage them from manipulating the data they buy from us to give themselves market advantages. As much as possible. They occasionally do, and sometimes they do because our documentation is clear as mud. Which brings me back to the question, how can we do this better?
I figure that most data warehouse stuff is fairly standard and easy to process. You don’t need an instruction set to know how to process lists of names, addresses, and most other purchasable data. But our stuff is extremely complex, as I mentioned. To give you an example, let me say it’s tax data. It’s not, but it’s that complex. So we hypothetically have files that contain tax data for every state, county, province, and city in the US and Canada. Our customers buy the data and because of it’s complexity and competition, they need to know how to process the data in order to collect the right amount for every sale, and remit the right amount to every government agency.
So a hypothetical file will contain these data elements: agency, origin location, destination location, drop-ship warehouse location, point of sale location, exception codes 1-5, and tax amount. In order to “match” the record, which means we want to apply that tax to the sale, we need to be able to match all of those fields to the actual transaction. Does the buyer’s address match the destination location? Is the sale taking place in a location that matches the point of sale location? Does the retailer’s location match the origin location? If it’s a drop-ship situation does the warehouse match the warehouse location? Are there any exceptions that match exception codes 1 - 5? If any of those answers are no, then you fail to match that record and go on to the next one.
Now do this for the entire set of 100 - 150 different data files that cover every country, state, province, county, and city in the entire world, because this is a global industry. Some of the data is sales tax, some is VAT, some is property tax, some is income tax. You start to see the complexity. My hypothetical starts to break down here, too. But does this help?
I’m thinking it would take a combination of English prose and symbolic logic or flowcharts to clearly explain the processing. It would also be a 30-volume set if printed.