Question for data warehouse people

I think this is a long shot, but lemme try anyway. It’s the Dope, I may be surprised. :smiley:

Anybody here work in a company that collects and sells data? If so, do you also provide any kind of documentation to explain how to process that data?

My company does that and we struggle constantly with it; specifically the documentation. Our data is extremely complex. Our data processing documentation is very old, written in English instead of symbolic logic or pseudocode and essentially written by committee. We have endless debates about the wording in the documentation. When we write updates, we spend weeks getting that done, then we spend weeks in reviews with a special “review board” and then more weeks trying to get final approval from the C-suite before we can publish said updates.

And then we get constant complaints and questions from customers and requests to make clarifications, so despite all of that pain, our end product documentation is pure shite.

My manager and I are both new to the company and natural problem solvers so we’ve been trying to think about ways to solve this problem. I know describing the data processing using pseudo code or symbolic logic may make the documentation longer, but I think it will be clearer to the developers who have to write the systems that consume the data. Whether I’m right or wrong, I’m wondering if some other company has struggled with this or implemented something that works better.

I work for County Gov, and we sell GIS and Assessor data.

You probably already know this, but what I think you want to investigate is Metadata.

Data about data.

I have a bit more time to elaborate.

The GIS data we share is an ESRI/ArcMap product that comes with it’s own tools to add metadata directly to the GIS layers. It travels with it.

The Assessor data is a compilation of all (well most) of the Assessor data in one table. It’s pretty straight forward except for some coded values that we have a data dictionary for. Much easier than what the OP is talking about.

Anyway, for internal use by the IS department we use software called Confluence.

(I am in no way associated with the company other than we use it’s software).

We don’t use this for individual Service/help desk calls, but more of a knowledge base. A how to/what is. It’s sort of like our own wiki. I think you are looking for knowledge base software.

It works great for us, and something like it is what you may be looking for. Since this is an online source for information, your users will always have the latest updates.

Now all you have to do is get rid of the committees that would update it :smiley:

This part confuses me a bit (I manage a fairly decent sized data warehouse department). I initially thought you were asking about a data dictionary, which could be solved via the method enipla outlined. But you speak specifically to the processing of the data, which is rather vague.

Are you using compound fields and want to document instructions for breaking them apart or some other form of ETL instructions, or is it something completely different?

Can you give examples of what you are trying to do?

I third the confusion here. I own a consulting firm that specializes in data warehousing and BI and have been in the field for more than 2 decades.

Are we talking about flat files, XML, or some other format?

Sorry about the confusion, I’ll try to clarify. I have to be careful to anonymize as much as possible because I’m in a niche industry so a proper description will identify my company. I consider it a competitive weakness that we have this problem, so I don’t want to highlight it to any possible industry insiders.

The data is flat files. We are slowly, ever so slowly, moving towards XML which should make this all so much easier. But we’re not there yet.

We provide the data for the entire industry to use. As you can imagine, it’s very competitive and one of the reasons we publish documentation that outlines how to process the data is so that all players are operating on a level playing field. They can set their own prices and policies to be competitive, but we discourage them from manipulating the data they buy from us to give themselves market advantages. As much as possible. They occasionally do, and sometimes they do because our documentation is clear as mud. Which brings me back to the question, how can we do this better?

I figure that most data warehouse stuff is fairly standard and easy to process. You don’t need an instruction set to know how to process lists of names, addresses, and most other purchasable data. But our stuff is extremely complex, as I mentioned. To give you an example, let me say it’s tax data. It’s not, but it’s that complex. So we hypothetically have files that contain tax data for every state, county, province, and city in the US and Canada. Our customers buy the data and because of it’s complexity and competition, they need to know how to process the data in order to collect the right amount for every sale, and remit the right amount to every government agency.

So a hypothetical file will contain these data elements: agency, origin location, destination location, drop-ship warehouse location, point of sale location, exception codes 1-5, and tax amount. In order to “match” the record, which means we want to apply that tax to the sale, we need to be able to match all of those fields to the actual transaction. Does the buyer’s address match the destination location? Is the sale taking place in a location that matches the point of sale location? Does the retailer’s location match the origin location? If it’s a drop-ship situation does the warehouse match the warehouse location? Are there any exceptions that match exception codes 1 - 5? If any of those answers are no, then you fail to match that record and go on to the next one.

Now do this for the entire set of 100 - 150 different data files that cover every country, state, province, county, and city in the entire world, because this is a global industry. Some of the data is sales tax, some is VAT, some is property tax, some is income tax. You start to see the complexity. My hypothetical starts to break down here, too. But does this help?

I’m thinking it would take a combination of English prose and symbolic logic or flowcharts to clearly explain the processing. It would also be a 30-volume set if printed.

I’d suggest moving slowly towards JSON instead of XML. Semantically they are very similar formats, but the risk in web development means a lot of modern, well-maintained tools for manipulating JSON where the equivalent tools for XML are clunkier.

Other than that, I don’t have much additional advice. If they data wasn’t complex and clunky your clients would probably manage it themselves.

hmmm. Understand that you don’t want to give too many details.

It sounds like you need some commonality between different datasets. A Primary Key.

Your example about a tax DB included location “state, county, province, and city in the US and Canada.” Now addresses generally suck as a unique identifier. But it can be done with GIS and geocoding. State, county or province would be a snap.

Get rid of the committee, and hire a Technical Writer for your instructions (first things first - to address your 1st post about bad documentation). Put the instructions online in a knowledge base (I know, simple, right?)

Perhaps a GIS specialist too. Everything has a location. It sounds like you need a link or primary key for the data.

Still a little bit lost about what you need to do.

:side note: I feel your pain.

Ultimately you need to document the complete syntax and, more importantly, semantics of all the data you collect and sell. Which ideally includes not only how to use it and its limitations, but how you collected it and the limitations of your collection process.

If indeed it’s something even remotely like tax info, where the data is very real-world messy and has no requirement to be logical except at its utmost granular level, well, that’s just how big and messy the problem is. Here’s a rare good place to use that overused business cliché: It is what it is.

Based on something unrelated I was reading earlier today this puts me in the mind of things like documenting genomic sequences or bio-reactivity of drug-like chemicals. IOW: nasty, goopy, intricate, and party-understood biology & chemistry.

Ultimately your business needs to make a business decision. Which may be conditioned by some combo of ethics and of legal liability.

It does no good to produce documentation so detailed that only your own PhDs can use it. Ultimately your business mission is to take some goopy reality and comb it into some semblance of order. So more simple-minded organizations can build their simpler processes on your detailed work.

Which means your basic documentation needs to be pitched at the level of your intended consumers. With asterisked caveats and links to more in-depth documentation only to the degree it’ll pay your business to create that additional detail. Again with “pay” meaning both in terms of sufficient fee for service income and also avoidance of jail & lost lawsuits when somebody relies on a simplified understanding of your product to their detriment.
IMO questions of technological format, e.g. XML vs. JSON, is abut 5 layers of meta too close to the metal. At least for what I understand the OP to be asking. Get your big picture right and the rest becomes doable. Ideal tech applied in the service of wrong and/or muddled goals is still GIGO.

LSL, you going Zen on us? :slight_smile:

This reminded me a lot of some projects from hell during my days as a CTO for a software dev company. They’re not happy memories.

The business didn’t even remotely know what it wanted or what the time and money budget was. IOW it had no metric to tell shit from shinola from gold. We built some beautiful tech they hated for no reason and built some utter shite they loved for equally no reason. We couldn’t even see a pattern looking *post facto *at what sold and what didn’t.

That way lies insanity. And near 100% personnel turnover. And scotch. Lots of scotch.

Heh.

We (GIS department) are trying to recreate (with GIS) an analysis for a wild fire protection plan from 10 years ago. To see what has changed.

All the data ten years ago was collected from people that no longer work here. A lot of it was created by committee or a ‘windshield survey’ - Drive around and tell us what you think. There is no documentation, nothing. 11 gigs of data to try to reverse engineer.

I do love people that think with their heart, but its very difficult to code.

GIS is now trying to compare and model opinions from people that left years ago to today. We have plenty of hard current data, but the previous data is completely undefined.

It’s frankly pointless, and I think we should start from ground zero.

Despite the fact that previous post was about XML vs JSON, I heartily endorse this message. If you don’t know what queries your customers are going to want to make of this data, then you don’t know enough to proceed.

Frankly, I’d expect a data provide to also provide a reference implementation on how to perform the most important queries. No documentation is as complete as source code that runs (although there should be clearer spec-type documentation as well).

I second, third and fourth pretty much everything that LSLGuy posted.

It also appears that your format is the actual or at least de facto industry standard. If so, that’s a whole different kettle of fish. Most people in that situation work with the various vendors in their industry so that everyone gets on the same page. That makes bi-directional data processing much more seamless, as the larger industry specific app developers have a compelling reason to get on board. If that is already in place, but you are just looking to see how others put together the documentation, one of the industries whose data we have to consume is MITS, and here is the documentation for their XML files. Another example would be the Geography Markup Language, the documentation for which can be found here.

While I too consider it a different issue, regarding the JSON vs. XML debate, it should be noted that while JSON is widely supported in the open source world, it is less so in the enterprise RDBMS world. While there are painful and slow ways of working with it in earlier versions, it was first natively supported in Oracle about a year ago, and many Oracle customers don’t upgrade frequently, thus making XML a better choice there (for now). SQL Server is in a similar boat as it is first available in the 2016 version. Both have supported XML natively for quite a while.

So, it sounds as if you have a collection of data that you provide but the organizaton and structure of the data is unclear, inconsistent, or just inherently nonhomogeneous, and you are looking for a way to structure the data so that a customer who buys access can take it and perform analytics on the data. In this case, it is clear that you need to develop associated metadata that is defined and structured in such a way as to make it useful to your customer base.

It sounds as if you need someone with expertise in data science, and specifically someone with experience in the industries you sell data access to, in order to define what this metadata looks like and how it is formatted. If your data were a cookbook, listing all recipes that have measures of “tbs” might be essentially worthless, but a customer might want to search on all recipies that use cardamon, sort them by cuisine or accompanying constituants, and then run statistics to see how frequently those recipes are used relative to others, and by whom. A data scientist who is knowledgeable about Indian and Persian cuisine would understand how to organize the data and what format to put it into in order to make is useful for your customers, and may also note the suprising fact that while you may think you should market your data to Indians and Pakistani there is also a subtantial market in South America and the Nordic countries.

It also sounds as if you need to take the responsibility for organizing and making decisions about how to document this data out of the hands of your überlords and have someone come along and do it correctly and then submit the results. Again, this is what data scientists do; they make kernels out of mush so that the end user can take the information and use it to make decisions.

Stranger

Stranger, you’re very bright. But sometimes it’s a bit much. You can’t give an anonymous person a business plan based on a few posts.

Been thinking about this. It was inappropriate and I would like to apologize.

I will lend my support to the idea that, if appropriate, a reference implementation of a basic system would be a really good start. In many ways.

To be a reference implementation is actually has to work. There is no room for ambiguity, if it doesn’t actually meet the specification and execute correctly, it isn’t right. This is a huge win. Psuedocode is a huge step back, as it is rare to find psuedocode references that are bug free or even make sense in some cases. The reference implementation must include a significant test suite, something that both keeps it honest, and allows further understanding and disambiguation. The implementation need not be in any way efficient or slick. But it needs to be understandable, copiously documented, and maintained. Choice of implementation language should attempt to produce the most easily understandable clean code. (Which lets things like C++ right out.) But the domain should suggest the language. (I’m sure many people will think Python might be the most obvious, and as a default with no other information I would agree. However you may need a language with a much more prescriptive type system.)

Coding a working system focusses the mind like nothing else. It must work, and in making it so you will inevitably turn up misunderstood issues, conflicts, downright contradictions, and edge cases ad-lib. WHihc is also why a comprehensive test set is important. It should include every edge case, and also test for all the (now fixed we hope) contradictions and conflicts.

Documentation can in many cases be dictated by the reference implementation. The process of even designing the reference will drive thought processes about how the system works and can be described in a brutal manner.

Designing a reference will almost certainly require the services of a data science specialist. It is clear you almost certainly don’t have the expertise in house right now. The trick is that a goal of a reference implementation system can drive the process towards an end that is much less mushy than alternatives. Done right it should significantly help all your customers, and, if possible, improve your market.

I didn’t take offense – I assumed it was meant in jest – but just to be clear I wasn’t trying to promote myself or some service I might offer (I’m a mechanical engineer, not a data scientist) but rather that problems the o.p. was facing are also a subset of what a data scientist does, and it may be best to seek out someone with training in information handling methodology and specific knowedge of the industry the o.p.'s company provides data to. Of course, a data scientist does more than just organize and label information; they also perform visualization and trend analysis to identify significant trends, and then performs detail statistical analysis to develop quantitive estimates, generally to predict future behavior of a market or natural phenomenon. But someone who could represent the applications that the end user will put the data to will also be able to provide insight onto how to organize data sets and develop a metadata representation of it.

Stranger

I think LSLGuy was closest to what I was asking. Some of you sound like you’re suggesting ways to implement the data processing solution. I and my team are business analysts, and the only thing we have control over is documenting the use of this data, not any actual programming solutions.

But all of the answers so far have given me a few thoughts:

[ol]
[li]We are probably unclear on who the audience for the documentation is. Developers certainly. Possibly non-technical analysts as well, although I’m not entirely sure about that. More on that in a bit…[/li][li]The data is non-homogeneous, unkeyed and unindexed. That’s likely the causing the larger portion of the difficulty. While stored in our database, it is indexed. And when customers buy the data, they upload it into their database and index it in whatever way makes sense to them. After that, though, to USE the data, there are no keys to work from. It’s entirely matching each of the elements (or fields, cells or columns depending on your frame of reference) to a real world data element in order to find the precise data you need for that exact transaction.[/li][/ol]

More on the audience question: We know that our customers have their own development teams who work on uploading the data into their databases and create the programs that process it according to our processing documentation. Strangely whenever there are questions or problems, we never hear from their developers directly. Their analysts and/or managers are the ones who pass along issues and questions to us. This is a very odd situation, I think. I’ve worked in other industries, in client-facing and vendor-facing roles and have never seen this particular interface. It forces the analysts to think technically, and because of the complexity of the entire dataset, it might be one source of confusion. (telephone game scenario) I don’t have control over that, though, and wouldn’t have the first idea of how to change that - it’s the ENTIRE industry that does this.

So… coming back to the original question: if you had to document how to process a very large, messy, complex database without using a computer programming language (the audience is definitely human), how would you do that?