Easy ETL for old-fashioned multi-record sets??

LSLGuy · August 20, 2017, 2:42am

I’ve now been out of IT long enough that all my tools and skills have kinda rotted or become mostly obsolete. So I’m looking for advice on a modern, code-lite way to load a simple database from text files. Note I said code-lite, not necessarily code-free. I’m in a Windows environment.

I receive a plain text file that is figuratively old-style punch-card images but wider. Each record set consists of a single header record in one fixed field layout followed by n detail records. The n is on the order of 2 to 15. All detail records are in the same field format, but other than a switch character to indicate header or detail, the detail format has no fields in common with the header format.

Each data file consists of a just few thousand record sets, so maybe 50K source records total. Tiny by modern standards.

The goal, of course, is to set up a relational database with one table for header records and one table for detail records with appropriate field types. Then feed the data file into a ETL tool to load the database. There will be some values that need to be propagated from header to detail, such as the record set ID to provide a foreign key. There are also some values that will need to be generated, such as a sequence number for each detail record within a set because sequence is significant and in the original data it’s indicated by the physical sort order.

Several fields are sorta-encoded where I need to do things like convert integers to dates in non-standard ways, and do other semi-simple twiddly transformations. I’d like to do as much of that through the ETL tool as possible although I could handle many of those in a second pass on the SQL side using some UPDATE queries before starting analysis.

The other fun tidbit is that each header record has a bitmask which indicates how many instances of itself it represents in the real world. It’s actually a quasi calendar where the activity specified by one header and its details will actually be enumerated out into x distinct instances in the real world. Think like a school class schedule, where the header record for class ID 1234 specifies via bitmask that it meets on these specific 27 dates across the 90-day semester. That’d be 27 1-bits in some arbitrary pattern within a 90-bit mask.

At the end state we need to be able to analyze the data at the holistic header or enumerated header-instance level. We also need to analyze the detail data the same way: per detail or per detail-instance.

My current thoughts are to have a separate header-instance table but I’ll need a way to load it given the appropriate conversion function from the bitmask to the instance dates. That too could be done in a second phase wholly within SQL if necessary, but that’s the high-effort way.

So that’s the specs.

I could certainly bodge this all together in python or any other scripting language, dicing and slicing strings pulled from txt files one row at a time. But I’d rather not recapitulate the last 40 years of my intermittent IT career … again … if there’s now a tool to do most of this translation definition with point-and-click ease. Or even a good ETL library with actual accurate documentation and fewer bugs than the stuff I’d whip up from scratch.

None of this needs to be accessible from the outside world. The whole thing could be cloud based or local PC based; whichever has better tool support works for me.

Unlike the old days I don’t have free access to anything commercial. So I’d be looking for free or lowish cost tools. e.g. I don’t mind paying a few bucks for an Azure instance now and again but I’m not buying a SQL Server site license. If Google or whoever has something for free I’ve never heard of, that’d be ideal.

At a high level this will be a fun project. Down in the weeds it just looks like drudgery and reminds me of my early days slinging COBOL on punchcards. I’m gettin’ too old for those headaches.

Ideas / Suggestions / Thoughts / Laughter?

Athena · August 20, 2017, 12:58pm

Must it be relational? My first thought is that you could do a great deal of what you’re looking for with NoSQL, treating each file as a document. The Cosmos DB that comes with Azure is an obvious choice (since you mentioned Azure).

That said, you don’t really say what the end game is; if you are looking to use a report writer that requires a relational DB then NoSQL wouldn’t do it for you.

Shoeless · August 20, 2017, 8:55pm

You might look at SQL Server Integration Services (SSIS). To be honest I don’t know what is available for SSIS in the cheaper/free versions of SQL Server and Visual Studio, so it may be beyond your budget. But it is perfect for exactly the kind of file import you are trying to do.

LSLGuy · August 21, 2017, 3:24am

Thank you both. To your points …

The end goal is various reports & graphs about the “shape” of the details and how they fit together on the headers. The idea is to begin to see the forest instead of 50,000 trees individual trees times ~10 instances per tree. Find the outliers, find the centroid, etc in a bunch of different attribute spaces. Where many attributes are properties of the detail records or sum of detail records within a header, and other properties are strict properties of the header not derivable from the details. Last of all there are some properties that effectively vary by instance due to scheduling issues. Think of the difference between the 4pm class on Monday and the 4pm class on Friday. Same class and curriculum, but which would you rather attend?

Once we’ve explored and massaged the data enough to understand that “shape”, The next step is to try to empirically develop various parameterizations that give us a measure of quality or fitness of the header-instance level items in some more holistic sense. Which eventually leads to a partial ordering or sorting of the header-instance items from “best” to “worst”.

There are pretty clear ideas of what the various measures of fitness for individual detail atttributes are. And how they might quasi-sum on a per-header basis. There is no understanding of the reality of how much chasing attribute X means getting sucky results on attribute Y. So the overall metric of the fitness space isn’t well-understood. We’re hoping to develop that understanding.

So that’s the goals. Now to the tools:

No, I don’t think relational is the only possible arrangement of the data. We could certainly produce a flatfile sort of thing that was the outer join of the header & detail records, then replicated per instance with sufficient extra fields to uniquely identify each instance of header and instance of detail. Depending on the report writing approach, whether we store the data this way or store it relationally and report it from joined and enumerated querysets is more a matter of style or convenience than one of substance. I have zero experience with NoSQL. It might well be the right tool for the job;. I’ll sure take a look.
In my IT life we were a pure Microsoft shop, so I was pretty good w SQL Server, VS, .Net, and the rest of their tool stack. I still miss my MSDN Universal. I had a nodding acquantance with SSIS; it just wasn’t up the center of our effort space then. IIRC in my era it was always a member of the more expensive SKUs, so I’d not expect to find it on the (near) freebies. But I’ll sure look.

Athena · August 21, 2017, 1:04pm

I have more than zero experience with NoSQL, but I’m far from an expert. From what you’ve described, it seems like the data will be easy enough to get into a document-based NoSQL solution; getting out the info relevant to your needs is a little more hazy to me. You can definitely query NoSQL data, but how hard the queries will be to write in order to get your data out is less clear to me. Perhaps someone with more experience will come along; alternatively there’s always StackOverflow.

Topic		Replies	Views
Programming alternative Factual Questions	41	12017	May 2, 2010
Data crunching! Excel, Filemaker masters needed! Can I do this? Factual Questions	3	692	August 23, 2007
Database questions (do I really need one?) Factual Questions	15	1069	May 4, 2005
What is a relational database? Factual Questions	20	1591	June 23, 2005
Is this legitimate use of SQL? Factual Questions	16	1402	May 8, 2003

Easy ETL for old-fashioned multi-record sets??

Related topics