I’ve now been out of IT long enough that all my tools and skills have kinda rotted or become mostly obsolete. So I’m looking for advice on a modern, code-lite way to load a simple database from text files. Note I said code-lite, not necessarily code-free. I’m in a Windows environment.
I receive a plain text file that is figuratively old-style punch-card images but wider. Each record set consists of a single header record in one fixed field layout followed by n detail records. The n is on the order of 2 to 15. All detail records are in the same field format, but other than a switch character to indicate header or detail, the detail format has no fields in common with the header format.
Each data file consists of a just few thousand record sets, so maybe 50K source records total. Tiny by modern standards.
The goal, of course, is to set up a relational database with one table for header records and one table for detail records with appropriate field types. Then feed the data file into a ETL tool to load the database. There will be some values that need to be propagated from header to detail, such as the record set ID to provide a foreign key. There are also some values that will need to be generated, such as a sequence number for each detail record within a set because sequence is significant and in the original data it’s indicated by the physical sort order.
Several fields are sorta-encoded where I need to do things like convert integers to dates in non-standard ways, and do other semi-simple twiddly transformations. I’d like to do as much of that through the ETL tool as possible although I could handle many of those in a second pass on the SQL side using some UPDATE queries before starting analysis.
The other fun tidbit is that each header record has a bitmask which indicates how many instances of itself it represents in the real world. It’s actually a quasi calendar where the activity specified by one header and its details will actually be enumerated out into x distinct instances in the real world. Think like a school class schedule, where the header record for class ID 1234 specifies via bitmask that it meets on these specific 27 dates across the 90-day semester. That’d be 27 1-bits in some arbitrary pattern within a 90-bit mask.
At the end state we need to be able to analyze the data at the holistic header or enumerated header-instance level. We also need to analyze the detail data the same way: per detail or per detail-instance.
My current thoughts are to have a separate header-instance table but I’ll need a way to load it given the appropriate conversion function from the bitmask to the instance dates. That too could be done in a second phase wholly within SQL if necessary, but that’s the high-effort way.
So that’s the specs.
I could certainly bodge this all together in python or any other scripting language, dicing and slicing strings pulled from txt files one row at a time. But I’d rather not recapitulate the last 40 years of my intermittent IT career … again … if there’s now a tool to do most of this translation definition with point-and-click ease. Or even a good ETL library with actual accurate documentation and fewer bugs than the stuff I’d whip up from scratch.
None of this needs to be accessible from the outside world. The whole thing could be cloud based or local PC based; whichever has better tool support works for me.
Unlike the old days I don’t have free access to anything commercial. So I’d be looking for free or lowish cost tools. e.g. I don’t mind paying a few bucks for an Azure instance now and again but I’m not buying a SQL Server site license. If Google or whoever has something for free I’ve never heard of, that’d be ideal.
At a high level this will be a fun project. Down in the weeds it just looks like drudgery and reminds me of my early days slinging COBOL on punchcards. I’m gettin’ too old for those headaches.
Ideas / Suggestions / Thoughts / Laughter?