The example shows a particularly messy record. The Easytrieves for that one is different from the ones I posted yesterday. Files are manually cleaned so they’re nice and pretty. After that, the next file is compared to that file as in the second program above. Then the next file is compared to that file, and so on. It’s not ideal (companies move), but it’s better than manually cleaning hundreds of thousands of records every month. (Besides, I can always submit a report correction to BMCRA. In 18 years, I’ve never had to do that for the data I send.)
Yes. That’s how you get commercial credit reports.
I tried using brackets, but they didn’t work; so I used < and />. That worked, except it didn’t keep the indentations (I inset the code between IF and END-IF, for example, to make it easier to read). Is a backtick the thing below the tilde?
Good to know.
The IT consultant did mention it was from like 1970, which surprised me.
My point was (as I stated) I would have to take a class in order to be taught. And for me, it would have to be in-person; as otherwise it would literally be like learning a foreign language online. A lot of people do it, but I need in-person instruction.
I’m on the clock now, so I’m out of time for the moment. I’ll go back and see if I can figure out the logic using your definitions when I have more time. I do appreciate help, but for me it’s like trying to learn Russian without knowing the Cyrillic alphabet.
After reading through all that code, one thing that occurred to me - because I read so much of it as exception coding - would be to drop the existing code into ChatGPT or whatever, ask it to clean it up and re-comment, and test the cleaned code. I’ve done that with some of my older R programs (I tend to code very sequentially), and it mostly works. That, at least, might be a helpful start to whatever the transition is.
I see all these Tableau is moribund comments, I do think Power BI is slowly and methodically eroding their market share. No, I don’t have data for that, just my conversations with people currently doing dashboarding.
Some clown once told me there were no stupid questions, so I’m going to ask:
Can you go upstream and sort out the formatting of the data at source? If it’s coming to you in a garbled mess, what would it take to get non-garbled inputs and avoid or at least minimise the need for any cleaning through whatever tool?
OK. So if I understand you correctly, most of the data is in neatly formed fields, and it’s mostly a matter of translating those fields to different specific text file formats?
That’s very, very doable, and I concur with everything Stranger’s said so far, then. You can altogether ignore the address normalization part and just treat them as outliers for manual processing.
Is there really no in-person programming course near you at all? Would videos + online interaction with a instructor be a close enough proxy? It doesn’t have to be Python, by the way. Any modern language can do this with ease, be it Python, Javascript, Java, PHP, C#, or similar. If they still teach Perl anywhere, it’s another language that works very well for text processing, but it’s quite antiquated and makes Easytrieve syntax look neat and readable in comparison. Of the languages you’re likely to find in-person courses for, I’d only avoid C or C++ (for this particular use case).
If it’d help, I can also write you an example program that can probably do the job (or much of it, anyway), if you can provide some sample input & output files and/or a more detailed specification. Your code is mostly self-explanatory enough, but I’m not 100% sure I understand the overflow workflow well enough.
All else being equal, I’d write it in Javascript just because that’s what I’m more familiar with, but if you prefer, I could (with AI help) do Python instead. Either way, you’d be able to see the source code and either learn from it later and/or pass it on to other consultants for further review and changes.
Or can this IT consultant you already have a relationship with write that script for you, or subcontract it out to a coder they already trust? In the time we’ve spent on this discussion, a competent Python coder could’ve probably already written the program.
Do you mean that each company might send you several files, and a human would manually check the first one in the sequence (arranged by what, earliest date)? After that, that one vetted file would become the known-good source of truth for all subsequent files from that same company, matched by (if I remember right from your code) a company code?
Sorry, I tried to look up that acronym but couldn’t find it. What exactly are you doing here, like what is the bigger objective beyond “data in, data out”? I hope this isn’t an intrusive question; not trying to pry, just trying to understand what the overall project is and how these particular transformations fit into your overall pipeline/workflow.
I take it has something to do with credit reporting, but you’re some sort of non-profit in between other companies and the credit bureaus, massaging files prior to submitting them… or…?
If it really is something like that, are there are any groups like yours with similar issues? Presumably, these are the same formats and systems used by many such orgs? If so, this might be a good case for an open-source conversion script (or maybe one exists already).
As for popularity, see our whole side thread (especially @DMC’s comments) for the opposite perspective. However, we both think that using Tableau Prep for this wouldn’t be ideal.
I don’t think you should lock yourself into yet another proprietary solution for a simple text file transform.
You don’t need to become an Eighth Dan Blackbelt of Python to do this task. It doesn’t require mastery of elaborate list comprehensions, or the arcane art of metaclaasing and factory methods, or how to use NumPy and Numba to do heavy data processing, or making animated 3D projections in matplotlib. All that is required is some very basic I/O and file manipulation, handling strings and doing pattern matching, and some very easy conditional logic and looping. And if this is more than you want to learn to do (I don’t buy that you “can’t” but have seemed to convince yourself that it is an unclimbable mountain) then you can lay out a straightforward workflow (preferably in flowchart form) and get someone with basic knowledge in Python to do this for you and produce a minimal set of documentation that will allow someone else to pick it up and modify it as needed, all for a fraction of what you are apparently willing to spend on a tool that is poorly suited and won’t automagically fix your problem.
Looking back through a multitude of threads you have made on this going back fifteen years(!) you complain about how difficult it is to keep using EASYTRIEVE and how the whole system is perpetually on the verge of failing, and yet you’ve gotten many suggestions on what programming language to use, how to approach the problem, where to look for resources, or how to find someone who can just do this for you so that you don’t have to worry about it and spend hours ‘cleaning’ data. You never seem to try doing any of this and just come back a year or two later with the exact same issue completely unresolved which just seems like some kind of learned helplessness, making a Sisyphusian task out of picking up a small rock and putting it on a table . So, either figure out how or find someone to solve this for you, or go on eBay and find an old 586 machine with Windows 3.11, install EASYTRIEVE on it, sneakernet your data over to it on CDs or 3.5” floppy disks, and resign yourself and your company to the task of maintaining this abstruse and needlessly articulated workflow because you won’t make the effort to solve a basic problem businesses and anyone getting poorly formatted data has to cope with every day. It really isn’t the difficult, insolvable problem you make it out to be.
I can’t test it for correctness without sample inputs & outputs, though. There’s too many edge cases and exceptions to easily reason through (at least for this human).
It’s probably like @DMC said, the “included with your Microsoft subscription” option is “good enough” for many companies, the same way Teams ate away at Slack despite Teams being the Windows Vista of chat… anyway, enough of that hijack.
It’s more a proprietary vs open thing. No reason to use either PowerBI or Tableau Prep or another proprietary system for this unless they already use or need Tableau for other things.
Even among visual no-code/low-code ETL tools, there’s open alternatives like https://openrefine.org/ or https://www.knime.com/ that don’t have the same vendor lock-in. Both are free and open source. Knime has its own (optional) paid plans. OpenRefine has third-party consultants and agencies that can help, like this one that also offers a free online course and also in-person trainings: Training & Mentoring | RefinePro
IMHO it’s just safer to go with something more open than Tableau for something like this, especially if the OP wants to be able to pass it on to someone new after a few years.
I want to second most of this comment, but maybe not so harshly
The thing is, you already got largely the same recommendation back in 2017 (to just rewrite the darned thing in a modern language). Python was explicitly recommended back then too, and it’s even better and even more popular now. And writing, learning, documenting, and testing Python is now easier and cheaper than ever, because the AI chatbots are already so darned good at it and will only get better. Even if you wanted to stay 100% meatbag organic, this is not a hard job for any competent developer or agency. Your code is essentially a giant list of conditionals.
If your goal is to find something you personally can maintain without having to learn how to do any coding, then yes, you can pay a consultant to do it in a proprietary tool (though even then I’d suggest looking into open alternatives like OpenRefine or Knime rather than Tableau Prep, unless your company and/or your consultant has an overriding preference for Tableau and Salesforce).
But if your goal is to create something maintainable, and stop having to re-invent this same wheel to solve this same problem every decade, then you should just bite the bullet and invest the (rather minimal) time and money to port the code to a modern language. Otherwise, you or your successor will be back in 2035 asking “I need a newer way to clean up and reformat files than Tableau Prep…”
First of all, let me thank everyone who replied. Your advice is appreciated.
But I started this thread to ask about a specific application, and it’s gone a bit afield. I don’t want to be like the deceased Doper who would ask for advice and then reject every reply. I was hoping to get by on what we have for the rest of my career. With a house, a wife, cats, medical issues, etc., I am just incredibly short of time. That, and the current political situation in the country has made me very frustrated.
In any case, re: Tableau Data Prep, the message I’m getting is that it’s too old and would lock us into another proprietary system. I’ll pass that information along.
I will address a few posts, and then ask for the thread to be closed.
This would be ideal. I’ve mentioned it to TPTB in the past. 30 years ago I learned how to run Easytrieves, before I learned how to write them. (Back then it was on a mainframe using JCL. Now, I open the Easytrieve app on the computer, create a new program or open an old one, compile (if necessary), and then click Execute.)
Anyway, I like the nerdy kid solution.
Most companies do not have the resources, or they don’t want to spend the money.
Big Multinational Credit Reporting Agency. (I mostly use it in MMP.)
I might phrase it more like, “Tableau is one proprietary option, but it might cause lock-in and price increases further down the road. There are other, more open ETL tools we can consider, like OpenRefine and Knime. Or we can hire A Nerdy Kid™ to convert our code to a more modern language. In any of these approaches, it won’t be an automatic conversion. Someone will have to spend a bit of time re-encoding these rules into whatever platform we choose, but it shouldn’t be a big job.”
It maintains your overall gist of “I just need to solve this immediate problem quickly and affordably” but opens up the possibilities of how a little more.
Just as a point of note, although using the ‘‘if name == “main”:’’ idiom and putting the driving code in the “main” function is recommended practice (because it allows you to control program execution for a program that might be called externally or from the command line with flag arguments) it actually isn’t necessary. You can just write code in a completely sequential fashion the same way you do with EASYTRIEVE and either call it from the command line or execute it within an integrated development environment like IDLE or IPython.
In fact, I rarely use this idiom because I’ve frequently working in the Jupyter Notebook/Lab which is an IDE similar to Matlab or Mathematica, and I just write simple control code in cells and define functions that I regularly use as functions or class definitions collected in a module that I import as needed. The only time I actually write standalone Python scripts called externally is when I need to provide a tool to someone who doesn’t have Jupyter installed, or for a script that is called by another program. So, there isn’t some doctrinal format you have to follow to make a Python program work, and as long as you are just building it for your own internal process nobody is going to be looking over your shoulder and whinging about how not ‘Pythonic’ your code is.
That’s the fundamental issue; there is no turnkey software package you can buy for any amount of money that is going to magically understand what to do with this data without some kind of rules for formatting the data, and this is such a straightforward text processing task (even with the variability in how the addresses are formatted) that is faced by anyone having to deal with data coming from varied sources that there is probably piece of code in Internetland written specifically to do this that you can crib from Stack Overflow which will work for your application with a small amount of modification and trial & error.
Let me start by saying that the name of the product in question is Tableau Prep. It is not called “Tableau Data Prep”. I was confused by the consistent incorrect use of the name of the product throughout this thread. I had to check to see whether there was in fact some product called Tableau Data Prep that I was unaware of.
I use Tableau regularly as part of my job as a data analyst. I can go months where I don’t do anything but Tableau dashboard development and maintenance. I have used Tableau Prep in the past, as my office has licenses for it. I personally do not like Tableau Prep, but that has a lot to do with my preference to write my own code, usually SQL (when I am pulling data from databases) or R. Tableau Prep is primarily used via a drag and drop/GUI interface, as opposed to written code. That would be a plus for individuals without coding experience, but my personal opinion of Tableau Prep is that it is clunky and slow. It is also primarily intended to integrate with Tableau dashboards. I cannot imagine paying for a Tableau Prep license if one is not intending to use it with Tableau dashboards.
As I mentioned, I do a fair amount of my data cleaning, wrangling, and preparation using R. Others have mentioned Python. I have used Python in the past, I’m just more experience with R. And I’m not claiming that one is better than the other (some people feel very strongly about their preferred programming language). But both Python and R could do what you need to do easily and for free.
Sorry, Johnny, I think the only way to resolve this thread is for you to learn Python and R and Tableau Prep and Javascript, rewrite the program in all of them, and then report back and tell us which one is the best. Which of course would immediately convince everyone and there would be no further need for discussion
I can’t speak to the specific tools (I’ve used Tableau as a BI dashboard / reporting tool but tended to go the “nerdy kid route” for ETL / data cleaning).
I’ve also heard the “shouldn’t be a big job” before when it came to refactoring or converting a bunch of legacy code.
Speaking as a consultant who tends to specialize in data issues, it sounds like the OP’s company has a much broader issue in terms of data governance. As I understand it, the OP’s firm is receiving a bunch of proprietary formatted Excel formats that need to have customer information standardized and consolidated so it can then be reported on or otherwise consumed. And it’s all sort of cobbled together with whatever ETL / data feed tool du jour is convenient.
The point is, I wouldn’t underestimate how big a job this could become, given that this seems like a somewhat critical ongoing operation. Is there any sort of central system of record for customer info?
Who will maintain and support whatever solution is created?
How critical is accuracy in matching stuff like address info (we ran into this problem at an insurance client where you can’t get it wrong).
How are you handling concerns about data privacy and security?
Who is going to test and sign off on all this?
I’m not saying the OP needs to spend millions to have Accenture come in and (badly) install some sort of Databricks data lake. But companies can get into trouble quickly by cobbling together a bunch of bandaid fixes.
Yeah, those are great points. It sounds like they don’t have the best system as it is, and introducing more novelty could potentially add more fragility.
Sounds like Johnny’s been dealing with the situation for many years though, and maybe just trying to band-aid it enough to hand over to his successor at some point…?