Tableau Data Prep

I’ve just been in a meeting with our company president and our IT consultant. You all know that I’ve been wanting a newer way to clean up and reformat files than Easytrieve. Python has been suggested, and the IT consultant looked an a Easytrieve and said he could see how Python could work. He suggested Tableau Data Prep. He said he’d talked to people who used it, who said, ‘Where has this been all my life?’

I’m on the clock for five more minutes, so I haven’t looked at it yet. Has anyone here used Tableau?

For reference, my Easytrieves sort current and previous files, compare the sorted files and writes the previously-cleaned name and address information and the current balance and aging amounts to an output file, and writes non-matched current records (and sometimes non-matched previous records) to an output file. The clean output file is then run through an Easytrieve that writes output files in as many as three formats (for going to different places).

Starting at $900/yr for “Tableau Creator” it seems like gross overkill for your relatively straightforward data sorting and formatting needs unless you have some need for advanced data analytics.

Stranger

I think we’d get the non-profit rate, so like $500/year.

Still a ripoff for a task that could probably be done in less than ten lines of running code.

Stranger

Given that you’re talking about a commercial entity, not a personal hobby project, I’d wag that $500 or $900 / year is pocket change.

I can’t speak to the product, but the idea that cost is a meaningful obstacle seems silly to me. How many hours of your time does the existing kludgy process waste per year? How does that dollar cost compare to spending $900 to simplify all this crap?

Here is a thread where @Johnny_L.A explains what he is trying to do. It is literally just taking a text file that is either fixed length or comma separated fields, doing some text manipulations to standardize the formatting and spelling of commonly used terms, sorting and doing a comparison for unique records, and then dumping it out into a predefined text format. Unless I’m missing something there are no data analytics (which is what Tableau does), no complicated processing or external lookups, no complex database management, or anything else that would require more than straightforward text manipulation and basic file I/O.

A clever Unix hacker could probably do all of this with regex and pipes but for a simpleton like me this is an obvious use case for Python (or Perl, although at this point you’d probably have difficultly even finding someone who would sheepishly admit to having used Perl back in the far off days of the ‘Nineties). There is no need for a complex package with a continuing subscription fee to do a basic task that you could pay a community college student beer money to code up for you, or possible even just have some AI tool gin up a serviceable script (although prompting it to do exactly what you need of it is probably more effort than just writing the code yourself) and the run it on an open source Python interpreter you can download from Python.org.

Stranger

I have co-workers who use AI systems like ChatGPT for data manipulation tasks like this. You might see if it can produce macros that will help. That might involve spending no money.

As much as I am inclined to dismiss most purported business uses for a chatbot, this is actually the perfect kind of application for an LLM; processing through complicated data and coming up with a general algorithm for formatting it is essentially what an LLM does. Getting it to spit out a working Python (or whatever language you want) script is an additional layer of complexity but people keep telling me how much better these tools are getting event though every time I get a tool or text generated via chatbot it usually has significant issues.

Stranger

The only thing is that the OP may not want to use real data when presenting the question. Doing so might violate privacy. So perhaps change stuff to obscure personal information.

Many of the Easytrieves check for crossfoot errors.

I’m not a hacker of any sort. I think you’re the one who suggested Python. Unfortunately, the local colleges don’t offer Python classes. (Or if they do, I can’t find them.) Also, I have little time to go to school.

The consultant suggested Tableau. I think I’d be fine with Python, only I’d need someone to convert all of my programs and then teach me how to use it. Anyway, this guy has heard good things about Tableau, and I said I’d see what I could find out. I’ll probably watch some videos tomorrow.

Yeah, I’d have to come up with several dummy files.

Actually, my employer uses the Google Workspace and the only authorized AI tool is a custom version of Google Gemini. I think that we can use it even with private data. So you might check if your employer has an authorized secure AI tool. It might save you the hassle of building dummy files.

AFAIK, they don’t.

It’s a major hassle. Especially since the data we receive is not in (or rarely is in) a standard format.

I’m with @Stranger_On_A_Train in that it feels like overkill, but Tableau Data Prep and Tableau Creator both have a free trial period. You will get contacted by a Salesperson after downloading, but otherwise it’s a fairly straightforward trial period and at least used to be easily extended if necessary.

The only people that I know who benefit from Data Prep are those Tableau-using departments where everything is in crappy Excel sheets, or some department wants to pull some corporate SQL data, but has no knowledge of SQL.

Tableau is a perfectly cromulent tool for data analysts, if you happen to have such a need as well.

Yep. Or text files.

Is there a finite set of formats in which you receive the data? So that once you determine how to process the various formats, you have no more need for the Tableau software? Or will you need the Tableau software regularly?

Almost every company has its own Excel format. Some, such as a particular airline, uses a fixed-position text format as spelled out in the Data Extract Guide put out by Big Multinational Credit Reporting Agency decades ago. (This particular airline still uses BMCRA’s former name in the header record.) A couple use BMCRA’s newer, 1,500+ byte fixed-position text format. Everyone else? Whatever they came up with to bill their customers. As long as all of the required fields are there, and the account numbers are unique (ant not re-used), it’s good. Most are one record per account, with the name, address, amounts, and other fields all on one line; and some send separate aging and master files. Two different databases can use separate files. Another credit reporting agency can only accept single, ‘flat’ files. Speaking of ‘flat’ files, some companies send us open-invoice files that have to be ‘flattened’. It’s easy for me to write Easytrieve programs, because I’ve been writing them for 30 years. Object-Oriented? It confuses me.

Then why are you wanting to move away from it?

The program has been sold a couple of times. We have a copy on an old CD. It won’t run on our new Lenovo laptops; it will only run on my older Dell desktop. It can’t last forever. For that matter, neither can I. We need something newer, that younger people can use when I’m gone.

Has Tableau actually proven that its software can do this? Did anyone build a prototype yet?

I don’t think this is a simple problem to solve… you are trying to identify address-looking strings inside unstructured text from different sources and parse them into structured fields? There’s going to be a crapton of edge cases. Have you done any sort of tests at all, like sampling even 1% of the incoming data and seeing how well different parsers do?

I’m not sure that this is something either a script or an expensive commercial package could do perfectly. Maybe to some acceptable error rate. But it seems far from trivial.

How does it currently work in Easytrieve? Are you manually making templates for every incoming file type? Or does it use some sort of heuristic engine to guess at the incoming structure?

It seems to me it would actually take some sort of natural language processing, or a LLM type thing, to reliably detect names and addresses and then compare them to some sort of geo database for normalization, especially if you want to go for USPS validation and not just “looks good according to our own company rules”. Unless there is some technique I’m not aware of (which is totally possible), it doesn’t seem like an easy regex or CSV parsing job.