Recommend data-cleaning/reformatting programs/languages

Johnny_L.A · September 14, 2017, 4:42pm

There’s an old saying: ‘Garbage in, garbage out.’ My job is to make not-garbage. When we receive data, someone goes through the file and ‘cleans’ it. That is, address formats are standardized, spelling is corrected, missing fields are provided, etc. I’ve written programs to compare a new file with the previous (cleaned) one, and the program used the cleaned information in the new file. It’s a simple matter to clean up the non-matched records in Excel. Other programs do calculations and write the output to two (sometimes three) different formats. I’m using an ancient language called Easytrieve.

Things are changing. Even though my programs have sped up processing such that instead of four people working on it, there’s just me, the company president wants less ‘touching’ of the data. We’re expecting a large cut in revenue from a credit reporting company that buys our data, and the president is looking for ways to either increase revenue, or cut expenses. I’m very worried about my situation.

So I need your help.

Are there any off-the-shelf programs available that A) Can take different input formats (fixed-position text, delimited text, .csv, .xls – Easytrieve only takes fixed-position text); B) Can look at something like ‘ABC MFG’ ‘CO INC 123’ ‘MAIN ST’ ‘#100’ and change it to ‘ABC MFG CO INC’ ‘123 MAIN ST #100’; C) Standardise such things as 'STREET to ‘ST’, ‘DRIVE’ to ‘DR’, etc. (I could parse the fields with Easytrieve, but it’s faster and easier to use Replace All in Excel); and D) Write the output to different formats?

Or: What programming language would be the best one to learn to accomplish those things?

In a nutshell, I’m looking for a program, or a programming language, where I can take ‘any’ format, define input and output parameters, and simply run the data.

treis · September 14, 2017, 6:45pm

You want to use a commercial service like https://smartystreets.com/. This will take an address and match it to the official post office address. They have a SDK for a bunch of different languages here SDK Reference - Smarty. Your best bet is going to be python.

Johnny_L.A · September 14, 2017, 7:00pm

I’m pretty sure the president doesn’t want a commercial service. I think he wants a software package we can buy and just use. Let me try to show you the kind of thing we’re trying to do. This is what we get:

[JOHNNY’S BUILDERS AND ] [CONTRACTORS INC 8910 HARVARD ] [ST APT 700 ] [SAN FRANCISCO] [CA]

This is what we want:

[JOHNNY’S BUILDERS & CONTRACTORS INC ] [] [8910 HARVARD ST #700 ] [] [SAN FRANCISCO] [CA]

Where the brackets indicate fields.
[ul][li]AND in the Name field has been changed to &;[/li][li]The business name portion of the Address field has been concatenated onto the name;[/li][li]An additional business name field has been added;[/li][li]The street address portion of the Address field, and the rest of the address from the secondary address field have been concatenated;[/li][li]APT[space] has been changed to #.[/ul][/li]
I’ll see what I can find out about python.

treis · September 14, 2017, 7:29pm

Yeah, I understand what you are trying to do. The issue is that it is an extremely hard problem to solve with a massive number of edge cases. AFAIK all of these services use the USPS’ address database and comparison mechanisms. There’s just too many rules and edge cases.

treis · September 14, 2017, 7:30pm

Here’s a python library that tries to do what you want:

Johnny_L.A · September 14, 2017, 8:13pm

That’s pretty cool. (Unfortunately, the code tags no longer work here.)

JOHNNY’S BUILDERS AND CONTRACTORS INC Recipient
8910 AddressNumber
HARVARD StreetName
ST StreetNamePostType
APT OccupancyType
700 OccupancyIdentifier
SAN FRANCISCO PlaceName

It does parse the name and address. But they need to be output to defined fields in a single record, and the Occupancy Type and Occupancy Identifier should be combined into ‘#700’. I assume that the single-record output is easily coded?

So let’s say I have a file that contains the following fields: Name, Secondary name, Address, Secondary address, City, State, ZIP code, Phone number, Terms Account balance, Current amount, 1-30 DPT, 31-60 DBT, 61-90 DBT, Over 90 DPT, Comment, Date of last sale. This is all in a single Excel spreadsheet row (or fixed-position text record). What would the code look like to make the output record look like the example I gave in my previous post?

treis · September 14, 2017, 9:02pm

Yes, it’s trivial. You want to split it like this first before running transformation rules. For example, if one of the businesses entered was Apt 9, the clothing brand, you don’t want that to be changed into #9. By breaking it into components you can more narrowly target parts of the address. Only attempting to change the OccupanyType from Apt to # as an example.

You will read the CSV, create the string to feed into the usaddress library, run your transformation rules, and then write out the values to a (new) csv file.

Johnny_L.A · September 14, 2017, 9:42pm

That sounds straightforward enough. It occurred to me that we might not have to output to the specified formats, but may be able to send .csv files after giving them field definitions.

I’ve looked at a couple of products, and contacted one vendor. Several of the software packages I looked at seem to fix addresses by looking at a database. First, we don’t want to get data from a database when a member is sending us an address (no matter how badly formatted). It’s a little unethical to ‘make stuff up’, if you know what I mean. And second, I’m already ‘checking against a database’ – the previous month’s files – with my Easytrieve programs. And I still have to manually fix the non-matched records. (I just had to clean up NINE addresses! :eek: ) Granted, I do have to do a ‘fresh cleanup’ periodically in case a company moved or changed its name, but I think that’s better than spending two or three thousand dollars on a product that does what I’m already doing.

Johnny_L.A · September 14, 2017, 9:57pm

Heh. Just had to try it with a whole record. Address Python doesn’t like whole records.

123456 8/30/2017 7654321 CORP JOHNNY’S CONSTRUCTION 504 FAKE STREET DENTON TX 76205 0 0 0 0 0 0 0

Returned: ‘Error parsing that address.’

treis · September 14, 2017, 10:18pm

Johnny_L.A:

That sounds straightforward enough. It occurred to me that we might not have to output to the specified formats, but may be able to send .csv files after giving them field definitions.

I’ve looked at a couple of products, and contacted one vendor. Several of the software packages I looked at seem to fix addresses by looking at a database. First, we don’t want to get data from a database when a member is sending us an address (no matter how badly formatted). It’s a little unethical to ‘make stuff up’, if you know what I mean. And second, I’m already ‘checking against a database’ – the previous month’s files – with my Easytrieve programs. And I still have to manually fix the non-matched records. (I just had to clean up NINE addresses! :eek: ) Granted, I do have to do a ‘fresh cleanup’ periodically in case a company moved or changed its name, but I think that’s better than spending two or three thousand dollars on a product that does what I’m already doing.

You are comparing it to a previous DB but you’ve already established that it’s “garbage”

What are your customer using this DB for? If it’s anything to do with the address at some point they are most likely standardizing it by using a service that references the official USPS list of addresses. It’s probably worth it for you to do it once.

Johnny_L.A · September 14, 2017, 10:33pm

We send them data, which they put into their database. (We also put it in our own database in Florida, and also another company.) It’s used to make business credit reports. When I worked for that company, they used Pitney-Bowes software to clean up the addresses, but that was downstream from where I worked. Where I am now, I’m sending the data to The-Person-Who-Is-Now-Me at the other company. We want to send the highest-quality data to ensure none is lost.

Johnny_L.A · September 15, 2017, 11:02pm

When I started with the company, I wrote my programs to conform to our customer’s shorter format. Since the fields in that format are of limited size, I took, and take, pains to ensure all of the data fit into their fields. One thing I can do to ‘reduce the touching’ of data is to simply redefine the widths of the fields such that I don’t have to ‘pre-clean’ the data. I do this for some files, since they’re a PITA. So short term, there’s something.

But in the longer term, it looks like our customer no longer cares about format/layout. It would be much easier to send them a .csv file instead of a formatted text file as described earlier in the thread. Right now I’m not thrilled with some vendor’s software. Maybe it’ll work, and maybe it won’t. But I don’t want to recommend software that will cost $2,500 per year. I think the better option is treis’s suggestion of Python. If I learned it, then I can set up the jobs, receive the data, run the data through the program, and send it to our customer(s). Basically, re-doing my Easytrieves.

So… How do I go about learning Python? I didn’t turn up anything at the local community colleges. Code Academy looks like it has a tutorial. Lynda.com has courses for a fee. General Assembly has courses in Seattle, but I don’t have four grand to spend. For thirty bucks I can get a book called *Learn Python The Hard Way, which says ‘[T]he original and still the most popular way for total beginners to finally learn how to code. Learn Python The Hard Way takes you from absolute zero to able to read and write basic Python to then understand other books on Python.’ Then there’s Python for Beginners, which treis linked to earlier.

Where’s the best place to start?

treis · September 15, 2017, 11:53pm

I don’t think you should go through any tutorials. Doing is the best way of learning. Start at the beginning and google when you get stuck.

rbroome · September 16, 2017, 12:13am

I suggest trying a Python tutorial until you get bored. That won’t take long and will give you an idea of the first steps. After that, just jump in with some real problems and google is your friend.

While almost the definition of user unfriendly UNIX AWK is the original text processing language. Python is a better choice, but if you have a beard and suspenders, AWK is the way to go.

DPRK · September 16, 2017, 12:44am

Python is basically a general-purpose/multi-paradigm language (versus Perl, Awk, and SNOBOL which are designed for text wrangling). The advantage Python may have is that it is in the top 5 most popular programming languages today (2017), so there are a lot of educational resources, examples, and code libraries out there for it.

Johnny_L.A · September 18, 2017, 5:29pm

I haven’t had a chance to start the tutorials yet, but I have a question. I have a file right now that has a Name field and four address fields. The secondary name will go into the first address column, and the primary address will go into the second address column. The City, State, and ZIP code are combined into one field and must be separated. (Two additional columns will need to be added.) Some of the CSZs are in the fourth address column, some are in the third, and some are in the second.

I know from this thread that Python can separate elements. I assume it can insert additional columns. I imagine logic that says if the ZIP code is blank, then look in the previous filed. If the ZIP code is blank there, look in the previous field to that. And so on. If the ZIP isn’t blank in a previous field, move everything over until the ZIP code field is populated. That’s easy in Easytrieve. I assume Python can do the same. OK, there re primary addresses in what will be the secondary name field. There are also secondary names and 'ATTN:'s (which we don’t want). Many records have an Attn: in what will be the primary address field. Many have the secondary address where the primary address should go, and the primary address where the secondary address should go. It’s a small file, so I can just fix it in Excel. But how would Python figure it out?

Topic		Replies	Views
Tableau Data Prep In My Humble Opinion	55	209	June 14, 2025
Programming alternative Factual Questions	41	12017	May 2, 2010
Programming alternatives, Part 2 Factual Questions	113	6886	October 8, 2018
Looking up hundreds of ZIP codes Factual Questions	22	3293	June 27, 2019
PDF / .DOC => CSV parsing conversion - cheaply, efficiently, accurately Marketplace	1	5613	June 3, 2011

Recommend data-cleaning/reformatting programs/languages

Related topics