Are there "data scrubbing" tools to do what I need?

control-z · November 5, 2009, 4:02pm

I have a bunch of OCRed archival data with various errors throughout, of course even though the OCR is 99% accurate that can mean that 1 out of 100 characters is wrong. The data is basically name, address, city, state. Going through all this information manually is error-prone and not something we’ll be able to spend the time or money on, there are hundreds of thousands of archival records.

As an example of problems I’m seeing, the original data was liberal in state abbreviations, so NY could be listed as New York, N.Y., N Y, etc. Tennessee could be Tennessee, TENN, TENNI (due to OCR error), TENH, etc.

Are there tools to go through the data and standardize the information? What I’m picturing is some sort of editor that initially colors everything red but as you confirm something is correct or correct the first occurrence it also corrects every other occurrence and turns them green.

If necessary I’ll write an interactive program to correct data problems, but of course that’s likely to turn into another big project.

Athena · November 5, 2009, 4:23pm

There are companies that do address scrubbing, but they’re not cheap. I used to work at one. Believe me, taking a standard address, even one that hasn’t been OCR’d, and making it into a standard readable address is NOT an easy process.

steadierfooting · November 5, 2009, 5:29pm

This website lists vendors who make address standardization software:

http://ribbs.usps.gov/files/vendors/index_ribbsnew.cfm

If the data isn’t too munged you could probably do a one time deal and have them standardize the address. If it’s too screwed up and involves preprocessing then outsource.

As the previous poster mentioned, it’s time consuming and not cheap.

control-z · November 5, 2009, 6:59pm

Thanks for the input. I’m not sure how well address standardization software would work since these are ~40 year old addresses with no ZIP and a variety of state abbreviation variations. What I was hoping for was a more generalized data correction utility, not just for addresses.

Topic		Replies	Views
Recommend data-cleaning/reformatting programs/languages In My Humble Opinion	15	1603	September 18, 2017
Help me to Help her. Excel Guru's sought! Formula's needed! Factual Questions	9	822	August 5, 2003
Looking for help with Word re displaying E-Mail addresses Factual Questions	10	691	June 24, 2003
Purging duplicates from a long list of e-addresses Factual Questions	8	1158	October 19, 2009
Reformatting a text file Factual Questions	9	914	June 1, 2005

Are there "data scrubbing" tools to do what I need?

Related topics