Are there "data scrubbing" tools to do what I need?

I have a bunch of OCRed archival data with various errors throughout, of course even though the OCR is 99% accurate that can mean that 1 out of 100 characters is wrong. The data is basically name, address, city, state. Going through all this information manually is error-prone and not something we’ll be able to spend the time or money on, there are hundreds of thousands of archival records.

As an example of problems I’m seeing, the original data was liberal in state abbreviations, so NY could be listed as New York, N.Y., N Y, etc. Tennessee could be Tennessee, TENN, TENNI (due to OCR error), TENH, etc.

Are there tools to go through the data and standardize the information? What I’m picturing is some sort of editor that initially colors everything red but as you confirm something is correct or correct the first occurrence it also corrects every other occurrence and turns them green.

If necessary I’ll write an interactive program to correct data problems, but of course that’s likely to turn into another big project.

There are companies that do address scrubbing, but they’re not cheap. I used to work at one. Believe me, taking a standard address, even one that hasn’t been OCR’d, and making it into a standard readable address is NOT an easy process.

This website lists vendors who make address standardization software:

http://ribbs.usps.gov/files/vendors/index_ribbsnew.cfm

If the data isn’t too munged you could probably do a one time deal and have them standardize the address. If it’s too screwed up and involves preprocessing then outsource.

As the previous poster mentioned, it’s time consuming and not cheap.

Thanks for the input. I’m not sure how well address standardization software would work since these are ~40 year old addresses with no ZIP and a variety of state abbreviation variations. What I was hoping for was a more generalized data correction utility, not just for addresses.