I have a bunch of OCRed archival data with various errors throughout, of course even though the OCR is 99% accurate that can mean that 1 out of 100 characters is wrong. The data is basically name, address, city, state. Going through all this information manually is error-prone and not something we’ll be able to spend the time or money on, there are hundreds of thousands of archival records.
As an example of problems I’m seeing, the original data was liberal in state abbreviations, so NY could be listed as New York, N.Y., N Y, etc. Tennessee could be Tennessee, TENN, TENNI (due to OCR error), TENH, etc.
Are there tools to go through the data and standardize the information? What I’m picturing is some sort of editor that initially colors everything red but as you confirm something is correct or correct the first occurrence it also corrects every other occurrence and turns them green.
If necessary I’ll write an interactive program to correct data problems, but of course that’s likely to turn into another big project.