Let’s say I have strings in my VBScript code that look like this:
MR. JOSEPH SMITH
123 FAKE STREET
SPRINGFIELD OR 01234
I want it to look like this:
Mr. Joseph Smith
123 Fake St.
Springfield OR 01234
I could probably use the ucase and lcase functions to get it to look like that. Since each bit of demographic information is held in a discrete field, I can just lcase everything but the first character.
Now suppose I have strings like these:
MRS. GEORGIA O’KEEFE
MR. RONALD MCDONALD
MS. T’PRING K. VULCAN
MR. DAVID ST. HUBBINS
I would google my brains out trying to find someone else who has posted a function to do Proper Case that includes clauses for apostrophes and names starting with Mac, Mc and De.
Assuming you don’t get any results from ZipperJJ’s excellent suggestion, I’d handle stuff like O’<name> and Mc<name> directly, and set up a hash for other possible exceptions. There is no way to distinguish the cases of MacDonald and Mackin, and De<name> is even worse. If you can find something like a downloadable telephone directory, you might be able to collect a lot of the names.
First of all 100% accuracy is impossible. There are certainly names out there that differ only by capitalization. Plus some people have weird names like “SanDee.” My question is then, “How much time are you willing to spend on accuracy?”
Guessing at figures, I think you’ll get better than 80% by using lcase on everything but the first letter. If you want to keep going, capitalize everything after punctuation (don’t forget hyphenated names) and then pick up Mac and Mc. I expect that will get you about 97% accuracy.
How would I solve it? I’d leave everything in upper case and go on to the next part of the project.
I used to be on a mailing list that worked with such matters on a global scale, leading to some ISO standards IIRC regarding name and address formats.
The general case is not trivial - I hope you don’t have to turn in a homework assignment to solve it.
I seem to recall that the name of the list had the acronym grscdi, so maybe google can help (no guarantees on that acronym though it is probably close)
As far as the cases for the names, I refuse to do it because it’s impossible to do with any sort of accuracy.
I can’t tell if you’re trying to standardize addresses or not. If you are, that is possible - I used to work at a company that did mailing list standardization & cleansing. You can write code that will go through a list of names & addresses and format them to the correct USPO standardized address format. However, as not_alice says, this is not a trivial problem. This is the kind of problem people start businesses (BIG businesses) to fix. If you need address standardization, I suggest you contract with one of those businesses to do it for you; you can’t write the code yourself, at least not unless you have several programmers and several months.
As for the scope of the project, I am the only programmer and I don’t have months to do this. But I’ll settle for 97% accuracy. I can’t predict what new names and addresses will come down the pike, but I do have a list of past names to work from already. If I can create cases for all of them, then there shouldn’t be a lot of surprises in the future.
The good news is that every letter created can be hand edited before it’s printed up and mailed out.
What concerns me most is names like “van der Linden M.D.” and “USMC.”
Yes, you are right to be concerned! I like my small “v” in my “van Surname” style surname, and am annoyed at all the variations of my name that computers come up with (“Van Surname”, “Vansurname”, “Surname, Firstname Van”, etc). Why can’t you either leave it the way it was typed in or convert everything to a nice safe uppercase?