Aside from the obvious, and the jokes and criticisms, what interesting information lies hidden in spell-check programs.
I just typed “biometric”, and it was flagged, and demands a hyphen. But if I cut and paste the same word into a Google search field, it does not get flagged. If I enter “bio-metric” in Google, the drop-down suggestions remove the hyphen. I’m using Firefox, and as far as I know, I have no other spell-checks running. So why the difference? Unless some text entry fields carry their own spell-check, independent of the one in my browser. If I click ‘add to dictionary’, will that apply to all renderings?
How often do new words get entered into spell-check lists, in order to update them? Do I get a new updated list every time I upgrade to a new version of Firefox?
Who decides what words are included. For example, who decides which names of cities are listed, and at what level of unimportance is a city relegated to a wavy red underscore? (Unimportance is flagged.) When will the names of Trump’s new cabinet get spell-check validated?
Spell checkers aren’t part of some central clearinghouse for spell-checked words. They’re totally up to the whim of whoever is including them in an application. That said, there’s likely some kind of commercial spell-check API and probably a service to keep the word lists updated.
As to how they actually work, my guess is that they basically check each word you type against a dictionary. If it doesn’t match, the word is highlighted, and then they probably have something akin to a hash for each word in the dictionary, and generate a hash through the same method, and compare the hash vs. the dictionary hashes and suggest the top 3-4 scoring dictionary words as possible replacements. There are also probably context-dependent suggestions, etc… and much more sophisticated ways than what I describe.
No, if I log into SDMB on Chrome, it still flags biometric and unimportance in the text entry field, but not bio-metric. The same as Firefox. It appears they are using the same dictionary.
Furthermore, in the Google Search text entry field, using Firefox, the word ‘unimportance’ by itself does not get flagged, but the combination ’ unimportance biometric ', the word ‘unimportance’ gets flagged, and ‘biometric’ doesn’t.
FWIW the only word my Opera spell check flags in the OP is “bio-metric”.
There are all sorts of weirdness I’ve seen over the years with spell checkers. E.g., “labeled”/“labelled”. Some love one and hate the other and vice versa. Tons of things like that.
Speaking of Opera’s spell checker. It includes many acronyms but surprisingly not a lot of computer ones. E.g., no “USB”.
In my case I can easily add words with a right click. But doing it over and over isn’t really all that easy.
While simple lists of words aren’t can’t be copyrighted, people go to great lengths to protect their words lists. The GNU-licensed Aspell dictionaries are probably the best free ones out there. But then you have to deal with GNU-world goofballs if you want to lobby for a word to be added, deleted or fixed.
(Checking shows that Opera uses Aspell. Weirder and weirder.)
Indeed, you can add your own words to the spell checker (on your machine, not everybody’s). This is handy if you commonly use jargon for your profession, or want to add local place names, etc.
The go-to data structure for a basic spell checker dictonary is a trie, rather than a hash table. It makes it easier to find “nearby” words, in addition to telling whether a word is, or is not, in the dictionary.
As a general matter, each PC application, including each brand of browser, has its own spell-check app with its own built-in matching rules and its own separate dictionary.
Many spell checkers consider hyphens the same as spaces. If that applies to your spell-checker, “bio-metric” is NOT a word it recognizes. It’s not even a word it’s trying to check. It’s checking “bio” and finding a match in the dictionary. Then it’s checking “metric” and finding a match in the dictionary. So neither of those two words is highlighted as wrong.
My android phone’s spell-checker is “smart” enough to include every name and email address in all my contacts as known words. Which means lots of weird useless suggestions.
It’s also real easy (especially on a phone) to inadvertently accept a misspelled word as legit, thereby adding the misspelling to the dictionary.
At least for IE there is a text file which contains your personal additions to the built-in dictionary. You can open and edit it with Notepad to add new words or remove accidentally accepted misspellings.
Yeah, tries are a good core data structure for something like this.
But to suggest replacements is something else entirely. (And certainly hash tables won’t come anywhere close. Two similar words should end with with very different hashes. The opposite of what you want.)
There’s been extensive research in how to quickly determine how “close” two strings are. The key is to define what kind of errors you want to deal with: missing letter, extra letter, swapped letters, etc. (In the case of the OP, “-” counts as a letter. You can also consider " " a letter for missing/extra blank errors.*)
There are very nice dynamic programming solutions to compare strings like this that are basically optimal. The real trick is to work this in with tries. This gets deep fast.
Pretend you went thru all possibilities of simple errors comparing the typo to strings in a dictionary. But don’t actually do that because you’ll be doing a lot of redundant work. Use a table to keep track of what you’ve done so far. Then figure out how to get rid of almost all of the table since most are hopeless options. Etc.
Basically if you’re trying to do something involving comparing/matching strings, dynamic programming is a great way to start off. E.g., KMP string matching is a simplification of a dynamic programming algorithm.
Figuring out if a typo might be two words with a missing blank is the “funnest” type of simple error to deal with.
This is especially interesting. That’s a very common error for me on my phone. And my newest phone (Samsung Galaxy S7) is friggin’ awesome at breaking 2- or even 3-word run-ons correctly. Even with a small typo in the middle.
Before this I’d never had a phone or even PC that even tried to deal with run-ons. Much less succeed.
Now (ref the bad UI thread) if only the phone’s spell-check UI made it harder to accept misspelled words and silently add them to the dictionary to trip me next time.
I do not think I’ll ever sie
A hash as lovely as a trie.
And one other common sort of error that spellcheck suggestions look for is keys that are near each other on the ketboard. For instance, ‘t’ is close to ‘y’, so “ketboard” there was probably supposed to be “keyboard”.
I can add words to my spellcheck , I had to add my last dog name b/c spellcheck kept trying to ‘correct it’ . My dog came with his name and I had to keep.
:smack: spellcheck just change ‘words’ to ‘work’.
Today’s spell check oddity: I mistyped “I” as “Ii”. Spell check was happy with it. Maybe thinking it was Roman numerals. But mixed case? The spell checker frequently flags incorrect cases on certain types of words like acronyms. It should have caught this and suggested “II” or “ii” instead.
Hmm, let’s check: IVI IIV IV VI VII VIIII. First two and last one are wrong. “VIIII” is wrong? Old fashioned but not wrong.
Most spellcheckers ignore ‘words’ of only 1 or 2 letters. These are mostly not really words, but someone’s initials or abbreviations, or markers used in outlines or lists of items.
Testing them generates a lot more false positives, which most users find more annoying than the occasional miss of a mistyping. Usually, there is a ways to adjust this parameter in your spellchecker program.