How does Spell-Check work?

jtur88 · January 31, 2017, 7:23pm

Aside from the obvious, and the jokes and criticisms, what interesting information lies hidden in spell-check programs.

I just typed “biometric”, and it was flagged, and demands a hyphen. But if I cut and paste the same word into a Google search field, it does not get flagged. If I enter “bio-metric” in Google, the drop-down suggestions remove the hyphen. I’m using Firefox, and as far as I know, I have no other spell-checks running. So why the difference? Unless some text entry fields carry their own spell-check, independent of the one in my browser. If I click ‘add to dictionary’, will that apply to all renderings?

How often do new words get entered into spell-check lists, in order to update them? Do I get a new updated list every time I upgrade to a new version of Firefox?

Who decides what words are included. For example, who decides which names of cities are listed, and at what level of unimportance is a city relegated to a wavy red underscore? (Unimportance is flagged.) When will the names of Trump’s new cabinet get spell-check validated?

Chronos · January 31, 2017, 9:16pm

Why would Firefox use the same spellchecker as Google? I think Chrome might use Google’s spellchecker, but that’s because Chrome is made by Google.

bump · January 31, 2017, 9:26pm

Spell checkers aren’t part of some central clearinghouse for spell-checked words. They’re totally up to the whim of whoever is including them in an application. That said, there’s likely some kind of commercial spell-check API and probably a service to keep the word lists updated.

As to how they actually work, my guess is that they basically check each word you type against a dictionary. If it doesn’t match, the word is highlighted, and then they probably have something akin to a hash for each word in the dictionary, and generate a hash through the same method, and compare the hash vs. the dictionary hashes and suggest the top 3-4 scoring dictionary words as possible replacements. There are also probably context-dependent suggestions, etc… and much more sophisticated ways than what I describe.

jtur88 · January 31, 2017, 10:11pm

No, if I log into SDMB on Chrome, it still flags biometric and unimportance in the text entry field, but not bio-metric. The same as Firefox. It appears they are using the same dictionary.

jtur88 · January 31, 2017, 10:20pm

Furthermore, in the Google Search text entry field, using Firefox, the word ‘unimportance’ by itself does not get flagged, but the combination ’ unimportance biometric ', the word ‘unimportance’ gets flagged, and ‘biometric’ doesn’t.

ftg · January 31, 2017, 10:32pm

FWIW the only word my Opera spell check flags in the OP is “bio-metric”.

There are all sorts of weirdness I’ve seen over the years with spell checkers. E.g., “labeled”/“labelled”. Some love one and hate the other and vice versa. Tons of things like that.

Speaking of Opera’s spell checker. It includes many acronyms but surprisingly not a lot of computer ones. E.g., no “USB”.

In my case I can easily add words with a right click. But doing it over and over isn’t really all that easy.

While simple lists of words aren’t can’t be copyrighted, people go to great lengths to protect their words lists. The GNU-licensed Aspell dictionaries are probably the best free ones out there. But then you have to deal with GNU-world goofballs if you want to lobby for a word to be added, deleted or fixed.

(Checking shows that Opera uses Aspell. Weirder and weirder.)

Tim_T-Bonham.net · February 1, 2017, 12:45am

Indeed, you can add your own words to the spell checker (on your machine, not everybody’s). This is handy if you commonly use jargon for your profession, or want to add local place names, etc.

leahcim · February 1, 2017, 1:41am

The go-to data structure for a basic spell checker dictonary is a trie, rather than a hash table. It makes it easier to find “nearby” words, in addition to telling whether a word is, or is not, in the dictionary.

LSLGuy · February 1, 2017, 1:43am

As a general matter, each PC application, including each brand of browser, has its own spell-check app with its own built-in matching rules and its own separate dictionary.

Many spell checkers consider hyphens the same as spaces. If that applies to your spell-checker, “bio-metric” is NOT a word it recognizes. It’s not even a word it’s trying to check. It’s checking “bio” and finding a match in the dictionary. Then it’s checking “metric” and finding a match in the dictionary. So neither of those two words is highlighted as wrong.

My android phone’s spell-checker is “smart” enough to include every name and email address in all my contacts as known words. Which means lots of weird useless suggestions.

It’s also real easy (especially on a phone) to inadvertently accept a misspelled word as legit, thereby adding the misspelling to the dictionary.

At least for IE there is a text file which contains your personal additions to the built-in dictionary. You can open and edit it with Notepad to add new words or remove accidentally accepted misspellings.

bump · February 1, 2017, 3:50am

Makes sense based on what I saw- the structure itself makes the finding of suggestions and variants of the word easier.

Can’t say that we ever discussed tries in college; just b-trees and avl trees.

ftg · February 1, 2017, 2:12pm

Yeah, tries are a good core data structure for something like this.

But to suggest replacements is something else entirely. (And certainly hash tables won’t come anywhere close. Two similar words should end with with very different hashes. The opposite of what you want.)

There’s been extensive research in how to quickly determine how “close” two strings are. The key is to define what kind of errors you want to deal with: missing letter, extra letter, swapped letters, etc. (In the case of the OP, “-” counts as a letter. You can also consider " " a letter for missing/extra blank errors.*)

There are very nice dynamic programming solutions to compare strings like this that are basically optimal. The real trick is to work this in with tries. This gets deep fast.

Pretend you went thru all possibilities of simple errors comparing the typo to strings in a dictionary. But don’t actually do that because you’ll be doing a lot of redundant work. Use a table to keep track of what you’ve done so far. Then figure out how to get rid of almost all of the table since most are hopeless options. Etc.

Basically if you’re trying to do something involving comparing/matching strings, dynamic programming is a great way to start off. E.g., KMP string matching is a simplification of a dynamic programming algorithm.

Figuring out if a typo might be two words with a missing blank is the “funnest” type of simple error to deal with.

LSLGuy · February 1, 2017, 3:56pm

This is especially interesting. That’s a very common error for me on my phone. And my newest phone (Samsung Galaxy S7) is friggin’ awesome at breaking 2- or even 3-word run-ons correctly. Even with a small typo in the middle.

Before this I’d never had a phone or even PC that even tried to deal with run-ons. Much less succeed.

Now (ref the bad UI thread) if only the phone’s spell-check UI made it harder to accept misspelled words and silently add them to the dictionary to trip me next time.

leahcim · February 1, 2017, 3:56pm

Yeah, I never heard of them until I was out in the “real world” either. Probably because they are not that interesting CS-wise.

ftg · February 1, 2017, 10:38pm

If you’ve seen one trie you’ve seen 'em all.

Chronos · February 2, 2017, 12:05am

I do not think I’ll ever sie
A hash as lovely as a trie.
And one other common sort of error that spellcheck suggestions look for is keys that are near each other on the ketboard. For instance, ‘t’ is close to ‘y’, so “ketboard” there was probably supposed to be “keyboard”.

purplehearingaid · February 2, 2017, 1:23am

I can add words to my spellcheck , I had to add my last dog name b/c spellcheck kept trying to ‘correct it’ . My dog came with his name and I had to keep.
:smack: spellcheck just change ‘words’ to ‘work’.

ftg · February 2, 2017, 1:45pm

Today’s spell check oddity: I mistyped “I” as “Ii”. Spell check was happy with it. Maybe thinking it was Roman numerals. But mixed case? The spell checker frequently flags incorrect cases on certain types of words like acronyms. It should have caught this and suggested “II” or “ii” instead.

Hmm, let’s check: IVI IIV IV VI VII VIIII. First two and last one are wrong. “VIIII” is wrong? Old fashioned but not wrong.

(And apparently “spellcheck” is invalid as well.)

Tim_T-Bonham.net · February 2, 2017, 5:51pm

Most spellcheckers ignore ‘words’ of only 1 or 2 letters. These are mostly not really words, but someone’s initials or abbreviations, or markers used in outlines or lists of items.

Testing them generates a lot more false positives, which most users find more annoying than the occasional miss of a mistyping. Usually, there is a ways to adjust this parameter in your spellchecker program.

Topic		Replies	Views
The Firefox spellchecker In My Humble Opinion	8	1349	August 16, 2011
So either Facebook or Firefox are still spell-checking Barack Obama The BBQ Pit	23	2555	November 25, 2008
I've gotten spoiled at how good Firefox spell checker is. Miscellaneous and Personal Stuff I Must Share	7	1235	August 9, 2012
Why does spell check suck so bad in Chrome? In My Humble Opinion	4	3737	February 16, 2015
Why don't internet browsers use an online database for spellchecking? Miscellaneous and Personal Stuff I Must Share	4	789	June 3, 2008

How does Spell-Check work?

Related topics