Language filters in Google searches

When you do an advanced search in Google or AltaVista or what have you, selecting language as a filter, how does the search engine determine what language a page is written in?

If the character set was Thai, that might be a clue. But usually the character sets I see are your run of the mill iso-8859-1.

Occasionally, I’ll find
<html lang=“fr”>
or
<meta name=“Language” content=“Français”>

But on many sites, such as this one, these are not to be found.

Nonetheless, it works. Search for “soleil” in English, you won’t find the site, but in French no problem. Can they be using some sort of word analysis, searching for “the” or other characteristic words of a language?

Because they’re smart, that’s how.

Actually, they have a dictionary of words in various languages, look at the words in a given page (prolly via some sort of very efficient hashing mechanism) and figure out the probability that the page is in language X. The meta tags in the HTML are also looked at, if they’re there.

Google uses the same type of technology as The Babelfish (May Mr. Adams rest in peace) to identify languages and do translations.

This is unfortunate, because I was hoping there would be a way for me to determine whether the language of a page is say Malay or Bahasa. Other than serially plugging in each language into the search engine to see if it finds the site in question. Any ideas?

There are language-guessing sites. The one I use most often is TextCat Just plug in the text and it will guess the language. It supports several dozen languages, including Malay and Indonesia. (Bahasa is the same as Indonesian, isn’t it?)

Try doing a Google search on “Press Briefings”. The first link is to the White House Press Briefings. It also conveniently offers to translate them from the original German. I believe it’s because the Press Secretary is named Ari Fleischer, and “Fleischer” is a German word. Cute, eh?

Which proves friedo’s right once again.