what is the best way to determine what language an online text is in?

I was search for something today and ended up at a Wikipedia site in an unknown language. The language was very similar to one I knew, but I could tell that it was not the language that I knew. I had to do a bit of searching to finally figure it out.

For those of you that like puzzles, here is a sample page in that language:

and the language is, I believe,

So suppose I find a text in an unknown language. What is the easiest way to determine what language the text is in? Is there a website where I can type in (or copy and paste) some text, and be told the language for that text?

There are sites to identify languages. I’ve never used one.

I don’t know of a general way that is very reliable, and it’s going to be harder for relatively minor languages. That page is clearly in a Romance language, and I did some snooping to confirm your guess that it is, probably, in Tarantino – a language that I hadn’t heard of before.

One method, which probably won’t work for this text, is to look up a few words in the Wiktionary, since that doesn’t require you to know the language of a word before looking it up. But, for example, that page didn’t know about “ottommre”, used in the text to (probably) mean October.

This page has a list of the most common tools. None of them recognise Tarantino, though.

Agreed that the “language identifier” sites work best for national languages, and poorly if at all for dialects. Whatever process you went through to identify it as Tarantino is probably the best you can do. Sometimes I scan a site for common words, and then type them in alongside distincting English words like “through” and hope I get a bilingual page to clarify things.

If you happen to find it on a Wikipedia page you can (as I did in this case) use the inter-wiki language links in the lower left of an article page.

First I clicked over to the English version of the page (http://en.wikipedia.org/wiki/Denis_Diderot).

I then did ‘View Source’ and searched for the URL fragment of the mystery Wikipedia language (“roa-tara”) and found the mirror link back to the mystery language:


<li class="interwiki-roa-tara"><a href="http://roa-tara.wikipedia.org/wiki/Denis_Diderot">Tarandíne</a></li>

Which is the name of the language in its native tongue. A quick Google search should turn up the name of the language in English.

That’s pretty good. Normally I would have just recomended looking up the ISO code that starts all wiki addresses, but Tarantino is such a small language that it does not have an official one. Technically, Wikipedia doesn’t allow this, but there was a big fight, and it couldn’t get shut down.

As for the OP: yeah, it’s really hard to identify a language that is spoken by so few people unless you speak it, or are a language expert. The URL is probably the best clue you had. Especially since it appears to have been translated by bot (note all the red links).

Isn’t the Tarantino language mostly obscenities and extreme violence? :smiley:

In The Story of Language by Mario Pei, he gives a number of pointers for various languages that can allow you to identify the language of a text using a quick visual scan. There are certain distinguishing features for each one. For example, double accents are unique to Hungarian. Of course, he only covered the more familiar languages of the world. For really abstruse ones like in the OP you need more specialized acquaintance to pinpoint them. But a quick glance told me instantly that this example was from some region of Italy, and I have enough familiarity with Italian regional idioms to recognize it as obviously Southern Italian. My first thought was Neapolitan, but I wasn’t very far off.

Japanese text is easy to tell apart from Chinese at a glance, even if you can’t understand a single character. Japanese has a very frequent occurrence of syllabic glyphs belonging to the hiragana and katakana character sets. Kana are much simpler in form than most Chinese characters. Japanese has a lot of them, while they’re absent in Chinese.

For languages written in the Arabic alphabet, remember that Arabic, being the original user of this alphabet, has fewer letters. Other languages adapted it by creating additional letters to cover their sounds that don’t exist in Arabic. Persian has 4 additional letters. Pashto, Sindhi, and Urdu have a lot more, and a different character set for each language. Although to recognize them does require a more fine-grained acquaintance with the various features. If you already know some Arabic, the differences will jump out at you immediately, but the catch is you have to know some Arabic first.

In general, I would start by first identifying the language family, and then picking out more specific clues that distinguish languages within the family, and gradually narrow it down. As much as possible, develop a wide knowledge base by learning to recognize the specific identifying features of each language.

My favorite method for identifying the language of a written text, in conjunction with the above principles, is to search sample words in Google, and you can almost always get useful clues that way. Sometimes a search hit will even name the language outright. It helps to pick out the words or phrasings that seem less frequent, and hence more peculiar to the specific language.

If I want to identify the ethnicity of someone’s name, I search for it as a keyword in the Library of Congress database, and often it comes up in connection with a specific language or culture. Context usually provides a wealth of information. It’s all about context.

The Library of Congress doesn’t recognise a language called Tarantino. I would just call it Italian.

ETA Neither does Ethnologue, which, in my opinion, is very prone to call various dialects languages instead.

I think you meant to say double acute accents, since using double (i.e., two different) accents on the same letter is not done in Hungarian, but is done in some other more widely spoken languages such as Vietnamese. (Actually, though, double acute accents are not limited to Hungarian; they’re used in the Chuvash language, and in some transcription systems.)

And Estonian.

If it’s on a website, check the website itself for clues. (X)HTML tags and HTTP headers may include language metadata (in the form of two- or three-letter ISO 639 codes). Many multilingual websites, such as Wikipedia, will alternatively use ISO 639 codes as part of the URL to distinguish copies of the same page in different languages.

You can also use the top-level domain to help guess at the language of the web page. They usually encode countries, and not languages, but knowing the country of origin can be a starting point at least. For example, if you see a bunch of Cyrillic text on a .bg site, it’s more likely to be Bulgarian than, say, Ukrainian.

Estonian is not more widely spoken than Hungarian. And as far as I know it doesn’t use multiple accents simultaneously on the same glyph.

If you go to http://translate.google.com/, select English as the target language, then select “detect language” as the source, it usually gets it right.

However, it doesn’t tell you what the feckin’ language was. :smack:

Sorry, it was my poor eye-sight that made a tilde look like a double acute accent.

Yes, I considered specifying “double acute accents” as I was writing that, but then when writing for an audience that includes many nonlinguists, there’s always a question of balancing precision in technical terminology vs. readability for the uninitiated. It was a judgment call. But I’ll grant you that “gotcha.”

Chuvash is written in Cyrillic, so it isn’t in danger of being confused with Hungarian, which kind of goes without saying.

I didn’t say to search language texts in LoC (I recommend Google for that), but personal names, because LoC metadata is highly developed for personal names in particular.

Especially since as this map shows, Taranto* and Naples are in the very same dialectal area. The map shows a lot of dialects, but it isn’t granular enough to include Taranto dialect, which is just one variant of Pugliese. Are we going to wind up with a separate dialectal Wikipedia for every damn city in Europe?
*Firefox spellchecker wants me to change the city’s name to “Tarantula” <snerk> Of course.

I look forward to the Glaswegian Wikipedia.

Will this do for the time being?

Not far off.