Does the English language have many more words than, say, Hindi/Urdu?

I frequently see claims to the effect that English has by far the largest vocabulary of all languages, given that it has over 170,000 words theoretically in current use (although something over 98% of all written English texts employ a vocabulary of less than 20,000 words).

However, the comparisons I’ve seen of the vocabulary size of English with other languages all invoke competitors where, IMHO, English is punching significantly below its weight. Yes, I’m not surprised that English may be approximately twice as big, vocabulary-wise, as Spanish or German or French, say. But Spanish and German and French are not what spring to my mind when I think of languages with large vocabularies.

The reasons commonly adduced to explain the unusually large size of English vocabulary generally include the following:

  • major influence from multiple language families, especially its fundamental mix of Germanic and classical tongues;

  • hundreds of years of exposure to and borrowing from other languages due to Anglophone political and cultural influence worldwide;

  • a large number of native speakers and second-language speakers;

  • absence of formal academic oversight of its linguistic development.

French and German certainly aren’t comparable to English in these respects, but it’s not clear to me that other languages aren’t. In particular, I would think that Hindi/Urdu would have many similar factors favoring large vocabulary size. E.g., it has two separate major linguistic influences from extremely prolific ancient languages: Persian (itself a hybrid of two distinct linguistic streams from different families, Arabic and Middle Persian) and Sanskrit (also hybridized, from proto-Indo-Aryan and Dravidian and other South Asian language families). Hindi/Urdu has also borrowed like crazy from more recent linguistic sources, and has a huge number of speakers, many of whom enrich its content with words from other languages.

However, I can’t seem to find an authoritative source on the size of Hindi/Urdu vocabulary (although I’ve seen it stated without cite or explanation as 120,000 words), nor can I find any explicit comparison between English and Hindi/Urdu vocabulary size. Anybody got the Straight Dope?

Please do not ever again conflate Hindi and Urdu, they are seperate languages!

Re the OP; the answer is that Urdu has a much larger vocabulary then English; of course the reason for this is that Urdu began life as an amalgamation* of many tongues and to this day every Farsi word is also an Urdu word.

*The Official term for it was infact Zaban-e-Urdu; “the language of the army”.

Lots of Persian expressions have been disqualified from modern Hindi, just as tons of tatsama (loans from Sanskrit) were ruled out of Urdu. Nowadays you won’t find the full extent of combined vocabulary on either side. The place to go for the consolidated vocabulary would be a monumental work dating from before the split: the Dictionary of Urdū, Classical Hindī, and English (1884) by J. T. Platts. Which is one huge-ass book. I don’t have a copy of my own, though.

My Standard Twentieth Century Dictionary: Urdu into English (Delhi, 1980) claims it has “Over 50,000 words, phrases, and proverbs used in spoken and literary Urdu.” Obviously if you don’t count the phrases and proverbs, the number of words will fall short of 50,000. I’ll say one thing for this dictionary, which I’ve had for over 20 years now, it’s rarely been stumped. However, when a word isn’t found in it, I can get it either from the literary Persian-English dictionary by Franz Steingass, or from the Hindi dictionary. One or the other.

I searched through the front matter of my Oxford Hindi-English Dictionary, but they aren’t letting on how many words they stuffed into it.

My money is on Platt, if you can access a copy of it somehow. Good luck.

From Wiki:
Standard Urdu is mutually intelligible with Standard Hindi. Both languages share the same Indic base and are so similar in phonology and grammar that they appear to be one language.[5]


Because of religious nationalism since the partition of British India and continued communal tensions, native speakers of both Hindi and Urdu frequently assert them to be completely distinct languages, despite the fact that they generally cannot tell the colloquial languages apart.

As to the size of vocabulary, the estimates I see online is 150,000 words which is approximately equal to English.

This is ahistorical. The split between them is relatively recent. And IIRC the split was instigated by British imperial divide-and-rule policies. For centuries they were a shared language called Khaṛī Boli or Hindūstānī. I think Khaṛī Bolī refers more to spoken language (bolī means ‘speaking’), with Hindūstānī referring more to the literary language. If Platt is any evidence, the combined language could be and was written equally well in both Devanāgarī and Perso-Arabic Nasta‘līq. For historical political reasons, the Perso-Arabic script was dominant throughout the time of the Mughal Empire and perhaps also the Delhi Sultanate.

Hindi and Urdu are not even two dialects of the same language; they are two divergent elaborations of one and the same dialect, Khaṛī Bolī, native to the Delhi-Agra area. I mean Hindi in the narrow sense, the official standard language of India. Hindi in the wider sense covers a vast area of dialects that are not mutually comprehensible. In the latter sense you could technically say that Hindi is not even the same language as Hindi. But aside from linguists, I think everybody pretty much understands the name Hindi in the narrow sense.

It might have been the case 60 years ago. I don’t know. However today I as a native speaker of Urdu when watching Indian television am stumped about half the time. Formal Hindi is even more difficult. I also know that cross border movies, songs and drama serials of which they are a lot have to be carefully scripted lest they become incomprehensible.

More like 100 years ago is when they were, quite deliberately and with malice aforethought, split apart because of communal politics. As noted above, this was instigated by the British for their divide-and-rule strategy. They put people to work with scissors (metaphorically), cutting Persian words out of the Hindi dictionary and pasting in Sanskrit tatsamas to replace them. It was manmade tinkering, not natural language evolution. It has only partially taken. Lots of Persian expressions that the policymakers tried to eliminate have remained in popularly spoken Hindi. The simpler the level of language being used, the more similar and even identical Hindi and Urdu get. The more prestigious, learned, and specialized the register, the more Urdu veers away from native Indic speech and uses literary Persian expressions, and the more Hindi gets away from both native speech and Persian, using Sanskrit vocabulary. It used to be that Bollywood Hindi films were a broad linguistic common ground. Maybe in recent years the gulf has been widening there too; I don’t keep up with them like I used to.

When I first learned Urdu/Hindi and watched a lot of Bollywood, I noticed specifically that the songs used a vocabulary best described as Urdu. It’s no wonder, considering the weight of centuries of Urdu poetic tradition that filtered down into film songs, and the fact that Muslims are heavily represented in the entertainment business in India, especially in music. Who are the two most renowned film music composers in India? Naushad and A. R. Rahman. Who are the two most beloved playback singers of all time? Lata Mangeshkar and Muhammad Rafi. That makes 3 out of 4 Muslims at the topmost level of the film music biz.

Re the British, I Don’t really think that was the case. If anything the amount of people speaking Urdu and Hindi has increased since 1947 . Before that you would know your own regional language and then English and the Urdu or Hindi. Look at Gandhi and Jinnah, both were Gujarati speakers and there professional language was English. Niether had a good command of Urdu or Hindi.

The actions of the language tinkerers 100 years ago (which is what I was talking about) has nothing to do with the total number of speakers today. The latter, post-1947, I’d attribute to the increase in elementary school attendance levels, national language policies intended to promote national integration using official languages, and most of all the huge growth of mass media.

The OP ommited this paragraph from the link:

Most people treat the OED as the authoritative source. Yet the Second Edition mentioned in the link is hopelessly obsolete; omitted all words that its editors found only a single mention of - except when certain favored authors used them; neglected to research the vast majority of writers because they weren’t highbrow enough; was almost comically ignorant of technical and scientific terms, along with various trade argots, slang, and dialects; took totally unrepresentative samplings of usage from English-speaking countries outside the UK when they bothered to notice them at all; disdained any use of pidgins, creoles, and English as a second or world language; and corrected only a bare minimum of the millions of errors in the First Edition. The Third Edition will correct many of these flaws, but the Internet overflows its banks daily and the OED can’t catch up, because the Internet is mostly English as she is spoke by several billion speakers and writers who the editors would have committed suicide before allowing to sully the pages of their precious volume.

And that’s not even getting at a problem that the early editors of the OED barely even realized was an issue. Everyday language and schooling treats a “word” as a low-level, easily-understood entity. Lexicographers can’t do this. Semanticists can’t do this. Historical linguists can’t do this. When the most basic count varies by 400-500% then a “word” can’t be pinned down exactly any more than the simultaneous momentum and position of an electron can be.

A vast territory lies between 170,000 and three quarters of a million. The meaning of “word” itself breaks down into technicalities. The way different languages treat the slices that occur between spaces also varies so much that comparisons between and among languages for the technical definition of a word are matters for academic debates that last lifetimes. That way lies madness.

Get out now while you can still save yourself!

According to a linguistics professor who was asked the question in a radio program there is absolutely no way whatsoever to say how many words a given language has.

As Johanna has explained, this position is a political one, not a linguistic one. Hindi and Urdu are two standardized registers of one language, known either as Hindustani or Hindi-Urdu.

The Hindi used in television news is a different standardized register of Hindi-Urdu. However, the movie and music industries use a form of Hindi-Urdu largely intelligible across borders, and it doesn’t require much effort to keep it that way.

Then every Sanskrit word, and every Urdu word, and every Farsi word is also a Hindustani word. It works both ways.

More precisely “language of the camp”

“Urdu” being cognate with “horde,” you might also say “language of the horde.”

Lata Mangeshkar is (at least nominally) a Hindu, as is her sister, Asha Bhosle, who should certainly be on your list.

I think that’s why Johanna said “3 out of 4,” the three being Naushad Ali, A. R. Rahman (a convert to Islam), and Mohammed Rafi.

I implicitly acknowledged Lata being non-Muslim by saying “3 out of 4” on my list were Muslims. But I could have phrased that more clearly, so it isn’t your bad if it was hard to parse. If I’d named the three most beloved singers, of course Asha would have been a close third.

Oh, I see. I didn’t know Rahman was a Muslim. Sorry, Johanna.

Of course, A. R. Rahman is a native Tamil speaker, so it’s meaningless to cite him on a question of Urdu. I ought to have thought of that. :smack:

Well, even I as a native speaker of American English can easily be stumped when watching UK television using unfamiliar dialects of British English. (And not necessarily rare or obscure dialects, either.) Different dialects or variants within a single language can evolve very rapidly in different directions while still being technically considered the same language.

Anyway, bahut shukriya and dhanyavad to all respondents. While I quite agree that it’s hopeless to attempt to pin a precise number on the size of the vocabulary of almost any language, this discussion has reinforced my suspicion that English is not as utterly exceptional in its huge vocabulary size as is often claimed. In particular, English vocabulary compared to that of other European languages looks a lot bigger than it does when compared to Hindi/Urdu.

Now I’m wondering: are there other languages whose vocabulary size is closer to that of English than that of, say, French or German?

The huge vocabulary of English is predicated on a size of 750,000 to 1,000,000 “words,” which is far greater than that claimed for any other language. Nothing in this thread should have led you to believe otherwise.

I’m a little surprised that German is being discounted. It’s famously accretive.