Is there a program that will measure the vocabulary of a piece of text?

Sutremaine · January 28, 2007, 12:54am

By vocabulary, I mean anything from every unique word (plurals and conjugations would be counted separately) to every unique root and excluding common words such as the, and, a, etc. This doesn’t need to be particularly rigorous, as I’m just curious about something I’m writing at the moment.

As a somewhat related question, how was it determined what Shakespeare’s vocabulary was?

Shagnasty · January 28, 2007, 1:07am

You could do it very roughly in MS Access if you have the program installed and just a touch of skill. It would involve saving your document as text, and then importing the whole file into MS Access using a space as the field delimiter. That would give you records with just one word in each. A query that counts the unique words (any difference at all counts as a unique one here) could be done in a few seconds. Special characters might need to be cleaned up before it is imported into Access however.

I admit it is crude but it can be done that way fairly easily.

Sutremaine · January 28, 2007, 1:36am

Could I do that with the database program that comes with Works? It’s been a while since my high school lessons, and even with the program’s help files I failed to create any database at all (I ended up creating the final file as a spreadsheet, so the incentive to learn how to do so went away).

Shagnasty · January 28, 2007, 1:44am

You might be able to get it to work or you can just e-mail it in text form to me as plain as possible and I could do it in a few minutes.

Back about 1995 or so, the Mac version of Word used to have a vocabulary and writing complexity measure but I haven’t seen anything like that in years. There must be one somewhere but I don’t know where.

Sutremaine · January 28, 2007, 2:08am

Thanks for the offer. I’ll see if I can get it working at my end first, as it’s an ongoing thing.

Little_Nemo · January 28, 2007, 3:29am

There are programs that do this - I’ve seen works discussing the results of analyses performed by them. But I don’t know if any are readily available online or commercially.

alterego · January 28, 2007, 4:36am

GNU Style and Diction.

AThingWithFeathers · January 28, 2007, 4:45am

In MS Word, if you go to Tools>Options>Spelling and Grammar, then check the box that says “show readability statistics”, then every time you do a spell check it will roughly assess the grade reading level.

Sutremaine · January 28, 2007, 2:01pm

I don’t have Word, just the Works Word Processor. It shows the word count, but not much else.

alterego, I’ve tried getting Style and Diction to work. I’ve now got a bunch of files in a specially-created folder, but I don’t know if I need to do anything else, or how to use the commands. Do I need to have something else installed to get the commands working?

alterego · January 29, 2007, 6:13pm

You need to be running Linux… If you are using Windows, you can use Cygwin.

CookingWithGas · January 29, 2007, 7:07pm

Search on “Porter Stemmer” which takes a piece of text and lops off the endings to roots, so that you get a vocabulary of roots (many of which look weird, though). There are free versions of it available, but probably just source. I used a Java one but had to compile to byte code myself.

The next step is to take the output from the stemmer and just find the unique occurrences of each. That one I don’t have a quick and easy off-the-shelf answer for.

Shalmanese · January 29, 2007, 10:31pm

Access is using a hammer to hit a fly. This is what unix text processing tools are designed to do. If you have a local unix/linux nerd, ask him to show you how to do basic text filtering (cat, grep, cut, sort, uniq, head, tail, sed). It’s quite a useful tool to have in general and if you install cygwin, you can do it from right inside windows.

Alive_At_Both_Ends · January 29, 2007, 10:45pm

The only way you could even guess this would be to count all the unique words in his published works, but even this would only give you a lower bound. There must have been many words which he knew but never used in his writings. So the real answer is that the size of Shakespeare’s vocabulary hasn’t been determined at all.

Stan_Doubt · January 29, 2007, 11:56pm

I was doing a project where something similar was required and modified this macro to do what I needed. I doubt that Works supports macros, but you could probably find a machine with Word to run it.

I’m certainly not disputing the other posts here, there are much easier ways to do what you want, I’m offering this as a suggestion because you probably know more people with Word than you know Linux/UNIX nerds.

alterego · January 30, 2007, 1:12am

Regarding vocabulary size, just last week our first assignment in Natural Language Processing was to determine the size of our vocabulary. This might sound trivial, but it was designed to make a point. When someone says “how many words do you know?”, what exactly do they mean by the words “word” and “know”? You can beat this to death all day, but psycholinguists recognize a basic distinction between words that you might recognize, and words you might use. So for my assignment, I wrote a python script that downloaded a random word from the OED and presented it to me. It tallied statistics and determined my vocabulary based on the sample size and the total size of the OED, which is 301,100 words. I knew 72% of words, which comes out to 216,792 words. But this result is absolutely inconclusive. The OED has a ton of antiquated words that are no longer in usage. They also document things that you might not consider a word, or words you have never seen before but can easily generalize to.

The resultant message here is that until you rigorously operationalize what you mean by the size of someone’s vocabulary and then test that person on your standardized corpus, this question has no factual answer. Or rather, this is your factual answer

alterego · January 30, 2007, 1:27am

Whoops. For those interested, here’s that random word link: http://dictionary.oed.com/cgi/entry/lfw.

And here’s the output of running GNU Style on Shakespeare’s complete works (Did I mention I am also in a Shakespeare class right now? :):

readability grades:
Kincaid: 2.4
ARI: 2.7
Coleman-Liau: 8.3
Flesch Index: 95.5
Fog Index: 5.2
Lix: 23.2 = below school year 5
SMOG-Grading: 6.1
sentence info:
3753895 characters
919080 words, average length 4.08 characters = 1.20 syllables
94490 sentences, average length 9.7 words
55% (52027) short sentences (at most 5 words)
12% (12253) long sentences (at least 20 words)
1 paragraphs, average length 94490.0 sentences
10% (10073) questions
20% (19674) passive sentences
longest sent 282 wds at sent 68853; shortest sent 1 wds at sent 47
word usage:
verb types:
to be (25640) auxiliary (18843)
types as % of total:
conjunctions 6% (53740) pronouns 16% (142545) prepositions 9% (81512)
nominalizations 1% (5918)
sentence beginnings:
pronoun (14321) interrogative pronoun (5205) article (3254)
subordinating conjunction (1845) conjunction (3221) preposition (2647)

And my quick and dirty analysis of how many unique words he used is 83,391, but realize that the actual number is likely to be significantly smaller.

Shalmanese · January 30, 2007, 2:33am

A slight refinement to that might be to cut out all the words that end in “ing” or “ed” or other common roots.

alterego · January 30, 2007, 3:10am

I’m not going to go that far, but here is an improvement:


cat shaks12.txt | tr " " "
" | sed -r 's/.*(\b\w+\b).*/\1/' | sort | uniq | wc -l

Returning 29,156 words. Not bad, considering that this sophisticated program returns 27,505 words

Here’s a Google Search that brings up other folks’ estimates. Some of the exact ones are 29415, 20933, 26146, 27780, 27505, 26344, 27730, 29066, with an average of 26,936 words.

Topic		Replies	Views
Looking for a website that evaluates how well-written a block of text is Miscellaneous and Personal Stuff I Must Share	4	747	February 4, 2007
College Application/Essay Word Counts? Factual Questions	4	1106	October 25, 2003
word count on Word Perfect Factual Questions	5	1291	November 19, 2001
Any way to compare Word dictionaries? Factual Questions	1	823	March 25, 2009
Latent Semantic Analysis Factual Questions	3	717	February 17, 2003

Is there a program that will measure the vocabulary of a piece of text?

Related topics