Is there a program that will measure the vocabulary of a piece of text?

By vocabulary, I mean anything from every unique word (plurals and conjugations would be counted separately) to every unique root and excluding common words such as the, and, a, etc. This doesn’t need to be particularly rigorous, as I’m just curious about something I’m writing at the moment.

As a somewhat related question, how was it determined what Shakespeare’s vocabulary was?

You could do it very roughly in MS Access if you have the program installed and just a touch of skill. It would involve saving your document as text, and then importing the whole file into MS Access using a space as the field delimiter. That would give you records with just one word in each. A query that counts the unique words (any difference at all counts as a unique one here) could be done in a few seconds. Special characters might need to be cleaned up before it is imported into Access however.

I admit it is crude but it can be done that way fairly easily.

Could I do that with the database program that comes with Works? It’s been a while since my high school lessons, and even with the program’s help files I failed to create any database at all (I ended up creating the final file as a spreadsheet, so the incentive to learn how to do so went away).

You might be able to get it to work or you can just e-mail it in text form to me as plain as possible and I could do it in a few minutes.

Back about 1995 or so, the Mac version of Word used to have a vocabulary and writing complexity measure but I haven’t seen anything like that in years. There must be one somewhere but I don’t know where.

Thanks for the offer. I’ll see if I can get it working at my end first, as it’s an ongoing thing.

There are programs that do this - I’ve seen works discussing the results of analyses performed by them. But I don’t know if any are readily available online or commercially.

GNU Style and Diction.

In MS Word, if you go to Tools>Options>Spelling and Grammar, then check the box that says “show readability statistics”, then every time you do a spell check it will roughly assess the grade reading level.

I don’t have Word, just the Works Word Processor. It shows the word count, but not much else.

alterego, I’ve tried getting Style and Diction to work. I’ve now got a bunch of files in a specially-created folder, but I don’t know if I need to do anything else, or how to use the commands. Do I need to have something else installed to get the commands working?

You need to be running Linux… If you are using Windows, you can use Cygwin.

Search on “Porter Stemmer” which takes a piece of text and lops off the endings to roots, so that you get a vocabulary of roots (many of which look weird, though). There are free versions of it available, but probably just source. I used a Java one but had to compile to byte code myself.

The next step is to take the output from the stemmer and just find the unique occurrences of each. That one I don’t have a quick and easy off-the-shelf answer for.

Access is using a hammer to hit a fly. This is what unix text processing tools are designed to do. If you have a local unix/linux nerd, ask him to show you how to do basic text filtering (cat, grep, cut, sort, uniq, head, tail, sed). It’s quite a useful tool to have in general and if you install cygwin, you can do it from right inside windows.

The only way you could even guess this would be to count all the unique words in his published works, but even this would only give you a lower bound. There must have been many words which he knew but never used in his writings. So the real answer is that the size of Shakespeare’s vocabulary hasn’t been determined at all.

I was doing a project where something similar was required and modified this macro to do what I needed. I doubt that Works supports macros, but you could probably find a machine with Word to run it.

I’m certainly not disputing the other posts here, there are much easier ways to do what you want, I’m offering this as a suggestion because you probably know more people with Word than you know Linux/UNIX nerds.

Regarding vocabulary size, just last week our first assignment in Natural Language Processing was to determine the size of our vocabulary. This might sound trivial, but it was designed to make a point. When someone says “how many words do you know?”, what exactly do they mean by the words “word” and “know”? You can beat this to death all day, but psycholinguists recognize a basic distinction between words that you might recognize, and words you might use. So for my assignment, I wrote a python script that downloaded a random word from the OED and presented it to me. It tallied statistics and determined my vocabulary based on the sample size and the total size of the OED, which is 301,100 words. I knew 72% of words, which comes out to 216,792 words. But this result is absolutely inconclusive. The OED has a ton of antiquated words that are no longer in usage. They also document things that you might not consider a word, or words you have never seen before but can easily generalize to.

The resultant message here is that until you rigorously operationalize what you mean by the size of someone’s vocabulary and then test that person on your standardized corpus, this question has no factual answer. Or rather, this is your factual answer :slight_smile:

Whoops. For those interested, here’s that random word link:

And here’s the output of running GNU Style on Shakespeare’s complete works (Did I mention I am also in a Shakespeare class right now? :):

And my quick and dirty analysis of how many unique words he used is 83,391, but realize that the actual number is likely to be significantly smaller.

A slight refinement to that might be to cut out all the words that end in “ing” or “ed” or other common roots.

I’m not going to go that far, but here is an improvement:

cat shaks12.txt | tr " " "
" | sed -r 's/.*(\b\w+\b).*/\1/' | sort | uniq | wc -l

Returning 29,156 words. Not bad, considering that this sophisticated program returns 27,505 words :slight_smile:

Here’s a Google Search that brings up other folks’ estimates. Some of the exact ones are 29415, 20933, 26146, 27780, 27505, 26344, 27730, 29066, with an average of 26,936 words.