I had this brilliant idea. Wouldn’t it be interesting if I could get the relative values of different words - for instance, is “nifty” better than “mediocre”, but can we get a list of hundreds of words that fall on the positive end of that kind of comparison and another hundred on the negative end? What about words like elusive or bordered.
So my brilliant idea is to make a “word wars” type web page, a la kittenwars. And I’ve already loaded a mysql database with data from the big linguistic database, so it’s just a matter of making a couple of more tables to keep stats.
I decided to start with just adjectives, since a lot of adjectives have emotional weight. So I pulled just the adjectives one of my tables. My other table is a word pairing table, so that I can keep stats on how a specific pair of words rate against each other.
Problem is, I have 70,000+ adjectives. Quite a lot of them are going to be obscure, like pseudoalveolar or nonsubjugable. That makes the word-pairing table have over 3 billion records. :eek:
I’d like to cut down the list to more common words, but I don’t know how I’d arrive at such a list, especially since I’d want to end with several thousand words, certainly more than 100.
So, I see my options are:
Go with what I have, who knows I may get massive numbers of hits.
Write another screen to let people vote on whether particular words are common or not.
Go to a page that some kind doper points me to that has the top 5-10 thousand most common adjectives.
Any solution that requires one person (me) to sift through 70,000 words just ain’t gonna happen.
I can understand “nifty” above “mediocre,” with “stellar” and “subpar” higher and lower still, simply as general quality descriptors–but what would be the basis for judging the position of “elusive” or “bordered” (let alone “pseudoalveolar”) relative to any of those? Utility? Euphony? Whatever the voters like?
I assume you mean that every tine a word wins it gets a point, which sounds good. For that matter, if it loses it could lose a point, allowing negative values.
I also suggested to her that there are probably “concise” dictionaries out there she could get a shorter list of adjectives from. Gutenberg only seems to have the big Webster’s, but I didn’t search much further.
The adjective lists I found in my admittedly brief search tended to be very short–a few hundred words, at most.
Perhaps a change in methodology would be more effective? Instead of presenting a single “battle” when the page is loaded, present a set of them–maybe four or five. Inform visitors that they may skip battles involving unfamiliar words, or look up the definition (link each word to a definition, if possible). Track “unfamiliarity” in addition to “value” by adding one point to a word’s unfamiliarity score each time it is in a skipped battle and adding two points every time the definition link is clicked. You can weight the selection of words for battles by their unfamiliarity scores, so unfamiliar words should tend to sink to the bottom of your list, and appear progressively less often in battles.
You could try running your list through a common spell checker. Eliminate those that are not in the spell checker’s list and see what your count is then.
Should still be in the thousands and this will weed out the obscure ones.