Every once in a while when I use the board search, I get the message, “The following words are either very common, too long, or too short and were not included in your search : <search word>”. I was wondering what the whole list of too common unsearchable words was, and I think I have found it here. It is a list embedded in MySQL, which is used by vBulletin. I have tried a small sample of them, and all are unsearchable.
a's able about above according
accordingly across actually after afterwards
again against ain't all allow
allows almost alone along already
also although always am among
amongst an and another any
anybody anyhow anyone anything anyway
anyways anywhere apart appear appreciate
appropriate are aren't around as
aside ask asking associated at
available away awfully be became
because become becomes becoming been
before beforehand behind being believe
below beside besides best better
between beyond both brief but
by c'mon c's came can
can't cannot cant cause causes
certain certainly changes clearly co
com come comes concerning consequently
consider considering contain containing contains
corresponding could couldn't course currently
definitely described despite did didn't
different do does doesn't doing
don't done down downwards during
each edu eg eight either
else elsewhere enough entirely especially
et etc even ever every
everybody everyone everything everywhere ex
exactly example except far few
fifth first five followed following
follows for former formerly forth
four from further furthermore get
gets getting given gives go
goes going gone got gotten
greetings had hadn't happens hardly
has hasn't have haven't having
he he's hello help hence
her here here's hereafter hereby
herein hereupon hers herself hi
him himself his hither hopefully
how howbeit however i'd i'll
i'm i've ie if ignored
immediate in inasmuch inc indeed
indicate indicated indicates inner insofar
instead into inward is isn't
it it'd it'll it's its
itself just keep keeps kept
know known knows last lately
later latter latterly least less
lest let let's like liked
likely little look looking looks
ltd mainly many may maybe
me mean meanwhile merely might
more moreover most mostly much
must my myself name namely
nd near nearly necessary need
needs neither never nevertheless new
next nine no nobody non
none noone nor normally not
nothing novel now nowhere obviously
of off often oh ok
okay old on once one
ones only onto or other
others otherwise ought our ours
ourselves out outside over overall
own particular particularly per perhaps
placed please plus possible presumably
probably provides que quite qv
rather rd re really reasonably
regarding regardless regards relatively respectively
right said same saw say
saying says second secondly see
seeing seem seemed seeming seems
seen self selves sensible sent
serious seriously seven several shall
she should shouldn't since six
so some somebody somehow someone
something sometime sometimes somewhat somewhere
soon sorry specified specify specifying
still sub such sup sure
t's take taken tell tends
th than thank thanks thanx
that that's thats the their
theirs them themselves then thence
there there's thereafter thereby therefore
therein theres thereupon these they
they'd they'll they're they've think
third this thorough thoroughly those
though three through throughout thru
thus to together too took
toward towards tried tries truly
try trying twice two un
under unfortunately unless unlikely until
unto up upon us use
used useful uses using usually
value various very via viz
vs want wants was wasn't
way we we'd we'll we're
we've welcome well went were
weren't what what's whatever when
whence whenever where where's whereafter
whereas whereby wherein whereupon wherever
whether which while whither who
who's whoever whole whom whose
why will willing wish with
within without won't wonder would
wouldn't yes yet you you'd
you'll you're you've your yours
yourself yourselves zero
Just one of those things that bugged me, and now I know the answer. And so do you.
Yes, I am aware of how to search the board with Google.
IF you look more closely, many of the words, and especially these ones folks are about are simply connectives.
“Therefore” doesn’t help in identifying what a passage is about. Ultimately, what we all wish for is *semantic *search: “Tell me all about whales’ intestines”. AI doesn’t really do that yet, and MySQL is very far from AI. So instead we have *syntactic *search: “show me articles that contain the word ‘whale’ and the word ‘intestine’”.
The hope is that’s close enough and the humans will be able to separate the wheat from the mostly-winnowed chaff. Anyone who’s tried Googling for some facts about music will tell you that theory falls apart when the topic in question is also a heavily sold retail product.
The other point is that MySQL is ultimately an underpowered hobbyist project. To be sure it’s been improved a great deal since v1.0, and given todays’ mongo hardware it does pretty darn good. But it retains some design features from the old days.
Leaving out all the “noise words” means the size & computational challenge of creating the search index is reduced by 5 or 10x. That’s a speed-up worth having.
Whereafter the expense of being able to look up that one thread you remember where some pretentious twit used “whereafter”.
I frequently search thread titles to find old threads I remember reading. The thread might be about war or sex, but you can’t search on either of those words because they are too short. But if I remember the title contained the word “nobody” or “whither”, I could still find it, except those words are on the unsearchable list. So it does diminish the usability of search in some cases.
Which also raises an interesting issue: searching the SDMB is not like searching the internet. Because the users’ goals and starting positions are different. Or at least mine are.
I’m *never *searching for new threads on specific topics. I’m *always *searching to find a thread I remember reading or posting to so I can link to it in a new post.
In that context, and in that context alone, being able to search for rare words even if they’re noise words would be useful. e.g. I recall that twit used “whereafter” in a post about music and searching for “whereafter” will probably return just a couple threads, whereas searching for “music” will probably return thousands.
Contrast that with searching the 'net at large via Google or whoever. I have zero idea of the details of anything I’m going to find. All I have are some plausible topic words & a hope.
Ultimately MySQL text search was written more for the general case than this specific case.
What bugs me is that the search has a blacklist and a minimum word length. A minimum word length is a quick-and-dirty way to eliminate most of the “noise words”, like “the”, “and”, and “of”. But when you have an explicit blacklist, words like that are already on it. So all the minimum word length actually ends up winnowing out are uncommon short words, like “OSX” and “Wii”, which would be genuinely useful for searches.
Another odd thing about this list is that while all the cardinal numbers from one through nine are included, it includes some ordinal numbers, but not others. First, second, third (not fourth) fifth (not sixth, seventh, eighth, or ninth). Weird.