Merriam-Webster computer(?) speech

How does the Merriam-Webster site generate its speech? If you look up a word there, you have the option of hearing it pronounced. But unlike all other computer-generated speech that I’ve heard, this sounds quite natural. Surely, they didn’t record someone pronouncing 10,000+ words, one at a time. Or did they?

If not, if this truly is computer-generated speech, I guess my question doesn’t have a factual answer. Still, I’d like to ask how it manages to be so much better than even the “computer speech research” sites, like this one from ATT.

They recorded someone pronouncing 10,000+ words, one at a time.
Peace,
mangeorge

As a professional voice-over actor, my not-so-humble opinion is that the voice is indeed computer generated. Not only because, to my ears, it sounds as artificial as the bell labs generators do, but also because as someone in the biz I couldn’t imagine either a company shelling out the kind of cash it would require to do something like that just for a website or an actor both masochistic enough to take on such herculean a task and skilled enough to accurately pronounce every single word in the dictionary without multiple retakes and/or going hoarse.

Now I’ve actually typed in words of various lengths and listened and I’ve found a few things:

  1. There is both a ‘male’ and a ‘female’ reader.

  2. Their voices are almost identical aside from pitch.

  3. Their voices are always a perfect monotone.

Now, I’m not saying that this couldn’t be done by regular people but I just don’t see why they would go to such a bother to spend that sort of money, take that much time and perfect it so well when it could be done so easily with voice synthesis.

Then again, they actually found people willing to do voices for The Land Before Time VII so I’m willing to believe a lot.

I can’t comment either way about whether or not the words are actual voices or computer generated, but I just wanted to put something in perspective. At ~10,000 words (which seems a little low, BTW), that translates to roughly 20 pages on 8.5" x 11" sheets with 1" margins all around.

So, if a man and a woman split the task of reading the ~10,000 words, it is certainly something that could be done in a day. Now it doesn’t seem so monumental.

Well then. So I went to the site linked by KarlGauss, and I must admit, the technology has advanced a lot more than I would’ve thought.
The feature (on m-w) is pretty cool.

Though I’ve heard the speech on this site, it’s been while and I’ve crap speakers right now, because of this, I won’t offer my ear judgment at present. However, I will throw in my 2 cents on the topic.

When synthesising a single word, achieving naturalness isn’t all all that tough. Speech synthesis has progress to the point where you can even get a sentence to sound pretty damn good. However, listening to a paragraph can be a painful experience. According to some folks (like me), this is due to the fact that synthesisers lack the ability to add emotional inflection to an utterance. This could be a partial explanation to mangeorge’s surprise regarding the ability of the synthesiser. That’s to say, it’s not the speech that’s wonderful, rather, your perception of it. You’ll be less demanding of a word in isolation than you would of an entire passage. But before I digress too far, I’ll just sat that depending on your method of synthesis, a single word can be damn easy. Additionally, Arken’s observations that there’s no variance, other than pitch, is a classic sign of synthesis.

I seem to be leading to the conclusion that the speech in this site is in fact synthesised. But if you, gentle reader, will permit me, I’d like to take the time to further address one other point in Arken’s post; the feasibility of having this done by voice over actors. As mentioned by Vandal, with two people 10K words is a nominal task. If you crack the whip, it’d be done in two days…at the most. Furthermore, if you were the voice for this task, I’ve the feeling that after, ohhh, about one hundred words or so, you’d be damn bored and speaking in a monotone voice. And as a bit of icing on the cake, you think a trained linguist comes cheap? Or can produce results quickly?

But at the end of the day, the phoneticians and the lexicographers are in the same field and often work together. This is also the sort of project that would be a good way to test out new theories in and would have appeal to someone working on synthesis. Thus, I wouldn’t be too surprised to learn that the pronunciations on this site are synthesised.

One advantage m-w.com has over a normal text-to-speech system is that they already have all the pronunciations for the words. They don’t have to figure out algorithmically whether that i or e is a schwa sound, since someone already figured it out manually and wrote it down. I would hope their synthesizer works off that rather than the spelling.