text to speech and a wierd invention

When I was a kid, I saw a film called Gizmo which featured, among other things, a device that looked something like a piano which produced the sound of vocal chords vibrating and something else I can’t recall. The operator could then produce speech with this device which included inflection. It sounded a lot better than various text to speech implementations I hear nowadays and I was wondering if anyone knew what this machine was called, how it worked or if the technology is being used at all now.

Thanks for your help,
Rob

That sounds similar to Homer Dudley’s Voder, which was featured at the 1939 World’s Fair.

It had ten spectrum-modulating keys (one for each finger,) a pedal for pitch control, a wrist bar that switched between a “buzzing” and “hissing” unmodulated source, and a couple of stops. In combination, it could produce twenty distinct sounds.

Here is an entertaining and long, detailed demonstration from the thirties. WARNING THIS IS a .wav file, NOT merely a website link.

This was not an intuitive machine to use, by any stretch of the imagination.

It should be noted, too, that the Voder was there at the beginning of Bell Labs research into mechanical speech synthesis – and of course they went on to pioneer text-to-speech applications.

They even let you try it on their website.

Interesting, but you might have warned people that this is a .wav file, not a web page.

That was neglectful of me. (Perhaps a mod might lend a hand, for the benefit of folks who are at work or coming across it in a quiet house in the wee hours?)

That was the machine. Does modern speech synthesis use this technique, or is it considered outmoded?

Thanks,
Rob

No one else going to take this one? Really?

Alright then.

No, modern speech synthesis does not substantially resemble the Voder, which is indeed as outmoded as all get out.

The Voder was a sort of analog synthesis. Analog synthesis is no longer considered “cutting edge,” although it continues to be emulated digitally by electronic music enthusiasts.

The Voder starts with a basic oscillation and the various switches modulate it to produce a variety of sounds with varying pitch, timbre, and loudness. This will always sound very unnatural and robotic. Still, this approach is still used occassionally – it’s called “formant synthesis.”

Most modern speech synthesis is “concatenative.” This means that there’s a table of speech segments to choose from, which are then strung together to form coherant speech. This can be made to sound very natural – especially since software can be made to select from many different versions of the same phoneme, depending on context. After these parts are lined up, additional processing can be done to reflect tone or inflection.

There are many different approaches to concatenative speech synthesis, depending on your requirements.

For quality, “unit selection synthesis” is the best. This means discrete phonemes are the basic unit. The bigger your database, the better your speech is going to sound (assuming that it’s competently programmed, of course.) English has something like forty discrete phonemes, but the table may include variants for each phoneme.

For efficiency, you may opt for “diphone selection synthesis.” This requires a smaller table of discrete sounds. It uses the transitions between phonemes as an “alphabet,” or summat. It also sounds pretty robotic.

There are more sophisticated approaches out there, some of which use extremely lofty mathematical concepts. There’s at least one type I read about that’s “trainable” – that is, instead of approaching the building of the table manually, you provide it with recorded speech to analyse and compare with an electronic transcript, using, I think, Markov Chains (which I won’t pretend to understand, I just remember because of that misanthropic midget in a surrealist novel.) After it has gathered enough input, the synthesized voice substantially resembles the reader’s. F*cking marvelous.

Here we go: Hidden Markov Models.

For a fun demonstration of how natural speech synthesizers that use this technique for building their phoneme tables sound, check this out. (Sound plays on loading.)

The link uses Hidden Markov Model trained speech-synthesis to give us an idea of what it would sound like if Tom Baker (best Dr. Who ever) went all William Shatner on us and recorded covers of Pulp, Morrisey, and Pink Floyd.

(Tom Baker sat down to train a voice synthesizer as part of a licensing deal with GPS company – the original intent was so you could have your car’s GPS computer give you directions from “A” to “B” in his voice.)

Like I said, f*cking marvelous. :smiley: