No one else going to take this one? Really?
Alright then.
No, modern speech synthesis does not substantially resemble the Voder, which is indeed as outmoded as all get out.
The Voder was a sort of analog synthesis. Analog synthesis is no longer considered “cutting edge,” although it continues to be emulated digitally by electronic music enthusiasts.
The Voder starts with a basic oscillation and the various switches modulate it to produce a variety of sounds with varying pitch, timbre, and loudness. This will always sound very unnatural and robotic. Still, this approach is still used occassionally – it’s called “formant synthesis.”
Most modern speech synthesis is “concatenative.” This means that there’s a table of speech segments to choose from, which are then strung together to form coherant speech. This can be made to sound very natural – especially since software can be made to select from many different versions of the same phoneme, depending on context. After these parts are lined up, additional processing can be done to reflect tone or inflection.
There are many different approaches to concatenative speech synthesis, depending on your requirements.
For quality, “unit selection synthesis” is the best. This means discrete phonemes are the basic unit. The bigger your database, the better your speech is going to sound (assuming that it’s competently programmed, of course.) English has something like forty discrete phonemes, but the table may include variants for each phoneme.
For efficiency, you may opt for “diphone selection synthesis.” This requires a smaller table of discrete sounds. It uses the transitions between phonemes as an “alphabet,” or summat. It also sounds pretty robotic.
There are more sophisticated approaches out there, some of which use extremely lofty mathematical concepts. There’s at least one type I read about that’s “trainable” – that is, instead of approaching the building of the table manually, you provide it with recorded speech to analyse and compare with an electronic transcript, using, I think, Markov Chains (which I won’t pretend to understand, I just remember because of that misanthropic midget in a surrealist novel.) After it has gathered enough input, the synthesized voice substantially resembles the reader’s. F*cking marvelous.