How close are we to voice transformation?

Out of curiosity I downloaded a few feebie Voice Transformation programs for use with headset voice-gaming (Skype, Ventrilo, Teamspeak, etc). It had a few sliders for adjusting the waveform of the voice, changing pitch and timbre to make the voice sound male, female, very young, robotic, and so on.

However, they’re very … digital. Choppy. Of course, I don’t have a sound card in that machine.

Is this technology going to get any better? Or have we hit a wall with how much we can alter a voice without detecting the artifacts?

Laurie anderson was playing around with this stuff in the early 80’s… She managed to transform her boice into pretty accurate gender neutral and masculine voices, and THEN reworked them to sound a “Little Digital” (Which could have been a great albulm title for her work)

check out:

various telephone devices have been invented since then to transform femine voices into masculine ones.

its not new technology…


Folks have done all right with subtle changes in pitch but I’ve found that any digital device (usually programs we call them VST plugins, virtual studio technology, in the biz) advertising it can change the gender of the voice realistically or change pitch by more than a major third is crap and sounds like a vocoder. Devices like the one in the video above shift an octave down (sometimes 2), make you sound like James Earl Jones, if James Earl Jones was a eunuch with a pitch shifter.

There’s more, of course, that you can do to change the sound of a voice, than pitch, usually equalization. There’s better stuff than the program you used, certainly, and this is big business, research into new tech is constantly underway, but as of now, you’re not going to fool a whole lot of people with it.

Last I saw on the Internet, there was some kind of open request: “the Air Force is seeking technology that does X, Y and Z for the purpose of transforming the voice of the user for blah blah.” Seems like there’s money behind it.

What you’re looking for is “formant pitch shifting.” This is pitch shifting that maintains duration generally without the choppiness of a simplistic pitch shift. Formant pitch shifting analyzes the waveforms and stretches or shrinks the vocals to maintain duration, but does so in a way that listens to the different parts of speech and deals with each one differently. Glottal stops, labials, dentives, and other unvoiced parts are analyzed very closely and stretched/shrunk so that they don’t stutter and still sound natural. Normal voiced, sibilant, fricative and other such sounds are analyzed in a wider aspect and stretched/shrunk and smoothed to also sound natural.

The main problem is that formant pitch shifting is that it is not best suited for realtime interaction, as it needs to analyze whole portions of audio before it can properly apply the pitch shift. At best would have to be delayed by at least half a second or more before being output. Formant pitch shifting also loses its effectiveness the higher or lower you shift; shifting, say, two octaves up or down may begin to introduce those dreaded stutters. Furthermore, the average human speaks with a particular envelope to their speech patterns. Particularly with down shifting, this generally tends to make your speech less intelligible and more “muddy” sounding the lower you go. The only way to overcome this to any degree is to widen your envelope – in other words, over-enunciate. The lower you go, the more exaggerated your enunciation needs to be to remain coherent.