How are voicemail messages transcribed into text?

How are messages transcribed into text? Is there a software program that does this? Because I have to say it’s pretty sophisticated. I have Vonage, and as a part of my package, it will not only email me when someone leaves a message on my home voicemail, I can listen to the message and read a transcription on my cell phone. Here’s part of a transcript from a call I just received.

It’s not 100% perfect, but nearly so. I was struck by how the transcript reads as perfectly idiomatic English, with “gonna” in place of “going to” and “wanna” for “want to,” something that you wouldn’t necessarily expect a software program to catch in real time. And I received the text literally within a minute from the time of the call.

I’d like to know this too.

I have the same feature with my MetroPCS cell service, and it does idiomatic well (but will **** out swear words, we’ve experimented with that LOL.) Mostly I get amusingly garbled VM texts but I can almost always get the gist of the message.

I think the trick is that they depend on the idioms to give an accurate transcription. Some phrases make sense while others don’t, and if you build the system using a database of phrases that people actually use, you can make better guesses as to what was said.

My phone’s GPS software does voice searching. A while back I searched for “Thai restaurant”, and it came back with the correct results. How did it know I didn’t mean “tie restaurant”? Because no one would say that, and I’m sure that when data mining popular searches, “Thai restaurant” came up infinitely more often than its homonym.

The speed is just a function of computing power. I’m sure the transcription was done in a fraction of a second; the extra time was just network overhead.

I don’t know about Vonage, but I know that at least with Google Voice, it’s definitely software doing the translation.

The now defunct 1-800-GOOG-411 service, was a free, voice recognition based directory assistance service. The entire reason Google offered this service was to build a phoneme database for their voice recognition software. AFAIK, this is now used for search indexing, and for the voice command and transcription features in Android (when you use these features, the phone doesn’t do the speech recognition, the audio is uploaded to Google’s servers for processing).

It usually works pretty darn well, as well. However, the GOOG-411 service was offered in the USA and Canada, so the lion’s share of its phoneme database will be filled with accents common in those two countries. You should see the voicemail transcriptions when my British boss leaves me a message. When he says “Hi, goldmund” it usually transcribes as “Hi, love” which is quite disturbing!

Just a WAG, but I’d imagine Vonage is using similar software as well. Apparently theirs will transcribe Spanish as well.


I’m curious as to what would happen if you search for “black tie restaurant”.

I just did a little test on my phone, speaking “I hear you. I’m still here at work.” And it got it exactly right, using hear vs. here correctly.

If the system knew English grammar, this would be a trivial task. Computers aren’t that good yet, but what they can do a mine a large body of text–documents, emails, etc. And from this they can deduce that “I hear you” comes up far more frequently than “I here you”, and “here at work” is more frequent than “hear at work.” So those are the guesses it makes.