PDA

View Full Version : How are voicemail messages transcribed into text?


cochrane
08-17-2011, 05:28 PM
How are messages transcribed into text? Is there a software program that does this? Because I have to say it's pretty sophisticated. I have Vonage, and as a part of my package, it will not only email me when someone leaves a message on my home voicemail, I can listen to the message and read a transcription on my cell phone. Here's part of a transcript from a call I just received.

I have a pretty bad situation here at work. I am gonna be pulling late evenings between today and Friday. I just wanna see if we can reschedule or this coming Saturday morning. As soon as I'm done doing the closing ground I can head over if you're okay with that.

It's not 100% perfect, but nearly so. I was struck by how the transcript reads as perfectly idiomatic English, with "gonna" in place of "going to" and "wanna" for "want to," something that you wouldn't necessarily expect a software program to catch in real time. And I received the text literally within a minute from the time of the call.

chiroptera
08-17-2011, 05:49 PM
I'd like to know this too.

I have the same feature with my MetroPCS cell service, and it does idiomatic well (but will **** out swear words, we've experimented with that LOL.) Mostly I get amusingly garbled VM texts but I can almost always get the gist of the message.

Dr. Strangelove
08-17-2011, 05:52 PM
I think the trick is that they *depend* on the idioms to give an accurate transcription. Some phrases make sense while others don't, and if you build the system using a database of phrases that people actually use, you can make better guesses as to what was said.

My phone's GPS software does voice searching. A while back I searched for "Thai restaurant", and it came back with the correct results. How did it know I didn't mean "tie restaurant"? Because no one would say that, and I'm sure that when data mining popular searches, "Thai restaurant" came up infinitely more often than its homonym.

The speed is just a function of computing power. I'm sure the transcription was done in a fraction of a second; the extra time was just network overhead.

goldmund
08-17-2011, 05:59 PM
I don't know about Vonage, but I know that at least with Google Voice, it's definitely software doing the translation.

The now defunct 1-800-GOOG-411 (http://en.wikipedia.org/wiki/GOOG-411) service, was a free, voice recognition based directory assistance service. The entire reason Google offered this service was to build a phoneme database for their voice recognition software. AFAIK, this is now used for search indexing, and for the voice command and transcription features in Android (when you use these features, the phone doesn't do the speech recognition, the audio is uploaded to Google's servers for processing).

It usually works pretty darn well, as well. However, the GOOG-411 service was offered in the USA and Canada, so the lion's share of its phoneme database will be filled with accents common in those two countries. You should see the voicemail transcriptions when my British boss leaves me a message. When he says "Hi, goldmund" it usually transcribes as "Hi, love" which is quite disturbing!

Just a WAG, but I'd imagine Vonage is using similar software as well. Apparently theirs will transcribe Spanish (https://support.vonage.com/app/answers/detail/a_id/781/~/vonage-visual-voicemail) as well.

ETA:

A while back I searched for "Thai restaurant", and it came back with the correct results. How did it know I didn't mean "tie restaurant"? Because no one would say that, and I'm sure that when data mining popular searches, "Thai restaurant" came up infinitely more often than its homonym.

I'm curious as to what would happen if you search for "black tie restaurant".

Dr. Strangelove
08-17-2011, 06:01 PM
I just did a little test on my phone, speaking "I hear you. I'm still here at work." And it got it exactly right, using hear vs. here correctly.

If the system knew English grammar, this would be a trivial task. Computers aren't that good yet, but what they can do a mine a large body of text--documents, emails, etc. And from this they can deduce that "I hear you" comes up far more frequently than "I here you", and "here at work" is more frequent than "hear at work." So those are the guesses it makes.