With all the data & CPU power available why is speech to text accuracy still so crappy in 2016?

I talked to a telephone VRU system the other day that was friggin’ amazing compared to any I’d ever used before.

I made my request and it paraphrased it back to me in its own voice. With good cadence and everything. Where it was unsure of my meaning it presented two alternative interpretations as paraphrases and asked me to choose. Truly excellent design & implementation.

If I can remember what it was I’ll post it. Compared to other modern VRUs I’ve used this was like a 2016 Lexus versus a 1904 steam bulldozer.

I just remembered. It was Comcast cable customer service at 800-xfinity. Magic.

Then I got to talk to “Geraldine” in Punjab. She was cheerful, but less than magic. She did accomplish what I needed though. In fairness, her English is far, far better than my Punjabi and wasn’t an obstacle to the call.

Funny enough it seems to me they could have automated the request & response I wanted pretty easily. I’m not sure how “Geraldine” earned her pay on this one.

Indeed - and you had to spend hours training the thing before it could understand you. Now you just speak and it works, mostly.

Mine is the same way. It’s using fuzzy logic to guess the word in addition to interpreting what is spoken.

Absolutely false and ridiculous.

Exactly. The problem with speech recognition, similar to language translation, is that it’s not a purely symbolic problem but one which has numerous contextual and experiential dependencies – language conventions and knowledge of the real world, for instance – although the two problem domains are very different. The former has to deal with highly ambiguous input and the latter with a heavy dependency on context and real-world knowledge, but both are eminently manageable. There’s no reason at all that computers can’t be imbued with all the necessary knowledge and indeed eventually do a better job of speech transcription and language translation than any humans could.

A great example of how text-to-speech and AI software can still be tripped up because of a lack of human ‘common sense’ occurred during one of these matches. There was a Jeopardy category that contained a ‘fill in the blank’ in its title. IOW it was something like “You ___ Your Life” and when Watson picked it it asked:

"I’ll take ‘You underscore underscore underscore underscore underscore underscore Your Life’ for $400 Alex…".

Pretty funny hearing it say that in it’s creepy, soothing, 2001’s HAL-like emotionless voice*!* When the audience laughed at it it made me think that Watson might cut the power and flood the studio with poison gas. :smiley:

I do agree with you that Corporal Clegg is being a bit ridiculous in concluding that it will never be reliable.

However, let’s look at the *human *error rate in spoken communication. Anyone have kids? Spouses? Employees? What percent of your communications do they interpret correctly?

That lesson alone should teach us that verbal communication cannot be 100% reliable - too much of the content depends on interpretation and context. I would wager that computers will one day be more reliable than other humans, and yet we’ll still be faced with impossible situation like:
You: “Are we going left at the next corner?”
Your Spouse: “Right.”
:smack:

Many, many reasons.

First off, there are words like: to, too & two or their, there & they’re. I think they are called homonyms and they are very hard for a machine to figure out.

ax, acts
ad, add
ade, aid, aide
aisle, I’ll, isle

and many, many more.

http://www.cooper.com/alan/homonym_list.html

Then there is the problem of different people having different accents. It takes a long time for any software to become adept at recognizing an individual’s accent. Then, when someone else tries to use the software, great care must be taken to let the s/w know which user is speaking.

There is also the problem of greed. Diff manufacturers just outright lie about how fast their products will enable people to convert speech to text.

But every time the s/w makes a mistake, it takes a huge amount of time to correct that error and then try to find the right text. It can easily take the same time to correct one mistake as it takes to speak maybe ten words. So, unless you can use a speech to text s/w accurately, you will be fighting an uphill battle. That may be the single most important reason and because the manufacturers are so greedy for sales, they always ignore that fact.

It makes it almost impossible to have a workable product.

The future where people can speak and the machine can produce error-free text is still a long long way away. It’s mostly just a dream promulgated by manufacturers who want your dollars.

That’s somewhat true, but let’s understand the real reason – it’s that human communication has inherent ambiguity and human language (English is a prime example) is loaded with inconsistencies and ambiguities. So in comprehending human communication, whether it’s transcribing the spoken word or translating languages or maybe both at the same time, “reliable” can never mean “perfect”, but only “as good or better than the best human”. It’s like par on a golf course – the only reasonable benchmark is the best reasonable human achievement, not a hole-in-one on every hole. And as I said, in these areas computers can easily be better than most or perhaps any humans before too long.

What we’ve already gone a long way toward achieving the necessary contextual and semantic understandings. One of the famous examples of early translation blunders was the expression “The spirit is willing but the flesh is weak” translated into Russian as “The vodka is still good but the meat has gone bad.” We’re far beyond such simplistic blunders in production-quality translation systems, though competent translations still need an element of human oversight.

I don’t remember that, but I might well have forgotten. Though it’s so absolutely trivial that I wonder if this wasn’t something that happened during one of the early trial runs – of which there were many. Watson went through large numbers of trials, refinements, and training exercises.

Voice recognition is already very good at using context to choose the correct word. I just tried dictating “I’ll take an aisle seat to fly to the Isle of Wight” and “Please add this figure to the ad” on my Android phone, and it got it all right, including the capitalization. It didn’t get “My aide came to my aid” though. (Ax/acts aren’t homonyms, and ade is not a common word, so I didn’t try those.)

I don’t think that’s a very important goal, except for special applications like generating transcripts from video. For data entry, typing is much faster than speaking.

At least on my phone voice recognition is getting pretty good as long as there is no background noise. My Nexus 5x gets almost everything except for some proper names.

From the quote provided by Wesley Clark

"If you chart it out, Huang says, that means that on average, speech recognition has gotten 20% better every single year for the last twenty years. Which means that the end is in sight.

“In the next four to five years, computers will be as good as humans” at understanding the words that come out of your mouth, Huang says ."

Wow, that’s really terrible reasoning. In anything, it’s always the final few percent that are the hardest to tackle.

Robert Fortner wrote a really good post on this. Though his blog doesn’t exist any more, thankfully the post still does:

If the 92% accurate figure is, well, accurate, that’s quite impressive. But getting those final few percent right will probably require something close to Strong AI. If Google can accomplish that in the next four or five years, that will be awesome!

I guess his blog is no longer down. He has new posts up. Very intelligent writer/thinker.

It’s the same issue that spellcheckers face, but with the addition of accents, background noise, dialectal variations…

In the last week I’ve suddenly acquired a new variation of my lastname (as if I didn’t already have enough), from two different sources and both times in writing. As the variation consists of changing a Basque word to a similar Spanish one, spellchecker at work. Does it mean spellcheckers don’t work well? No, they work just fine, but their dictionary doesn’t happen to cover every single word, declension, conjugation, dialectal variation, firstname, lastname, street name… known to man.

To answer your “why was it so bad in 2003?” question (others have already addressed the current state of the art), cars with limited command sets were looking for certain trigger phonemes in sequence, not trying to determine exactly what you were saying. For example, my 2003 BMW recognized “parking at location” as a command to find the nearest parking lots and show them on the navigation screen. However, it was perfectly happy to hear the words “penguin rotation” to perform the same command.

I subsequently upgraded that car with a much newer voice response system (still from BMW, but not offered for my model) and it is now a lot pickier about the actual words (rotating the penguin no longer works). That is most likely due to the vastly enlarged command sets in the newer car models, plus a need to integrate with “Brand X” cell phone / MP3 player / infotainment devices in the car.