With all the data & CPU power available why is speech to text accuracy still so crappy in 2016?

I have a 2003 Denali and it had voice recognition for simple commands and still half the time it had no idea what I was saying even if enunciate correctly. I have a relatively deep voice but no regional accent or speech impediment. Now in 2016 I have a Samsung Galaxy S5 and it has about (at best) a 70% accuracy. As I speak I see the phone making all these weird choices that sound nothing like the word I am enunciating.

You would think that getting voice recognition right would be HUGE on so many level levels. We have computers powerful enough to beat chess Grand Masters and win at Jeopardy, we can store personal terabytes of data on the cloud, and we still can’t devise a decent voice recognition system. Why?

Because it’s a very difficult problem and having a better processor doesn’t solve it, it just gets you the wrong answer faster.

Because speech recognition is MUCH harder than you think it is.

Chess was a game researchers spent decades on, and it’s very well suited to computer play, and it still took 30 years to beat a grandmaster. Watson was a supercomputer designed for IBM to show off and it still made boneheaded mistakes (“Toronto???”).

Speech recognition requires a ton of processing to determine the phonemes, a lot of guesswork to figure out what words those sounds make (have fun with homophones!), then try to figure out what someone meant with all the “uh, um er” sounds and without a lot of joining words. Now make it work across all accents and dialects.

Then there’s the problem that speech recognition isn’t actually as useful as you think it is. You can almost certainly type whatever you need just as fast, so the times it’s useful are when you have to be hands free, which frankly isn’t as often as you think.

It’s probably your deep voice throwing it off, btw.

Actually, for speech recognition today the compute power is in the cloud. At least that is the case with Microsoft’s Cortana. Cortana is a simple agent on your phone, tablet, PC or xbox console, and the serious computation work is done in the cloud.

Some systems are better than others, and I don’t expect my smart phone to be one of the better ones. A good system will have ways to familiarize itself with the speaker’s ideolect over time, and thereby improve with use.

Human speech is extremely complex–much more complex than most people realize. It’s not just phonemes: It’s phonology and intonation and other paralinguistic channels. Additionally, when we’re listening to a person, we bring in larger contextual cues to decode these purely auditory signals, but that isn’t something a handheld computer is really able to do, or even a desktop, really.

This is why sometimes a transcript of an interview–even when “accurately” transcribed–can sometimes be hard to understand. Someone here recently was complaining about something Jamie Oliver said on a talk radio show, and by just reading the transcript, it did indeed seem like unclear babbling, but upon listening to him actually say it, it’s much more comprehensible.

I call BS on your data. You are saying that every 4th word is wrong. Although my experience is on an iPhone, I can state with confidence that it gets at lest 19 out of 20 words correct. I could stack the deck and make the accuracy worse by purposely picking confusing words, but just common speech is recorded with pretty decent accuracy.

Having been a geek my whole life I can attest that voice recognition has improved by leaps & bounds the past 5-10 years. I first started noticing it in phone menu systems, when they switched from 'Press 1 for…" to just saying the word, and it worked! Every time!

To expand, it’s not that being connected to the internet gives VR software access to supercomputers or anything. It gives it access to the enormous database of variations of speech that is constantly being accumulated and cataloged.

It started with OCR, optical character recognition. Originally programmers tried and tried to write more complex algorithms that could better deconstruct, analyze and ‘intelligently’ decipher the structure and shape of each letter, regardless of the font, the way humans do. Eventually they discovered that it was much, much easier to just create a huge database of every possible variation of a letter and simply match it that way.

IOW it isn’t so much increased processing power as it is increased storage capacity. And broadband internet was a huge tipping point in terms of having access to the ultimate database.

AFAIK, most OCR nowadays is done with convolutional neural nets, not image databases. Deep nets tend to get the highest accuracy scores (usually upwards of 99%) on test databases.

There is no such thing as “no regional accent”.

I’m actually impressed with the speech recognition on my various devices. If anything set back the progress though, it’s got to be the swindling of Dragon Systems.

Human speech is an incredible “fuzzy” medium, the listener has to fill in huge blanks and even with highly advanced, dedicated hardware (our brains and ears) we accept quite a lot of errors. Computers make about the same amount of guesses and errors, just completely different ones than a person would make, making the result seem weird.

Right. The key is that better accuracy is not achieved through meaning so much as through matching. Humans recognize speech much more through meaning than by auditory processing. That’s why we can understand dialects so much different than our own.

When a dialect of your own language is so different that you can’t understand it–when all meaning is lost–you realize that a certain threshold hasn’t been met, and just say “What???” VR systems, on the other hand, will keep trying beyond that point (and generate nonsense), because they don’t “realize” that the threshold hasn’t been met.

FWIW I’m actually impressed by how well the speech recognition google search option works on my android phone. Even when I screw up halfway into a word it usually still picks the correct word.

I also have an app where I can either type or use speech to take short notes and reminders. That is good (95% correct rate at least), but not as good as the android google search.

I remember in the 90s when Star Trek TNG was on, a computer that you controlled via voice was considered sci-fi, and now we are pretty close to that and already have it in a way.

One problem is that in order to interpret speech correctly, you often have to know quite a bit about the topic you are discussing. It’s not just “this noise made by the human most likely translates to this word”. It’s “this noise made by the human in the context of this sentence most likely translates to one of these words and he was talking about boats earlier (or later) in the sentence, so likely it is the nautical term instead of the similar-sounding pastry, but maybe it’s actually the proper name of someone he’s just bringing up now so let me guess how that is likely spelled”. The sounds themselves, unless the person is speaking very clearly and unnaturally and probably not even then, don’t contain all the information.

It’s kind of like a related problem in machine vision. If you see a trapezoid in your field of view, you are likely to interpret it as a rectangular horizontal surface like a table, rather than a trapezoid hanging in midair. This is because you are a human and the human world is full of horizontal surfaces and has a noted paucity of mid-air trapezoids. But the computer doesn’t have that experience, and has to be told that tables are common in order to get the most likely interpretation of the scene in front of it correct.

In both cases, you have to already have a significant conception of the world to correctly interpret what you are presented with.

Even better, when Captain Picard asks his Android phone for “tea”, it already knows that he prefers “Earl Grey, hot” so he doesn’t have to say it every time. One gets the impression that the Enterprise’s computer would serve him iced Darjeeling every time he wasn’t completely specific. Most unrealistic part of ST:TNG by far.

This. I remember the voice recognition software from 10 years. To get any kind of usable accuracy you needed to pause. between. each. word. Now people can speak normally, without even knowing they are speaking to a computer, and it’s recognized pretty well. I’m still amazed that a computer can listen to a voice mail message and transcribe it well enough for most practical purposes.

It will never be reliable. Turning sound into text is the same problem as turning a raster image into a vector (in other words a jpeg photo into a geometrically accurate CAD model). The desired outcome is too precise to expect because of the wide range of idiosynchrasies in the sound wave.Yes, there are mechanisms in place right now that offer choices where the answer will trigger and action but how many times do you get “I didn’t understand, do you mean?” or some such message from a recorded database. If you have a speech impediment, you’re screwed. I work with software every day that tries to interpret music–it can’t tell what the important chord progression is and interprets baselines beneath the chord as a chord change. The result is a mess. I have to spend a while editing the results into something usable for guitarists (like myself).

Well, that’s what was meant when I said above, “Additionally, when we’re listening to a person, we bring in larger contextual cues to decode these purely auditory signals . . .” but it bears repeating.

I believe that this is how Watson, the computer that played Jeopardy, works. It has a vast network of contextual cues at its disposal, and is able to process them according to algorithms that arrive at the most likely topic.

And yet, we already have machines that interpret speech with human levels of reliability. They’re called humans :slight_smile: .

Unless humans are “magical”, it is possible to make a machine that just does what a human does. It may not be practical, and such a machine may be too expensive for a particular use, but that’s a different story.

I suspect that the problem with such software stems more from the fact that guitar transcription is a problem that is relevant to far fewer people than general speech to text is.

One of my previous phones had a feature where when it didn’t understand it would say “I didn’t understand …” then play back what the microphone had recorded. I suppose the developer’s intent was to train the user to speak more carefully.

The fidelity of the recording was pitiful. Not as good as a 1960s long distance telephone. In general I couldn’t understand what was said using my own built-in voice recognizer (i.e. brain). And this was my own voice having said those exact words not 15 seconds ago. So I knew what had been said and the full context!

No wonder the dumb CPU had a hard time with it.

I’d really be interested to know how good the fidelity and digitization is now. Surely it’s better, but surely there’s also an engineering tradeoff here; you want enough fidelity to capture the necessary differences, but no more than that. Especially if you’re sending the audio off-device for analysis.

I am actually amazed at how good it’s become in the last few years.

In Computer Science, I really did hear “We’re going to solve _____* in the next 5 years!” over and over again for decades. Nice to see some real progress for once.

  • Fill in the blank for various AI problems.