How Accurate is Voice-Recognition Software?

I’ve just received a fancy digital audio tape recorder that allows me to upload taped messages to my PC.

The manufacture adds that an additional voice recognition software (VRS) application can then take these PC audio files and somehow convert the recorded messages into a typed document.

Problem is, I’ve heard VRS is unreliable, no matter how new the product.

Let’s say I tape someone for an hour and s/he speaks rather clearly. What kind of results am I liable to get using the best software? Does, for instance, a 90 percent accuracy level mean that 10 percent of the characters are transferred incorrectly, or that 10 percent on the words are deleted, or what?

90% accuracy means that in a sentence of ten words, on average one will be mis-interpreted. Most of these problems will come from homonyms, words that sound identical but are written differently (son/sun, etc). A decent VRS has grammar tools to filter these mistakes out, but they’re far from perfect.

Also, it depends greatly on whether the computer is “trained” for your particular voice. Dialects and accents are extremely hard for a computer to recognize and compensate for; if you train it using your own voice prior to using the VRS it will greatly improve the accuracy of the process.

AFAIK, anyway.

I have fought and fought and fought with the newest IBM ViaVoice. Sometimes it works really well, and sometimes it makes me twitch, wanting to just reach forward and type instead of correcting, and re-correcting, and re-correcting…I even did all of the training exercises, until my voice was hoarse.

Of course, I have a low hoarse voice to begin with (due largely to injury - it used to be quite high). Think Lauren Bacall with a cold…anyhow, that might have been the problem. Whatever the case, even with all that training, I was lucky to get a paragraph in without 2-4 maddening corrections.

In my case, I would say that it still has a long way to go.

Fast fingers and a ‘dic-ta-fone’ are still hard to beat.

Manny years ago in my first life, had a wife who could do that as fast as a normal persontalked and did not even know what was going on, In the ears and out the fingers, was somthing to see.

Like Anthracite said, the systems have a long way to go.

I have a large simulator at work that relies on voice commands to control the various aspects of the simulation. Since there are a large number of people that use it, we don’t have the option of training it for a particular persons voice. It was designed with that fact in mind.

The whole system cost about $200,000, and the voice recognition is next to worthless. We often just fall back to keyboard shortcuts.

When it comes to voice recognition, I just don’t think we are there yet. I would not count on it for any level of accuracy. One that can be trained may perform better.

I encountered a voice-recognition thing when I called UPS to track a package. My voice is kind of low as well, and I guess for that reason I was transferred automatically to a live operator.

Speech recognition software can be divided up in a couple of ways:

  1. Speaker Dependent vs. Speaker Independent

Speaker dependent software works with a particular speaker, and includes stuff like ViaVoice and Dragon. They usually require some sort of training (although the amount of training needed is significantly reduced in the newer versions), but provide higher accuracy. Speaker independent systems are used for IVR’s (i.e. telephone answering services) and other applications where the user(s) cannot be pre-determined. They don’t provide as much accuracy, but serve a much wider audience.

  1. Dictation vs. Command and Control

Dictation systems allow the user to say absolutely anything, while command-and-control systems provide a more limited (and therefore less error-prone) set of recognizable sentences. With dictation software, you can say whatever you like, for example, “Colorless green ideas sleep furiously!”, though you risk being mis-heard as, “Carl’s great ideas leap furiously!” With command-and-control, on the other hand, you are more likely to say something like “robot arm up” or, in the case of a phone system, “eight-six-two-three”, but you are more likely to be heard correctly.

  1. Word-Based vs. Phonetic

Word-based systems are hard-wired to recognize just a few words. They tend to be very accurate, even in noisy environments like cars or factories. Problem is, you can’t add new words very easily (which in some applications isn’t a problem at all). Phonetic systems, on the other hand, are less accurate, but recognize discrete phonemes (sound segments), which can in turn be used to recognize any possible word in a particular language. Your ViaVoice, Dragon, and so forth are all phonetic.

So there are a lot of trade offs, and the best choice is very application-dependent.

Accuracy with dictation software (the OP’s obvious choice) can be improved in a couple of ways:

  1. By training and retraining the engine a couple of times. However, too much training can make accuracy go DOWN, so be careful.

  2. Learning to talk to the software. The more you dictate, the better you’ll become at adjusting your voice to suit the software’s needs. This is probably best done in front of a screen, so you are getting direct feedback, rather than into a tape recorder.

  3. Using a good microphone. Noise-cancelling mikes can be good in somewhat noisy environment, like an office.

– CH

I bought VoiceXpress some three or four years ago, when supposedly it was competing favorably with DragonSpeak(?) and other voice recognition packages of standing.

The method is:

  1. Spend about an hour training the system to recognize your voice.

  2. Record what you want transcribed either into a mike attached to the PC, or to an independent digital recorder supplied with the product.

  3. Replay your voice into a software analyzer.

  4. Edit the results.

The results were not satisfactory, so many words had to be corrected it would have been easier to type directly. Sometimes mistakes are so bad it takes a considerable effort to figure out what was supposed to have been said.

However, the instructions and monitoring process indicated that I was not very successful in the training session: my voice varies too much. Someone with more regular speech, or more monotonous would probably have better – even good – luck.