How far away is reliable star-trek style verbal communication with computers?

Currently it goes like this (if you have the right software) “Start… Programs…Word”

(Computer loads Microsoft Word. If you’re lucky)

"My… Name… Is… Is… Is… IS!… Lobsang… Delete… Lobsang… Delete… "
How far away is the day when it will be like this…

User: Computer, I’d like to dictate a story

Computer: Would you like me to use Microsoft Word for this task?"

User: No, It’s just a short story. Use Notepad.

Computer loads notepad.

User dictates story at casual conversational pace.

Computer dictates it perfectly.

User: Computer, What’s the largest file in storage?

Computer: The largest file currently stored in non-volatile memory is ‘game_data.dat’

User: Find all my pictures of boats.

Computer finds images containing boats.*

If my experience with voice recognition software used by insurance companies, waaaaayyyyy off.

I work primarily with Blue Cross. Sometimes I have to call their Blue Card line to get eligibilty information. You call the number and have to speak the prefix of the patient’s policy number and it’s supposed to transfer you the state where the plan originated. Here’s an example “conversation”:

Computer: “Please speak only the three letter prefix.”
MBS: “Z C Y”
Computer: “Let me say it this way, I think you said “B” as in Bravo, “P” as in potato, and “I” as in Indian. Is that correct?”
MDS: “No.”
Computer: “Please try again. Say only the first three letters of the policy number.”
MBS: “Z C Y” (I’m now enunciating so hard, I’m nearly rupturing my larynx)
Computer: “Let me say it this way, I think you said, “B” as in Bravo, “M” as in Mary, and “R” as is Radio. Is this correct?”
MBS: “NO”
Computer: “Please hold while I transfer you to customer service.”

I have to do this 4-5 times per week. The computer gets it right about 1 in 10.

Somewhere in the order of 30 - 50 years for generic, almost 100% reliable speech recognition. No idea about the “find all images of boats” thing.

It depends on what exactly you expect. Systems with a very limited scope exist today. e.g. At my Institute at my University we have a speech controlled elevator. It allows relatively free dialog, but of course only on a extremely narrow topic. A typical dialog look like (in continuous speech):

Elevator?
Hello, We do you want to go?
Err, I am looking for Professor Barry.
Ok, I will take you to the 4th floor.

Ok, it’s not that impressive :slight_smile: But It shows the basic principle how you get working systems today: You restrict the domain and choose tasks that can be solved without too much in-depth analysis.

A big problem is that many applications depend on very “deep” analysis. Simply recognizing the sounds is not that hard any more, but analysing the “meaning” is extremely hard. There are words that sound the same, most of the time sentences don’t have an unambiguous structure, many words have multiple meanings, often “world knowledge” is required…

So a truly universal dialog system is still relatively far in the future, but I estimate that we will have something that is useful for many everyday tasks way earlier than 30-50 years from now.

A project from our institute (but unfortunately I have nothing to do with it) that does research on this topic:
http://www.talk-project.org/

Btw. That’s the project were I work as a student assistant. This is one the many basics required before your computer can really understand you.

A few years ago, while working for a law firm, I was looking into voice recognition for them. Some of the partners thought it was going to be a wonderful cost-cutting technology - after all, they could just dictate straight into their computers now, so they could fire half of the scretaries.

At one demonstration of the system I was overcome by having seen 2001 on TV the night before and decided to attempt to dictate “Daisy, Daisy” into the machine.

I enunciated carefully and loudly, just like it said in the documentation. I can remember quite clearly that what came up on the screen was:

Daisy Daisy
Give me your aunts to chew
Buying hearth crazy
Over the love of you
It won’t be a stylish manage
I can’t afford a gravel
But you’ll look sweet on the sweet
Of a bicycle made for two.

For some reason that package always had an affection for gravel. It used to insert the word seemingly at random.

I wish I could say the technology has got better over the years, but it still seems just as bad whenever I run across it.

I don’t want to excuse the crappy performance of the existing speech recognition system, but be very careful with that. A very common source of error in modern systems are people who try to “help” the system in some way. It is true that it can only hear what you actually said and you shouldn’t mumble, obscure your mouth or anything like that. But never try to sound like a stage actor (or those people who talk to foreigners and the hard-of-hearing in the same way) This problem is so great that there are separate models for “hyperarticulate” speech in the works that can be used as a fallback if the input wasn’t understood by the “natural” model.

On a side note, how long before it goes like this…
User: Computer, Find all my Work documents.

Computer: No.

Just for kicks, I loaded up my Microsoft Speech SDK thingy and tried “Daisy, Daisy” a couple times as well.

Here’s the overenunciation result:

daisy daisy give me your answer to
I’d go crazy over you
it won’t be a violation marriage (and here the rest suffers a bit because I was laughing)
I can’t afford a tear in each
book will look so we can
on a bicycle built for two

Here’s the regular speech result:

Daisy daisy give me your answer do with
I’d go crazy over you
it won’t be a stylish managed
I can’t afford it very little boy
but will look sleek,
bicycle built for too long

Here’s what happened when I tried actually singing into the microphone:

The C. B. C. P. B. you’re and cert do
I go crazy the whole for the new
ITU won’t be as stymie-marriage
I can’t afford that.
But both will look this week to
own up by its goal built for two

Using the “TalkBack” feature, I got:
I heard: the V. A. T. U. V. your answer you will
I heard: they go crazy over you will
I heard: it won’t be a stylish marriage
I heard: I can’t afford it tended
I heard: but will looks weak bicycle built for two

Clearly, there’s a ways to go. And my singing sucks.

It’s quite likely to be developed in my lifetime (and I’m not that young). And it will be a disaster.

If people use computers to dictate documents, they will be unreadable. You have a tendency to get wordy when dictating and there are all sorts of rambles and tangents you get on to. Try reading the late Henry James (when he dictated his work to a secretary). :o

Panther has a decent voice-recognition program. Though it only recognizes a limited number of commands, it works fairly well - you can program it to recognize your favorite webpages, etc.

I thought I remember Cristopher Reeve being in a movie where all the voice commanded stuff he used, he said was actually available.

Looks like it was Rear Window.

I’d say that it depends on what you want to do specificly. I think dictation is a ways off, but setting up keywords, to perform certain functions is well within the range of most software now.

I plowed through a voice recognition menu only to get deep into the wrong place.
CP: “Aw, F–K!”
Computer: “If you would like to end this session, please hang up.”

There are 3 main layers for conversational computing:

  1. Voice recognition. Figuring out what words were said.
  2. Natural Language understanding. Figuring out what the words mean.
  3. Responding. Figuring out what to say/do. This is a Holy Grail of AI.

There’s been great progress in step 1. Maybe 60% of the way there (with the remaining 40% being the hardest part, ergo it will take more than twice as long). Note that there is considerable overlap between 1 and 2. When the computer hears “to” did you mean “to” “two” or “too”?

Note that you can avoid step 1 entirely by having people type in their side. But that’s merely bypassing the easiest step.

Steps 2 and 3 is an extremely long way off except in limited circumstances. If the realm of running a starship bridge, there’s enough limits that a not too far off computer can handle most responses: “Ahead warp 5” “Shields up” “activate nebulizer”. But if you try “what do you want to watch tonight?” it’s far from easy.

AI people were saying “We’re 5 years from Natural Language understanding.” from the mid-1950s thru at least the 1980s. I.e., we’re centuries away. So that’s hard to predict. Maybe 100 years, maybe never.

Part of the problem is that, in Star Trek, the sorts of tasks they give their computers are different to the sorts of things we typically do on a desktop PC and even in those cases where they are similar, the ST crew get away with “look for correlations in the personal log files and the cargo manifest” - imagine translating this into SQL - what fields do you want to join and in which direction? how will the query match ‘A big scarlet box’ (in the personal logs) with ‘packing crate, 1.2 cu m, red’ (in the cargo manifest)? etc etc.

Yeah, from what I remember of Star Trek, a lot of times the commands given to the computer were so woolly that even a fully sentient human would have to ask for a lot more clarification before you could actually accomplish the task requested.

This phrase has a very Douglas Adams sense about it. And, as 2001 is one of my favorite movies, I love your post. Thanks Armilla!

Yes, but in recent years you’ll have a hard time finding anyone who seriously promises natural language understanding. First of all you won’t even find consensus what this means. Human language understanding is still far from being understood. However we will probably see a time when speech input is more convenient than the present keyboard/mouse combination. Of course other input methods will also evolve so that we will see various solutions for different scenarios. Even if we probably won’t live to see HAL-9000, speech technology will play its role.

Strong AI (ie, computers that can think like a human and/or are conscious) is so far off right now that we might as well say we haven’t even begun.

People talk about Moore’s Law and about how processers are getting better. The problem really isn’t power (although that issue isn’t insignificant). The problem is that we can’t even write the “thought” program.

In fact, nuts like Kurtzweil (www.kurzweilai.net) now propose that we will “download the human brain” in order to get our strong AI program.

Tell you what: ain’t gonna happen.

Maybe you should consider singing the blues. The software obviously thinks you have a gravely voice.


If I could see the world through Ray Charles’ eyes I would be a better person for the journey