Are some languages better for speech recognition than others?

Nava · November 28, 2012, 9:49am

Because when a language has many “fixed” sounds, the amount of training needed for a given speaker is much lower than for languages which have messier phoneme-grapheme matches. A language with five vowel sounds which are always represented the same way, or one with eight vowel sounds which are always represented the same way, do not require training on vowel recognition: one with a dozen vowel sounds, any of which may be represented in several ways, some of which may in turn represent several of the vowel sounds, needs vowel-recognition training.

Quercus · November 28, 2012, 2:29pm

But aren’t you talking about either how many different phonemes there are in the language (which isn’t directly the same as how the sounds are represented), or how many different accents of that language exist in the real world?
As an example of the first, the ‘f’ sound in English is pronounced maybe two ways but written something like five or six ways. And that shouldn’t be a problem for speech recognition (or even taking dictation); the program recognizes an ‘f’ sound, decides what word the speaker is trying to say, then looks up the spelling of the word to write ‘cough’. In English, without some knowledge of meaning, it might have trouble deciding whether the speaker means ‘ruff’ or ‘rough’, but that’s not connected with how the words are written.
And for the second, for instance, we all agree that Spanish spelling is very regular (compared to English), in the sense that a given letter is usually pronounced the same way by the same speaker and vice versa. But a program might still need training to decide whether a written ‘s’ is pronounced like and English ‘s’ or English ‘th’ (i.e. Latin American or Castillian accent).

Nava · November 28, 2012, 3:27pm

But those are the few letters that are different: a recognition program needs vowel training in English, but not in Spanish - it needs a lot less total training - and the rules to match sounds to graphemes when a word is not in its list will vary slightly with dialect (that’s what the training is for), but there is actually very little need for a list. In English, you need a much longer list and much longer training, in order to teach a machine how to spell something knowing how a given person pronounces it. You know, same as for human beings…

Graphemes which do not change for a given speaker, in Spanish:

A B CH D E F G H I J K L LL M N Ñ O P Q RR T U V

Y has only one instance where it sounds as I does; R changes, but it’s due to a spelling rule (at the beginning of a word it sounds like RR does; elsewhere they sound different). Now come up with a list like that for English… you get a ton more graphemes, for starters!

Quercus · November 28, 2012, 6:53pm

Nava:

But those are the few letters that are different: a recognition program needs vowel training in English, but not in Spanish - it needs a lot less total training - and the rules to match sounds to graphemes when a word is not in its list will vary slightly with dialect (that’s what the training is for), but there is actually very little need for a list. In English, you need a much longer list and much longer training, in order to teach a machine how to spell something knowing how a given person pronounces it. You know, same as for human beings…

Graphemes which do not change for a given speaker, in Spanish:

A B CH D E F G H I J K L LL M N Ñ O P Q RR T U V

Y has only one instance where it sounds as I does; R changes, but it’s due to a spelling rule (at the beginning of a word it sounds like RR does; elsewhere they sound different). Now come up with a list like that for English… you get a ton more graphemes, for starters!

Maybe we’re disagreeing on what ‘speech recognition’ means? I understand it to mean ‘able to extract information from speech’, in which case knowing how to write the words is pretty much meaningless (except perhaps at the margin if you have an unfamiliar word and a dictionary arranged by conventional spelling). After all, humans can (and you have at one early point in your life) been able to carry on a meaningful conversation without any clue how to write the words.
If you’re definition of ‘speech recognition’ is just taking dictation (that’s not going to be very accurate because meaning isn’t used to correct errors), then maybe your points make sense.

Freakenstein · November 28, 2012, 7:48pm

I’ve been thinking this too.
So how about My previous examples of Finnish words ‘Tuli’, ‘Tuuli’ and ‘Tulli’. You could get a Finn to say these words in different speeds, with slacking jaw, twisting his face or whatever. Other Finns still wouldn’t be confused which word he’s trying to say.
One could get some wrong answers with changing the speed in the middle of a word, like saying ‘Tu-’ slowly and ‘-li’ fast, then some might think he’s saying ‘Tuuli’ instead of ‘Tuli’.
I think a language like that would be quite easy for speech recognition.
Or maybe I’m way off - see next:

You have no idea. Theoretically every verb could be inflected in ways that the word in writing could literally end up being infinitely long…

Hermitian · November 28, 2012, 7:53pm

Nava:

But those are the few letters that are different: a recognition program needs vowel training in English, but not in Spanish - it needs a lot less total training - and the rules to match sounds to graphemes when a word is not in its list will vary slightly with dialect (that’s what the training is for), but there is actually very little need for a list. In English, you need a much longer list and much longer training, in order to teach a machine how to spell something knowing how a given person pronounces it. You know, same as for human beings…

Graphemes which do not change for a given speaker, in Spanish:

A B CH D E F G H I J K L LL M N Ñ O P Q RR T U V

Y has only one instance where it sounds as I does; R changes, but it’s due to a spelling rule (at the beginning of a word it sounds like RR does; elsewhere they sound different). Now come up with a list like that for English… you get a ton more graphemes, for starters!

Even considering dictation, we are obviously approaching this differently.

In a simplistic way I imagine speech recognition to dictation as happening like this:

Joe says one word into microphone.
Computer records sonic waveform of word.
Using sophisicated statistical analysis, the computer decides that the waveform most resembles entry #65852 in it’s waveform repository.
Computer looks up how to spell waveform #65852 - it is spelled “dog.”
Computer outputs “dog” on the screen.
Joe is happy.

Notice that the computer doesn’t care one bit about English spelling. If it’s lookup table said that waveform #65852 was spell like “ibkofhik” then that is what the computer would output. No sound to letter equivalents are relevant.

Ximenean · November 28, 2012, 8:59pm

I agree that spelling regularity is irrelevant when it comes to the recognition of words, but it could be useful in dictation. “Dog” is easy because it doesn’t have any homophones, that I can think of. But what if the word is “right”? The computer looks up waveform #65853 and finds four spellings associated with it, “right”, “write”, “rite”, and “wright”. Which one to use?
In a regularised written English, they would all be spelled “rite” or something. It wouldn’t help comprehension at all, of course.

KneadToKnow · November 28, 2012, 9:20pm

Obligatory Steve Martin reference.

“Casa. Pepe. Casa de Pepe.”

Tim_T-Bonham.net · November 28, 2012, 9:34pm

This simplistic way is glossing over one of the major problems in speech recognition.
" Joe says one word…" – but people don’t talk one word at a time, they talk in phrases and run their words together. A significant problem in speech recognition is identifying where one word stops and another starts. It is still an issue in such programs. Early speech recognition programs handled this by requiring users to speak. one. word. at. a. time., but people generally refused to use such programs.

Another problem is in “the waveform most resembles entry…”. There are very significant differences in the waveforms from different speakers, sometimes more difference than between different words. Compare the clipped northern “dog” to the drawled southern “dawg”, for example. Most speech recognition programs require a period of ‘training’ to accurately recognize a specific voice, and re-training for another voice.

Hermitian · November 28, 2012, 9:56pm

t-bonham@scc.net:

This simplistic way is glossing over one of the major problems in speech recognition.
" Joe says one word…" – but people don’t talk one word at a time, they talk in phrases and run their words together. A significant problem in speech recognition is identifying where one word stops and another starts. It is still an issue in such programs. Early speech recognition programs handled this by requiring users to speak. one. word. at. a. time., but people generally refused to use such programs.

Another problem is in “the waveform most resembles entry…”. There are very significant differences in the waveforms from different speakers, sometimes more difference than between different words. Compare the clipped northern “dog” to the drawled southern “dawg”, for example. Most speech recognition programs require a period of ‘training’ to accurately recognize a specific voice, and re-training for another voice.

I completely agree with those.

I was just trying to convey that the program is not struggling to “spell out” words letter by letter by listening to the sounds. It does its best to match sounds to words, despite the fact that people DO slur words together and talk differently.

However, Ximenean is right in that homophones are a distinct difficulty that they have to solve (or guess) by context.

Topic		Replies	Views
Which languages come closest to having an unambiguous written form? Factual Questions	56	2516	May 24, 2007
Most Divergent Orthography? Factual Questions	28	4039	January 11, 2009
With all the data & CPU power available why is speech to text accuracy still so crappy in 2016? In My Humble Opinion	34	1991	January 25, 2016
Why not phonetic spelling? Great Debates	29	4025	October 28, 2003
"Enuf is enuf. Enough is too much." Protestors at the Washington DC Spelling Bee Miscellaneous and Personal Stuff I Must Share	62	9456	October 18, 2010

Are some languages better for speech recognition than others?

Related topics