Because when a language has many “fixed” sounds, the amount of training needed for a given speaker is much lower than for languages which have messier phoneme-grapheme matches. A language with five vowel sounds which are always represented the same way, or one with eight vowel sounds which are always represented the same way, do not require training on vowel recognition: one with a dozen vowel sounds, any of which may be represented in several ways, some of which may in turn represent several of the vowel sounds, needs vowel-recognition training.
But aren’t you talking about either how many different phonemes there are in the language (which isn’t directly the same as how the sounds are represented), or how many different accents of that language exist in the real world?
As an example of the first, the ‘f’ sound in English is pronounced maybe two ways but written something like five or six ways. And that shouldn’t be a problem for speech recognition (or even taking dictation); the program recognizes an ‘f’ sound, decides what word the speaker is trying to say, then looks up the spelling of the word to write ‘cough’. In English, without some knowledge of meaning, it might have trouble deciding whether the speaker means ‘ruff’ or ‘rough’, but that’s not connected with how the words are written.
And for the second, for instance, we all agree that Spanish spelling is very regular (compared to English), in the sense that a given letter is usually pronounced the same way by the same speaker and vice versa. But a program might still need training to decide whether a written ‘s’ is pronounced like and English ‘s’ or English ‘th’ (i.e. Latin American or Castillian accent).
But those are the few letters that are different: a recognition program needs vowel training in English, but not in Spanish - it needs a lot less total training - and the rules to match sounds to graphemes when a word is not in its list will vary slightly with dialect (that’s what the training is for), but there is actually very little need for a list. In English, you need a much longer list and much longer training, in order to teach a machine how to spell something knowing how a given person pronounces it. You know, same as for human beings…
Graphemes which do not change for a given speaker, in Spanish:
A B CH D E F G H I J K L LL M N Ñ O P Q RR T U V
Y has only one instance where it sounds as I does; R changes, but it’s due to a spelling rule (at the beginning of a word it sounds like RR does; elsewhere they sound different). Now come up with a list like that for English… you get a ton more graphemes, for starters!
Maybe we’re disagreeing on what ‘speech recognition’ means? I understand it to mean ‘able to extract information from speech’, in which case knowing how to write the words is pretty much meaningless (except perhaps at the margin if you have an unfamiliar word and a dictionary arranged by conventional spelling). After all, humans can (and you have at one early point in your life) been able to carry on a meaningful conversation without any clue how to write the words.
If you’re definition of ‘speech recognition’ is just taking dictation (that’s not going to be very accurate because meaning isn’t used to correct errors), then maybe your points make sense.
I’ve been thinking this too.
So how about My previous examples of Finnish words ‘Tuli’, ‘Tuuli’ and ‘Tulli’. You could get a Finn to say these words in different speeds, with slacking jaw, twisting his face or whatever. Other Finns still wouldn’t be confused which word he’s trying to say.
One could get some wrong answers with changing the speed in the middle of a word, like saying ‘Tu-’ slowly and ‘-li’ fast, then some might think he’s saying ‘Tuuli’ instead of ‘Tuli’.
I think a language like that would be quite easy for speech recognition.
Or maybe I’m way off - see next:
You have no idea. Theoretically every verb could be inflected in ways that the word in writing could literally end up being infinitely long…
Even considering dictation, we are obviously approaching this differently.
In a simplistic way I imagine speech recognition to dictation as happening like this:
- Joe says one word into microphone.
- Computer records sonic waveform of word.
- Using sophisicated statistical analysis, the computer decides that the waveform most resembles entry #65852 in it’s waveform repository.
- Computer looks up how to spell waveform #65852 - it is spelled “dog.”
- Computer outputs “dog” on the screen.
- Joe is happy.
Notice that the computer doesn’t care one bit about English spelling. If it’s lookup table said that waveform #65852 was spell like “ibkofhik” then that is what the computer would output. No sound to letter equivalents are relevant.
I agree that spelling regularity is irrelevant when it comes to the recognition of words, but it could be useful in dictation. “Dog” is easy because it doesn’t have any homophones, that I can think of. But what if the word is “right”? The computer looks up waveform #65853 and finds four spellings associated with it, “right”, “write”, “rite”, and “wright”. Which one to use?
In a regularised written English, they would all be spelled “rite” or something. It wouldn’t help comprehension at all, of course.
Obligatory Steve Martin reference.
“Casa. Pepe. Casa de Pepe.”
This simplistic way is glossing over one of the major problems in speech recognition.
" Joe says one word…" – but people don’t talk one word at a time, they talk in phrases and run their words together. A significant problem in speech recognition is identifying where one word stops and another starts. It is still an issue in such programs. Early speech recognition programs handled this by requiring users to speak. one. word. at. a. time., but people generally refused to use such programs.
Another problem is in “the waveform most resembles entry…”. There are very significant differences in the waveforms from different speakers, sometimes more difference than between different words. Compare the clipped northern “dog” to the drawled southern “dawg”, for example. Most speech recognition programs require a period of ‘training’ to accurately recognize a specific voice, and re-training for another voice.
I completely agree with those.
I was just trying to convey that the program is not struggling to “spell out” words letter by letter by listening to the sounds. It does its best to match sounds to words, despite the fact that people DO slur words together and talk differently.
However, Ximenean is right in that homophones are a distinct difficulty that they have to solve (or guess) by context.