One problem is that in order to interpret speech correctly, you often have to know quite a bit about the topic you are discussing. It’s not just “this noise made by the human most likely translates to this word”. It’s “this noise made by the human in the context of this sentence most likely translates to one of these words and he was talking about boats earlier (or later) in the sentence, so likely it is the nautical term instead of the similar-sounding pastry, but maybe it’s actually the proper name of someone he’s just bringing up now so let me guess how that is likely spelled”. The sounds themselves, unless the person is speaking very clearly and unnaturally and probably not even then, don’t contain all the information.
It’s kind of like a related problem in machine vision. If you see a trapezoid in your field of view, you are likely to interpret it as a rectangular horizontal surface like a table, rather than a trapezoid hanging in midair. This is because you are a human and the human world is full of horizontal surfaces and has a noted paucity of mid-air trapezoids. But the computer doesn’t have that experience, and has to be told that tables are common in order to get the most likely interpretation of the scene in front of it correct.
In both cases, you have to already have a significant conception of the world to correctly interpret what you are presented with.
Even better, when Captain Picard asks his Android phone for “tea”, it already knows that he prefers “Earl Grey, hot” so he doesn’t have to say it every time. One gets the impression that the Enterprise’s computer would serve him iced Darjeeling every time he wasn’t completely specific. Most unrealistic part of ST:TNG by far.