The recent mention of Language Grammars caught my attention, and triggered a memory from a class I took last semester in Automata Theory. Don’t worry, its relevant, and maybe helps to explain why its very hard to make a straight English compiler.
So any sort of language needs a grammar, right? (Ok, thats not true, but any language which can be readily understood needs one) Well, the linguist Noam Chomsky developed an idea called Context Free Grammars (or CFGs). Actually, I’m not sure if he developed it, but he did a lot of research in the field, as well as bringing about a standard notation for it, called Chomsky Normal Form, or CNF.
The general idea of CFGs is that you have a two different kinds of symbols, terminals, and non-terminals, and a bunch of rules. An example CFG would look like this:
S -> AB
A -> aA | a
B -> bb
To use these grammars, you start with S, and begin following the rules, replacing any non terminal symbol (the capital letters) with any symbol or combination of symbols on the right side of each of those rules. So I can replace S with AB, then I can replace the B with “bb”, and the A with “a”, creating the string “abb”.
The set of all these strings that can be produced by these rules is called the language (L) of this grammar.
“So what?” you ask. Well, here’s the rub. If you set up a grammar like this, it is very easy to construct an automata that can take a string and determine whether or not that string falls in the language. What this amounts to, then, is if you can create a language, say a computer language, which can be built using these Context Free Grammar rules, it becomes relatively easy to write a translator for it because all possible strings follow a certain set of rules.
Unfortunately, English does not have a context free grammar. English, in fact, is a very context sensitive language. Which means its much harder to write a program to correctly interpret English statements (after all, what do you think a computer should do with a sentence like “He hit him with his head.”).
Well anyway, thats the reason that my teacher gave for why its difficult to write natural language compilers. Hope it helps.