how does a computer understand a language?

To add a little theory to the discussion, the most important class of computer languages are the context-free languages, which are described by a context-free grammer. This is because context-free languages can be processed by a pushdown automata, also known as a stack machine. These languages are sometimes called stack-based languages. Nearly all modern computer languages are context-free languages.

Computer hardware can process stack-based languages efficiently, and the underlying computational theory is well understood. (languages, grammers, regular expressions, etc.)

It is possible to use other languages to control computers, for example the orginal fortran wasn’t a stack-based language, but the compilers are much more complex and there is no real benefit to doing so.

It’s too complicated to go into the details, but in case someone is interested most books on the theory of computation or compiler theory will cover this topic very well, it’s the mathematical basis on which computer science is based.

Chronos: While Adm. Hopper was instrumental in getting COBOL going, that was not the first language. I believe that FORTRAN was the first high level programming language.
(Sorry could not find a handy refference.)

If you have K&R “The C Programming Language” and turn to the back, you will find what is called a “grammar” for the C language. This grammar shows you all the legal ways that you can string keywords together to make statements. If your source code does not follow these rules, than the compiler (or translator) knows that it can reject your code and not try any further. But what about English? You can form sentences all kind of different ways and have them make sense.

In the reference I was trying to use to back up my assertion above, they have an interesting chart. It compares Level and Effort for several languages. Lambda is the level and E is the Effort to mentally write the program in that language.:

Language Lambda E
English 2.16 1.00
PL/1 1.53 2.00
Algol-58 1.21 3.19
FORTRAN 1.14 3.59
Assembler 0.88 6.02

(I would like to add that the lower the L the easier it is for the translator to handle the language.)

For the Geeks, here are the formulas:
P = number of distinct operators appearing in the program
p = Total number of operators appearing in the program
Q = Number of distinct operands appearing in the program
q = Total number of operands appearing in the program
N = Number of operators and operands appearing in the program
V = Volume = N* log2(P+Q)
L = Level of abstraction used to write program = (2/P)(Q/q)
Lambda = Level of the HLL used to write the program = L
L*V
E = Mental effort to create the program = V/L

Sorry, this is an old list, I do not know where C or C++ or Java would end up.

Thanks so much. That answer put things in a perspective that I didn’t even know existed- very interesting. =)

Sorry about the screwed up chart, I should have used tabs – or at least preview!

FWIW, I believe that Moby Dick was used to generate the numbers for english in the chart. (Though I do not know how they defined “operator” and “operand”!)

The recent mention of Language Grammars caught my attention, and triggered a memory from a class I took last semester in Automata Theory. Don’t worry, its relevant, and maybe helps to explain why its very hard to make a straight English compiler.

So any sort of language needs a grammar, right? (Ok, thats not true, but any language which can be readily understood needs one) Well, the linguist Noam Chomsky developed an idea called Context Free Grammars (or CFGs). Actually, I’m not sure if he developed it, but he did a lot of research in the field, as well as bringing about a standard notation for it, called Chomsky Normal Form, or CNF.

The general idea of CFGs is that you have a two different kinds of symbols, terminals, and non-terminals, and a bunch of rules. An example CFG would look like this:

S -> AB
A -> aA | a
B -> bb

To use these grammars, you start with S, and begin following the rules, replacing any non terminal symbol (the capital letters) with any symbol or combination of symbols on the right side of each of those rules. So I can replace S with AB, then I can replace the B with “bb”, and the A with “a”, creating the string “abb”.

The set of all these strings that can be produced by these rules is called the language (L) of this grammar.

“So what?” you ask. Well, here’s the rub. If you set up a grammar like this, it is very easy to construct an automata that can take a string and determine whether or not that string falls in the language. What this amounts to, then, is if you can create a language, say a computer language, which can be built using these Context Free Grammar rules, it becomes relatively easy to write a translator for it because all possible strings follow a certain set of rules.

Unfortunately, English does not have a context free grammar. English, in fact, is a very context sensitive language. Which means its much harder to write a program to correctly interpret English statements (after all, what do you think a computer should do with a sentence like “He hit him with his head.”).

Well anyway, thats the reason that my teacher gave for why its difficult to write natural language compilers. Hope it helps.