I worked on Merced (the first Itanium) for five wonderful quarters :dubious: so I figure I have a right to call it Itanic. I left when I saw how it was going and before I got too optioned in. Last year Intel and its few customers (HP and noise, basically) said there were going to throw in a few billion bucks to try to market the thing. What is going to happen during the current cleanup is anyone’s guess. Anyone wanting a nice perspective on this should search on The Register for Itanic.
I did a lot of my pre-dissertation work in BCPL, which is related, and which ran on Multics, but that was after I had already dissed C.
First, let’s clear up some terms: a language is not the same as a compiler or interpreter. You can design a programming language on paper without writing a single line of actual code.
An interpreter is a program that reads some code in a certain language and follows the instructions in it; a compiler is a program that also reads that same code, but instead of executing the instructions immediately, it translates them to another, lower level format to be executed later. Most compilers produce machine code for the same CPU they’re running on, but sometimes this involves translating to assembly code first and then running a separate assembler. Cross compilers produce machine code for a different CPU. Finally, compilers for some systems like Java and .NET produce instructions for a virtual CPU that doesn’t physically exist, but is emulated by a program.
Now, once you’ve designed the language, you can write a compiler for it. Of course, you run into that chicken and egg problem. But you can solve it by writing two versions of the compiler, one in some other language and one in your new language, and using the first one just long enough to get the second one to compile. (For those first compilers back in the dark ages, the “other language” had to be assembly or even machine code, but these days you have a lot more choices.)
As TimeWinder points out, the first version doesn’t even have to implement the entire language, only the subset that the second version actually uses. In fact, the first version doesn’t have to be a compiler at all: you can write an *interpreter * for your new language and have it interpret the code for your compiler, which can take its own source code as input to produce a compiled version. So in that sense, it’s possible that the first compiler for a language could be written in the language itself.
Aha, you’re quite correct. There was some work on fiber optical computing also - not just for the interface, but for the actual computation part. I saw a few presentations about this, but it didn’t go anywhere, and might have been mostly silicon anyway.
You should read some of the original papers from Backus et al. about Fortran. They were fanatically concerned about efficiency, not surprising when computing time was really expensive. I doubt they used current optimization techniques, but efficiency mattered a lot back then. Some of the structure of the original Fortran (like no free form input) made it easy to compile.
And just to add tools to the discussion, there are many compiler-compiler programs, the most famous back when I was doing it being lex and yacc. I believe there is one associated with Perl, and I used one when I took compilers back in grad school over 30 years ago.
lex and yacc are still used, but on the systems I like (Linux, mainly) they have been replaced with the functionally-equivalent flex and bison. I don’t know which Perl compiler-compiler you’re talking about, though.
To bring this back to a comprehensible level:
To be able to compile or interpret a language, the computer needs to know precisely what it looks like, or its syntax. It needs to know all of the rules that must be obeyed. Humans have devloped a rather clever way to lay all of this out so specifically even a computer can follow, without burying us poor cholesterol-brained monkeys in detail. That method is called context-free formal grammars, more specifically a special kind of context-free formal grammar called Backus-Naur Form (BNF). There is a good description of BNF on Wikipedia. BNF is a simplifed version of the language yacc and bison recognize, and nearly every modern language has its grammar written in an extended form of BNF as part of the creation of its compiler or interpreter.
(There is a whole mathematical field behind formal grammars which is way too complex for me to go into here. (Nobody wants a post the size of a 200-level CS text. My fingers would get worn to wee nubbins, anyway.) Suffice it to say, BNF is only the tip of the iceburg and is hardly even the most complex kind of formal grammar.)
Now we can discuss what yacc actually does: It takes a file written in that modified, extended version of BNF I mentioned and creates a program, called a parser, capable of parsing programs written according to the grammar defined by the input file.
So, how does lex fit into all this? (If you are feeling suicidal, check into Café Society. They have threads about people wearing their underpants on the outside of their clothing.) Well, every file is merely a bag of bytes. It would be possible for the parser to parse those bags of bytes directly, but it wouldn’t be as efficient. It’s far more efficient to divide those bytes into words, or lexemes, and feed those to the parser. Recognizing and dividing lexemes is the job of a lexical analyzer, or a lexer. And lexers are the programs lex writes, again based on a high-level description of the lexemes (Numbers like ‘12’ and ‘1.5e-2’, variable names like ‘pi’ and ‘ChunkyBacon’, and operators like ‘+’ and ‘%’.) that compose a given language. (Of course, programmers can’t leave well enough alone. We like to call lexemes ‘tokens’ and lexers ‘tokenizers’.)
So, to recap: The file goes into the lexer, which divides it into lexemes, which get fed to the parser, which decides whether the program is bogus and, if not, what kind of statements it is composed of. The parsed statments go to an optimizer (in most modern compilers), which massages them into something that should be more efficient to run. Those optimized statments then get fed to a back end, which turns them into either machine code (C usually, sometimes Java), byte code (Java and Python usually, sometimes C), statements in another language (RATFOR, nothing else I can think of), or actions (Perl, Python, sometimes even C).
(The difference between machine code and byte code is that machine code gets run by a machine, whereas byte code gets run by a program. Byte code can get converted to machine code by building the appropriate machine, at least in principle.)
I’ll look it up when I got back to work. I discovered it when I was considering adding a little mini-language to a project I was working on, then figured out it would be easier to write it myself, then figured out it would be even easier not to give general programming ability.
BNF is also called Backus Normal form at times, though I prefer it your way.
To make it even easier, in yacc anyhow, when the parser recognizes a set of lexemes, a chunk of code is executed to do the right thing. For instance, a a <number> + <number> is recognized (the actual grammar is more complicated, allowing an expression where I have a number) it can emit the machine language for an addition, or a call to an interpretive routine to do the addition, or even code for a subtraction of the language is really bizarre. The lexical analysis piece is not hard for these thinks, the grammar is usually fairly simple, but the code to do the work is the hard part.
Typically I don’t think you’d emit the machine code from a yacc rule’s action, unless you’re not planning to do any optimization and you’re compiling for a very simple processor, such as a stack machine (where you just have to emit the code for the first expression, then the code for the second expression, then an add operation).
Instead, you’d build a tree node which contains the two expressions and the addition operator, then pass it up the line to the expression or statement that contains this addition expression, building a parse tree out of the entire source code (or some subset, like the current function), and emit the machine code once the whole tree is built. That way you can optimize the tree, assign variables to registers or memory based on their usage throughout the entire function, or emit the same operation differently based on the context in which it appears.
Donald Knuth does as well. He’s the one who changed it, or one of the main ones, in part by writing a letter to the editor (This link goes to a webpage, but the letter itself is a 310k PDF download. Despite all appearances, it is a free download.) of CACM*.
*(Communications of the Association for Computing Machinery, a wonderful publication that has focused on both theory and practice, hardware and software, relative to Computer Science since the 1950s.)
You know, I’ve even used p2c. :smack: (Meh. If that’s my only slip, I’m very lucky.)
You know, I do believe we’ve completely scared off FlyingRamenMonster.
This thread would be better named, “How does one write a compiler or interpreter?”
When I read the title, I thought it would be about how one designs a new language; i.e., syntax, semantics, type system, etc.
Anyone care to address that? My knowledge of the theory of programming languages is rather low; I’ve mostly tried to pick up things in bits and pieces.
You pick bits and pieces from other languages you like, or you start with a theory (the lambda calculus for Lisp, Horn clauses for Prolog) and add enough extra features to make it practical to write programs in.
There are multiple families of languages, some with many and varied branches and some with just one main example. Fortran, for example, spawned a lot of simple line-oriented languages such as Basic (which has mutated in other directions recently) and FOCAL (which never got the chance to). Families spring up around successful ‘founder’ languages, such as Lisp and Fortran and Algol and C and Forth, when people take the fundamental ideas of those languages and modify them to grapple with new problems. For example, Forth is a stack-oriented language that’s especially good at providing a flexible development environment for very small systems. When that very small system is a printer and you want a language to create good-looking documents, you wind up with a language in the Forth family called Postscript. That basic pattern gets repeated over and over again for decades, with echoes of successful languages turning up in new places.
For the past few decades, the trend has been to put features first developed in Lisp (such as garbage collection, free-form input, first-class functions, and control flow more advanced than GOTO) into a Fortran-derived syntax* to come up with languages like Algol and C and Ruby. Every generation we get closer to actually programming in Lisp, a language invented in the 1950s.
*(Any language that differentiates between statements and expressions has a Fortran-derived syntax.)
Languages don’t evolve as quickly as the computers they control do, because languages are notation and therefore intimately related to human preferences and limitations.
I know very little about how computers work (and to be honest I’m not that interested), but I wanted to thank everyone in this thread. I’m currently on a kick of reading books about computers and the Internet. I just finished Hackers by Stephen Levy, The Soul of a New Machine by Tracy Kidder, Cuckoo’s Egg by Cliff Stoll (all of which I’d read and enjoyed before), and Where Wizards Stay Up Late (fascionating). I’m about a quarter of the way through Dealers of Lightning (about Xerox PARC, they’ve just built the MAX), and I have Fire In The Valley waiting. It’s fascinating to me, but I understand very little about the mechanics of it all. When Levy talks about the hackers staying up all night “coding” I have no idea what’s happening when they do that. All I know for sure is that I deeply, deeply respect these people. You people.
Thank you for my computer/browser/all the programs I use. Thank you for the Internet!
Equipoise: Programmers do for love what most people wouldn’t do for money.
Oh, and about Hackers: rms wasn’t the last hacker, as Levy claims. There are hackers around the world working on Linux, the Internet, the BSDs, Google, and so on, and so forth. The mass media’s appropriation of the word for its own ends hasn’t destroyed its usage among people who appreciate the original sense.
I’ve been obsessed about hobbies, so I could relate in a very tiny way to focusing in on one task while shutting everything else out. I love reading about this kind of devotion, especially knowing what all of this led to. Just the sheer amount of brainpower is dazzling to read about. I’m in a constant state of goosebump at the smarts on display in these books. Smart guys are sexy! (I know, I’m married to one) My only regret is that there weren’t a lot of women in their ranks. Ah well.
Part of what is so fascinating about it is that these folks were pioneers, but at the same time every discovery, every invention, every bit of coding built upon what came before. I need a book that starts with Babbage and works up to the people who built the computers of the 40’s, 50’s and early 60’s.
I know the book is very out-of-date. I wish he would write another one about the next 20+ years (didn’t the book end in the mid-80’s?). All the people you mention…their stories are interesting to me too.
Yep, I’ve always tried to correct people when they use the term in a negative way.
I said it was over-simplified. I’ve never used lex and yacc to emit machine code, but rather to do things like translations of inputs which could be conveniently expressed as a grammar. Things like that, and interpreters, often do something directly without going through the optimization phase you described.
In the '70s, when I was in grad school, nearly every issue of SIGPLAN (ACM Special Interest Group on Programming Languages) would describe a new language. I don’t get it anymore, but from reading Computer I get the impression that language design is no longer in. Anyone who does get it know?
A lot of the languages were designed for better structured programming, many were functional, some were early object oriented languages (After Simula and before C++). The one I designed was designed to support writing portable microcode. I designed it by taking Pascal as a base, throwing away features that didn’t make sense (like sets and pointers) and adding features I needed, like better bit manipulation and directives saying that a particular data structure or subroutine was bound to a resource in the target machine.
I understand why people designed languages - it’s kind of fun.