How do you write a programming language interpreter in its own language?

Pypy is a Python interpreter that is written entirely in Python. How would one go about writing an interpreter in its own language? Or is it no different from writing an interpreter in a different language (such as how the traditional CPython interpreter is written in C)?

A compiler or interpreter written in its own language is said to be “self-hosted.” Probably the most widely used example of a self-hosted compiler is gcc, which is written mostly in C.

The process for building a self-hosting compiler necessarily involves some bootstrapping. Either you use another compiler for a similar language as the first step, or you build a minimally-functioning version of your compiler in another programming language, then use that to write the self-hosting version.

There’s a sentiment among computer programmers that no compiled language is really “real” until it’s been used to produce its own compiler. It’s probably not a bad benchmark, actually: If an empty suit comes up with some supposedly great new language, a compiler is probably a big enough task that he won’t be able to do it himself, and if real programmers like the language well enough, then they know all they need to be able to make the compiler.

Well, only if the language is intended for the domain of general programming. Many languages are designed for specific problem domains, and it would be silly to try to write a compiler in them.

COBOL, for example: designed for high-volume business data transactions. Or ForTran, designed for mathematical work. There’s just no point in trying to write a compiler in either of those languages. There are other tools better for the purpose.

Fortran was certainly intended as a general-purpose language, and at the time it came out, there definitely wasn’t any better tool for the job. Given the choice of writing a compiler in Fortran or writing it in Assembly, I’d choose Fortran every time.

Of course, nowadays, most Fortran compilers are modified C compilers, and so are presumably written in C.

Corrected typo in thread title.

Colibri
General Questions Moderator

(First, who in the world spells it ForTran? It’s FORTRAN (for the older variants, apparently up through FORTRAN 77) or Fortran (for Fortran 90 and later).)

There was at least one COBOL compiler written in COBOL for the ICL 1900 series mainframe computers. Compared to that, a FORTRAN compiler written in FORTRAN is a lot less surprising, especially if FORTRAN was the only non-assembly language supported on the system. I mean, if people are willing to write compilers in C… (They are, for the record: Most modern C compilers are written in C and have been for decades.)

Finally, I doubt FORTRAN was originally intended to be a domain-specific language even to the extent COBOL was: It was used, at least, as a general-purpose language to roughly the extent C is, even though you can also argue C is a domain-specific language for OS design and device driver development. They’re both terrible languages for compiler construction but that never seems to stop people.

Here’s an example of a design for a FORTRAN (dialect) compiler written in FORTRAN.

Oh, and a point we’ve all been glossing over in the OP: The OP is asking about interpreted languages, not compiled. While one could write an interpreter for a language in that language itself, it would almost never make sense to do so, unless the language has both compiled and interpreted versions. You’d just be adding layers unnecessarily.

I presume (though it’s not clear from the article) that Pypy is produced using a compiled form of Python. Otherwise there’d be no point, since you’d still need another interpreter anyway, to run Pypy on.

Having the language (at least mostly) implemented in itself can be useful of its own since it potentially allows for language extensions to be written by the people who are most familiar with the language.

Probably the prototypical language that explicitly allowed for an interpreter written in itself was Lisp. You can write a completely trivial interpreter or compiler* in a few lines of code in most Lisp variants. It’s so easy it’s probably cheating, but the language also makes it pretty easy to expand on that trivial first step because the parser (reader) is built into the runtime, the output of the parser is trivially predictable, and in old-school Lisps, only a handful of constructs** are “built in” and everything else is already implemented as higher-level Lisp functions and macros.

  • Many modern Lisps implement (eval) as a compilation / link / call action but older versions of Common Lisp typically have both a compiler and an interpreter available at run-time.

** Here’s a Lisp (scheme) interpreter in Python. It’s about two screens of code and implements 9 “special forms” with which you can implement the rest of the language in itself as a library.

Heck, round the circle: you can write a C compiler in Python, then compile CPython using it. Why not?

Any Turing-complete language can complete any programming task. So, yes, it’s possible to write a Python interpreter in Python. Whether it’s practical is another question.

I remember doing something like this in college. We started with 5 defined functions and rebuilt the entire LISP language using only those functions or functions derived from those 5. It was a fun class actually.

Back in the golden days when I was a “C” programmer and APL was taking over the world the APL guys and I would often get into pointless arguments about which was the better language. I’d use the above argument, point out that most/all “C” compilers were written in “C” and challenge the APL gurus to find out what language their favorite APL interpreter was written in. The usual finding was that it was written in “C”.

I was convinced that proved my language was superior. The APL guys weren’t.

(As I understood things at the time, whenever computer hardware types designed a new CPU one of the first things the software guys did was port a “C” compiler onto it. They’d take an existing “C” compiler that ran on another CPU and change its final machine-code-generating step so that it produced machine code for the new CPU. They’d then run the source code for that already-existing “C” compiler (which was, of course, written in “C”) through the modified version so that it produced a “C” compiler executable that would run on the new CPU. Ta-Da!! - the new CPU now had a working “C” compiler for it. Then the same folks would grab a copy of the UNIX source code, which was of course written in “C” and… Ta-Da!!)

((Nowadays they probably do the same sort of thing with Java or whatever the heck youngsters use these days. Bah, Humbug))

Why do you put C in quotes there? It isn’t common usage, you can’t argue it’s correct usage, yet every so often I see it and now I’m curious why.

I don’t know that Java is all that well-suited to that sort of thing, since it compiles for a virtual machine, not the underlying hardware. You could write a C compiler in Java that would compile C code to Java virtual machine code, and you could custom-make a compiler that compiled to some other code, but you couldn’t just plug the source code for a standard C compiler into a standard written-in-Java C compiler and get underlying machine code out of it.

And you’d need to get a real C-to-underlying-machine compiler at some point, since Unix is all written in some flavor of C or another, and nobody’s going to go through all that code and translate it all to Java.

Just make sure you don’t use any C compiler written by Ken Thompson, or one compiled on one written by him, or compiled on one compiled by one written by him, or…

The different memory models would make this nontrivial to say the least.

Good point. Beats me why I did it. I don’t do it around my techie friends, and I probably shouldn’t have done it here either.

Yeah, but even you young whippersnappers can’t avoid it forever. Sooner or later you’re going to have to get down to the underlying hardware. (Or as my generation would say it “sooner or later you’ve got to scrape the metal”.)

But the job was mathematical work – FORmula TRANslation. That’s why they gave the language that name. Early FORTRAN compilers were indeed written in assembly language. Just like everything else! All programmers were assembly programmers at that time. And many of them didn’t believe that higher-level languages woul dever work, or ever be efficient enough to be workable.

Yes, but at the time, all programming was regarded as being fundamentally mathematical (it still is, of course, but it’s not generally regarded that way any more). And the first FORTRAN compilers had to be written in assembly, since FORTRAN was close to the first compiled language. But I’ll bet that the next generation of compilers were written in FORTRAN.

Quoth sevenwood:

Yes, but Java programmers scrape it in a different way than C programmers. Whenever a new architecture comes out, the C programmers create a new compiler for it, while the Java programmers keep their same compilers but implement a new virtual machine for it. I don’t know what tools are used for this, though-- I wouldn’t be surprised if they just have a Java virtual machine written in C, and wait for the C guys to make the new compiler, and then compile their virtual machine on it.

And, of course, not everyone needs to scrape the metal. The whole point in higher-level languages is to allow deeper specialization: The guy who can create a really efficient JVM doesn’t need to be the same guy as the guy who can write really great applications that run on top of it.

We in the CS land have a name for this: meta-circular evaluators, and they are fun. Here’s an explanation of why you might want to do such a thing aimed at an audience technical enough to understand what an evaluator is in the first place in this context. Of course SICP has a section on this, implementing Scheme in Scheme. (Which is fitting: The first Scheme system was implemented in NIL, I believe, a Lisp dialect that is not Scheme and isn’t entirely the same as Scheme. Picking NIL as a foundation saved the person who created Scheme a lot of work.)