Implementing Lisp Discussion

Moved from ImplementingLisp.

One of the most fundamental implementation issues is that of encoding type in each cell, so that each member of a pair is self-describing. Typically several bits (a small number of bits, usually 1 to 4) are devoted to type, so that e.g. 00 means small integer, 01 means pointer, etc. Floating point, strings, and non-primitive value types are often "boxed": represented by a pointer into a special allocation area; the pointer itself encodes the type (a float pointer has LOW_FLOAT_ADDRESS <= float_pointer <= HIGH_FLOAT_ADDRESS).

The exact encoding method used is typically architecture dependent, since it must deal with little-endian versus big-endian, whether low bits of pointers are guaranteed to be zero (on word-addressed machines), etc, and requires some creativity.

BoxingConversions are elegant from one point of view but leads to efficiency concerns with floating-point intensive code.

Also see DavidGudeman?'s review paper "Representing Type Information in Dynamically Typed Languages" (ftp://ftp.cs.indiana.edu/pub/scheme-repository/doc/pubs/typeinfo.ps.gz).

It is interesting and of non-trivial importance that Lisp's original implementation, and its most popular 1970s implementation on DEC 10s, ran on machines in which exactly 2 pointers fit in each memory word.

Lisp these days works quite well on machines where 1 pointer fits in each memory word, of course, but this is less of a perfect fit.

CeeLanguage programmers, by contrast, have always tended to assume that sizeof(pointer) == sizeof(int) == sizeof(word in RAM); that assumption wasn't forced by C, but many kinds of programming are more awkward on machines of other sorts, and require more careful C programming to be portable.

The implementation of lambda has varied wildly in different Lisps over the decades. The simpler approaches represent it the same way that any function is represented, the only difference being whether some name refers to it or not. It is possible to implement it as a sexpr that exactly reflects the way it was specified. Implementations stray from that due to issues concerning the binding of the name parameters to activation record values, amongst other concerns.

CactusStacks tend to be natural in order to implement LexicalScoping, if nothing else.

There are many resources about how to implement Lisp. I would recommend to read ChristianQueinnec's book 'LispInSmallPieces'. http://www-spi.lip6.fr/~queinnec/WWW/LiSP.html

Also: 'An Introduction to Scheme and its Implementation': ftp://ftp.cs.utexas.edu/pub/garbage/cs345/schintro-v14/schintro_toc.html

I highly recommend the out-of-print classic "AnatomyOfLisp?", 1979, JohnAllen?. I can't say exactly how far down in a list of must-read books it should be placed; there are many I haven't read. But everyone seems to still agree that it is unique. -- dm

'Anatomy of Lisp was once a classic. Now it is mostly outdated. It describes an ancient dialect of Lisp (with dynamic binding) and modern implementation techniques are a bit different to say the least.'

That's not accurate. It is certainly outdated on some topics, so it definitely is not a modern "how to implement lisp" book. However, it is not "mostly" outdated; Lisp's fundamental nature has not changed. For instance, studying dynamic scoping techniques is a positive thing, not a negative, despite the modern emphasis on lexical scoping.

'Actually a lot of Lisp's fundamental nature has changed from the time Anatomy of Lisp has written. Much of the modern Lisp trends started with newer Common Lisp systems and before that with developments in the Scheme world.'

Disagree that it has changed. It has primarily been added to. When you read JohnMcCarthy's 1960 paper, did you think "What a joke! That's a different language!" No, I predict you did not, because what he wrote about was recognizably Lisp, and that's what I mean by its "fundamental nature" not having changed.

'You would want to study continuation-based implementations, type inference, lexical scoping, partial evaluation, advanced garbage collectors, interfaces to the OS, and more. There are lots of smaller Lisp systems to study plus a system like SBCL which comes in full source, has an advanced compiler with multiple backends.'

No, not in a tiny implementation, and not in a pedagogical implementation, for instance. You have an interest in something of industrial strength, so studying these additional topics you list is obviously valuable, but none of those things need be studied if you're aiming for a tiny language with a 400-line C implementation, now do they? There is a multitude of reasons to study different aspects of Lisp, and I think it is misguided to forget that.

Another thing to consider: why would anyone need any such book in the first place? There are lots of implementations of Lisp already that are better than anything a single individual is likely to be able to recreate. Answer: for pedagogical reasons, or just for fun, or for a special purpose little language -- and for all such minority purposes, the old approach to Lisp is in general much easier to implement than modern approaches (e.g. dynamic scoping vs static). And yet many of the old techniques are getting to be forgotten and hard to find.

Also, I am not aware of any up-to-date book that serves as an outright replacement for Anatomy. Please let me know if you do know of one, because of course I want to read it.

As far as I know, the list of must-read foundational Lisp books is all too short to start with.

And as Jonathan says below about JohnMcCarthy's original paper, Anatomy is good for historical context even in the areas where it is outdated. It is still a classic.

It irritates me no end that people lose interest in everything but the cutting edge; there is tremendous value in history. "Those who cannot remember the past are condemned to repeat it." (George Santayana)

-- DougMerritt

Yes, but we'll repeat it much more efficiently, because we have TailCallOptimization. -- GeorgePaci

Did you see the references above? The modern replacement is 'LispInSmallPieces'. Anatomy of Lisp is hard to get (giving a reference to an out-of-print book will give the potential reader not much benefit - I would really advise to read something that's either online or buyable with some added material - like source code to study - see: Lisp in small pieces) and hard to read (yes, I have a copy) and outdated. Sure you can study old (very old) Lisp implementation techniques, but the newer books and online resources are preferable. As a first reference I would not give Anatomy of Lisp. Maybe to get a historical perspective. But that's about it.

Considering I said "I can't say exactly how far down in a list of must-read books it should be placed", I don't see why there should be any objections whatsoever. I also said "I am not aware of any up to date book that serves as an outright replacement", and I haven't heard that contradicted, since I think I was clear that the older techniques would ideally be covered as well as matters of newer importance.

That lexical scoping is more difficult to implement or that that matters is a myth. Proof: tons of small Scheme implementations.

I really don't think it's a myth, having personally implemented both kinds. Dynamic scoping is trivial. Lexical scoping at one point was actually considered impossible/impractical in Lisp, did you know that? Given all the research and implementations since then, it is now no longer anywhere near as difficult...but it remains noticeably more complex. Hugely complex? No, nor did I say that, so "tons of small Scheme implementations" doesn't "prove" anything one way or the other.

And it is quite common to hear elisp fans say that its dynamic scoping just isn't the big issue in practice that other Lisp folks say that it is -- I prefer lexical scoping, don't get me wrong, but dynamic scoping isn't a dead issue. Like I said, it is interesting for pedagogical implementations, for especially tiny languages, etc.

I'm sorry that you're disappointed in it, and I apologize if my own previous recommendation gave a misleading impression, but just because it isn't what you personally were hoping for doesn't mean that it isn't a classic.

I do agree that parts of it are difficult to read.

P.S. as for out of print, I did attempt to do my part on that: I have talked to both JohnMcCarthy and to John Allen in recent years about getting a revised second edition back in print. Doesn't sound like it's going to happen. :-)

-- DougMerritt

It occurs to me that, perhaps Anatomy would be a much less controversial suggestion if it were in a separate list of historically significant classics that any really serious implementor should eventually read, along with JohnMcCarthy's 1960 paper and the Lambda the Ultimate series? But if even that doesn't satisfy, then sorry, I'll just have to disagree. -- dm

'No, that's correct. Anatomy of Lisp is a classic. Lambda the ultimate series is also worthwhile to read. Other books would be 'The architecture of symbolic computers' (Kogge), 'Topics in Advanced Language Implementation' (Peter Lee), 'The Lambda Calculus' ;-), the various conference proceedings of Lisp and Functional Programming, Smalltalk 80 -- The language and its implementation, The Art of the Metaobject Protocol, various papers on macros, Efficient method dispatch in PCL, Rabbit -- a compiler for Scheme, Orbit -- an optimizing compiler for Scheme (Phd thesis) Internal design of CMU Common Lisp (Technical Report), Lisp machine manual, a multitude of papers on GC, and so on... Here are some more pointers: http://library.readscheme.org/page8.html'

Absolutely... although, as page 8 of a series of important references, this is something of an embarrassment of riches, and I may never catch up on reading all of them. I doubt if I've read even 20 of the papers listed on those pages.

You've given a nice list, but I wonder if it could be separated and prioritized, especially now that we're distinguishing between historical and modern, core/critical versus "it would be nice" , etc.

Another classic is "The Semantics of Destructive Lisp", although this is very focused/specialized, and not of interest to everyone by any means.

There's also the 1996 book that surveys the subject of garbage collection, "Garbage Collection -- Algorithms for Automatic Dynamic Memory Management", by Richard Jones and Rafael Lins (see GarbageCollectionBook). Although research papers continue to be written about GC, I'm not aware of any other book of this sort, so I would consider it a definite must-read for a serious implementor.

Contrary to some claims, it really isn't the case that there is just one single GC algorithm that is always best. It depends on the system (e.g. expected average free memory) and the goals.

"Advanced Programming Language Design" by Raphael A. Finkel is, unfortunately, apparently a one-of-a-kind ('though not Lisp-specific), and usually well-thought of by people designing new languages.

-- DougMerritt

CMUCL, SBCL, and CLISP are all open-source implementations. CLISP is bytecode-interpreted. SBCL is supposed to be an easier-to-read-and-maintain version of CMUCL. The CormanLisp? compiler also comes with source that you can browse through, though it's not open source in the FreeSoftware definition.

Also see JohnMcCarthy's original paper (http://www-formal.stanford.edu/jmc/recursive.html). This shows how the original 1959 Lisp was implemented. It's probably not at all relevant to current techniques, but it's good for historical context.

I'd also like more information on this, as I'm close to beginning implementation of a Lisp-like language that has many of the same challenges as Lisp. Here're some things I've learned from various papers (Orbit, LispMachine), newsgroup postings (Allegro) and implementations (GwydionDylan, Chicken Scheme, Gambit Scheme, and CormanLisp?).

Parsing is as simple as you'd imagine. Top-down recursive descent, call READ whenever you see an open paren, recursively read symbols. CommonLisp also allows reader macros, which let you execute a bit of code whenever you see a certain symbol. That's how ' and #' are implemented.

I assume you're asking about LexicalClosures when you say lambdas, because a simple lambda is just a function pointer to a compiler-named function. A closure is a data structure that contains both the function pointer and a pointer to an environment. The environment doesn't always have to be implemented on the heap, CactusStack-style. Highly-optimizing compilers (Orbit comes to mind) will try to statically analyze the closure to determine its extent, and if it doesn't escape, they'll just put a pointer to the base of the enclosing stack frame. This is the same mechanism that Algol and Pascal use, and appears in most compiler textbooks.

If the closure does escape, the compiler often doesn't need to store a full environment frame on the heap. Instead, it can analyze the free variables in the closure, and just store a data structure containing them. The enclosing function will also need to point into this structure (instead of allocating the variables on its own stack frame).

There's a third technique, called "LambdaLifting", that's often used in more purely-functional languages. In this, the function takes all its free variables as parameters instead of taking a pointer to an environment. This won't work if the variables are assignable, because lambda invocations won't share closed-over variables like they do in normal lambda semantics. But it saves an indirection and is as fast as normal function-calling, so you can see why it's useful in pure-functional languages. Read the papers on this for more details (I think SimonPeytonJones has done a bit of work on it); I'm probably butchering it.

Java InnerClasses essentially are implemented with LambdaLifting

Keyword args are another opportunity for optimization. According to a newsgroup posting, Allegro generates a table of possible keyword args when the function is defined, much as the same way traditional compilers implement structs/records. A function invocation sets the appropriate fields in the table, and the address of it is passed to the function.

Lists are not always lists! The LispMachine used a technique called "CDR-compression", where a list that's allocated all at once (as opposed to being consed step-by-step) would occupy a contiguous area of memory, like a vector. CDR then just moves to the next memory cell instead of following a pointer. There was a special tag bit on the list header that indicated a compressed list, and if you did a rplaca on the list, it'd be broken out into the normal format. I don't know of any implementation since that has done that; checking the tag may be prohibitive without the special hardware.

This is usually called "cdr-coding", and was developed and patented by the MIT AI Lab in the 1969 through 1975 era. It might have first been mentioned as early as Hansen WJ (1969) Compact list representation: definition, garbage collection, and system implementation, Comm ACM 12(9), 499
Lisp FAQ on cdr-coding: http://www-2.cs.cmu.edu/Groups/AI/html/faqs/lang/lisp/part2/faq-doc-9.html OR http://www.faqs.org/faqs/lisp-faq/part2/section-9.html
See also newer research Using Compact Data Representations for Languages Based on Catamorphisms (1995) http://citeseer.ist.psu.edu/fegaras95using.html

See ContinuationImplementation for continuation implementation.

Many high-performance Lisps (CMUCL, GwydionDylan, and SBCL, at least) also do type-inference as an optimization. With suitable type declarations, they can often pin down the precise type used in a function, which lets them get away without tagging or boxing of values. This can make a big difference in numeric code, and is one reason why CMUCL can often be faster than C/C++ for numerics.

-- JonathanTang

It would make an interesting contest to see the shortest C implementation of a simplified Lisp dialect. Ignore speed for the initial contest. The winner would receive something like an autographed photo of JohnMcCarthy.

http://www0.us.ioccc.org/years-spoiler.html#1989_jar.2

If you're interested, I have an old non-standard lisp-in-java implementation that I can exhume (haven't seen it in three years) and post up on the wiki as a catalyst for explanation and discussion. I seem to recall it being quite modular and easy to explain. It would end up spanning a rather large number of pages, though, for sufficient explanation to justify putting it up at all, and perhaps the interest doesn't justify the pollution? -- AdamBerger

I've been kicking around an idea of using CeeAsAnIntermediateLanguage without having to resort to something like CheneyOnTheMta by using a few bits of assembly that work in the GNU assembler on all platforms, and thereby avoiding the overhead of C function calls entirely. The basic idea looks something like this:

  #define LABEL(X) asm(".globl " #X ";" #X ":")
  #define GOTO(X)  do { extern void *X; goto *(&X); } while(0)

Functions would look like this:

  void funcname() {
LABEL(funcname_start);
/* ... */
GOTO(elsewhere_start);
  }

I've already discovered that any optimization in gcc breaks this (it sees no need to pop arguments to functions off of the stack after calling them when it is about to return and restore the previous value of the stack pointer anyway). It would be necessary to avoid doing any gotos from blocks with automatic variables (this means no autos in functions), but that's easy in auto-generated code. I wonder if the costs of losing optimization would outweigh the benefits of avoiding unneeded function call overhead, or could that be fixed by optimizing before compiling to C?

There is still a little overhead above a straight assembly version since on x86 the goto compiles to:

  movl $label,%eax
  jmp *%eax

instead of:

  jmp $label

but I'm thinking that the cost of an extra, non-memory referencing, instruction is tiny compared to the cost of a function call.

Has anyone else done this before? Is there any conventional wisdom on this sort of thing?

Not sure, but it sounds like it might ultimately be more trouble than CheneyOnTheMta.

You may not care, but I would think that "goto *(&X); " is allowed only by GNU extensions, and wouldn't work on other compilers.

You may also be interested in taking a look at CeeMinusMinus - a C dialect intended as a high-level assembly language (rather than being called that in a derogatory way).

Doesn't GCC already optimize tail calls nowadays?

Anybody want to volunteer a relational schema of parsed Lisp? Don't worry about field types or lengths at this point. -- top

Please STOP changing "--" to "-". This is simply wrong. The "--" represents a "dash", which is different from a hyphen. When I say "it is dead-wrong", I correctly use a hyphen. When I say, instead, "it is wrong -- you are pessimizing pages", I correctly use a dash.

It's not "wrong", just a matter of what is chosen to represent a dash; "dead wrong" is usually two words.

It doesn't matter what is "usually" done, the point is that if an author chooses to join "dead" and "wrong", the hyphen is used to represent said joining. Thus, either "dead wrong" or "dead-wrong" is correct, while "deadwrong" is incorrect. Perhaps we finally agree that emDash is different from hyphen. Since emDash and hyphen are different, and since the difference can change the meaning of the sentence, then a representation that cannot disambiguate the two is broken -- or "wrong". This is why the ascii representation of emDash is "--" (with or without adjoining space), enDash is " - " (always with adjoining space) and hyphen is "-" (never with adjoining space). --TomStambaugh

CategoryLisp