Homoiconic Definition Take Five

Homoicon definition made precise:

A language is homoiconic when the data types defining its AbstractSyntaxTree are part and parcel of the language (standard, part of the standard library, available implicitly at runtime/compile time for every client developer to play with, just like strings, numbers and other stock data types), and when the language supports AST literals.

Does "supports AST literals" imply also the presence of some sort of "eval" statement? I can easily imagine a language that provides its AST within its language as part of its standard, but that omits the any form of free-variable capturing "eval" for reasons of security and optimization. (Actually, I'm designing such a language.)

For example, in Lisp/Scheme the fundamental AST structure is given by EssExpression. And they both support literals for EssExpressions:

 (define myExpression '(if x (lambda (y) (+ y 1)) (lambda (z) (- z 1)) ) )

The variable myExpression is initialized with a constant denoted by the literal '(...).

Java is not homoiconic, even if you can manipulate code (usign BCEL and other open-source libraries) to accomplish virtually everything, because there is no literal in Java to denote a for loop, or any other code construct.

TCL is not homoiconic, because while it uses strings, its strings are not AST. Once you put the code into a string, you loose the structure of the AST.

You've run into trouble already, because although this is a reasonable view, on the other hand it is directly contradicted by quotes from published material. TCL is considered homoiconic.
- Oh, but that's a feature not a bug. It was designed into the definition with the specific purpose to exclude TCL, bash and other less deserving (from this point of view) languages. Otherwise if one is charitable "AST literals" can be replaced with "code literals" and TCL & comp. would be in.
Strings in Tcl are the AST. When you write a control construct in Tcl such as "if", you're actually calling a command "if" and passing it an expression (as a string!) and the body to run if the expression evaluates as true (as a string!). The interpreter isn't allowed to assume that the string you pass to if is code, since it could equally well not be (the command "if" might even be redefined to mean something else which really wants a string as its third argument). Any definition of homoiconic which excludes Tcl is broken. Likewise, any definition which relies on the inner workings of a particular compiler or interpreter is broken.
"Once you put code into a string..." This shows a misunderstanding of Tcl. The code is the string. Putting code (or anything) in braces is a way of writing a string literal, not a block of code. The only way you can pass code to if, while, for, proc and so on is as a string. The only AST structure Tcl applies to your source code is to match braces, and that structure is directly exposed by Tcl when you treat a string as a list.
It also paraphrases the historical quotes on the subject severely enough that I don't think it captures the original flavor.

This makes Scheme and Lisp homoiconic, while Java, C/C++, Smalltalk, Python non-homoiconic, stringified languages non-homoiconic even if they have eval, etc.

In support of this revised definition.

It's a simple boolean test (no fuzziness in here) and it's quite unambiguous. Other definitions mess things up. For example "code has the same representation as data" -- complete ambiguity. What code (source code, byte code, machine code) and what data representation (constants/literals, data internal representation in memory, external representation, etc) ?

Most of the questions in HomoiconicFaq become obvious for anybody to answer, rather than subject for endless disputes. HomoiconicClassification? becomes clearer. There's a direct relation between the substance of the definition and both advantages and disadvantages of "homoiconicity".

It preserves Lisp and Scheme as the traditional homoiconic languages while it gets rid of TCL, foxpro, and others (bash anyone ? ) who can eval strings. Manipulating code as strings is no fun, and is not what homoiconic is all about, otherwise every language can manipulate strings and use a well packaged library to generate code. There's no big deal about manipulating strings, the big deal is about AST. This preserves the and the intent of the original definition in the face of language evolutions that makes it unclear (byte codes, byte code manipulations, string manipulations, just in time compiling, widespread distribution of eval, etc).

Which is why I said it was "reasonable" -- in those senses. However, this redefines the word to be incompatible with the definition provably in use by others in the field, so I don't see how this can be made to fly. Idiosyncratic definitions hinder rather than aiding communication.

But the old definition is provably bad as it makes every modern language homoiconic by a trivial addition of a library. A definition that does not differentiate is useless anyways. And in the end, who cares ? The term is very much unimportant, the old definition is bad, this definition has all the qualities except being compatible with some old pronouncements. And by the way, other than Raphael Finkel mentioning in passing that TCL is homoiconic and TCL community, who clearly have a conflict of bragging interests, picking up on it, there's nothing else staying in the way of progress. Clear definitions help rather than hinder communications, fuzzy definitions hinder rather than help communication, so the choice is clear, I rest my case, and this subject is closed as far as I am concerned. Take it with a grain of salt, and all that. TakeFive?.

Don't be like that. Consider that the term was invented by the authors of TRAC, a very TCL-like language, it was not invented originally to refer to Lisp. Further consider that I have always refuted the notion that "makes every modern language homoiconic by a trivial addition of a library", and have expressly said frequently that this misunderstands the idea. So if you're done on the topic, it is a very hollow victory indeed. -- Doug

You may refuted for yourself, but all the other persons were unconvinced, endless discussions ensued and the people took it that homiconicity is something on a scale of 0 to 1 or from strong to weak. But under this definition it is crystal clear why neither Java, nor Smalltalk or other languages can be shoehorned into "homoiconic languages". As for TRAC, does TRAC have a notation for AST literals ? I don't know, but looking briefly over the TRAC paper, their "strings" are implicitly more structured then just an array of characters, so TRAC might fit this definition just fine. Or it may not. It's an old forgotten language that never really flied. Now the icon for homoiconicity is LISP. The very specific difference between LISP/Scheme on one hand and Java/Smalltalk/Ruby on the other hand is that the first category has a notation for EssExpressions literals.

In the end, a "debate" over a definition cannot be won by arguments but by usage and acceptance of people. Since the old "definition", already confused enough people, and is, in itself, ambiguous it follows that the old one already lost the battle. This definition can lose the battle as well, which may be interpreted that "homoiconicity" is not an important enough feature to be worth the battle for a good definition. So it was never my intention to claim that this is the definitive definition for homoiconicity, but if anybody wants to think of homoiconicity in no unclear terms, this definition can guide him perfectly well.

If you can get your proposed definition accepted by the world at large, that would be fine by me. Until then, however, as you say, "usage and acceptance" hasn't happened yet.

Also, although yes, there has been lots of confusion, you overlooked a proposal I made to EricHodges just yesterday that might cut through all the confusion while still being backward compatible. [I think I missed it too. What was the proposal you made to EricHodges just yesterday? What page is it on? -- jtg]

Interesting.

#1: It bothers me that this definition excludes TRAC, the language where the term was coined. It doesn't seem possible to build meaninful AbstractSyntaxTrees for TRAC, as parsing is so intimately intertwined with execution. To build a useful AbstractSyntaxTree, the structure of the code has to be parsable before execution. But in TRAC, it's apparently possible to change structure during execution, based on the data.

So whatever mechanics TRAC is doing, it ends up in the very least with a minimal syntax tree: at least as operand + operators, if they interpret the source code command by command. Or operand + operators + macros to expand, it's still a little tree. But who wants to learn TRAC now, and who cares ? TRAC was an experiment, apparently not very successful, if its embodiment of homoiconicity was less than stellar (at least in comparison with Lisp), then Lisp should be the golden standard.

#2: I think redefining the homoiconic term to be based on AbstractSyntaxTrees requires that you argue that that's what AlanKay really meant when he said "[...] both are �homoiconic� in that their internal and external representations are essentially the same." And that AlanKay was wrong when he acknowledged that TRAC was homoiconic. And that Mooers and Deutsch were mistaken when they coined the term.

Now if you take that definition to the letter, it would follow quite easily that LISP is not homoiconic. Internal representation for LISP code and data is a tree of pointers with some bytes on the leaf, whereas extrenal representation is a char array, or a source file, corresponding to the s-expression encoding, or older LISP had the M expression syntax as well. Clearly not the same. No need to keep an old and bad definition just for the sake of a dead language, but the spirit of the old definition as it applied to the technologies of that time, and the spirit of this definition as it applies to the current state of the art, is essentially the same. Concepts evolve with time, what should be considered homoiconic now need not be identical to the letter with what was then, but in large this definition is a refinement of those ideas, and it has the quality of being extremely precise.

I wonder if it would help to look for and list things that TRAC and LISP have (and maybe TCL, FoxPro, etc), that Java, C#, C, C++, FORTRAN, COBOL, etc. don't have. ...things seemingly related to "internal and external representations" being "essentially the same".

Like...

Runtime representation of the code (if any) being simply and obviously related to the (typically text) source code.
- Unfortunately Assembly and machine languages have this relationship. But maybe these are homioiconic.
- They aren't, but they don't have that relationship, either! I have written assemblers, and the parsing is conceptually trivial for the simpler assemblers, but a pain in practice. Similarly for machine language; writing a "code walker" of any sort (a disassembler, a simulator, whatever) again is only conceptually trivial, but in practice, a serious pain. They do not have a simple and obvious interrelationship, it only seems like it if you haven't had to actually do the mapping. Not that that's exactly the issue truly at hand, but just for the record.
- Intel 8080 assemblers are trivial; "Assembler 101" school book stuff. Zilog Z80 had a few tricks up its sleve: Op codes alone didn't uniquely identify instruction codes or lengths. But op code + operands did, so a traditional two-pass assembler still works fine. Intel 8086 is nasty: With more information available in the second pass, the assember sometimes changes its mind as to what op codes and therefore instruction lengths to use. Annoying, and a pain I admit, but not unreasonable.
- In practice, the difficulty with a disassembler is in separating the code from the data. In some programs, this can be ambiguous. And the more difficult this is, the more strong the argument that assembly/machine is homoiconic! ;->
TRAC and LISP have a standard way to represent code as data.
- But so does Java.
- But there isn't one primary representation for both. The standard way to represent code in Java is not the standard way to represent data in Java (there isn't really any single standard data representation, there are lots -- and actually, byte-code arrays aren't used for anyt kind of data except program data!).
- Complete non-sense. Byte arrays are used for tons of things. They're all over the place in my programs, and anything that deals with networking, messaging, imaging, cryptography. Not all java data is byte arrays, of course, but not all Lisp data is conses, they have at least arrays as an alternative complex structure.
  - I stand corrected. Well, actually, let's double-check that. I said "byte-code array", not "byte array", thinking that there was a specific Java class for "byte codes" (or arrays thereof) that was in fact different than the specific class for "bytes" (or arrays thereof). So I'm wrong about that?
    - Java makes no distinction between byte codes and bytes. BCEL has an "Instruction" class that models byte codes.
    - Indeed,there's no such data type or object that can be said to be byte-code arrays. ClassLoader.defineClass(...) takes a plain old byte array, and transforms it into the internal structures (possibly including machine code) that represents the class inside the VM. If the byte array as fed into ClassLoader.defineClass(...) does not conform to the JVM specs an exception will be thrown but that doesn't change the type of the parameter which is still plain old byte[]. For all intents and purposes, when you talk with Java programmers and say byte-code array, what they'll hear is byte[] .
- In Java, you have (1) things of some subtype of 'java.lang.Object', and you have (2) scalars. Byte array ('byte[]'), like all arrays in Java, is a child of 'java.lang.Object'.
- Yes, Java makes extensive use of byte arrays: Practically all I/O to the outside world is through byte arrays.
- In addition, if you take a LISP/Scheme implementation that takes all statically known code and churns it to machine code that has nothing do with conses, it stands to reason that it wouldn't be any less homoiconic, inspite of the code representation in the runtime having nothing to do with data representation.
  - This has been discussed pretty thoroughly, beginning last year, and people seemed reasonably happy with the notion that the compilation has to be transparent for it to remain homoiconic, otherwise indeed it does become non-homoiconic.
  - What "people" ? Some might have been bored by the irrelevance and the inconsequence of that protracted discussion. There's no reason for a Lisp/Scheme implementation that transforms every bit of S-expression code it receives in machine code (therefore removing all connection with the conses structure) to lose its homoiconicity. In fact, under this revised definition the homoiconicity is not lost, which makes it a very serious argument in favor of this definition.
TRAC and LISP interleave code and data intimately.
- Or should should we say promiscuously ? How does the usage of such fuzzy non-concepts is going to improve anything ?
- Java, etc, interleave code and data too. But there's at least a qualitative difference...

One possible advantage of pursing several independent lines of difference, like the above, is that maybe several of them together define "homioconic." If "homioconic" required "X", "Y" and "Z" attributes, and some language had "X" and "Z", but not "Y", then it wouldn't be "homioconic."

OK, I'll throw another wrench in the works:

With the AbstractSyntaxTree definition of homoiconic, the .Net languages are homoiconic. That is, C#, JScript, and Visual Basic (VB.Net/VbDotNet) are homoiconic.

At http://msdn.microsoft.com/library/default.asp?url=/library/en-us/netstart/html/cpframeworkref_start.asp we find "This section contains reference documentation of the public classes that constitute the .NET Framework, as well as lexicons for other languages employed in the .NET Framework." including the "CodeDOM Quick Reference"

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/cpconcodedomquickreference.asp

Microsoft's CodeDom? is standard for the .Net environment, and the C#, JScript, and Visual Basic (VB.Net/VbDotNet) languages. It offers AbstractSyntaxTree access to code, along with translation between that (the "DOM") to and from "string of characters" source code, and to (platform dependent) machine code.

For example, System.CodeDom?.CodeConditionStatement? is an "if" statement, and System.CodeDom?.CodeIterationStatement? is a "for" loop.

They still miss literals, in other words you cannot write something like:

 CodeExpression? myAssignemnt= 'i=10; ;

But other than that it's very cool. Hope Sun will imitate soon wioth java.

Ok, enough for homoiconicity -- too much noise for nothing really important.

Now you know why I gave up on the topic a year ago. :-)

[CD containing the song "Take Five" follows.]

The biggest bug of all: the Dave Brubeck version is vastly superior: ASIN B000002AGN (since that ASIN link doesn't work: http://www.amazon.com/exec/obidos/tg/detail/-/B000002AGN/)

"Boasting the first jazz instrumental to sell a million copies, the Paul Desmond-penned "Take Five," Time Out captures the celebrated jazz quartet at the height of both its popularity and its powers. Recorded in 1959, the album combines superb performances by pianist Brubeck, alto saxophonist Desmond, drummer Joe Morrello and bassist Gene Wright. Along with "Take Five," the album features [...]"

And if you're going to buy two albums, ignore Amazon's suggested pairing and get Two of a Mind with Paul Desmond and Gerry Mulligan (http://www.amazon.com/exec/obidos/ASIN/B00008VGMU/)

So somebody claims that TCL strings are abstract syntax tree. This may be true, as the previous claim was made by somebody with superficial knowledge of TCL. To clear things up please show the TCL function calls (or commands) that take a string and perform a traversal of the AST.

That somebody was me. How does the ability to take a string of Tcl code and perform a traversal of the AST (to the limited extent which such a thing even exists for Tcl) prove anything about the Tcl language? I could do that in any Turing-complete language, for input written in any language with a well-defined notion of an AST.

Tcl strings contain all the same information as any other AST for the code would. And they have the side benefit that they can be executed as code (without even using eval). A string is a very awkward way to represent an AST (from the point of view of traversing/updating it), but I fail to see how a string containing code is not a valid representation of a Tcl AST. What am I missing?

The essence of HomoiconicDefinitionTakeFive. That was it: that the language has AST elements as data (either built-in or part of the standard library) and that language has literals for this type of data. Strings in TCL are string literals not AST literals as, for example, LISP lists.