Definition Of Homoiconic Discussion

[Moved from DefinitionOfHomoiconic]

The above statement, and the referenced page, are both controversial. I believe further that it is false that there is an issue due to language evolution, since Lisp is considered universally to be (still) one of the best examples of a homoiconic language, yet its homoiconic core has not changed since around 1960, a little before the term "homoiconic" was even coined.

The issue with language evolution is that in Java the code representation is byte-code, which means array of bytes, and array of bytes are part and parcel of the Java language. Byte arrays and can be manipulated easily as pieces that define code, for all kinds of purposes for which TCL might manipulated strings , or LISP may manipulate s-expressions. Many successful projects did just that. If you claim that byte arrays are unstructured, than the response is that they are just as unstructured as TCL strings. If you claim that Java has data other than byte arrays, than so does LISP has data other than conses. So if 40 years ago somebody could say "code has the same representation as data" and everybody got a hint where it was going, today there's the source code (in case of Java it's char[] or String, no less useful than TCL strings), there's the byte code which would be byte[], and there's the machine code generated by the compiler, data representation is also unclear: representation in the source code as literals, or representation in the runtime ?

What is clear is that the Lisp approach to homoiconicity is stronger in multiple ways than the languages that approach homoiconicity via raw string evaluation, whether this means that the definition of "homoiconic" should be changed or not.

Yes, and Java's runtime has raw byte[] evaluation, see the documentation for java.lang.ClassLoader.defineClass which is not less worthy of praise than TCL's evaluator. Bottom line is: if TCL is in, Java has to be in as well.

I (ATS) am not sure about that. Can you manipulate java code in the same way as when you're writing it? I dunno, but when writing it, I don't manipulate byte arrays. OTOH, see the bit of code I proposed on MyNaiveAttemptAtUnderstandingHomoiconicity. If that would work, I'd say it was in (though not everybody will agree with that either).

When I write LISP code, I typically do so by manipulating text in a text editor, not by issuing CAR and CDR commands. So there is some difference between my editing and runtime manipulation of LISP code. -- jtg

TentativeSummary: We have several proposed definitions:

AbstractSyntaxTree must be a FirstClass datatype in the language. This excludes everything except Lisp and Scheme. Much disagreement, since TRAC ('the original homoiconic language') and TCL are excluded.
A reasonable representation of the code is FirstClass in the language. This does include TRAC and TCL, possibly FoxPro. Much debate about Java and byte arrays.
Any kind of access to code at runtime. Much disagreement, since this includes too much.

Java is clearly not homoiconic. Java byte code arrays represent the machine language of the Java virtual machine, which is a different language, not Java. Even a cursory glance at the definitions of Java byte codes make this clear. In contrast, Lisp can manipulate nested lists of symbols and literal data that that represent Lisp source code directly. In other words, what Java allows you to manipulate is a representation of the output of the Java compiler; whereas Lisp allows you to work with data that represents the input to the compiler or interpreter. (It's also the output of the reader. The reader, being the front-end of the interpretation/compilation process, is hidden as an implementation detail in most languages, but is distinctly identified and accessible in Lisp languages.) TCL's strings also have a very direct relationship to TCL source code, albeit at a lower level of abstraction. -- DanMuller

[Java lets you manipulate the input to the interpreter. (Byte code is typically compiled to machine code at run time these days. But that's a hidden transparent implementation detail.) -- JeffGrigg]

Alas, Java also lets you "manipulate nested lists of symbols and literal data that represent (Java) source code directly." You can do that with Strings out of the box or add a library like ANTLR and do that with ASTs of statements, expressions, etc.. Also, a more detailed study of the definitions and intent of Java byte codes reveals that they are little more than an AST of Java source code. The primary design goal was to provide a way to feed Java source to an optimizing compiler at run-time. -- EH

Example disassembly from http://www.javaworld.com/javaworld/jw-09-1996/jw-09-bytecodes.html:

iconst_0 // 03 istore_0 // 3b iinc 0, 1 // 84 00 01 iload_0 // 1a iconst_2 // 05 imul // 68 istore_0 // 3b goto -7 // a7 ff f9

This doesn't look much like Java source code. Any nesting structure is necessarily lost when moving to a linear byte code array; it's only very indirectly implied by branching instructions. Individual loads and stores have no direct analogue to the original source code. There's really no comparison, IMO. The ability to manipulate strings of source code are not in and of themselves an indication of homoiconicity, as has already been discussed ad nauseum. -- DanM

FYI: There are decompilers for java (e.g. JAD), that can recover java code from java byte code. Not that I'd think, that that makes java homoiconic. To forstall some further arguments: Arguing, that the formatting or even the exact ordering of branches is not the same doesn't count: The formatting is lost with lisp too (e.g. whitespace between tokens) and ordering could be considered part of the presentation or the result of an implied normalization step.

I am not arguing that formatting is significant, but nesting structure is, and I would argue that ordering may be. Decompilers are not news, but they don't generally produce output that closely resembles the original source code in terms of variable names and nesting structure. (Although they should certainly produce something that produces equivalent side-effects and results.) They are not a good replacement for homoiconicity, because it's much harder to write code to manipulate code-as-data when that data doesn't have the form that you expect (i.e., the form in which it was originally written). Even reordering code in computationally insignificant ways would complicate such tasks. Consider Lisp macros that receive fragments of code written by a user according to the input requirements of the macro; if the data that the macro operates on isn't an accurate reflection of the original code/data, life can become immediately much more difficult.

Earlier on this page, someone makes the argument that TCL's strings are as unstructured as byte codes. Although on the face of it this is true, it's also true that the structure of the source code that they represent is encoded in them absolutely unambiguously, which is not true of Java byte codes. -- DanM

Actually, byte codes are a lot more structured than TCL strings. They don't preserve all the structure, but preserve the structure that matters for practical applications, making for the success of projects like AspectWerkz? http://aspectwerkz.codehaus.org/weaving.html.

That's wonderful, but it's not evidence of homoiconicity. The structure that's important to this definition is the structure of the source code, obviously. Homo iconic -- same representation. TCL uses the same representation at the level of characters. Lisp does so at the level of list structure and atoms, which are the explicit elements of the language, one step up in abstraction from characters. There is no such defined one-to-one correspondence between elements of Java source code and the elements of Java byte code arrays. (Also, are you implying that Lisp macros which operate on their input code structures are not practical applications? Two generations of Lisp programmers would tend to disagree.) -- DanM

But there is a defined one-to-one correspondence between elements of Java source code and elements of Java byte code. Java's byte codes form an AST translation of the original source code, optimized for loading into a compiler. Java byte codes are easily translated back into source code. -- EricHodges
- If so, I haven't been able to find it in the language specification. (http://java.sun.com/docs/books/jls/third_edition/html/j3TOC.html) And the examples I've seen (such as one given earlier) don't seem to lend themselves readily to an AST interpretation. -- DanM
- Huh? That's the language spec. It doesn't say anything about byte codes. See http://java.sun.com/docs/books/vmspec/2nd-edition/html/VMSpecTOC.doc.html for the JVM spec. The BCEL docs also do a good job of explaining byte code structure. Also, BruceEckel taught a 1 or 2 day class at SD back in 1997 or so in which he went into some detail about this design feature of Java's byte code language. I'll see if I can find something on the web about it.
- Looked it over briefly. Doesn't seem to say what you say it says. As expected for an intermediate language, it specifies that language. The introduction to the Java language spec says: "The Java programming language is normally compiled to the bytecoded instruction set..." Note 'normally', not 'must'. In complementary fashion, the JVM spec says: "The Java virtual machine knows nothing of the Java programming language..." The relationship is much looser than you're claiming. -- DanM
- No, the relationship is much tighter than Sun admits. Run some Java byte codes through jad (it's free) and look at what it produces. -- EH
- Not relevant to this discussion. You're talking about the characteristics of a particular implementation, not the language as defined in its own standards. You could probably say the same about a particular C compilers' generation of native machine code with all optimizations turned off. -- DanM
- Try it with different compilers, then. You'll have a hard time finding differences in the byte code they produce. Java compilation isn't like C compilation. A Java compiler is really just a preprocessor that spews out byte codes in the easiest to load format. Real compilation goes on at run-time. The byte codes are used much more like P-code or I-code than instructions in practice. -- EH
The comparison was obviously with TCL, and TCL preserves or exposes no structure to the programmer. A char array is not structure, it is dead wood so to speak. Be sure that TCL parses that into a minimal parse tree, figuring out what is parameter what is procedure name, what needs just in time expansion. So the claim that TCL is homoiconic is very dubious.
- But TCL strings do preserve all the information about the source code. The contention that Java byte codes do the same is dubious, as even the fairly commonly-cited task of, for instance, instrumenting each expression or statement of some source code cannot be achieved, based on what I've seen of the Java language standard. I doubt that you can reliably reconstruct the original statement or expression boundaries. -- DanM
- So your new definition would be : homoiconic is a language runtime that contains an information preserving encoding of the source code for all its code? Then all language designers need to do to mark their "homoiconic" checkbox is to make sure that all the source code is included as a global char array in the runtime. That's just ridiculous.
- That's a ludicrous statement, obviously provided merely to be argumentative. I nowhere implied that was a complete definition. -- DanM
As for the claim that it's the source code that matters, you should know that a CONS (which is fundamental structure used to represent code) is not source code, it is intermediate representation, just like byte code is intermediate representation. Obviously CONS is closer to the source code, but it is no source code. And if you care to know LISP does not even guarantee that there's a one to one correspondence between all elements of LISP source code and some data structure available at runtime, because unlike TCL strings, LISP code can be compiled till there's no tomorrow. So if we all agree that LISP is homoiconic (because that's bloody obvious) while TCL and Java claims are dubious, you have serious problems with the definition that you operate.
- CONS is an important part of representing code, but note of course that any sexpr (even a simple atom, e.g. a number) is considered code in Lisp. Really, that's the key; source code is a textual representation of sexprs, which can be treated as code, and are accessible at run time (more on that assertion in a moment). In TCL, source code is a textual representation of executable TCL strings, which I presume are accessible at run time. (I'm only slightly familiar with TCL.) Calling Java source code a representation of byte codes is, AFAICT, a stretch, without reference to a particular compiler implementation.
- The definition as it currently stands on the linked page acknowledges that a Lisp which loses its sexpr-based representation of code in the course of compilation is a non-homoiconic dialect. This indicts CommonLisp, with its permissive definition of FUNCTION-LAMBDA-EXPRESSION. However, it's worth noting that all Lisps give potential access to the sexpr code representation via macros. Also, READ provides an easy way to convert source code files to sexprs, although that's a side issue to the definition, since anything involving locating files immediately involves concepts outside the language standard. -- DanM
- You are confusing something that has nothing to do with this discussion. The impossibility of recovering the body of a function means simply that Lisp is not fully reflective. But for reflectiveness we have an already consecrated word: reflective. As to the claim that a compile-based or just in time compile based Lisp/Scheme environment loses its "homoiconicity", it is hogwash. The other defendant of the homoiconic orthodoxy, claims that the definition applies to a language not to an implementation.
- I was commenting only on the definition as it stands on the related page, not expressing an opinion or trying to change any definition. According to that definition, the compilation must be transparent, but if compilation 'loses' the homoiconic code representation, that's obviously not transparent. Since the term was coined and applied to Lisp long before standardization, it's perfectly imaginable that standardized CommonLisp has drifted from what those writers intended by the term. Whether that's the case or not, I really don't know. And personally, I don't agree with the notion that the term must be applied only to a language standard. It's an adjective, it could be used to describe anything you'd like, including a particular implementation of a language -- as long as you're clear what you're applying it to. -- DanM
- You are trying to stretch things based on your personal reading of a particularly unclear and rather dubious definition. Of course, using various rhetorical devices any definition can be defended no matter how counter-intuitive and dopwnright illogical are the consequences. Suffice it to say that for me such a definition has no legitimacy and is totally counter-productive, it even goes to promote bad engineering practices as it is (at least in your interpretation of it) a clear violation of SeparationOfConcerns. You have yet to prove that those writers intended by defining the term the same thing as your interpretation of it, but even if you can prove that, it still would be irrelevant. In current little usage the word homoiconic has, Scheme and Lisp are homoiconic period, and programmers can exploit the advantages it provides when programming in Lisp/Scheme, much more so than in TCL - and this no matter what the implementationed etails are in the environment. But if you want to drag down the discussion to the point where Lisp/Scheme programmers have to be aware of implementation internals (that are neither part of R5RS nor of CommonLisp) in order to decide whether "homoiconic" applies, this is just ridiculous, too ridiculous to be worth the effort to convince you it's a bad idea. That's why this discussion ends here.
- Well, if I'm using rhetorical devices rather than logic somewhere, please point them out, because they're unintended. The only thing I feel strongly about here is that Java does not warrant the homoiconic label. I'm perfectly OK with the definition remaining somewhat vague, and I didn't author, nor have a hand in authoring, the definition on the referenced page. I'm uncomfortable with your stance (whoever you are), which seems to be that the word is tautologically defined by what it was used to describe (namely Lisp languages), which renders the term fairly useless. When you mean "Lisp", simply say "Lisp", then, and leave terms like homoiconic to people who want to discuss and compare language features. Although this discussion has gone around in annoying circles, it has been somewhat interesting to see people discuss the related features of various languages in detail. -- DanM
- This very last exchange is a fine example of useless rhetorical device, as you paint a picture of the opposing point of view in a way that you know isn't true. The alternative definitions according to which LISP/SCHEME are homoiconic independent of implkementation details that should not concern programmers, and are not the only conceivable homoiconic languages, while neither Java nor TCL are homoiconic has been provided. See HomoiconicDefinitionTakeFive.
- I was responding to what you wrote above. The problem with HomoiconicDefinitionTakeFive is that it is a redefinition of the term, possibly excluding the language for which the term was originally coined, as Doug explained there quite clearly. The expansion on that definition also contains a patent untruth: "Once you put the code into a string, you loose the structure of the AST." That's not true, it's there, just not quite as easily extracted as it could be. Obviously text can unambiguously represent structure, since that's how we write code. Don't get me wrong, I like your definition better in some ways, just as I find Lisp's approach more useful than TCL's. But I don't necessarily see that as justification for redefining a term. I also don't see an absolute need to nail it down to the point of complete unambiguity. At some level we're talking about people's evaluations of representations as being "similar" or "dissimilar". -- DanM
- Well, since the old definition is copiously broken, the new definition is not a redefinition but a clarification: what code has the same representation as what data. Well, it is the source code (as a definition that constrains the runtime implementation choices that are not the business of programmers - goes against SeparationOfConcerns) - that can be quoted as code literals without losing structure, and the data to be Ast elements (because obviously languages need to support other data types than just those used for representing code). It follows the new definition is not defining something else altogether, it just clarifies the old one in the way that is consistent, useful and avoids interminable discussions. That's all.
- The criterion without losing structure which restricts it to AST is self-evident as otherwise any programming language on mother Earth can put its own source code in a char array or other obvious encoding. The only controversial part is if a syntactic construct for code literals should be required. Requiring it would exclude the current crop of .NET languages which expose AST data types to the programmer, which is a very useful thing. But on the other hand the presence of code literals makes this facility essentially easy to use for programmers.'
If instead you claim that the definition is good (despite its obvious flaws, some people still like to hang on to it) and therefore appreciate that LISP is not always homoiconic, while TCL is always homoiconic, then it is no wonder that "homoiconicity" is a concept that virtually nobody uses, nobody has much use for anyways, but you can keep a useless definition. After all in the bible for LISP/Scheme users (SICP of course) homoiconic is never mentioned, so all you're left with seem to be some inconsequential articles 40 years ago, a mention in passing in the Finkel book, empty bragging rights on part of TCL community, and too much noise to be worth it on wiki.
- And the terms 'sexpr' and 'symbolic expression' are never mentioned in the CommonLisp standard, as far as I can tell. So what? 'Homoiconic' can still be a useful shorthand in discussions, as long as the term doesn't get watered down to uselessness. I happen to think that defining Java as homoiconic does exactly that - we have other useful terms for describing what goes on in Java, terms like 'intermediate code', that could be similarly analyzed and beaten to death, but start out closer to being good descriptions for that realm. So let's use those and be done with doing violence to simple descriptive phrases.
- Defining "homoiconic" to be something that TCL is, while some fully compliant ANSI Common Lisp isn't, makes the term entirely disposable. It's no longer watered down, it's just as good as "gobbledy-gook". And the fact that there are no 2 persons agreeing with the old definition that can provide consistent interpretations of it when examining the Lisp vs. TCL vs. Java case just proves the point: the definition is at best fuzzy.
- See comment above regarding the temporal relationship between the coining of the term and the standardization of CommonLisp. -- DanM

Most of the AOP libraries only do point cuts to the granularity of method calls. Some, like AspectWerkz?, can also PointCut? field get and set references. That, and catching calls to methods are ways to "reach into" and modify the code inside a method. But is this enough to argue that Java byte code in individual method definitions has "sufficient structure?" (AOP tools probably could point cut at a more granular level, but it hasn't been a widely desired feature. At least not yet.) -- jtg

How about this: A language is homoiconic if treating code as data (or vice versa) is Good Style, and is heteroiconic if treating code as data (or vice versa) is either not possible or is a Gross Hack.

So LISP, with its macros, is clearly homoiconic, but anything where you have to invoke a compiler via system() is heteroiconic. Languages with string-based eval might fall into either category, depending on whether the use of eval is a fundamental language feature or a bolted-on kludge.

But one man's fundamental feature is another man's bolted-on kludge. I can think of one language where all of its fundamental features are bolted-on kludges.

JulyZeroFive

CategoryDiscussion