Data Equals Code Depends On Context

Many times we hear the claim made about

  DATA = CODE
or
 (NOT (EQUAL? DATA CODE))

See DataAndCodeAreTheSameThing or DataAndCodeAreNotTheSameThing.

Whether or not data and code are the same or not depends on how they are used. In many contexts--a HomoiconicLanguage, a UniversalTuringMachine--they are the same (or at least isomorphic); in many other contexts they aren't the same at all. (And in some contexts, the SeparationOfDataAndCode is considered a good thing for security reasons; it ought to annoy users that when they read their email their computer gets infected with a virus or other malware).

This argument seems to assume its conclusion.


{Perhaps one shouldn't try to "execute" email, regardless of whether code is data or not. In other words, don't execute anything unless you know where it came from. By the way, why do we need yet another page on this topic? If you want to talk about security, then create a security-related topic.}

I thought page tried to synthesize the consensus on the subject in DocumentMode. If we agree that a reasonable consensus is reflected by this title (many people already expressed this idea in the related pages, and convincing arguments to the contrary have not been produced), than the existing pages suffer from inappropriate titles. DataAndCodeAreTheSameThing and DataAndCodeAreNotTheSameThing presuppose the conclusion in title. If the logical synthesis is that DataEqualsCodeDependsOnContext, then I think this is a much better title. The security issues were used just as an example.


When you say that in some contexts data "are the same (or at least isomorphic)", you borrow a precise mathematical term. I'm not aware of anybody who constructed a useful isomorphism between code and data, not even in HomoIconic? languages. Everybody should know from StructureAndInterpretationOfComputerPrograms that all data can be encoded (represented) as code (or lambda forms to be more precise), and obsviously all code is encoded as data (even in non-homoiconic languages), but the question remains whether such encodings have the properties that mathematicians associate with the concept of "isomorphism" -- bijective function preserving some algebraic structure. If anybody knows a particular isomorphism please post a link here. --CostinCozianu

Well, "isomorphism" also means "A similarity in form" in biology and "a close similarity in the crystalline structure of two or more substances of similar chemical structure" in chemistry. But in general I agree with Costin. When you say "up to an isomorphism", the assumption is you're talking mathematically, and this is not a mathematical isomorphism (though I wonder if one could look at the denotational semantics for Scheme and come up with a useful isomorphism to lists under car/cdr/cons and compositions thereof). Better to stay away from technical terms if we're not going to use them for their defined meanings. -- JonathanTang


Unless you can prove the two value-spaces are distinct by some firm, deductive, boolean property, issues of 'isomorphism' are rather moot - or, rather, you'll have failed to prove that 'identity' is not already correct mathematical isomorphism. So the important question is: given a particular context (e.g. a programming language) how are 'data' and 'code' to be formally distinguished?

Obviously both data and code are described by 'values': strings, numbers, graphs, sets, vectors, matrices and the like - immutable and complete feature-descriptions that are themselves encoded into a finite 'representation' that is ultimately reflected in the physical world as transistor-states or magnetic anomalies on a spinning disk or electromagnetic distortions in space-time or lines of ink upon stacks of processed and firmly-pressed cellulose. Obviously, values of both 'data' and 'code' are communicated and can possess meaning to an interpreter based on the context in which it was received: commands, statements of truth, permits, requests, inquiry, indicators of authority, authentication, etc. "GET <URI>" received on port 80 has a certain meaning, as does assigning a process-counter to a location that is a massive bit-vector within memory. Both data and code receive contextual interpretation by any agent that processes them. So, attempting to distinguish the two on the basis or mechanism of encoding, the possession of semantics dependent upon context, etc. would be largely futile.

My own proposed attempt at distinguishing 'data' and 'code' - based on the only sensible feature I can identify that is actually capable of receiving formal distinction - is that 'data' evaluates in the current programming-language context with complete ReferentialTransparency: no side-effects and no dependence upon any side-effects.

In this sense, 'data' would include not only strings and numbers, but also the results of lazy-evaluations (if they were 'pure' functional) and decompressing strings and translating between one representation of data and another, and the general extents of KolmogorovComplexity. 'Data' still has semantics based upon the context in which it is found or communicated (e.g. representing propositions, statements of authority, names and identifiers, attributed values, etc.). Of course, by this definition any program in a pure-functional programming language (considered independently of its input) is just one big data-description with considerable internal decompression semantics.

'Code', then, would be anything that either causes or depends-upon side-effects to process. I.e. asking a shared variable or object for its current state would always be 'code' because that state can change from one call to the next; something that when processed in its current context prints "Hello, World!" to the screen would also be code because it alters the world in a significant manner. These might even be GoodDefinitions given that it makes an applicable and useful distinction that is largely consistent with what most people mean by 'code' vs. 'data'.

Obviously, by these definitions DataAndCodeAreNotTheSameThing. Even this topic's title would need to be rejected: DataNeverEqualsCodeInAnySingleGivenContext, though both lazy nonce thunks with side-effects (code that describes how to get the data but only runs once) and caching (data with representation and position maintained by code for optimizations) would come very close to the conceptual borderline. The primary confusion would arise from the fact that values are often passed, as messages, from one context to another: it would be correct to claim that ValueBecomesDataOrCodeDependingOnContext.

If this were the distinction used, one would reject that data and code were isomorphic: it would be impossible to define a reversible, mathematical transformation function that gives data the essential properties of code (that essential property being side-effects) and vice-versa. After all, mathematical functions cannot add side-effects to a given value, and mathematical functions over any given value cannot transform the context of that value. But one might still create homomorphisms with (value,context) pairs, somehow transforming some data-values to code-values, some code-values to data-values, and transforming the context/language to something that can work with the shift and possess the same side-effects. (Such efforts might even be useful for security purposes - executing code on untrusted machines, blind signatures, etc.)


EditText of this page (last edited July 9, 2010) or FindPage with title or text search