Separation Of Data And Code

The practice of keeping "code" - instructions for some machine, whether a microprocessor, a VirtualMachine, or a scripting language - distinct from data. This is often done for security reasons, to prevent untrusted code (which might compromise a machine) from being executed.

The degree and reasons for separation range from hardware-level separation (e.g. to prevent stack overflows from mucking with the machine instructions, and possibly to allow greater efficiency in the cache line since code supposedly changes slower) to higher level separation (e.g. keeping the data in databases to better secure it and because data tends to survive applications - in this view, the data is what changes slower).

The discussion below doesn't focus much on the fuzzy logical distinctions that might exist in any attempt to define 'code' as distinct from 'data'. That's probably a good thing - the wiki will be better off if it keeps such arguments centralized to DataAndCodeAreTheSameThing.


I am one who agrees with the separation. Separating them allows one to perform "magical" operations and viewpoint management techniques on the data that mixing may hinder (OoLacksMathArgument). I agree that there are minor drawbacks, but the benefits outweigh these IMO.

Data can more easily be shared between different languages and applications if "data space" is considered separate from "behavior space". Further, if you let a single language or application dish out the data, then you generally end up reinventing a (poor) database from scratch to dish it. Even if you disagree with me, my viewpoint might help you understand somebody who is used to the separation. The details of my opinion on this start at http://geocities.com/tablizer/whypr.htm and http://geocities.com/tablizer/core1.htm

Your notion of ControlTables seems to directly contradict this claim.

Good observation. Control tables by definition are not required to contain code (although I admit I have not been clear on the definition). In practice I don't do it very often. I tend to do it more often in debates because the examples OO proponents tend to choose often over-emphasize IMO the likelihood of a method being one-to-one with a table row. In practice the one-to-one relationships drift away. However, if file systems were relational instead of hierarchical, then we might have a whole new ball game on our hand. Even when I do it, the contents tend to be function calls or names only. Even if one did go hog-wild with code in the tables, the attributes are still processable via regular queries by other languages. Sharable and queriable attributes is the most important thing. The OO approach tends to pull both attributes and algorithms into the realm of one language. Putting code in tables does not change the ability of other languages and paradigms to access the attributes via relational formulas and use database tools. IOW, I am really proposing the DB-ification of attributes rather than a separation per se. Where the code goes is of secondary importance. Perhaps if databases were better able to handle code, or the practice was more accepted and integrated into tools, then I would not propose a separation. (FileSystemAlternatives)


I must work on some unusual systems, then. In most of the software I have been involved with, the code works directly on the data, and the only data in the system is the data that the code operates upon. Perhaps you could provide some examples where the code and data are not coupled?

{What do you mean by "coupled" here? - "Data and code are separate"}

In most applications I work on, it is hard to know ahead of time exactly who or what is going to use which data. Thus, keeping the coupling a bit loose is a good thing in most cases.

If you do not know who is going to use the data, why is it provided? What basis was used to determine which data is to be captured and saved and which data is not of interest?

Generally, if somebody needs a data item the needs to be persisted and/or shared, then it is added to the DB. I suppose there is an assumption in DB thinking that something that is persisted is also more likely to be shared. It becomes "a fact about the NounModel" of the domain.


I am wondering. Is encapsulation a language-centric and/or an algorithm-centric concept? Perhaps this is a fundamental philosophical battle over whether data should be king or behavior should be king?

I am beginning to warm up to the idea that this is the key difference between OO fans and RelationalWeenies. OO'ers prefer to view everything from a behavioral standpoint. Encapsulation (GateKeeper) is putting a behavioral wrapper around state (data) for the most part. However, Relational fans see data, or things treated as data, as being more dynamic and virtual. You can apply relational "math" to data (see OoLacksMathArgument). To us, encapsulation busts the ability to apply our "relativity engine" to data to get ad-hac or unanticipated views of it. Whether this makes data more vulnerable or "naked" is a complex issue. We have to ask who we are protecting it from and why.

Maybe there is a trade-off. Maybe we are trading protection for dynamism. Maybe us RelationalWeenies are making "writers" more dangerous to get more flexibility for "readers". Just a thought. I don't agree yet that we are making data "vulnerable". We could use stored procedures as the only write access point if a single writing point is what is truely desired. And there are triggers etc. for validation. See GateKeeper for more on this. It is not so much the ability for protection, but what gets emphasized or catered to more heavily.

A data-head will view the data as the interface or "glue" between behavioral components, while a behavior-head seems to prefer that behavior be the interface or glue, hiding data underneath. There seems to be a bias from the behaviorists, who see data as an "implementation detail to be wrapped away" (encapsulated). This makes me feel that they don't "get" the data viewpoint. Data is the interface. It is not so much separation of data and code that is the debate, but rather whether data is the key interface or behavior is.


If you truly want to SHARE data across processing domains, you need the data to be represented (for management purposes) in a model that's independent of the processing domain. A relational data model is a SharedKernel representation that can feed multiple , unrelated BoundedContexts. In this view, Relational databases are an AnticorruptionLayer. --StuCharlton


One of the advantage of *not* exposing everything as raw data is that you can pre-process things to prevent obviously (or non-obviously) inaccurate results. A guy I worked with once was creating a very large report and managed to miss a condition on a join. This is perfectly acceptable under relational theory but, in this case, ended up generating a report that was over 70,000 printed pages long. Focussing on the behavior rather than just tossing data out there to be manipulated helps protect from problems like this. It's a gross example, there's lots of other more subtle ones -- Oracles operator overloading with regards to date comparisons, for example, can easily result in subtly wrong output. The focus on behavior means that you also focus on correct, allowable behavior, and the data manipulations to create that behavior are an implementation detail. I may be confused because I have a hard time grokking this so-called data-centric view, because to me standalone data is unimportant and irrelevant - it's only important in the context of whats done with it, which is behavior.


This is something I've been thinking about recently, as a result of developing new framework at my job for use in machine-vision measurement applications. My thoughts on the subject are still somewhat nebulous, but I wanted to throw it on the table. What if one were to develop a uniform, behavior-neutral data representation (maybe based on JSON, or XML, or HDF, or perhaps a table-like structure) that could hold "any" data, with appropriate labeling and indexing, and then develop behavior objects that, rather than containing data specific to those behaviors, could be passed a reference to this "universal data structure" and act appropriately based on the contents? This way, the data would remain distinct from the OO code, but the appropriate behaviors could be applied when desired? -- MikeSmith


Lisp fans are gonna have a fit over this.

No, they won't. We're talking about this at the hardware level. Way down in the guts. The idea of this system is to prevent people from filling up a heap or stack overflow with an "egg" (ExploitCode?), then jumping to it. There is almost no reason to ever execute code from the heap or stack of an application these days (although, once upon a time...) unless you're doing embedded development, in which case it's a different ballgame and YMMV.

Not that it provides 100% security. Someone can still overflow the stack and black the frame return address to jump to the code for a standard library call, which is pretty wicked.


This is a hack to work around language/implementation design flaws, especially in 'C'. As noted above, a lisp system (or other environment) may very well want to generate new code and execute it at runtime. Just because this is unusual in mainstream languages at the moment does not mean it isn't a valuable technique....

The way that every lisp I know of implements that kind of code generation is different from what we're talking about here. So don't get up in arms about the concept. Besides, it's supposed to be possible in these systems to explicitly "bless" specific blocks as executable without opening up the rest of the blocks.

Fair enough, if there is a mechanism to allow for it in systems that need it. This approach is still a kludge.

Like I said though, most Lisp systems use approaches unaffected by this restriction. This is also part of an optimization issue. Jumping into the heap is the equivalent of grabbing your cache prediction, throwing it against the wall, then beating it with a baseball bat that's on fire. And that means that you're going to lose a bunch of time to fetching and loading that code. These are not things you want to do.

Ultimately, you're right. It is a kludge to help minimize the impact of mistakes. And as I said before, you can still just jump to a libc function (they're in every program you can think of) like, say, exec() and load the stack appropriately. Until we can write bug-free software, kludges like this are probably necessary. Nothing makes Lisp less vulnerable to these kinds of bugs. I know SBCL and OpenMCL have found overflow bugs that, had they been exposed in a server interface, could have led to a BufferOverflow condition.

True enough, although it is quite unfair to say that lisp or Java (or many other languages) aren't *less vulnerable*, by design. C string functions, especially, are just a lousy design as far as security goes. Implementation flaws are a separate issue.

To the person whose security is compromised, it doesn't matter where the bug lies. What matters is that there was a bug that allowed for an attack. People who use Lisp in network services or tools should be aware that just because their language helps prevent buffer overflows doesn't mean that their implementation will. There may not be much you can do about it, but being mindful of it and designing things with this in mind should help.

Sure, but I just meant that this sort of problem is endemic in C/C++ programs, and quite unusual in lisp programs. Furthermore, we shouldn't conflate the possibility of implementation errors (which developers *should* be aware of, I agree) with fundamental language design flaws.


Derives from the PrincipleOfLeastPower and the PrincipleOfLeastAuthority.

See Also: LearningObjectOrientedProgramming, TableOrientedProgramming, PeopleWhoDontGetOo, DataCentricThinking

For an opposite view: DataAndCodeAreTheSameThing


EditText of this page (last edited May 16, 2008) or FindPage with title or text search