Object Identity Discussion

Here's the raw material of controversy related to RelationalHasNoObjectIdentity. Refactoring in course. See also ObjectIdentity.


I found this assertion in CrossingChasms

There is no true object identity. You must always work with an object that contains a copy of persistent data.

I quote the second page verbatim--it gives no further explanation. This is the easiest thing to refute about ObjectRelationalImpedanceMismatch.

I'll ask KyleBrown, or anybody who can help to further refine the definition of this problem.


To an application programmer using an ObjectOrientedProgrammingLanguage, the true identity of an instance of a class in the address space of the process running the program is derived from the memory location where that instance's state is stored.

No, this is confusing the semantics with a possible implementation. (Not a very common one, since most OO language implementations use copying GarbageCollectors.)

See the JDK 1.3 JavaDoc for java.lang.System.identityHashCode(Object object) and java.lang.Object.hashCode(), for example. This identity is not a function of the values of any instance variables of the instance. This is what Kyle meant. A row in a relational table is identified by the values of its columns and has no other intrinsic identity to the application programmer. This difference is one element of the ObjectRelationalImpedanceMismatch. If this is the easiest thing to refute, please refute it. --RandyStafford

If you have an object, instance of class Person with attributes firstName=Randy, secondName=Stafford, ... SSN='xxx-xx-xxxx' and so on, would your choice of identifying it be an instance of Person at address 0xaabbccdd? ? -- CostinCozianu 2001/07/23

My choice is not the point. The point is the traditional definition of "object identity". --RandyStafford

I trusted to you to avoid presenting your choice because you knew if you choose ' an instance of Person at address 0xaabbccdd', it would have been a design mistake. The traditional definition of object identity is really equivalent to a pointer to a memory area where we store object data and a pointer to a virtual table or an equivalent mechanism that helps us do dynamic method dispatch. The object identity aka the pointer, has a good purpose in areas other than logical data modelling for large shared databanks, and this is known to be true ever since DrCodd invented the relational model.

As an example see in what a mess you got below by choosing the wrong implementation (java.util.Vector vs. java.util.TreeSet? ), and doing the modeling at an implementation level. A relation as the relation between TaxPayer? and TaxEvent can be implemented in a number of different physical datastructures, but a developer of a client application should not worry at all what's the exact physical implementation of a relation, as long as logically he knows he deals with a relation at a conceptual level. The use an object identity i.e. pointer necessarily belongs to the implementation level -CostinCozianu

I agree with your points about the definition of "object identity" being equivalent to "pointer", and the usefulness ("good purpose") thereof, and hiding of implementation, but I do not at all see how I "got in a mess" or chose "the wrong implementation". A Vector suits the purpose just fine. --RandyStafford 2001/07/26


Disclaimer - I don't support this approach

Object identity can exist in RelationalDatabases. Instead of using the programming language's support of instance variables, always use SQL SELECT and UPDATE statements to access and mutate instance variables. That way, access to the instance variables are guaranteed to be transactional and since keys don't change, you can be guaranteed that everyone is operating on the same object.

Of course, this is also an excellent way to bring your network administrator to tears of frustration and your local Sun/IBM/Compaq hardware rep to tears of joy -- MarkAddleman

Aha. Go server-side. This can be done but it implies doing all the mapping up front such that methods run on tuples. TopLink does this. It is a painful thing. Any variant entity creates serious issues of performance simply because successive reads of the database are a heavyweight operation. This is okay when a bunch of objects are involved (indeed recommended) but it is not a happy thing. The mismatch may also be pushed into the code itself by limiting the structure of objects to be isomorphic with a relational database, but this is just unacceptable to many teams. They want collections and trees to be normal, not some twisted relational structure (see TreeInSql for an example heavyweight transformer). --RichardHenderson

How does TopLink map methods to tuples in the general case? I would think you'd have to give it a LOT of meta-information to describe how instance variables in one class relate to instance variables in another... --MarkAddleman

Yes you do. I have seen two versions. One that involved a modeling tool to help define the metadata. The other seemed to be a heavyweight mapping layer for EJBs (which have their own version of metadata). I think there is a discussion around here somewhere on how successful they have been.

The experience I discussed on TopLinkForJavaUsageExperiences didn't involve EntityBean's - it just involved using TopLink to map plain old DomainObject's. And in that experience we weren't mapping "methods" to tuples; just instance variables. Be all that as it may, the pain I described on that page is clear evidence of the ObjectRelationalImpedanceMismatch. --RandyStafford 2001/07/23


First of all, the quote above contains a glaring logical mistake.

With any kind of database, one ALWAYS operates with a copy of the persistent data. Be it object oriented, relational, structured programming, Lisp, you name it you can't operate "directly" with the physical persistent data, because that thing stays on a persistent storage. You always operate with a copy, and that copy is handed to you by the database system upon request.

What I think KyleBrown wanted to express is that perhaps in his view it would be desirable (and some object databases and some dumb middleware/application severs really do it ) that the whole system (database+client or database + middleware+ client) will make sure auto-magically that at most ONE copy of the "persistent data" will exist in cache at any given time, thus preserving the familiar OO concept of object identity.

You are wrong about what Kyle wanted to express, and it is highly inappropriate, in fact intellectually dishonest and very arrogant, for you to put words in his mouth and thoughts is his head to serve your own derisive arguments. --RandyStafford 2001/07/23

The thing that I put at the beginning of this page is a faithful (copy and paste) citation from CrossingChasms, and it is at best an obvious tautology that expresses no idea at all. Maybe he is in position to clarify, or maybe now he no longer supports it. So I don't put words in his mouth, but I have to interpret what he is trying to say with a great probability. Do you know better what he was trying to say ? -CC

Oh come on. Of course Kyle's quote expresses an idea: it expresses the exact idea that I clarified above, in my first contribution to this page. KyleBrown is a friend of mine and, yes, I know that the idea I clarified above is what he intended with that quote. --RandyStafford

I contend that enforcing this particular approach for the general data access patterns is one of the biggest design mistakes in programming database applications. But I hopefully wait for KyleBrown to find out what is his real take on the subject. --CostinCozianu

At a very limit, this approach is not supported by the relational model itself, but current relational systems allow you do just that if you want to. I strongly contend that one should not use this feature of current database implementations.

So in no way we have to deal with a impedance mismatch here. First, it is currently supported by relational database products, and second, it is a design mistake in the quasi-majority of the cases.


Last but not least, the identity of a single copy of a particular "object" in the client "cache" (client memory space) has nothing to do with relational model per se. It just doesn't deal with it, it is an implementation issues, and most database products of today allow you to achieve that (they allow you to shoot yourself in the leg if you really want).

Nomatter if it is or if it is not a valid design , this issue cannot be subject of ObjectRelationalImpedanceMismatch. The object identity is can exist only within client application space.

The right thing that you want to have in your design is the logical identity, which is given by the value of the primary key. And you can take this for granted in a relational database, it does enforce logical identity.

Database implementations of today give you one more possibility to shoot yourself in the leg, by making it possible for you to declare a table without a primary key, therefore loosing logical identity. However this is outside the relational model, and it's not worth to discuss all kinds of possible design errors here. See AnIntroductionToDatabaseSystems for the whole discussion


Consider the following DomainModel:

 import java.util.Iterator;
 import java.util.Vector;

class Taxpayer { private Vector events = new Vector(); private String name;

public Taxpayer(String name) { this.name = name; }

public void addEvent(TaxEvent event) { events.add(event); }

private boolean eventsContainsAuditEvent() { for (Iterator i = events.iterator(); i.hasNext(); ) if (((TaxEvent)i.next()).isAuditEvent()) return true; return false; }

private boolean eventsContainsFilingEvent() { for (Iterator i = events.iterator(); i.hasNext(); ) if (((TaxEvent)i.next()).isFilingEvent()) return true; return false; }

public boolean hasBeenAudited() { return eventsContainsAuditEvent(); }

public boolean hasFiled() { return eventsContainsFilingEvent(); }

}

class TaxEvent { public boolean isAuditEvent() { return false; }

public boolean isFilingEvent() { return false; } }

class AuditEvent extends TaxEvent { public boolean isAuditEvent() { return true; } }

class FilingEvent extends TaxEvent { public boolean isFilingEvent() { return true; } }

Note that there are no instance variables declared in the TaxEvent hierarchy. I could build a working application with this DomainModel in GemStonej with no further modifications. This is the SimplestThingThatCouldPossiblyWork. I could build a UI that displays all Taxpayers, allows the user to add TaxEvents, and allows the user to see which Taxpayers have filed, which have been audited, and which have done neither.

But I could not map this DomainModel to a relational database without modifications, because there are no instance variables in the TaxEvent hierarchy to translate to columns in a relational table. There is nothing to store other than the basic intrinsic identity of the objects, which includes the identity of their class, which is what allows the UI to display which Taxpayers have suffered which events.

I'm not interested in critiques of whether this is "good design" or not, and I'm not interested in other ways of designing it so that relational persistence could be used. That's not the point.

The point is that this simple example shows the power of object identity, which is leverageable with an object database, but not (in this case) with ObjectRelationalMapping. This difference in power is one of the elements of the ObjectRelationalImpedanceMismatch.

--RandyStafford


With any kind of database one ALWAYS operates with a copy of the persistent data

"I contend that enforcing this particular approach [preserving the familiar OO concept of object identity] for the general data access patterns is one of the biggest design mistakes in programming database applications"

Please enlighten us as to why this is "one of the biggest design mistakes". I beg to differ. One of the nicest features of GemStone is that it provides the appearance of there being only one copy of a persistent object (although under the hood a disk page containing the object, and others, is copied from disk into a memory region shared by Java VirtualMachines?). The reason why this is nice is because it makes (optimistic) concurrency conflict detection very simple. If two clients attempt to change a Taxpayer's name concurrently, the first transaction to commit will succeed and, unlike in an RDBMS, the second will suffer a "write-write conflict". No additional work is required to acquire locks, or declare timestamp or version attributes to select against. Once again, it's the SimplestThingThatCouldPossiblyWork.

I'm not interested in didactic, pedantic, theoretical, academic arguments. I'm interested in pragmatic expedients so that I can spend my time solving business problems. Leveraging object identity is a pragmatic expedient. The relational model doesn't support instrinsic object identity as exists in ObjectOrientedProgrammingLanguages. On the premise that you want to implement a DomainModel in an ObjectOrientedProgrammingLanguage, this lack of support is one of the elements of the ObjectRelationalImpedanceMismatch.

--RandyStafford

Why ?

Your inexperience at building modern information systems is showing here. Of course it is the application's business to implement concurrency policy. For scalability reasons, it is very typical to implement modern information systems with short transactions - meaning that data to be displayed on a user interface is typically read in one transaction, and written in a separate transaction. Before doing the write, one has to check if the as-read data is now stale vis-a-vis the committed state of the database, assuming concurrency conflict detection is critical in the application scenario (as it is for a reservation system, for example). There are a number of strategies that are typically employed to implement this detection - last update timestamps, version numbers, etc. All of these strategies show up in application code - in DomainObject instance variables, in relational schemata, and, most importantly, in extra application logic, typically in the presentation layer, to deal with transactions that fail due to concurrency conflict.

This is not to say that it should be the application code's business to provide mechanisms (such as locking and conflict detection between concurrent transactions) that are already provided by third party products (e.g. application servers, database management systems, object/relational mapping layers) upon which the application is built - clearly it shouldn't be. But those mechanisms are insufficient to meet all the needs of a modern information system with concurrency conflict detection requirements, a short transaction architecture, and a user interface. --RandyStafford 2001/07/23.

Well, if you are so experienced and I am so inexperienced, then you would know that such aproach is easily supported by a relational databases - Update ... Set ... Where Key= :myKey and Version = :myVersion, this is a no brainer. However, the mere fact that an entity's information has not changed is NOT enough to always guarantee you integrity. More, transactions often mean a lot more than update this entity in case somebody else hasn't changed it, and in this case application logic might have to do reads (optional, but not with the sole purpose to display it to the end-user's screen) and writes over several entities, possibly of deifferent types DURING a single transaction boundary. --CostinCozianu 2001/07/23

and Version = :myVersion

Thank you for helping me make my point. That, Costin, is application code which is not necessary with GemStone for optimistically detecting concurrency conflict. And I agree with you that transactions typically write multiple objects at a time - that is one reason why ObjectRelationalMapping incurs additional response time, due to all the SQL traffic required to sync the database for each object written in a transaction. --RandyStafford 2001/07/26

Look again. There is nothing in my example that "forces" client applications to share memory. Shared memory between VirtualMachine's is a feature of GemStone that provides for more scalable performance than can be achieved by competing approaches. I trust it doesn't come as a surprise to you that Oracle uses shared memory within its process architecture as well. Please do elaborate on how this is "a straight DUMB idea". --RandyStafford 2001/07/23

Oracle uses shared memory ONLY for it's internal structures, at an implementation level, otherwise , considering your experience you probably already know that different Oracle clients do not physically share data areas. It is pretty dumb because it forces me not only to use Java , but to use Gemstoine's modified JVM also. I guess you have no major objection against DatabaseApplicationIndependence, you give it up because you think you have other advantages. -CostinCozianu

GemStone uses shared memory ONLY for its internal purposes, at an implementation level. There's nothing about using shared memory that forces you to use Java. Yes, if you use GemStonej, then you use Java as your "data manipulation language". Are you saying that using Java in general is "dumb"? Several million downloaders might disagree with you there. As for using GemStone's modified JVM, so what? It's never caused me any problems (except for the fact that TopLink's WeakIdentityMaps? don't work with any HotSpot VM, whether from GemStone or JavaSoft - but then, that's another example of the ObjectRelationalImpedanceMismatch). GemStone is a Sun Microsystems JVM source code licensee - their JVM is the JDK JVM, with extra (and very useful) functionality patched in.

And I do have an objection against DatabaseApplicationIndependence - I don't think it works in practice. I think that over time within an enterprise, attempting to achieve DatabaseApplicationIndependence is what results in fragmented, unmaintainable schemata, and databases whose data integrity ends up being enforced from within byzantine application logic that has grown by accretion over the years. I would tend to side with RalphJohnson on this point. And even if I did believe that DatabaseApplicationIndependence was a owrthwhile, achievable goal, you're right - I do give it up because I think I have other advantages. My experience has been that getting software into production, on schedule, is hard enough to achieve that we should not burden ourselves with additional challenges such as overcoming the ObjectRelationalImpedanceMismatch. I feel that there are other, more workable approaches to maintaining data consistency within an enterprise's collection of databases that are used by its applications. --RandyStafford 2001/07/26

Oh, no thank you. I know as much about concurrency as I've needed to know to be successful at building systems over the years before requiring correction by your definitive views on the subject. I refer you, for example, to my MultiplexedSessions? PatternLanguage at http://wiki.cs.uiuc.edu/VisualWorks/DOWNLOAD/papers/int_vwgs2.zip. And I disagree with your assertion that preserving object identity limits a database [management system'] choice of transactional concurrency policy. GemStone preserves object identity and allows the application programmer to choose between optimistic and pessimistic concurrency control, with multiple types of locks available in the pessimistic case, and with multiple types of conflicts detected in the both cases. --RandyStafford 2001/07/23

I'm glad that you referred me to your paper, however I'd like to say that your paper has no formal definition of what makes a execution history of concurrent transactions correct. More, you don't quote even a single reference on databases or transactions. Does Gemstone also allow you to request a lock on a not-created-yet instance? Well, Oracle, Microsoft SQL Server, DB2, Sybase, and the now ex-Informix do this for me automatically. -CostinCozianu 2001/07/23

It was not the aim of the paper to provide a formal definition of what makes an execution history of concurrent transactions correct. Rather, it was the aim of the paper to examine the issues involved in integrating VisualWave and GemStone in an application architecture. With that aim it references exactly what it needs to reference. The reason I referred you to it was to demonstrate to you that I am sufficiently familiar with the subject of concurrency control in information system architectures. --RandyStafford 2001/07/26

Yes, "the insanity". Relational databases let you shoot yourself in the foot more easily and more painfully than does GemStone, with their default last-in-wins semantics. --RandyStafford 2001/07/23

Well, I want to see Gemstone's first figures posted at www.tpc.org, because I don't trust you blindly. --CostinCozianu -2001/07/23

I know you're interested in audited comparative performance benchmark results. But the above exchange is about concurrency, not performance. --RandyStafford 2001/07/26

--CostinCozianu

Excuse me, but I didn't argue that a relational database is not able to detect transactional conflicts. However it has been my experience that, by default, relational databases implement last-in-wins semantics between two concurrent transactions unless application code takes measures to use locking or application-detected concurrency conflict. GemStone is able to detect write-write conflicts, read-write conflicts, and lock acquisition failure conflicts, which have been plenty sufficient for my purposes building systems over the years. Regarding your comments on other pages about me being "shy to respond" on this point, I have better things to do with my time than respond to your irrelevant and incoherent arguments that the ObjectRelationalImpedanceMismatchDoesNotExist. I need no reminding that from you that WikiWritersDontGetPaid as you continue to engage in WastingPeople. But my payment is to correct your misrepresentation of the facts, based on lack of experience, about the similarities and differences between GemStone and relational databases, and how they are used, and why the ObjectRelationalImpedanceMismatch exists. Any point that you can't appreciate you deride as dumb, stupid, or bad design. You asserted that the ObjectRelationalImpedanceMismatchDoesNotExist, but you have yet to prove it. The only thing you have proven so far is that you are willing to exhibit rude behavior that is not welcome in this WikiCommunity, as you've done in this discussion and as you did on ItsTimeToDumpCeeSyntax. --RandyStafford 2001/07/23

Well, will you mind if I ask you to take your ad personna attacks to my wiki page CostinCozianu, and eventually leave a link per page so we can really discuss the issue if that interests you at all. Or maybe you can create an AntiCostinCozianu? page.


You're right, the above design and domain model is really not that great. I'll come back later on it if you don't want to change anything.

Nevertheless I think it is stupid, it can be stored in a relational database in several very convenient ways, without any modification as you want to. --CostinCozianu

OK, please show us one of these "very convenient ways", without adding any instance variables. --RandyStafford 2001/07/23. I did it below, I havne't added ANY instance variables because I don't need ANY instance at all, especially not instances without information.


In case Randy Stafford is wondering why I don't present my solution to the described problem above, I'll let him know that I can't do so because what he presented there is not a domain model.

Two weeks ago on ObjectRelationalImpedanceMismatch you said "a DomainModel, whatever that is" and now you've become such an authority that you declare the above "is not a domain model"? --RandyStafford 2001/07/23

Well, if you want to play an who is a greater authority here I can refer you to really authorized people who will not even bother to respond to your arguments about your domain model, considering them too obviously flawed. --CostinCozianu 2001/07/23

Therefore doing some modeling work before I have the full specification of the problem is a waste of time. What exactly is the functionality those classes are supposed to accomplish ?

If the functionality is only to show which person filed taxes and which has filed audits, and which did both and so on, then the biggest fault of his model is that it is redundant - no need to maintain a collection of events for that. If the application is allowed to delete events from one Person's record (as SHOULD be the case, since a business system should always be able to deal at least with human errors in data input), than the biggest fallacy of what he proposed is that an Event has no logical identity.

An event does not need a logical identity because its instrinsic object identity is completely sufficient - that's the whole point. Adding delete functionality would be trivial - it would require only the addition of a removeEvent(TaxEvent taxEvent) method to class TaxPayer?, which delegates to Vector.remove(Object object), which operates on the intrinsic identity of the collection members. --RandyStafford 2001/07/23

And how do you identify that object identity to the user of your system ? I guess you forgot this argument. I guess you really need to delegate to Vector.removeAll(Collection c) and not even that would be enough. Dr. Codd discussed all the issues that need to be discussed about object identity way back in 1968. I guess we haven't made that much of a progress, if we are still discussing it now. See AnIntroductionToDatabaseSystems, FundamentalsOfobjectOrientedDatabases?. -CostinCozianu 2001/07/23

Therefore the user of the hypothetical application will have NO means to identify which exact information is to be deleted.

Yes, he will - see above.

So you expect your user to see an FileEvent 0xaabbccdd label in a list and identify that this is really the information to be deleted ? -CostinCozianu 2001/07/23

This is a general and fundamental problem in dealing with objects that have "internal identity", while they don't have any logical identity (such as account no, and so on). The 'object identity' as is defined and used in some ObjectOrientedProgramming models (not all OO models support object identity) has no business in a valid domain model whatsoever.

Therefore RelationalHasNoObjectIdentity is not a problem but a virtue of the relational model, it may be a psychological problem for the people who think in OO only terms, and this will lead them to serious logical mistakes in data modelling and therefore in domain modelling . Please see TheThirdManifesto, and AnIntroductionToDatabaseSystems for a thorough discussion of this issue.

If he will clarify his position, I'll be able to offer him one of the many possible solutions for the problem at hand. --CostinCozianu


Judo time :). Perhaps it would be more accurate if we say that objects have no intrinsic identity? It is impossible to say if an object is a copy or not unless it contains within it some logical identifier. Simple physical identity cannot do that. This is a heavy limitation and confuses physical(implicit) with logical(explicit) identity. If you want to separate these concerns, and if you want your data to be internally consistent rather than coincidentally consistent, then each object requires a logical key. Logical associations cannot rely on a memory location, so they must also use these logical keys. This is the relational model, a model where identities are explicitly defined as internal logical concepts and thus may be verifiably consistent and complete. --RichardHenderson.

Exactly right Richard. The above model is not a model at all, if you put it like that in a database, be that an object database, you are in flagrant contradiction with DatabaseIsRepresenterOfFacts. You represented no facts at all, nothing identifiable to the end user. Let's try to see what the correct solutions are. The risk is that RandyStafford will come back and tell us that the intended use was larger and we might have needed a richer schema, but his objects with no data can't be of much use anyway. So under the reserve that a good schema can be designed only after you know its intended usage, and storing objects SHOULD NOT BE the intended usage of any schema (not even for an OODB schema !), we'll see how easy it is to model the case presented above (pretty lousy presentation).

Whether we want to find out only if a tax-payer has filed or if a tax-payer has been audited , we can very easily use two columns HAS_FILED, HAS_BEEN_AUDITED- both booleans, and we can attach that to an hypothetical TAXPAYERS table, of course we have to assume that the database will contain information for at most one year and many other assumptions that belong to the fallacy that Randy Stafford wants me to do modeling without at least summary information on what he wants to do.

If we are really interested to know how many times has filed or how many times has been audited (suppose I can file more than once, or I can be audited more than once), I'll have instead two columns NR_OF_FILED_EVENTS, NR_OF_AUDIT_EVENTS,which would be enough to show me if a tax payer has filed at all ( NR_OF_FILED_EVENTS>0), and even how many times (essentially this seems to be the ONLY purpose for which RandyStafford wants to keep a vector of events, otherwise two events of the same type are essentially unidentifiable).

And finally, if I have nmore info on each tax file and on each tax audit, I won't have ANY extra column at all because the whole information can be deducted from where I store the primary information on files and audits and I DON'T WANT a denormalized schema.

In this way I am within the principle of DatabaseIsRepresenterOfFacts, while Randy Stafford is not - even if he pretends he presented a domain model. As it can be seen the facts that I intend to represent in the database are not exactly what RandyStafford really wanted (the image of his precious objects stored on the disk), but are the essential facts and are understandable and are biased towards the users of my system and not towards the compiler and Java runtime.

This is the difference between an information model (domain model if it so pleases Randy Stafford) and an object model. Not all object models are proper domain models. ---CostinCozianu


The above model is not a model at all, if you put it like that in a database, be that an object database, you are in flagrant contradiction with DatabaseIsRepresenterOfFacts

I don't see that. Just because the Randy's implementation does not represent facts as columns in a table does not mean that a model doesn't exist nor does it mean that information isn't represented. As a matter of fact, I believe his model is 'normalized' in the sense that no proposition is stated multiple times and can be represented as a tuple (given the proper mapping), thus satisfying the definition of a database in DatabaseIsRepresenterOfFacts.

--MarkAddleman

Randy's model above is denormalized because events of the same type are logically unidentifiable. How do you expect the end-user to understand the following as facts : Mark Addleman had 5 audit events , such as: 0xaabbccdd, 0xaabbccff, 0xaaeeccdd, 0xaabbc1dd, 0x2abbccdd. How is the user suppose to correct the fact that an event has been erroneously introduced twice in the system, choose an address at random ?

The key here, is that in case we decide to store the above objects as such in an OODB, then we don't represent facts about the underlying reality that we are suppose to model, we only represent facts about our internal OO runtime. Those facts have no logical identity , they do not exist in what we try to model, because object identity ultimately doesn't model anything outside a runtime, be it even runtime extended through persitence to images on disk.

The minute you want to pass the barrier of pure example (you can't do that in relational) and you want to actually represent something about what happens to the taxpayer, you'll see that you will need some logically identifiable information that will make your logical identity about your objects. Then it would constitute a design flaw if you give up the logical identity and base your design on the instance physical identity. -CostinCozianu


Are you saying that the defining strength of a RelationalDatabase is its ability to manipulate data independent of application code? ie, some user (probably a DBA) can sit down at a terminal and enter a series of SELECT, UPDATE, INSERT, or DELETE statements.

How did you make that connection ? What about different applications , possibly written in different languages, some of them not OO, shouldn't they be allowed a controlled access to the facts ? If you have no logical identity, what is your OO application is going to present to the user, event 0xaabbccdd?

The defining feature of a rdB is relation support. Its power comes from supporting this. Randy's model above uses native references. Native references are not portable and are unsafe. I like relations. I hear tell of object associations, but if they don't follow the relational keying model then they will break. If the keys are maintained separate to the data, then data integrity is by coincidence rather than assertion.

We could add the appropriate keys to each object in Randy's model to support relational integrity. There would be: <taxpayerKey>, <taxEventKey> (I'm ignoring the polymorphism as a different problem). To relate the values we may build an explicit relation object. It would contain a set of aggregates: <taxpayerKey>, <taxEventKey>.

There are optimisations/specializations where the relationship is simpler than M:N . Specifically in 1:N associations, the relation can be concatenated with the N objects. In Randy's case, this would be if each event belongs to a single taxpayer. Then the <taxpayerKey> would naturally go in the <taxEvent> object as its owner/parent. Thus an explicit relation object can be optimised out (saving one object/table) as well as asserting the 1:N relationship. Its amazing how many OO-relational mappers screw this up because they don't know if an association is 1:N or M:N.

In Randy's example the Vector inside the Taxpayer is doing the work of the relation object, but hiding it from the Event objects. This is somewhat limiting but okay as long as you don't want to derive <Taxpayer> given a <taxEvent>. Flat-file databases have this problem too.

There's more, but I use the same keying techniques extensively in TreeInSql for those who want more information (sick puppies ;)) --RichardHenderson.

The biggest problem that I see with the indiscriminate use of object-relational mappers (I don't consider that SQL is too tough for a developer to cope with, so I can't find enough justification for them unless they're absolutley perfect at their job) is that they very easily help you do it the object database way: store objects. That is you have an extension of the runtime, an illusion of infinite memory like Mark suggested in ObjectRelationalImpedanceMismatchDoesNotExist, and it is easy for people to forget about the underlying logical model while concentrating on run-time implementation structures.

One of these structure is the Vector in the example above. By using a Vector, we already have a denormalized model, because we can store an event more than once (store the same object twice). On the other hand we might even store an event twice even if we declare that object a HashSet, because even if we don't store a physical instance twice, two different object instances might actually represent the same logical identity. And whether we choose HashSet, TreeSet?,or even a SortedArraySet? (that's a class of mine), it is only an implementation level and should absolutely not make it to the logical design. A relation is a relation and by its definition it means a set (the subset of a cartesian product), we can think later how we reflect that set in our OO programming LanguageOfChoice (Vector is hardly a solution here). On the other hand,if we look at the problem itself, and see that our users really don't want to know any details about each individual event, but whether or not thetaxpayer has filed and/or has been audited, then having a Vector of Events is more of a trouble because two boolean variables are enough. -CostinCozianu


Here's a go at putting the tax model into SQL. First, i'll do it as is, then i'll make some changes so it's slightly more sensible (based on some of the discussion above).

CREATE TABLE taxpayers (

ssn DECIMAL(9,0) PRIMARY KEY,
name VARCHAR(64) NOT NULL
);

CREATE TABLE audits (

ssn DECIMAL(9,0) FOREIGN KEY REFERENCES taxpayers,
);

CREATE TABLE filings (

ssn DECIMAL(9,0) FOREIGN KEY REFERENCES taxpayers,
);

Events are recorded by putting appropriate events into the audits and filings tables. So, to check if someone has an audit:

SELECT COUNT(*) FROM audits WHERE ssn = <your SSN> ;

if this is 0, they do not have an audit; otherwise, it tells you how many they have. It might be possible to do this with EXISTS, but i'm no SQL wizard.

I'm now going to improve the model by attaching a date to each event; i'm also going to introduce primary keys for events (which thus now have 'object identity') and some event-type-specific fields for the events.

CREATE TABLE taxpayers (

ssn DECIMAL(9,0) PRIMARY KEY,
name VARCHAR(64) NOT NULL
);

CREATE TABLE events (

ssn DECIMAL(9,0) FOREIGN KEY REFERENCES taxpayers,
number DECIMAL(4,0) NOT NULL,
date DATE NOT NULL,
PRIMARY KEY (ssn,number)
);

CREATE TABLE audits (

ssn DECIMAL(9,0) NOT NULL,
number DECIMAL(4,0) NOT NULL,
auditor VARCHAR(54) NOT NULL,
FOREIGN KEY (ssn,number) REFERENCES events,
PRIMARY KEY (ssn,number),
);

CREATE TABLE filings (

ssn DECIMAL(9,0) NOT NULL,
number DECIMAL(4,0) NOT NULL,
code CHAR(12) NOT NULL,
FOREIGN KEY (ssn,number) REFERENCES events,
PRIMARY KEY (ssn,number),
);

What i've done here is added a 'number' field to the event tables, which, together with the taxpayer's SSN, comprises a primary key. In order to preserve uniqeness of this key across all kinds of events, i've created an 'events' table; the tables for specific event types use this both as as primary key and a link to the events table.

You can still do:

SELECT COUNT(*) FROM audits WHERE ssn = <your SSN> ;

But now you can do

SELECT number, date FROM events WHERE ssn = <your SSN> ;

To get an overview of all the events (although, of course, all you see is the number and date - you can't retrieve type-specific details in a query like this without getting into outer joins, i think).

-- TomAnderson

WARNING: SSN is not a good candidate for a primary key. See: http://www.cpsr.org/cpsr/privacy/ssn/SSN-addendum.html#NewDBs

SSN is a pretty bad candidate key for almost everything except tax related applications. And it is bad mainly for privacy issues: very few entities should have the right to have a hold of our SSN. Other than that, SSN is still a lot better than other forms of identifications for people. But in tax related applications the whole business logic depends on people being identified by their SSN, and there's no other acceptable way of identifying people. So, in this particular example, using SSN is quite the right choice.

It's still not the right choice -- not all "taxpayers" are individuals, thus no SSN. SSNs are not guaranteed unique nor valid. Then there's the whole issue of preferring OIDs vs keys with business meaning. --StevenNewton

Ok, let's restrict it: we're talking here about persons as taxpayers, not businesses as taxpayers. As a matter of fact we'd probably need distinct systems because the laws are very different. The individuals who don't have a SSN (my family members don't) are still able to use a taxpayerID which has exactly the same format as an SSN. But the whole point that we've been discussing on this page was exactly that OIDs should not be used as a substitute for real identification. If you "identify" the taxpayers (meaning persons) by their system generated OIDs it is like you're not identifying them at all. The simplistic "common wizdom" that you should not use keys with business meaning is generally misguided, and with very few exceptions is an anti-pattern. What we've been discussing on this page (for more than an year already :) was that the inherent "object identity" of objects (which is substituted oftentimes by arbitrary generated OIDs in systems dealing with relational databases), provides no identification at all, and thus should be considered an AntiPattern.

Let's take for example two people for which an error (or worse, a fraude) made that their assigned SSN/taxpayerIDs collide. This is a real problem that should be signaled by the system and resolved by business specific procedures. Relying on the fact that they might have distinct system OIDs even with the same SSN, is like hiding your head in the sand, and can have consequences for the validity of data. Because you don't identify people by their OIDs, you do identify them by their SSN, that's the inherent business rule in tax laws and regulations. Saying that these 2 business objects are distinct because they have distinct OIDs, but we don't care if they have distinct SSNs or not, is opening a whole can of worms that might invade your data.

Should the SSN be better? Well it better be better, and better be replaced by genuine public certificates, because if we're going to have a dynamic economy, in which the trust is paramount, any kind of information systems that is not able to identify the actors is useless. And OIDs without business meaning are useless for the purpose of identification in any kind of information system. --CostinCozianu

So, putting aside the fuss over whether a SocialSecurityNumber? is a good key (hint: this section screams "RefactorMe to my own page"), does the above SQL contribute anything towards solving this argument? Perhaps the problem is that we are confusing two kinds of identity based on two kinds of thing - there is identity of the object and there is identity of the entity; two instances of Taxpayer are different objects but could represent the same entity (unless the system was built to keep object-identity the same as entity-identity).

As for the identifier question, i rather like the ClarkeIdentifier. -- TomAnderson

In practice, it is easier to assign an auto-generated number to people as an internal working number. A good global (domain) identifier is either elusive or raises too many privacy concerns.

I suggest the material about generated ID numbers be moved to an ID-related topic.


This entire page strikes me as an argument where neither side has addressed the question of what "object identity" is. Among other things, it makes a set of assumptions about both DB's and object systems that may not withstand more careful scrutiny. Let me therefore offer some observations that I don't think have been addressed here (where "system" means a distinct processor, memory address space, and program):


I think again the point of this page was missed. A well designed object model does not need an intrinsic or artificial ObjectIdentity for its domain objects. Domain objects should be uniquely identifiable by their contents, or else you've got yourself some problems.

I don't think I understand your assertion. Suppose I want to, for example, examine the objects that represent the clients who did transactions with my investment firm on May 19, 2002. How does a "well designed object model" retrieve that information from the datastore without "intrinsic or artificial ObjectIdentity" (whatever that means)?

SELECT * FROM CLIENTS WHERE EXISTS (SELECT * FROM TRANSACTIONS WHERE TRANSACTIONS.CLIENT_ID= CLIENTS.CLIENT_ID AND TRANSACTIONS.INVESTMENT_COMPANY_ID = ? AND TRANSACTIONS.DATE='May 19, 2002')

So why do you think you'd need object identity for ? When I refer to intrinsic object identity I refer to something that can exists only in object databases, just like the example presented by RandyStafford above where TaxEvent objects don't have any identifying content. This would eb legitimate in an object database while impossible in a relational schema. But a domain model where objects are not distinguishable by their contents is intrinsically bad.

Oh, I see. And presumably when I then want to save this object in the audit records associated with my tax filing, my compensation checks (one for each employee of the firm), my shareholder report, and so on, I merely record your 190 character string and then run the query EACH time I need to reference the information - as opposed to recording a 32, 64, or 128 bit object ID and cached object dereference. I can see how that would dramatically improve my system. And I suppose that if I was instead writing an ECAD system, where each drawing contains hundreds of items drawn from a catalog, your approach would be even better.

Now, common, Tom since when fine ladies and gents like us have come to count the bits ? You are making here a StrawMan, there's no way a typical candidate key will have 190 characters. And I do hope you won't put the 128 bit Object ID on a shareholder report.

I think Tom makes a good point. The 190 character string he's talking about is your select statement. Without identity every reference invokes that query. Would you really advocate using a query for every reference to domain objects in a CAD system? And why would using identity lead him to displaying it on a shareholder report? -- EricHodges

Actually I can't understand what Tom is trying to say. Why would I want to store a query ? I might want to store some criteria values for that query, but as long as his use case is as clear as mud, I'll refrain from pronouncing. And I would use candidate keys for referencing other "entities", of course. Establishing relations between different pieces of information has never been a problem in relational model, that's why it's called relational after all.

He didn't say "store" the query, he said "run" it. I'd rather let him clarify.

As to what regards CAD systems I really do not have any experience programming such beasts.

So your statement at the top of this section ("A well designed object model does not need an intrinsic or artificial ObjectIdentity for its domain objects.") is limited to object models you've encountered? Can you specify the set of object models that don't need object identity?

Most of them. Certainly the typical models for business applications. Even in your modern IDEs (NetBeans, Eclipse) most objects are identified by a meaningful key. As for (the real big) CAD systems the way they are a little bit different is by some stringent performance requirements and atypical usage patterns as opposed to most other database applications. In any case even for those there are two layers that you have to analyze: the logical layer, and the physical (implementation) layer. Or the "what" and the "how". What information is stored about things in a CAD and the how information is stored and retrieved. ObjectIdentity certainly has no meaningful usgae in describing the what. I know from my brother-in law that does large scale electronic circuit design for Philips that he has to be able to identify all the components in the circuit, and certainly it's not by a 128 bit random integer.

I must not understand ObjectIdentity then. I thought object identity is "the what" during the object's lifetime. Any given state is transient, but the object's identity ties all of its transient states together.

Well it looks like you understand ObjectIdentity but you fail to understand the consequences. Let's say that EricHodges is represented in a computer system by a persistent object having a persistent identity 0xAABBCCDD. The contents of that object consist of { firstname: "Eric" ; lastname: "Hodges" ; ssn: "111-11-1111" }. In a subsequent run, for whatever hypothetical reasons that may be detailed later, the object with the persistent identity 0xAABBCCDD might have the contents { firstname: "Costin"; lastname: "Cozianu" ssn: "222-22-2222" }. So what is the what ? Is it the identity or is it the contents ?

It's still the same object. That object's state has changed. I must still be failing to understand the consequences.

Yep. It is the same physical object. However it reflects another object in the domain being modeled. So your "what" - the object identity - reflects actually the "how": how the information is stored in an object system, the real what is still the value content of the object. I don't know if you have kids, but how abot if we had another object with the content: { firstname: "Eric's kid"; lastname: "Hodges; ssn: 333-33-3333; parent: 0xAABBCCDD }. Now, do we or do we not have a problem ?

I don't have a problem. Eric's kid Hodges parent's name was changed to Costin Cozianu. It's still the same parent.

Now really :) Including the social security number ? Or is it more likely that the software system screwed up something ?

The software screwing something up is a separate issue. That can happen regardless of whether we use object identity or not. You're the one that said the software changed the values of the object. I assumed you made it do that for a good reason. I don't see how avoiding object identity will reduce bugs.

Software screwing things is part of the whole package. That's why database management systems have an important role called data integrity.


Let's say user Bob has a Thing (object, tuple, whatever) that he is working with. If the identity of that Thing is determined entirely by its state then if user Sue modifies another Thing so that it has the same state as Bob's Thing, then both Bob and Sue are now working with the same Thing, whether they want to be or not. It's rare that requirements indicate Bob will want Sue to become co-owner of his Thing if her Thing ever has the same state as his at some instant in time.

Well, but you miss that in a relational database two things will be prevented from having the same candidate key. In an object system, maybe or maybe not.

So users will be prevented from ever modifying their Things in such a way that they have the same state as any other user's Thing at the same time unless their Things have some part of their state that keeps them from colliding, correct? I don't think I'm missing that part. I think I'm highlighting it.

Exactly and that is a good thing. How would you like, if we were at the same bank that I created a bank account , brought it to 0$ and then modify the account number contents of "my account object" to match the account number of "your account object" ?

I wouldn't worry about it because I'd use object identity to keep them from colliding.

Yes, now can you look on your checks and tell me whether it's the object identity that is printed or is it the account number ?

It's the account number, but my checks are not objects. They are checks. Each check is different from the other checks, even if I write the same information on them.

[And I promise that the back-end systems used to process each check are written in C++, each check is modeled with one or more objects, and that the account number (together with the ABA number of the bank and the check number) are closely associated with that object identity. Oh, and those systems use overlapping windows, pointing devices, and copy/cut/paste protocols as well.]


It sounds like the anti-identity argument is that the domain model isn't robust enough and should include an owner identifier as part of the Thing's state. That would keep state from colliding accidentally. But now we've burdened many Things with the knowledge of which user (or process, or server, or whatever) can manipulate them. And for what benefit?

No, the anti-identity argument is that the mathematical mapping that should exist between the domain model that reflects the business rules of the users of the system, and the execution artifacts (objects) in the software system is very shaky should you choose to rely on ObjectIdentity.

What's the benefit of the non-shaky mathematical mapping that "should" exist?

The benefit is that the identity of things cannot be screwed, like it can happen in object systems that rely on ObjectIdentity. At this point I'm affraid we are going around in circles like we did with TwoPhasecommit?, therefore I'll kindly ask you to read up the ObjectIdentity chapter in FundamentalsOfObjectOrientedDatabases.

So by not using unique object identities, identities of things can't be screwed? It seems just the opposite to me.

Well, think it again and please read before you write, cause you may be borderline trolling.

I promise I'm not trolling. I just don't understand the aversion to object identity. From what I can make out of FundamentalsOfObjectOrientedDatabases, he's using a "hidden abstract identifier" (section 1.5). [Nah, he's just trolling.]

Try to read chapter 2. Skip to the conclusions if you are impatient. The essence is about ValueIdentifiability? principle.

This whole page is one side arguing that there are two kinds of objects... reference and value(OO guys), and only value(relational guys). Well guess what, the OO guys are right. Maybe that's not the way it should be, which is the point of the relational folks, but it sure as hell is the way it is, and it isn't changing any time soon. Relational simply isn't ready to compete with OO at the application level, maybe it will be someday, but it isn't now, so I don't see the point of most of the arguing.

Can you show an example of a particular relational melt-down?

Sure, just show me how I can attach behavior to a resultset so I can enforce business rules without resorting to passing around a bag of untyped values like rs["field"]. Show me how relational can help me know what's in a container without having to track down the statement that created it, with OO and modern environments I usually need nothing more than a period to see a list of available fields. Relational is great for putting the damn data on the damn screen so to say, but it sucks for allowing the enforcement of complex business rules. Relational solutions simply don't allow us to model a behavioral solution to a problem in the same way OO does, right or wrong, that's quite important to a whole lot of people, myself included.

Please let's avoid voodoo words like "relational somlutions simply don't allow us to model a behavioral (sic!) solution. This is quite non-sense, as you assume that you have a good and useful definition of "behavioral" that everybody should agree to, and presumably is incompatible with relational model. Quite an unwarranted assumption that you try to pass as fact. Relational model is a model of data structuring and data manipulation. It takes care largely of the server part in a client/server protocol (for example ODBC, JDBC, etc). What you do in a client is absolutely your problem, including whether you decide to manipulate the rows you've got back with rs["field"] (smells like DotNet), or whatever else you like.

Then there's this outlandish claim that "relational suck at enforcement of complex business rules". On the contrary, relational shines on enforcement of complex business rules, because business rules are all about logic (if (condition x) (action y) (action z)). Prolog would be among the best (unfortunately it is not as mature for large database applications). OO languages are possibly among the worst, unless you use OO to implement a rule engine. If you put business rules in methods, belonging to domain objects, then whenever the business rules change, the business user has to ask the programmer to reprogram the domain object. Possibly redeploy the system (or at least patch it). Quite a feature this OO jandling a business rules. On the contrary with a logic programming approach, you'd give the business user a rule editor, so that simple modification to business rules do not triger a software modification. This approach has been validated in many business systems.

''And speaking of business rules one area where OO lacks spectacularly is declarative specification of integrity constraints. Here' s how it goes from our very own XP/object guru RonJeffries, from (http://www.xprogramming.com/Practices/PracDoInObject.htm) :''

Voila. No more proof needed that objects are lousy at business rules.

In any case the title of this is RelationalHasNoObjectIdentity, and as the discussion goes in ObjectIdentity and related pages a good object model will always identify its objects by value. This has been justified theoretically by researchers in object databases -- see FundamentalsOfObjectOrientedDatabases, and practically it has been confirmed time and again by episodes like the one told by RonJeffries.


Even the proposed solution in TheThirdManifesto isn't adequate, isn't complete, and is mostly theoretical. Were a perfect implementation of it to exist, most OO folks wouldn't be happy with the flavor of OO it offers, so it's not really the final solution. Object identity exists, get used to it, it isn't changing in the near future.


There are a few known ways to uniquely identify anything in computers

OO seems to want to use either hardware addresses or the every-attribute approach.

'''Nonsense - Python uses the order in which the objects were created, ie. object 1,2,..N. Similar to AutoGeneratedKey in some DBs I've seen - and the same problem exists as with hardware address, because different platforms will not have instantiated the exact same sequence of objects. The fact is that a concept of identity is almost always needed (although relational people often hide the identity in domain-specific concepts such as SIN) unless we're dealing with immutable data. An AutoGeneratedKey has all the weaknesses of a pointer unless all users are fetching from the same DB. Likewise, hardware addresses would work fine provided there was consensus about which computer those addresses referred to, or all computers synchronised this information. The fact is that databases do not have the same speed concerns that OO languages are designed around, and have the luxury of using non-hardware keys for identifying objects. For example, C++ processes a hundred times faster than Python because Python over-relies on strings for looking up members.

And yes, once data is intented to be synchronised over a network, it does become important to use non-local information for global identification - but the nature of this info is irrelevant. SINs, the address of the object on a central server's RAM, whatever. The only problem with memory address is that it becomes an impediment if you wish to move the object around while preserving it's identity. ''' -- Martin Zarate


EditText of this page (last edited July 4, 2005) or FindPage with title or text search