Relational Object Identity

[RefactorMe: This page needs a new title. maybe "ObjectIdentyPrimaryKeyMapping?"? kind of klunky. How about a simple SurrogateKeys??]

I can't find the original reference, But on one of these pages, someone asserted that when mapping object data into relational databases, data members are pretty straightforward, relationships can be bit tricky, but usually not too bad. Then they went on to say that we shouldn't invent new columns to be a primary key if objects are uniquely identified by their data members.

However, on our system, we have done exactly that. Since our objects are unique if we compare enough of the non changing data members, we could create a primary key from 5 columns .... but that seems like too much. Instead we gave every object a member variable that holds some seemingly meaningless data that happens to be unique (it is filled by a counter at construction.)

Are there any obvious reasons why this might be a bad idea?

How about a simple answer: Create a unique index on the 5 column business key. (In addition to the primary key unique index on the system assigned key.) -- JeffGrigg

True, this would prevent two objects with different object identifiers but identical NameSpace identity. It does not address a symptom of a potentially klunky design. Can we support or argue against the practice of inventing these keys?

There are a lot of nuances to the story you tell, but the essential question you have to ask yourself is whether you can have two objects with identical content but "differing" only in the ArtificialKey? / ObjectsIdentity?. What happens then ? If you can come up with a satisfactory answer, then you've probably got your schema right. However in many situations creating an identity column is a bad solution to the wrong problem. --CostinCozianu

I've had similar feelings from time to time, but I haven't been able to put my finger on why. For starters, hopefully, we'd never have two objects identical in all ways except their ArtificialKey?/ObjectIdentity. If we did, we'd have a problem. Perhaps you'd be willing to elaborate on the bad solution to the wrong problem part.

Now here we hit the nuances. Basically you're right, if you have two objects differing only in their object identity you have a problem.

There's an exception here, in that it is OK to have object with identical content but which differ in the context of the database instance by the way they relate to other objects. This approach has been there for a very long time, because that's how people were identified: like Jesus son of Joseph and Mary and completed eventually with a whole genealogy usually on the paternal side, because you can bet there were more than one Jesus in Israel. So if two objects have identical value content they still have a chance of being valid if the graph of relationships differ in shape and/or content if you extract the two subgraphs of objects related directly or indirectly to the object in question they are not isomorphic , where the isomorphism has to preserve the class of the object in each vertex and it's value content. To make it clearer I'd probably have to draw some examples, but amazingly, that's how people were identified in my country (and other countries) in old ages, like Ion "son of" Maria, and if there was a confusion they'd go to grand parents Ion "son-of" Maria "son-of" Elena, and so on, and, of course, the naming had to be unique only within the space of a village. When people move to other village the head of the family took the name of his village of origin within its name, so there wouldn't be namespace clashes. That's how my last name came to be "Cozianu" which means "from Cozia". But there isn't uncommon to have names like Amariei which means "Of Mary", those guys had more stable ancestors.

But if people have no problem recognizing simple relationship graph patterns, and creating ad-hoc solutions when problems occur, computer systems are different, and essentially we can't have a generalized graph based identification scheme enforced by the database and/or the software systems. Database support for graph algorithms is essentially none (be it relational or OO database). So we have to stick with the general rule that objects should differ in values, either directly or indirectly. The later case comes up with what is called "weak entities" like the well known OrderLine? (what a bad name) which usually is badly implemented in the relational schema as OrderLine?( OrderLine_OID, Order_OID, Product_OID, quantity), when in fact the primary key should be [Order_OID, Product_OID] and having an OrderLine_OID is absolutely non-sensical, but the thing here is that OrderLine? as an object is undistinguishable by content, it is distinguishable by the way it relates to Order and Product, and as a matter of fact it doesn't make a lot of sense to access an Orderline directly, a user will come either from an Order or a Product. But that's for "weak entities", entities whose existence doesn't make sense alone, they always have to be related to other entity. The core entities should be distinguishable by values, directly.

What I have briefly discussed above are instances of the IdentificationProblem?, and that is the essential problem to be addressed when designing a database schema, creating an E/R diagram, or worse, drawing an object model in UML. Every piece of information in the system has to be identifiable (you have to have a way of pointing to it) and more be uniquely identifiable it doesn't help if you end up pointing of two or more undistinguishable objects.

Of course, UML being brain-dead from its nascence is totally in ignorance of this issue as well as many others. And so is the vast majority of OO literature, but amazingly enough, the problem has been extensively studied not by relational experts, but by researchers in OO data models, although the basics of it are found in Dr. Codd's original paper.

Now, of course one might say that attaching an OID to an "object"/tuple solves this problem, but the tricky thing is that the identification process has to start from the logical perspective of the end-user. The premise of OIDs is that they are totally internal, they're never shown to the user. End-users understand values not pointers, and they understand relations. So, when you attach OIDs/ surrogate keys and declare you won the battle you basically HideTheDirtUnderTheCarpet. The wrong problem being addressed is that every table needs a primary key, and what can be more convenient as primary key than a 4 bytes or 8 bytes autogenerated integer? --CostinCozianu

[[More on this later since the subject is too long, and maybe we can come up with a better name, RelationalObjectIdentity is kind of not making a lot of sense. --CostinCozianu ]]

Relationships can be bit tricky.

On the contrary, the tricky part is getting the relationships right in an object model. In a relational schema, expressing relationships between entities is trivial.

Yes. Relationships with ArtificialKey?s or ObjectIdenty?s are quite easy. Say you have a table of Widgets that hold a collection of Widget Parts. If the Widgets Table and the Widget Parts table each have an ObjectIdenty? column, then a third table can be constructed which is a simple relationship of the Widgets ObjectIdenty? Column to the Widget Parts ObjectIdenty? Column. Better yet, if Widget Parts always belong to one and only one Widget, then the Widget Parts table can simply have a column which is a simple reference the owning Widget ObjectIdenty? column: no relationship table required. I imagine that it could be nearly as simple without an ArtificialKey? in each table, but ease of relationships was something that encouraged this implementation.

See ReadingsInFundamentalsOfObjectOrientedDatabases?.

Primary keys need to (A) be unique (B) not allow null (C) not change (D) non BLOB (BinaryLargeObject?) data. Creating an aditional row with a meaningless primary key is very safe, and never "wrong". If you have data which naturally contains requirements for uniqueness, you may wish to enforce that with a unique index. Additionally, if the data naturally contains a primary key there is nothing wrong with using it, either. It might be confusing to have two unrelated unique indecies on a table. To add to clearity, and maybe to reduce the feeling of duplication, perhaps the natural key would be preferable.