Mapping Inheritance Hierarchies To Relational Schemata Involves Compromises

This page is intended as a more clearly-named discussion of one of the problems that partially comprises the ObjectRelationalImpedanceMismatch.

Mapping inheritance hierarchies to relational schemata involves compromises between efficiency and clarity. Some feel that these compromises represent one of the problems that make up the ObjectRelationalImpedanceMismatch. ThreadMode discussion follows.

The relational model does support inheritance, but the problem lies in the efficiency of the available approaches. As I have voluminously documented in CrossingChasms, there are three different approaches to modeling inheritance in a relational database. Simply put they are:

The single-table approach where each class in an class hierarchy maps to a unique "select" statement from a single table (T). E.G. if there are columns a..z then the superclass of the hierarchy (A) might define attributes a..f, a subclass (B) might define g..j, and a subclass of B (named C) might define k..n. Thus to obtain a "A" you do "select a..f from T where..." and to obtain a "C" you do "select a..n from T where..."
The table-per-class approach where for each class (A,B,C) you have a table that only defines the unique parts of that class. Thus T-sub-A defines a..f, T-sub-B defines g..j, and T-sub-C defines k..n. All tables T-sub-whatever share their parent table's primary key. Thus the primary key of T-sub-B contains the primary key columns of T-sub-A and so on. To obtain an "A" you do "select * from T-sub-A where ...". To obtain a "B" you do "select * from T-sub-A, T-sub-B where T.pk-sub-A = t.pk-sub-B AND ..." and so forth. Thus each select of a subclass is a join of the N tables in a path to the root of the class hierarchy.
The table-per-class approach where for each concrete class (A,B,C) you have a table that defines all of the defined and inherited attributes of that class and all of its superclasses. Abstract classes do not warrant tables in this approach.

The problems with these approaches are:

The first approach results in a large number of "null" columns in the database. It also means you've got one monster table, and one that might be difficult to index (e.g. just because you want to search an attribute of one leaf class it doesn't mean everyone wants to search on that.) It also means you're constantly updating that one table as you add/remove new subclasses from the hierarchy.

The second approach has a problem in that it effectively limits the depth of your inheritance hierarchy. Most relational databases will not efficiently join more than four tables.

The third approach results in a situation where any change above a class in the hierarchy results in changing all of the tables that represent the classes that derive from it.

KyleBrown

Before Kyle Brown's CrossingChasms, N. Rishe offered a more thorough approach in SemanticBinaryModel.

K.B. reffers here to "record inheritance" as documented in PrinciplesOfObjectOrientedSoftwareDevelopment and as it is supported by usual OO languages of today (Java, Smalltalk etc...). I'll address his ideas in due time, however I want to state that there are several inheritance models we can talk about.

It can be argued that other inheritance models are of theoretical interest only, as they are not easily supported by common OO languages of today, however they can serve us to better understand and define the domain model of our application, and sometimes those alternative models are better fitted for a particular situation. --CostinCozianu

See LiskovSubstitutionPrinciple, CircleAndEllipseProblem, IsaCircleAnEllipse, TheThirdManifesto, PrinciplesOfObjectOrientedSoftwareDevelopment

{Move these "see also's" below? Is it a response?}

The above observations are all quite valid. The EnterpriseObjectsFramework from AppleComputer uses the above three methods for ObjectRelationalMapping (in EOF the three methods are called single table, vertical, and horizontal mapping). For performance reasons, single table mapping is usually the fastest; vertical mapping is the most normalized and flexible, but involves a lot of joins and so is not recommended for high-performance applications.

In most cases, the characteristics of your object model will determine how best to model inheritance in this fashion. Is your model changing frequently, or not at all? Is is a deep inheritance hierarchy, or shallow, etc.

More information on the way in which EnterpriseObjectsFramework handles these issues can be found at http://developer.apple.com/techpubs/webobjects/WebObjects_4.5/System/Documentation/Developer/EnterpriseObjects/DevGuide/EOsII3.html .

And the ObjectRelationalImpedanceMismatch here is that you need to select one of the above, and each one has distinct disadvantages:

The "kitchen sink" approach (single table) gives the least trouble if you don't mind having very large numbers of NULL columns.
The "one-for-one" tables-to-classes approach (vertial) gives you a highly normalized model, at the expense of performance. And you may even hit the maximum number of tables your RDBMS can handle in a query.
The "leaves only" approach (horizontal) is a good compromise, except that you can't model relationships to abstract classes in the database, as the tables don't exist. Likewise, you can't list all Employee objects easily if Employee has subclasses. (Yes, you can UNION, but are you aware of how many SQL-92 databases don't allow UNION in views or subqueries?)

Yes, ObjectRelationalImpedanceMismatch here -- that's for sure.

KyleBrown wrote Actually, the relational model does support inheritance, but instead the problem lies in the efficiency of the available approaches.

Oh, in that case assembly language supports inheritance too, because I can hack it in that too.

(I don't have the appropriate references handy at the moment, but...) Relational Database modeling has supported "exclusive" and "specialization" relationships in EntityRelationshipDiagram-s for some time -- since long before objects became popular. But the inconvenience and performance costs of querying these hierarchical trees of tables generally discouraged people from using it much.

I recall doing a hierarchy of Person-Employee-Student/Instructor tables in the mid-1980s; implemented in PL/I. (A structured procedural language -- not OO at all.) Our model eventually went about half a dozen layers deep and with nearly a dozen total "Person classes/tables." Supporting this mess got rather annoying in implementation. ;-> -- JeffGrigg

[We used the "table-per-class" (vertical) approach. -- JeffGrigg]

Supporting inheritance at a logical level means supporting an IS-A relation. The IS-A relations generates a hierarachy sometimes, but in the general case we are talking about a DAG (directed acyclic graph), and admitedly facilitating climbing up and going down from trees was not among the initial targets of the relational model.

However, in some relational databases of today you can do just that, such as in Oracle, although CONNECT BY is not quite every one's favourite thing about Oracle. DB2 apparently has a different substitute for it, I'll have to check it myself before.

Climbing trees and going down from it have to be done manually by navigating pointers in OO models, even if the navigation could be facilitated through the use of an iterator. -CostinCozianu

Actually, a hierarchy of tables does not imply recursive descent through data -- just ordinary joins. (We were using Sybase under Unix.) To get all Instructor data, you'd do "select p.*, e.*, i.* from Person p, Employee e, Instructor i where p.id = e.id and e.id = i.id" and then add whatever other joins and conditions that you need. The annoyance is that every time you touch a "person child table," you end up throwing up to half a dozen additional tables and joins into the query. It tends to be kind of complicated and slow. -- JeffGrigg

I've been very disappointed by Oracle's implemention of CONNECT BY: They impose very severe restrictions on using ordinary joins in CONNECT BY queries -- generally making it unusable in applications where I've tried to use it. -- JeffGrigg

Oracle seems to have realized that, and has been slowly improving CONNECT BY over the years. If you get the chance to work with a sufficiently new release, check it out again. --JasonBucata

There is a fourth possibility, though I've never explored this.

Have two tables, one that maintains the properties common to all the object types, and a second that maintains the properties where they diverge. This second table contains fields that describe the object type, the property value, and possibly the property type.

(the following example is totally contrived, and may not make much sense) Consider a Customer class and a VIP customer class. You have the usual CustomerName?, CustomerAddress? and that sort of thing in the main Customers table. These are common to both VIP customers and ordinary customers. However, VIP customers receive an automatic discount on all their purchases (regular customers do not). VIP Customers have the additional property DiscountPercent?. This would appear in the second table as a record like:

{Peronsally, I find such sub-typing of customers an ugly approach. Often who gets what discount or whatever changes. Using a sub-type hierarchy would require moving such code up, down, and all around, whereas if it is just a column switch then little or no code changes to change who gets the discount (assuming it does not later depend on things such as zip-code and other products in the basket). Plus, it is likely that many traits (discounts) are orthogonal and don't fit well into a single hierarchy. But I do realize that some believe hierarchical taxonomies are a good enough UsefulLie to live with some problems and that I am not going to talk them out of it. Generally, though, relational fans do not see as much value in hierarchical taxonomies as OO fans. See EmployeeTypes and SetsAndPolymorphism. Set theory is a better tool for this kind of thing, but polymorphism and sets don't mix.}

customer_ID: (whatever) customerextra_property: "DiscountPercent?" customerextra_value: 15

You might have another table that describes the fields and their data types, where you have "Customer.DiscountPercent?" as "Percentage" or something.

You could then write SQL that sets/retrieves certain properties (or in some cases allow end users to define their own properties, much like MicrosoftOutlook). The only problem I foreseee is you need to do transforms or subqueries, and take a big perforance hit. Still, might be interesting to try.

-- DanNovak

Actually, if you throw out the first table and do EVERYTHING like you've suggested in the second table, you've arrived at an AntiPattern that I've seen in many large companies, especially in the insurance industry. Believe me, the performance hit (not to mention the lack of relational integrity and all of the other boundaries that using the relational model gives you) will soon make this solution your worst nightmare. Although I'll grant you, it is QUITE flexible and easy to change -- you only ever have one database schema :) --KyleBrown

[See GenericDataModel.]

Inheritance needs to be supported explicitly. Avoiding concrete supertypes is a good start. I tend to munge peer types into one table and switch on some type flag. I have used a generic mapping of one tuple per attribute too (see DocQueryInSql) but the query code got very logical-mathematical. --RichardHenderson.

Why Before How

A RelationalWeenie will often view trees and inheritence as the wrong solution. Thus, focusing on how to represent trees in relational is considered less important than why. Is a tree really the right structure? The LimitsOfHierarchies get in the way. Relational tends to see sets as a more flexible and natural taxonomy/classification system than trees.

Can I ask why you do not convert object classes to ADT's (as specified by CJDATE), even if there is stacks of inhertance to consider?

-- Jo.T.

ShorterNameSuggestion: EfficiencyVsClarity?

CategoryHierachy

CategoryRelationalDatabase