Refactoring Friendly Database Design

Costin: Sorry, I wasn't quite done with this page, and I have moved all your inter-textual comments all to the end to allow me to complete it without losing the thread. Please restore theme appropriately as you see fit. [DeleteMe]

From RefactoringWithRelationalDatabases:

Schema private or public? -- From an OO point of view I want to see the database schema as something private--the persistent storage of our private attributes. I want all access to go through the public methods of our classes. Unfortunately, databases aren't always viewed like that. Several applications might share a database without sharing application code. The database schema might be seen as a public interface to the data. This basically means that most database refactoring violates the the OpenClosedPrinciple. I certainly think such a view of databases is bad, but we often have to live with a legacy.

This is really the linchpin upon which relational inertia turns. If developers can design from the ground up and build as they go for a single application they can refactor the schema as often as necessary to meet the requirements of development. I've done it more than once and it's quite lovely to be able to do that. But in the case of existing database schemas development is constrained to accept the problems that arise from having a centralized, unencapsulated resource -- an impressively large collection of global variables if you will -- in order to retain the benefits of relational principles. In other words, RelationalBreaksEncapsulation

This global variable table is now a brittle resource that inertially drags down all development because the conflicting needs of different applications and business processes have to be sorted out and carefully shepherded into complex dependencies by the DBA experts. The cost of change is enormous when there are side-effects and ripples throughout the whole of a corporation's applications, but that's because we have one table, for example ENTERPRISE_GOLD that has a 3-column primary key. Suppose developing an application to meet a certain new business requirement would really become almost trivial if the size or type of one of those columns could be adjusted. It won't happen -- that application will have to work around the existing schema to get its work done (if it can) because it can't go in and change the global variables that every other application depends on.

So the costs of relational databases relate nearly as much to the inertia of sharing global variables as to the ObjectRelationalImpedanceMismatch. Actually refactoring relational schema is not terribly difficult. Suggesting that a centralized globally shared resource be refactored to meet the requirements of one development effort goes against the very grain of how DBAs think.

From DontRefactorDatabases: Try to get them right the first time, it's not that difficult. The refactorings should be minor and very rare. It is a mistake to consider that if refactorings are good and necessary for code structures, to avoid BigDesignUpFront, the same thing also applies to databases.

A most difficult assertion from which to proceed with a discussion of RefactoringFriendlyDatabaseDesign. Interestingly, the cases where I worked on projects where the database was designed and built from scratch for the purpose of the application, refactorings were quite common and sometimes significant as the problem space was explored, the requirements evolved, and the solution emerged. There is no question that it would have been impossible to get it right the first time, because "the first time" evolved into further iterations that exposed new problems and solutions.

Again, from RefactoringWithRelationalDatabases: Go ahead with the database refactoring, making liberal use of scripts for reliability and testability.

Here is a clear consequence of the global variable pool nature of a shared and unencapsulated resource. It's unit tests and regression testing writ enormously comprehensively. If there was code like this, the developer or team would discover very quickly the problems it was imposing on the rest of the work and refactor it until the behavior was flexible enough to allow the development pace to resume.

The challenge to DBAs today is not to create the most perfect schema, but to create a way to interface with relational databases that does not expose the internals of the relational model to the world. Applications coupled tightly with the schema are the ones that most impact the schema when they need refactoring, yet they are also a significant reason why the schema cannot be refactored, when there is more than one application coupled to the schema. Where is the interface to relational databases that allows an application developer to write "pay all the salespeople commissions, unless they are a subtype of salespeople that reports to a paymaster, in which case assign all their commissions to the paymaster" in a way that can allow the definition of "reports to a paymaster" be changed as the business needs change?

From RefactoringWithRelationalDatabases: You may not be able to assume field consultants have DB skills. These techniques, with a simple middle-ware abstraction interface, provide the needed flexibility for simple customization.

The goal here would be to build a set of tools that will let business application programmers spend more of their time and mental effort creating code that implements business requirements and less of it implementing the various pieces are parts that are required to make the language or tool accomplish the basics. Specifically, it would be wonderful to abstract the SQL/relational specifics into a set of services implementing the business view of the data. In Java, for example, have a business applications programmer know that if he wants to add some new code to pay a different sort of commissions to salespeople, that code just needs to call the "pay commissions" service and not worry about the relationships of paymaster/salesperson.

From DontRefactorDatabases: Is it possible that the main reason this holds true is that the interface we have to databases is not one that easily supports refactoring? I don't see any (practical) reason why a database that had an interface which made refactoring easy should not be refactored. The arguments for refactoring code apply just as easily to a database and although I absolutely agree with what you say in terms of a database being a representation of facts rather than a repository for objects I think there may be times when you want to change the way you represent those facts.

If the "global variable pool" were encapsulated with a business view such that all applications went through the same narrow interface, then it wouldn't matter so much that the applications were all doing different things. The interface would take care of transforming the application and business logic into transactions. Changes to the meaning of "reports to paymaster" (and how it is represented) would be isolated such that applications wouldn't need to change in response.

From DatabaseIsRepresenterOfFacts:

you absolutely can't trust the code to maintain your data integrity

in the RelationalDatabase more rules about data integrity (consistency of stored facts) can be stated in a declarative manner, which is safer and easier to verify, as opposed to an imperative manner in an OO programming language. The imperative approach is more subject to application bugs, and no matter how much you trust your programmer capabilities it is harder to verify their correctness then if you just choose to simply declare business rules

These statements, as with the "just do it right the first time", present a difficulty. Perhaps modern relational databases are mature and bug-free enough to be given absolute trust in maintaining data integrity, but that is doubtful. A DatabaseIsSoftwareToo.

Perhaps there's fruitful exploration to be done in the direction of database experts and OO application programmers developing an interface layer that meets the relational rigor yet addresses the application developer's need to spend more time thinking about solving business problems and less time thinking about the relational model.

-- StevenNewton

You can twist object and relational all the way you want, but you're not gonna get around the "large collection of global variables". Here is how things are about database for businesses: DatabaseIsRepresenterOfFacts. If you study the etymology of data, that's where it comes from: datum = a "given", or in modern words a fact. The problem of database design then becomes the problem of creating a well-defined and consistent logical framework to represent the facts about the business, while attending to some other desirable qualities. Database design will tend to evolve towards a global database (global can also be physically distributed but logically global), because upper management will want to have a unitary view about the reality of business. They also want to have DBAs because the database is not the facts per se, it is a representation of the reality, and there's a non-trivial risk of getting the facts wrong, or having a flawed logical framework that will derive a wrong image about the business. Furthermore, the accuracy of facts is sometimes mandated by law, and the inaccuracies can have dramatic business impact. It is therefore well understood that DBAs should function as gatekeepers to one the most important asset of the enterprise: the knowledge of the business reality. Nobody will leave this aspect to be attended by buggy applications anytime soon. Get used to it, and leave with it.

Now the fact that you complain about shared global variables, it's also like me complaining that the water is wet. It is the reality that dictates it. Some facts are shared between different business processes, divisions, etc. Consequently these facts will have to be shared between the applications that deal with them. In the Product table you have facts about products, and the knowledge about those facts is required by the marketing department, the shipping and handling, the ordering system, the accounting staff and so on so forth. Are these facts global ? You bet they are. Are they shared ? Of course. Only if you want to bend the reality, you can implement your dream OO subsystems that will not take into account that certain information is global and shared. Put your dream OO database in the picture and the hard problem will not disappear, it will just become harder to solve or it'll make it easier for you to ignore it, until the disaster may struck. -- CostinCozianu

As pointed above, usually more than an application/system will access different parts of the database and some will unavoidably be shared. Refactoring database is hardly justified because you implemented one story at a time, you did XP, or RuP and didn't get the perception of facts right from the beginning. In the meantime you might have let your guard off, and the production database might contain some inaccurate, inconsistent data. Even if you were lucky for that not to have happened, refactoring a production database (be it even Gemstone or whatever OODB) is no free lunch, no matter how you dream about it. The hard problem is if the old data (old facts) will have a equivalent and accurate representation in the new schema. Furthermore, let's say you solved the first two hard problems, there's a non-trivial amount of time it takes a DBMS to transform its internal physical representations to support the new schema. Many production systems just can't afford that kind of stoppage, because the developers together with the DBAs didn't get it right the first time.

Here I want to ask you what changed from "the first time". Did the business rules change, or your understanding of the business rules became larger in scope and more exact in details ? It is a common place in specialized literature, and my limited experience tends to confirm that business rules change very rarely so radically that a schema restructuring is needed. Some business rules do change more often, but not the ones that define the basic facts about the business. -- CostinCozianu

Then there is the ultimate case, when everything was perfect the first time, designers and DBAs did a great job, but the management decides to restructure some business rules so radically that the system of facts changes dramatically. Then you have the same hard problem, but this time is the management's responsibility and it is OK. They will have to pay cash and in other means (like loss of business) for the refactoring of the database and most likely for redesigning and reimplementing software systems -- CostinCozianu

The answer to the last one is simple: what is "reports to a paymaster" ? It is a binary relation on Employee x Employee ( or SalesPeople? x Paymasters to be more exactly). No matter how you twist it, try to change the representation, refactor, etc, it will still be a binary relation. So let's say that initially you represent that directly, as a table Reports_TO (Employee_ID1, Employee_ID2). Than the requirements change to let's say "reports to" is defined by the rule that "all salespeople in the same branch with the paymaster, who don't have a senior status". Obviously , we can't represent that as a table anymore, unless we want to live up with redundancy. So the answer is simple: have the Reports_TO defined as a view, not a base table. When business rules change, you change the definition of the view, while the application will always refer to the condition "Employee_ID NOT IN (SELECT Employee_ID1 FROM Reports_TO)" or the equivalent "NOT EXISTS (SELECT * FROM Reports_TO WHERE Employee_ID=Employee_ID1)", which is more optimizer friendly for no apparent reason. -- CostinCozianu

I think this is the case of ObjectRelationalPsychologicalMismatch.

For example there's absolutely nothing wrong in having a 3 column primary key in the general case. But if you later want to adjust that and have to face rippling effects, the usual suspect is "ignorance and in data modeling" not "relational inertia". If they're DBAs that doesn't imply that they are database design gurus. Just like with developers, if they're using Java that doesn't mean they are OO experts, and you might see some awful things in the source code from time to time. -- CostinCozianu