Refactoring can be especially challenging when a RelationalDatabase is involved. You can't get your DBA to change the schema every 10 minutes, and many of us are not lucky enough to use something like GemStone for persistence. Furthermore, if you're dealing with existing data, data migration can be difficult, time-consuming, and risky. There are a couple different opinions as to how to approach this subject:
Go ahead with the database refactoring, making liberal use of scripts for reliability and testability.
Requires that you have the cooperation of the DBA, or that there is no DBA in the first place. This strategy is a consequence of ExtremeProgramming thinking: If something is expensive, do it all the time and force yourself to make it less expensive. See ContinuousDatabaseRefactoring for specific advice.
Defer changes in the database.
Code is much more flexible than a database, so make your refactorings in your code, and fix your class-to-table mapping to make the code and the database line up. Defer schema changes and defer data migration until you really have to do it. (See ShieldPattern.) See HowToDeferSchemaChanges for specific advice.
Get it right the first time.
Refactoring databases will slow you down no matter how you approach it, and with the proper up-front analysis you should be able to design a schema that never needs refactoring. DontRefactorDatabases.
On the difficulties of data migration
Migration can be straightforward, it can be unbelievably nasty and high risk. I'd say that frequent complicated (fear inducing) data migration will slow refactoring and indicates a brittle environment: a poor fit with the XpFrame. -- KeithBraithwaite
Could you say some more about the forces that determine whether it is straightforward or risky? -- MartijnMeijering
Well, this depends a lot on your database. PostgreSql offers some fairly extensive refactoring options; you can add and delete columns, change constraints, and so on. The more complicated refactoring usually involve several steps, adding new tables or columns without constraints, copying data as necessary, creating new constraints, dropping old constraints, and then dropping old columns or tables. This can get pretty complex if you have extensive constraints, but the scripts are eminently testable. Even better, it can all be done in a single transaction (PostgreSql is one of the few RDBMSes out there that lets you roll back table creates and drops), so you can do a test run against your production database (or a copy thereof) before doing it for real.
Dealing with different versions of software can be more of an issue. Often you can create views that show the new data in the old way, at least closely enough for the old programs to work. If you're concerned about things changing in more drastic ways, you can also create functions in the database to do some of the work and then the function can present a constant or versioned interface to the application code. Sometimes it's just as simple as giving a new column a default value, rolling out the schema change, and then updating the software after that.
More difficult, and less often needed would be a procedure such as:
There are the annoying not-tool-supported technical issues of making database changes: Unless you're adding NULLable (or defaulted) columns, you'll have to do rename-copy-drop on the table.
Who controls the database?
In many organizations, relational database technology is considered to be a strategic corporate resource, requiring centralized control by specially trained database "experts." And, to a large extent, the design, tuning, and operation of relational database schemas requires skills and experience beyond that of your average developer.
So, relational databases are often under the strict control of a centralized group; nothing (other than data) can change in a database without having them doing it for you. Some companies take this to extremes, requiring centralized control of development/unit testing databases with the same rigor of production databases. And a few insist that no database change can be made without having a face-to-face meeting between members of the development team and members of the centralized DBA team (so they can "do" the database design for you). Needless to say, these policies increase the cost of change and are anti-XP.
-- JeffGrigg
Related: DbasGoneBad
When the developers must share a common database
Also, in many business system development projects, the developers must share a common database. So, if one pair were to change something in the database schema, it would immediately break code and stop the work of all other pairs attempting to work at the same time. (This violates the commonly understood requirement that each developer (or pair) must have their own isolated "play pen" in which to do development and unit testing. But this argument doesn't seem to hold much weight in many corporate development environments.) Problem with "play pens" is that often developers don't know enough database administration to quickly change the schema. They opt for a shared development database so that schema changes have to be done only once, by the most DB-skilled man on the team or a shared DB admin.
-- JeffGrigg
When multiple teams must share a common schema
My advice is DontDoThat. Break the schema (logically) into separate pieces, each of which is under the control of one team. Don't treat the database as a giant global variable, open to all comers. Rather, give ownership of each piece of the data to a piece of the system and make all accesses through that system. This encapsulates the data with behavior and prevents semantic misunderstandings of the data. It also allows each team/module/sub-schema to refactor at will.
Problem is, in practice 90% of this behaviour is CRUD (a lot of loadObjectX/saveObjectX methods), complete with an application-specific query language, thus ReinventingTheWheel. Also, in my experience behaviour (and method signatures) tends to change faster than underlying data, even when costs of modifying data and behaviour are the same.
Sharing the database is also the easiest way of integrating various applications (but is not an integration panacea - sometimes you really need to share behaviour). SQL is a joy compared to serious Web Services (SAP SOAP interfaces make a grown man cry...). And giving each application ownership of the data will lead to PointToPointIntegration, which will require a major EnterpriseApplicationIntegration project to reintegrate.
Deployment issues
I must admit, that I still doubt that it would work to RefactorMercilessly in a large system with relational databases that have actually been deployed. It should certainly be possible for each developer to have a small development and test instance of a database, with a small set of data, but the problem as I see it concerns deployment. Whether you deploy it often or not, there will be a disturbance for the customers that is roughly proportional to the number of changes you make in the database schema. If the changes are rolled out often, there will be many down-time occurrences and plenty of evening or weekend work for the DBA's. With fewer rollouts, there will be larger changes at each delivery, and thus longer down times, more to test, and bigger risks...
Some mixed points...
Still, the situation might not be hopeless. There are a lot of database design decisions that affect these issues. Do you have one or a few large databases, or several smaller? Same thing with tables. Perhaps we need to discuss RefactoringFriendlyDatabaseDesign!
Parts of these concerns are only problematic if you touch a few main tables in the system. Most databases I've used have a large number of fairly small tables that could probably be refactored without big problems. On the other hand, it's with the big tables we have a lot to gain...
BTW: How do people handle version/change control of database schemas?
There might also be database content that is delivered by the developers rather than produced by the customers, for instance error messages or calendar data (which days are bankdays etc). One solution (that I find reasonable) is not to put such data in the database, but to keep it in the code or in text files. At least that leads to a little less database headache...
Version control & change control of database schemas BTW: How do people handle version/change control of database schemas?
The way I handle all of these problems (i.e. RefactoringWithRelationalDatabases) is to use SQL command scripts. Instead of using the interactive SQL tool for the database, I devise a script consisting of SQL DDL/DML commands to create (and populate, if necessary) the table/view/what-have-you. Then I just check the scripts into CVS and voila!
With this setup, you can even write unit tests for the scripts. If there are dependencies in the schema, you'll have to be more careful in the way you write and call the scripts. I like to handle this with make files, but you may have a better idea. If your RDBMS tool supports it, you can even have make execute the database scripts during the make process.
This setup is especially useful in situations where you know you're going to have to drop/recreate/repopulate your database many times. I've used it successfully with MS SQL Server and Oracle, though I am sure other RDBMS support the technique.
Post Scriptum: If you are working with an existing application/schema (or have already created the schema for your project), many RDBMS include a "build script" command that will build the scripts for your schema automatically. Try it!
Database Extensibility Techniques:
many of us are not lucky enough to use something like GemStone for persistence
A very good point, and for those using a RDBMS there are some database design techniques that will provide some OODBMS-like extensibility that may delay the need for schema refactoring. The downside to these techniques is the inability, or difficulty in querying extended data.
1) Storing XML in a text column named 'XMLProperties' is a classic way to dynamically extend a database entity without changing the schema. {It is tough to query such XML however. Related: MultiParadigmDatabase}
2) BLOB or Image storage. Persisting serialized object data (not object behavior) in an 'ExtendedData?' binary image column.
These techniques work well for a table such as 'Contacts' where you will have the obvious columns (name, address, city, phone, etc...) and an extensible column, such as XML or binary image. When the customer requires that an alternate mobile phone number be added as a contact field, you can quickly extend your contacts table without touching the DB.
But don't get too hooked on this practice. If your customers next request is to generate a report showing all alternate mobile phone numbers in area code 503, standard SQL will not be enough to reliably SELECT the extended data. You'll have to write some extra logic in the object-relational mapping layer.
A good rule of thumb is to keep extended column data storage to less than 2-3 21 items. Any more than 3 is a sure sign that the relational database is in dire need of re-normalizing and poorer performance is setting in. As long as the extended data is not needed in queries, you can extend indefinitely without touching the DB.
Another Advantage-Disadvantage: Consultants in the field, and their customers, will love the ability to customize the application, but at the expense of complicating migration to newer releases.
-- MichaelLeach
This is bad advice, pure and simple. First of all you fail to see that extending (adding columns) is essentially a lot easier than refactoring, so if you want to add a new column with significance to the application model(s), you have to do just that: add it. If you want your user to dynamically add new properties/data, you can do it relationally, and you won't even have to refactor either the schema or the application to support such a requirement.
If you extend it your way, and later you'll find out that you need refactoring (get some data out of the damn blobs, and put it the way it should have been from the beginning), well, good luck with the maneuvers.
To put structured data in XML blobs or in binary serialized blobs is doing lots of harm, and has no benefits compared to the usual solutions, which have all the known advantages (data integrity, manipulation, etc, etc ). Blobs should be strictly reserved for things like images, video, unstructured text content, things that are a whole, not a bunch of parts that you need to access separately. This is the only real rule of thumb.
Would you care to explain what exactly you will solve, and how you've got your rule of thumb? -- CostinCozianu
Only if you'll explain why binary storage should strictly be used for images and video and XML in databases is bad. This sounds like a rather feeble maxim to me. -- ML
Well, when you're using a RDBMS, you use it because you want it to resolve the problems of data structuring, data manipulation and data integrity, together with a few others. Now when you stored structured data in binary or XML, you give up everything a DBMS can and should be doing for you, and move it to be the application's responsibility. It is a bad engineering practice for all the reasons we are not programming with Cobol indexed files anymore and we have DBMSes instead, there's no point in repeating here what any decent book on databases states in the introductory chapter.
The problem is that if you want to make an exception from the rule, first you have to acknowledge it as such. There's absolutely no rule of thumb in here, what you recommend is contrary to basic principles of database design. Then you have to justify what the exception is for. Does it solve a real problem, and are there no better ways? -- cc
Enterprise application developers cannot know field customization demands in advance. You may not be able to assume field consultants have DB skills. These techniques, with a simple middle-ware abstraction interface, provide the needed flexibility for simple customization.
These options are good intermediary tactics until a long term, FourthNormal? tuple table strategy is needed in the infrastructure. Sure, you can build extensibility into your infrastructure relationally. These are simply faster, cheaper options.
BLOBs have long been used for storing serialized data and many DBMSes now provide intrinsic support for XML anyway. Clinging to legacy relational principles is artless in light of the latest technology IMO. -- ML
Well, you have to define "field customization" for me, but if it is what I think it is, you're wrong that your solution is cheaper and faster than doing it relationally. When the long term fourth normal tuple table strategy comes around (do you have anything against the fifth normal?) your data buried into blobs, be it XML blobs, will give enough headaches, as to forget about the cheaper and faster consideration. Fast food is also cheaper and faster and sometimes called junk food for good reasons.
If BLOBs have long been used for storing serialized data, it is also true that design mistakes based on presupposition and "rules of thumb" without any serious arguments have also been plenty. They even made it to diagrams shown as samples of good designs in the books, especially the OO ones, to prove that ignorance reigns supreme in certain circles. So hardly good enough for your lack of arguments.
As to what regards clinging to "legacy relational principles", this refrain was sung in the early 90's by ODBMS vendors. There were tons of latest technologies that just vanished into thin air. Among many other reasons for the fall of ODBMSes, they ignored the few theoretical OO Data Models, which, strangely enough, were building upon the ideas of the relational model. Relational model is a legacy in as much as Newton's mechanics, Maxwell's electrodynamics or classical mathematics are legacy. Do not think that waving the "latest technology" flag is going to solve the database management problems, problems that were already solved by the relational model, while the new kids on the block are trying to reinvent some squared wheels. XML support in databases, well it is some, kind of. It depends on your interpretation of "support". Why don't you store the whole schema in XML then? Because XML ain't no data model, it pretty much sucks at it. -- cc
Interesting rant... Can you point me to any FifthNormal? products that you've worked on? I suspect you're starting to waste my time.
Should I take it that you're asking me when I worked on a project where the database schema was in the fifth normal form? Well, it so happens that I've never designed a schema that was not in the 5NF. There's no justification whatsoever for having a schema in 3NF but not in 5NF. If you don't waste my time you know what I mean.
Hello, this is a different interlocutor. You say you never designed a schema that was not in 5NF, but question: were any of the designs actually implemented, or was it an academic exercise and the real implementations all denormalized the design somewhat to make them actually useful?
Yes, they were implemented. Sorry for this rant from my part, but you might be mislead by some opinions that normal forms higher than 3NF are only good as an "academic exercise" and that "over-normalization" - that's what they called the 5NF, is bad for performance. This is wrong, and is based simply on ignorance of many authors. First of all the most common schemas like the ones in the "ndNorthwind" sample from MicroSoft databases (with customers, products, orders, orderlines, shipping, employees, etc) are in 5NF. Just if you follow your intuition, without any special training in database design, great chances are that you might end up with a schema that is already in 5NF. Nothing magical or academic about that. But sometimes common intuition is not enough, because there might be some tricky parts.
To be in 5NF, wouldn't the Northwind example have to normalize the Employees and Customers tables to be a Person table with a foreign key reference to a PersonType?? I also thought 3-5NF would move the contact information away from the Person table and into a ContactProperty? table with Person-Contact intersections in a PersonContactInformation? table. The task of refactoring then becomes an exercise in adding new Contact rows instead of Person columns.
My take is that Northwind is more like 3NF and not easily refactorable. But it's simple enough to meet the needs of many solutions that don't need the complexity of further normalization. -- MichaelLeach
Costin is right, Michael is wrong. Look up the definitions of 4NF and 5NF (and 3NF and BCNF while you're at it). Try http://www.inconcept.com/JCM/June1999/becker.html as a start, and remember that 1-5NF are defined in terms of a single table, not the database as a whole. -- MishaDorman?
Where the theory of normalization is really important from the practical point of view is in those situation where schemas in 3NF but not in 5NF might appear. By studying a little about database design, you get your eye trained to easily spot those situations, and avoid some common pitfalls. Generally you don't want to have schemas that are in 3NF but not in 4NF, they are a nightmare from a very practical point of view. Schemas that are in 4NF, but not in 5NF are somewhat rarer, but nevertheless the right solution from the practical point of view is to have them in 5NF.
Unfortunately this subject has been stolen away by a bunch of ignorant but popular authors, and is subject to more myths than reality. -- CostinCozianu
many of us are not lucky enough to use something like GemStone for persistence
Well, you only think you are lucky. Gemstone might support some mechanics for refactoring. But refactoring a database, when you already have existing data is a problem of semantics. If you change the structure of data, while the data is in the database, it is not the mechanical operations you have to worry about, it is all about the meaning and integrity of old data, transformed to fit in the new structure. And if it's a relational database or an object database that you have there, you're facing the same hard problem, only an object database might help you in being in ignorance.
Refactoring a database is no free lunch. You'd better not do it.
Just to toss something else in:
ExtremeProgramming was developed with code in mind. But databases are fundamentally different from codebases. I suspect that certain XP practices like RefactorMercilessly would not be developed if one were inventing XP for databases.
Databases aren't different from codebases. Databases can be converted to code (though often it would be a huge set of source files). -- JeffWinchell
I would submit that manipulating the database's structure would be better compared to using an IntegrationStation? in XP; rather than integrating everything all the time, developers integrate code one piece at a time, as needed. Ditto with changing the database; modifying the database is expensive, so there needs to be a certain amount of shielding (see the ShieldPattern).
I'll also recommend BrianSlesinsky's suggestions in ContinuousDatabaseRefactoring. It solves a lot of the practical problems in modifying the database's structure. -- BrentNewhall
(I tried to factor this in as DocumentMode above, but I don't see how.)
I work with database intensive code, and refactor the database schema regularly. How? I work TestFirst, and my tests use an (in memory) AxionDatabase that is created in my setUp, destroyed in my tearDown, and is changed as easily as the code. Only once the schema is settled do we bother to create it in a big iron database. If we need to change it later, we follow the same cycle.
I highly recommend this approach.
There are only a few fundamental ways that a data model can be modified (this applies to OO models just as much as relational models):
Consider again the case of a city / customer table. Further, assume that we have two versions, one merged, the other split into the two tables listed. We can move the data both ways to match the code versions, and know that the database will never be in an inconsistent state. It may loose information, specifically, it will only retain the least common factor of information between all the versions being migrated through. In terms of objects (specifically java, seeing as it has some use in this matter), this is like having your version control system understand that the SerialVersionUID is directly related to the version of the source for that class, and the class knowing how to use the version control system to get a strategy to update a serialized version to its version based on that understanding.
Now, there's still the issue of it not being a good thing throwing away good data just because it doesn't fit the current version of it. If your db doesn't let you roll back transactions, throw it out. If your OO model isn't versioned in some way (through version control or whatever), than throw it out. If the version control your using for your OO model doesn't do something sensible with the binary data your OO language throws at it, throw it out and get something that at least does binary diffs. This is what they're there for. The data can be brought up to the current version of the representation by your stored update-decisions. That's what _they're_ there for.
And for the record, I know I'm oversimplifying the nature of updating huge datasets. Of course, if your really adding features and not merely bugs, you're going to have to do that anyway. There's nothing stopping you from developing with a small dataset such that updates are snappy, and then rolling out the update once. I'd be surprised (and quite concerned actually) if you did this with live data. But even then, the worst case is lost/incomplete information, rather than inconsistent / incorrect information.
Most of the changes I see are of this type [equivalent information], but they aren't just name changes. More often, they are restructuring of data. De-normalize these tables for performance reasons, move column from 3 existing tables to 1 new one, split this table into 2 tables, etc. These are the most common changes in all refactorings I've worked with, for OO or relational.
XAXB ----------------- 1a1x 2b2y 3c3z
XAB ---------------- 1ax 2by 3cz
XAXB ----------------- 1a2x 3b3y 5c4z
XAB ---------------- 3by
XAB ---------------- 1ax 2ax 3by 4by 5cz
XAXB ----------------- 1a1x 2a2x 3b3y 4b4y 5c5z
XAAB ----------------- 1aax 2aby 3bcz 4b 5c
Now, there may well be other transformations which don't change the information content, I just couldn't think of any offhand. But anything that affects the normal of a table is by definition not an equivalent information transform.
If denormalisations (or other transformations that result in increased scope for inconsistent duplicate data) are done that result in "equivalent data" and are done purely for performance or refactoring reasons (i.e. there should be no change in the logical behaviour) then the transformations should be accompanied (in the logical data schema) by the putting in place of appropriate (cross-row or cross-table) constraints to prevent the inconsistent duplicates that could otherwise occur. Such a "consistency constrained" but de-normalised schema doesn't change the information content, but does affect the normal form of the table.
Ideally those consistency constraints would be enforced within the physical data schema, but in some cases they are too complex/hard to implement/slow to evaluate within the DBMS, and it may be pragmatic instead to enforce them through triggers or application code. (Recognising of course that the resulting procedural [non-declarative] implementation will likely require significantly more testing and cost more to maintain than a typical declarative constraint.
See also: RefactorSql, DatabaseBestPractices