Refactoring With Relational Databases

Refactoring can be especially challenging when a RelationalDatabase is involved. You can't get your DBA to change the schema every 10 minutes, and many of us are not lucky enough to use something like GemStone for persistence. Furthermore, if you're dealing with existing data, data migration can be difficult, time-consuming, and risky. There are a couple different opinions as to how to approach this subject:

Go ahead with the database refactoring, making liberal use of scripts for reliability and testability.

Requires that you have the cooperation of the DBA, or that there is no DBA in the first place. This strategy is a consequence of ExtremeProgramming thinking: If something is expensive, do it all the time and force yourself to make it less expensive. See ContinuousDatabaseRefactoring for specific advice.

Defer changes in the database.

Code is much more flexible than a database, so make your refactorings in your code, and fix your class-to-table mapping to make the code and the database line up. Defer schema changes and defer data migration until you really have to do it. (See ShieldPattern.) See HowToDeferSchemaChanges for specific advice.

Get it right the first time.

Refactoring databases will slow you down no matter how you approach it, and with the proper up-front analysis you should be able to design a schema that never needs refactoring. DontRefactorDatabases.

: That's exactly what Extreme Programming is not about. XP means being humble enough to admit that probably you will not be able to do it right the first time, but following the XP practices the cost to change from wrong to right will be low. -- FelipeHoffa

Contributors: JeffBay, StephenGryphon, PerrinHarkins, GeorgePaci, GeorgeDinwiddie

On the difficulties of data migration

Migration can be straightforward, it can be unbelievably nasty and high risk. I'd say that frequent complicated (fear inducing) data migration will slow refactoring and indicates a brittle environment: a poor fit with the XpFrame. -- KeithBraithwaite

Could you say some more about the forces that determine whether it is straightforward or risky? -- MartijnMeijering

Well, this depends a lot on your database. PostgreSql offers some fairly extensive refactoring options; you can add and delete columns, change constraints, and so on. The more complicated refactoring usually involve several steps, adding new tables or columns without constraints, copying data as necessary, creating new constraints, dropping old constraints, and then dropping old columns or tables. This can get pretty complex if you have extensive constraints, but the scripts are eminently testable. Even better, it can all be done in a single transaction (PostgreSql is one of the few RDBMSes out there that lets you roll back table creates and drops), so you can do a test run against your production database (or a copy thereof) before doing it for real.

Dealing with different versions of software can be more of an issue. Often you can create views that show the new data in the old way, at least closely enough for the old programs to work. If you're concerned about things changing in more drastic ways, you can also create functions in the database to do some of the work and then the function can present a constant or versioned interface to the application code. Sometimes it's just as simple as giving a new column a default value, rolling out the schema change, and then updating the software after that.

More difficult, and less often needed would be a procedure such as:

put a version number in the database
have the insertion routines look at it and do different things depending
do the above shuffle and increment the version number before you commit the transaction
remove the old code

-- GrahamHughes, with extensive updates by CurtSampson

There are the annoying not-tool-supported technical issues of making database changes: Unless you're adding NULLable (or defaulted) columns, you'll have to do rename-copy-drop on the table.

Drop inter-table relational integrity constraints.
(If your DB supports this...) Rename the table(s).
Create new table(s) a la changed schema.
Perform queries to populate the new tables. (Non-trivial schema changes may require non-trivial SQL queries.)
Recreate all the appropriate inter-table relational integrity constraints.

-- JeffGrigg

Who controls the database?

In many organizations, relational database technology is considered to be a strategic corporate resource, requiring centralized control by specially trained database "experts." And, to a large extent, the design, tuning, and operation of relational database schemas requires skills and experience beyond that of your average developer.

So, relational databases are often under the strict control of a centralized group; nothing (other than data) can change in a database without having them doing it for you. Some companies take this to extremes, requiring centralized control of development/unit testing databases with the same rigor of production databases. And a few insist that no database change can be made without having a face-to-face meeting between members of the development team and members of the centralized DBA team (so they can "do" the database design for you). Needless to say, these policies increase the cost of change and are anti-XP.

-- JeffGrigg

Related: DbasGoneBad

When the developers must share a common database

Also, in many business system development projects, the developers must share a common database. So, if one pair were to change something in the database schema, it would immediately break code and stop the work of all other pairs attempting to work at the same time. (This violates the commonly understood requirement that each developer (or pair) must have their own isolated "play pen" in which to do development and unit testing. But this argument doesn't seem to hold much weight in many corporate development environments.) Problem with "play pens" is that often developers don't know enough database administration to quickly change the schema. They opt for a shared development database so that schema changes have to be done only once, by the most DB-skilled man on the team or a shared DB admin.

-- JeffGrigg

When multiple teams must share a common schema

My advice is DontDoThat. Break the schema (logically) into separate pieces, each of which is under the control of one team. Don't treat the database as a giant global variable, open to all comers. Rather, give ownership of each piece of the data to a piece of the system and make all accesses through that system. This encapsulates the data with behavior and prevents semantic misunderstandings of the data. It also allows each team/module/sub-schema to refactor at will.

Problem is, in practice 90% of this behaviour is CRUD (a lot of loadObjectX/saveObjectX methods), complete with an application-specific query language, thus ReinventingTheWheel. Also, in my experience behaviour (and method signatures) tends to change faster than underlying data, even when costs of modifying data and behaviour are the same.

Sharing the database is also the easiest way of integrating various applications (but is not an integration panacea - sometimes you really need to share behaviour). SQL is a joy compared to serious Web Services (SAP SOAP interfaces make a grown man cry...). And giving each application ownership of the data will lead to PointToPointIntegration, which will require a major EnterpriseApplicationIntegration project to reintegrate.

Deployment issues

I must admit, that I still doubt that it would work to RefactorMercilessly in a large system with relational databases that have actually been deployed. It should certainly be possible for each developer to have a small development and test instance of a database, with a small set of data, but the problem as I see it concerns deployment. Whether you deploy it often or not, there will be a disturbance for the customers that is roughly proportional to the number of changes you make in the database schema. If the changes are rolled out often, there will be many down-time occurrences and plenty of evening or weekend work for the DBA's. With fewer rollouts, there will be larger changes at each delivery, and thus longer down times, more to test, and bigger risks...

Some mixed points...

Views - are a GoodThing that is often forgotten, but usually, you can't update a view that consists of more than one table, so it won't really solve the problem described above. (Actually, some databases, such as PostgreSql, will let you write an arbitrary function to deal with view updates, which effectively lets all views be updatable if you can figure out by what rules you want things updated.)
Downtime - In the systems I have worked with, the large customers have such database sizes that it might take a day or so to load just one big table. To make extensive changes in the database schema more than a few times per year is hardly acceptable then. The XP thing to do would be to refactor as soon as you find that it would improve code simplicity, readability or whatever, and that could cause a schema change in the next minor release, even though this change wasn't crucial to the customer who needed the new release.
Risk - Even with plenty of tests, every change to the schema in a mission critical deployed database system is a risk. People usually try to avoid these things...
Performance - It might well be too expensive performance wise to have fixes in the code for retrieval or insert/update that would affect two (or more?) tables instead of one. (See discussion on A, B and F above.)
Schema private or public? - From an OO point of view I want to see the database schema as something private - the persistent storage of our private attributes. I want all access to go through the public methods of our classes. Unfortunately, databases aren't always viewed like that. Several applications might share a database without sharing application code. The database schema might be seen as a public interface to the data. This basically means that most database refactoring violates the the OpenClosedPrinciple. I certainly think such a view of databases is bad, but we often have to live with a legacy...
Wide deployment - The more customers you have running your system, the worse it gets. Even if you don't make shrink wrap software, you might have a few hundred, or at least tens of installations. There might also be different customizations at different sites. This certainly increases the need for XP practices such as automated test and build and version control, but it also makes me reluctant to change the system more than I must...
License costs - depending on your database it can be very expensive to have more connections and/or more servers.

True, this also depends on what tools we use for programming. With flexible tools and a restricted subset of SQL we might be able to run a completely different RDBMS for our local tests. Also, it seems to become more and more common to give development licenses for free, and just charge money for deployment.

Still, the situation might not be hopeless. There are a lot of database design decisions that affect these issues. Do you have one or a few large databases, or several smaller? Same thing with tables. Perhaps we need to discuss RefactoringFriendlyDatabaseDesign!

Parts of these concerns are only problematic if you touch a few main tables in the system. Most databases I've used have a large number of fairly small tables that could probably be refactored without big problems. On the other hand, it's with the big tables we have a lot to gain...

BTW: How do people handle version/change control of database schemas?

There might also be database content that is delivered by the developers rather than produced by the customers, for instance error messages or calendar data (which days are bankdays etc). One solution (that I find reasonable) is not to put such data in the database, but to keep it in the code or in text files. At least that leads to a little less database headache...

If you do such, it's often difficult to do the equivalent of joins with DB data. If you wanted to run a report showing sums by calendar date for one year, you'd have to run 365 separate queries if the dates come from a file. -t

-- MagnusLyckaa

Version control & change control of database schemas BTW: How do people handle version/change control of database schemas?

The way I handle all of these problems (i.e. RefactoringWithRelationalDatabases) is to use SQL command scripts. Instead of using the interactive SQL tool for the database, I devise a script consisting of SQL DDL/DML commands to create (and populate, if necessary) the table/view/what-have-you. Then I just check the scripts into CVS and voila!

With this setup, you can even write unit tests for the scripts. If there are dependencies in the schema, you'll have to be more careful in the way you write and call the scripts. I like to handle this with make files, but you may have a better idea. If your RDBMS tool supports it, you can even have make execute the database scripts during the make process.

This setup is especially useful in situations where you know you're going to have to drop/recreate/repopulate your database many times. I've used it successfully with MS SQL Server and Oracle, though I am sure other RDBMS support the technique.

: -- PrestonRickwood

Post Scriptum: If you are working with an existing application/schema (or have already created the schema for your project), many RDBMS include a "build script" command that will build the scripts for your schema automatically. Try it!

Database Extensibility Techniques:

many of us are not lucky enough to use something like GemStone for persistence

A very good point, and for those using a RDBMS there are some database design techniques that will provide some OODBMS-like extensibility that may delay the need for schema refactoring. The downside to these techniques is the inability, or difficulty in querying extended data.

1) Storing XML in a text column named 'XMLProperties' is a classic way to dynamically extend a database entity without changing the schema. {It is tough to query such XML however. Related: MultiParadigmDatabase}

2) BLOB or Image storage. Persisting serialized object data (not object behavior) in an 'ExtendedData?' binary image column.

These techniques work well for a table such as 'Contacts' where you will have the obvious columns (name, address, city, phone, etc...) and an extensible column, such as XML or binary image. When the customer requires that an alternate mobile phone number be added as a contact field, you can quickly extend your contacts table without touching the DB.

But don't get too hooked on this practice. If your customers next request is to generate a report showing all alternate mobile phone numbers in area code 503, standard SQL will not be enough to reliably SELECT the extended data. You'll have to write some extra logic in the object-relational mapping layer.

A good rule of thumb is to keep extended column data storage to less than 2-3 21 items. Any more than 3 is a sure sign that the relational database is in dire need of re-normalizing and poorer performance is setting in. As long as the extended data is not needed in queries, you can extend indefinitely without touching the DB.

Another Advantage-Disadvantage: Consultants in the field, and their customers, will love the ability to customize the application, but at the expense of complicating migration to newer releases.

-- MichaelLeach

This is bad advice, pure and simple. First of all you fail to see that extending (adding columns) is essentially a lot easier than refactoring, so if you want to add a new column with significance to the application model(s), you have to do just that: add it. If you want your user to dynamically add new properties/data, you can do it relationally, and you won't even have to refactor either the schema or the application to support such a requirement.

If you extend it your way, and later you'll find out that you need refactoring (get some data out of the damn blobs, and put it the way it should have been from the beginning), well, good luck with the maneuvers.

To put structured data in XML blobs or in binary serialized blobs is doing lots of harm, and has no benefits compared to the usual solutions, which have all the known advantages (data integrity, manipulation, etc, etc ). Blobs should be strictly reserved for things like images, video, unstructured text content, things that are a whole, not a bunch of parts that you need to access separately. This is the only real rule of thumb.

Would you care to explain what exactly you will solve, and how you've got your rule of thumb? -- CostinCozianu

Only if you'll explain why binary storage should strictly be used for images and video and XML in databases is bad. This sounds like a rather feeble maxim to me. -- ML

Well, when you're using a RDBMS, you use it because you want it to resolve the problems of data structuring, data manipulation and data integrity, together with a few others. Now when you stored structured data in binary or XML, you give up everything a DBMS can and should be doing for you, and move it to be the application's responsibility. It is a bad engineering practice for all the reasons we are not programming with Cobol indexed files anymore and we have DBMSes instead, there's no point in repeating here what any decent book on databases states in the introductory chapter.

The problem is that if you want to make an exception from the rule, first you have to acknowledge it as such. There's absolutely no rule of thumb in here, what you recommend is contrary to basic principles of database design. Then you have to justify what the exception is for. Does it solve a real problem, and are there no better ways? -- cc

Enterprise application developers cannot know field customization demands in advance. You may not be able to assume field consultants have DB skills. These techniques, with a simple middle-ware abstraction interface, provide the needed flexibility for simple customization.

These options are good intermediary tactics until a long term, FourthNormal? tuple table strategy is needed in the infrastructure. Sure, you can build extensibility into your infrastructure relationally. These are simply faster, cheaper options.

BLOBs have long been used for storing serialized data and many DBMSes now provide intrinsic support for XML anyway. Clinging to legacy relational principles is artless in light of the latest technology IMO. -- ML

Well, you have to define "field customization" for me, but if it is what I think it is, you're wrong that your solution is cheaper and faster than doing it relationally. When the long term fourth normal tuple table strategy comes around (do you have anything against the fifth normal?) your data buried into blobs, be it XML blobs, will give enough headaches, as to forget about the cheaper and faster consideration. Fast food is also cheaper and faster and sometimes called junk food for good reasons.

If BLOBs have long been used for storing serialized data, it is also true that design mistakes based on presupposition and "rules of thumb" without any serious arguments have also been plenty. They even made it to diagrams shown as samples of good designs in the books, especially the OO ones, to prove that ignorance reigns supreme in certain circles. So hardly good enough for your lack of arguments.

As to what regards clinging to "legacy relational principles", this refrain was sung in the early 90's by ODBMS vendors. There were tons of latest technologies that just vanished into thin air. Among many other reasons for the fall of ODBMSes, they ignored the few theoretical OO Data Models, which, strangely enough, were building upon the ideas of the relational model. Relational model is a legacy in as much as Newton's mechanics, Maxwell's electrodynamics or classical mathematics are legacy. Do not think that waving the "latest technology" flag is going to solve the database management problems, problems that were already solved by the relational model, while the new kids on the block are trying to reinvent some squared wheels. XML support in databases, well it is some, kind of. It depends on your interpretation of "support". Why don't you store the whole schema in XML then? Because XML ain't no data model, it pretty much sucks at it. -- cc

Interesting rant... Can you point me to any FifthNormal? products that you've worked on? I suspect you're starting to waste my time.

Should I take it that you're asking me when I worked on a project where the database schema was in the fifth normal form? Well, it so happens that I've never designed a schema that was not in the 5NF. There's no justification whatsoever for having a schema in 3NF but not in 5NF. If you don't waste my time you know what I mean.

Hello, this is a different interlocutor. You say you never designed a schema that was not in 5NF, but question: were any of the designs actually implemented, or was it an academic exercise and the real implementations all denormalized the design somewhat to make them actually useful?

Yes, they were implemented. Sorry for this rant from my part, but you might be mislead by some opinions that normal forms higher than 3NF are only good as an "academic exercise" and that "over-normalization" - that's what they called the 5NF, is bad for performance. This is wrong, and is based simply on ignorance of many authors. First of all the most common schemas like the ones in the "ndNorthwind" sample from MicroSoft databases (with customers, products, orders, orderlines, shipping, employees, etc) are in 5NF. Just if you follow your intuition, without any special training in database design, great chances are that you might end up with a schema that is already in 5NF. Nothing magical or academic about that. But sometimes common intuition is not enough, because there might be some tricky parts.

To be in 5NF, wouldn't the Northwind example have to normalize the Employees and Customers tables to be a Person table with a foreign key reference to a PersonType?? I also thought 3-5NF would move the contact information away from the Person table and into a ContactProperty? table with Person-Contact intersections in a PersonContactInformation? table. The task of refactoring then becomes an exercise in adding new Contact rows instead of Person columns.

My take is that Northwind is more like 3NF and not easily refactorable. But it's simple enough to meet the needs of many solutions that don't need the complexity of further normalization. -- MichaelLeach

Costin is right, Michael is wrong. Look up the definitions of 4NF and 5NF (and 3NF and BCNF while you're at it). Try http://www.inconcept.com/JCM/June1999/becker.html as a start, and remember that 1-5NF are defined in terms of a single table, not the database as a whole. -- MishaDorman?

Where the theory of normalization is really important from the practical point of view is in those situation where schemas in 3NF but not in 5NF might appear. By studying a little about database design, you get your eye trained to easily spot those situations, and avoid some common pitfalls. Generally you don't want to have schemas that are in 3NF but not in 4NF, they are a nightmare from a very practical point of view. Schemas that are in 4NF, but not in 5NF are somewhat rarer, but nevertheless the right solution from the practical point of view is to have them in 5NF.

Unfortunately this subject has been stolen away by a bunch of ignorant but popular authors, and is subject to more myths than reality. -- CostinCozianu

many of us are not lucky enough to use something like GemStone for persistence

Well, you only think you are lucky. Gemstone might support some mechanics for refactoring. But refactoring a database, when you already have existing data is a problem of semantics. If you change the structure of data, while the data is in the database, it is not the mechanical operations you have to worry about, it is all about the meaning and integrity of old data, transformed to fit in the new structure. And if it's a relational database or an object database that you have there, you're facing the same hard problem, only an object database might help you in being in ignorance.

Refactoring a database is no free lunch. You'd better not do it.

Just to toss something else in:

ExtremeProgramming was developed with code in mind. But databases are fundamentally different from codebases. I suspect that certain XP practices like RefactorMercilessly would not be developed if one were inventing XP for databases.

Databases aren't different from codebases. Databases can be converted to code (though often it would be a huge set of source files). -- JeffWinchell

I would submit that manipulating the database's structure would be better compared to using an IntegrationStation? in XP; rather than integrating everything all the time, developers integrate code one piece at a time, as needed. Ditto with changing the database; modifying the database is expensive, so there needs to be a certain amount of shielding (see the ShieldPattern).

I'll also recommend BrianSlesinsky's suggestions in ContinuousDatabaseRefactoring. It solves a lot of the practical problems in modifying the database's structure. -- BrentNewhall

(I tried to factor this in as DocumentMode above, but I don't see how.)

I work with database intensive code, and refactor the database schema regularly. How? I work TestFirst, and my tests use an (in memory) AxionDatabase that is created in my setUp, destroyed in my tearDown, and is changed as easily as the code. Only once the schema is settled do we bother to create it in a big iron database. If we need to change it later, we follow the same cycle.

I highly recommend this approach.

There are only a few fundamental ways that a data model can be modified (this applies to OO models just as much as relational models):

Equivalent Information - The trivial case of merely changing to another form. This is pretty much limited to operations such as renaming entities (i.e. methods, columns), although a mildly interesting case shows up in splitting or composing tables such that they still share the same primary key set (candidate key sets?).
Removing information - Covers the cases of removing a method or column, but also covers the cases of removing a relation (in the sense of LinksAreContent). In the later case, this would have the effect of reducing the normal form; consider a city table inlined back into a customer table. The customer table will now have all the city information duplicated each place the city is mentioned. Semantically, this represents a statement to the effect that we may no longer be talking about the same city just because it has the same name. Note that this is generally only useful as a means to an end, in the same way that we may introduce duplication in our code with the goal of reducing more duplication than we added.
Adding information - Adding columns, new object types, new methods, new tables, etc. But more than this. Adding new relations, new structures to the model. The important thing to note here is that it may not be possible to do this automatically by simply declaring the nature of the information: you can't add a new column or instance variable by simply stating it's there, you need to specify a value for each row/instance (often achieved through use of a default value). More importantly, you can't extract a new table without specifying what to do with any resulting duplicates. If the only candidate key contains every key in the new table, you're not gaining anything. If we have two logical cities 'Saskatoon', one with a population of '200 000' and the other '250 000', which one do we believe? (This is really already a composite of adding and removing information) This is almost identical to the situation of factoring out some duplicated code, and realizing that they don't do the quite the same thing: the computer can't make the decision of whether that's a bug or a feature, that decision requires understanding of the semantics, an answer to the question "What does it _mean_ to do this differently?".

All updates can be thought of in terms of these basic operations. And here is the problem: in order to support the fast and merciless refactoring that we all know and love (well, present company accepted :p), we have to be able to codify all of our decisions on these matters. "If extract table runs across a would-be duplicate primary key, assume the existing primary key is correct and continue/modify duplicate to avoid conflict/call for help" If these decisions are made with the intent of doing the right thing, and are automatically associated with a particular version of the db, then all data can be migrated to any particular version, with the following guarantee: The database will not contain inconsistent data. It might however contain incomplete or incorrect data, but whatever is there will be consistent; this strategy is inherently conservative. It's vaguely similar to using a checked buffer to keep a string from overflowing: you can't expect to get the full string back by applying the operation in reverse given only the result of that operation.

Consider again the case of a city / customer table. Further, assume that we have two versions, one merged, the other split into the two tables listed. We can move the data both ways to match the code versions, and know that the database will never be in an inconsistent state. It may loose information, specifically, it will only retain the least common factor of information between all the versions being migrated through. In terms of objects (specifically java, seeing as it has some use in this matter), this is like having your version control system understand that the SerialVersionUID is directly related to the version of the source for that class, and the class knowing how to use the version control system to get a strategy to update a serialized version to its version based on that understanding.

Now, there's still the issue of it not being a good thing throwing away good data just because it doesn't fit the current version of it. If your db doesn't let you roll back transactions, throw it out. If your OO model isn't versioned in some way (through version control or whatever), than throw it out. If the version control your using for your OO model doesn't do something sensible with the binary data your OO language throws at it, throw it out and get something that at least does binary diffs. This is what they're there for. The data can be brought up to the current version of the representation by your stored update-decisions. That's what _they're_ there for.

And for the record, I know I'm oversimplifying the nature of updating huge datasets. Of course, if your really adding features and not merely bugs, you're going to have to do that anyway. There's nothing stopping you from developing with a small dataset such that updates are snappy, and then rolling out the update once. I'd be surprised (and quite concerned actually) if you did this with live data. But even then, the worst case is lost/incomplete information, rather than inconsistent / incorrect information.

"Hey, my bank card doesn't work": "Oh, I'm sorry, we'll have it up and running again in 2 minutes while we recover it from a backup"
"Hey, this atm says I have $10 000, my online banking says I have $4 800, and the pizza company is apologizing profusely for accidentally advicing my transaction to $500 000": "Oh, I'm sorry... do you happen to know which one was correct?" (And, what happens if nobody notices for a while?)

-- WilliamUnderwood

Most of the changes I see are of this type [equivalent information], but they aren't just name changes. More often, they are restructuring of data. De-normalize these tables for performance reasons, move column from 3 existing tables to 1 new one, split this table into 2 tables, etc. These are the most common changes in all refactorings I've worked with, for OO or relational.

: Denormalizing a table changes the information content. If it denormalized, then each occurrence of a row from the original table can possibly be changed uniquely. The content is different. Instead of a reference to a single fact, you have a reference to many facts, some of which are coincidentally the same, as described below. Now, you might be able to constrain them back to the same semantics, but at this point you're either reimplementing the original table structure, or introducing bugs. Now, if the app is okay with the multiple-identical facts, then okay, but it's simply not representing the same body of information.
: Now, there may well be other transformations which don't change the information content, I just couldn't think of any offhand. But anything that affects the normal of a table is by definition not an equivalent information transform.
: -- WilliamUnderwood

It changes the content of one table, but not the entire set of tables. The same information lives in the database (or the object model). Only it's location has changed.

: Um, that's changed information. As stated by MishaDorman? above, "1-5NF are defined in terms of a single table, not the database as a whole". Now, if the old table corresponds one to one with the new table, you're right, it doesn't change the normal form, and therefore neither the content, by definition. Denormalization isn't that though. If they're one to one, combining two tables into a single one doesn't change the content, because it doesn't change the normal form. Of course, then there probably isn't much to gain by splitting them unless it is to allow another operation.
: Now, any operation involving tables which are not a one to one correspondence will involve some information change. If we remove the many-to-one / many-to-many by combining them, then you loose the information that the various instances with the same values are referring to the same facts. Again, by definition, because they can now change independently. If you split a single table into two with a many-to-one / many-to-many relationship, you are adding information, specifically that they represent the same fact (... because they can no longer change independently).
: To reiterate, if it's truly only the location which has changed, than the information content hasn't changed. The things you can do with it, and the body of facts you can derive from it hasn't changed. If both tables share the same primary (or candidate) key set(s), the information has not changed. If they don't, then it has.
: And one more time just to make sure that my own mistakes don't cloud my point... A Table is a set of tuples. If splitting a table into two doesn't result in the elimination of duplicates from either one, then it didn't change the information content. Likewise, if in composing two tables each element from each table appears exactly once, then the information content didn't change.
: What I tell you three times is true :), And Thanks for bring the one-to-one operations to my attention... I knew I forgot something :)
: -- WilliamUnderwood

I don't buy it. If I start with X,A in one table and X,B in another table and then denormalize it so that I have X,A,B in the same table, I still have the same information I started with. I've just moved it around. I bring this up because it seems to be the cause of most headaches when migrating data.

: Perhaps an example is in order.

 XAXB
 -----------------
 1a1x
 2b2y
 3c3z

: Joining them results in

 XAB
 ----------------
 1ax
 2by
 3cz

: This is what I meant by one to one. In contrast,

 XAXB
 -----------------
 1a2x
 3b3y
 5c4z

: results in

 XAB
 ----------------
 3by

: Now, you say, no fair, maybe I don't want an inner join. Fair enough, if it's not an inner join, then this is the case I stated before where a candidate key was shared between tables. We're not in disagreement here. But there are other cases:

 XAB 
 ---------------- 
 1ax 
 2ax
 3by 
 4by
 5cz

: Splitting this into {X,A} and {X,B} we get

 XAXB
 -----------------
 1a1x
 2a2x
 3b3y
 4b4y
 5c5z

: Uninteresting. Okay, let's try splitting it instead into {X,A} {A,B}:

 XAAB
 -----------------
 1aax
 2aby
 3bcz
 4b
 5c

: This _does_ have a different information content. We've stated that the property that each unique A has a unique B is not a coincidence. The information is implicit in the constraint that we can no longer change a value of B for a particular value of X. Likewise, rejoining the tables removes the information embodied in the constraint.
: To be more specific, the information is of the form "For every A, there is exactly one B" as opposed to "For every X, there is exactly one B".
: Finally, note that in the first example, the final table {X,A,B} is in 5th normal form, and it therefore isn't 'denormalization'. There simply isn't any duplication there, where as in the second example, the table {X, A, B} has duplication, and is therefore in 1st normal form (or is it second normal?).
: -- WilliamUnderwood

P.S., just wanting to make sure, that the above explanation only applies to relation algebra... not SQL unless you select distinct or otherwise make sure duplicates don't occur.

Now, there may well be other transformations which don't change the information content, I just couldn't think of any offhand. But anything that affects the normal of a table is by definition not an equivalent information transform.

If denormalisations (or other transformations that result in increased scope for inconsistent duplicate data) are done that result in "equivalent data" and are done purely for performance or refactoring reasons (i.e. there should be no change in the logical behaviour) then the transformations should be accompanied (in the logical data schema) by the putting in place of appropriate (cross-row or cross-table) constraints to prevent the inconsistent duplicates that could otherwise occur. Such a "consistency constrained" but de-normalised schema doesn't change the information content, but does affect the normal form of the table.

Ideally those consistency constraints would be enforced within the physical data schema, but in some cases they are too complex/hard to implement/slow to evaluate within the DBMS, and it may be pragmatic instead to enforce them through triggers or application code. (Recognising of course that the resulting procedural [non-declarative] implementation will likely require significantly more testing and cost more to maintain than a typical declarative constraint.

CategoryAnalysis CategoryExtremeProgramming