Continuous Database Refactoring

Assumption: the big corporate DataBaseAdministrator? is either on your side or non-existent. The process I outline will *not* work if you don't control the schema yourself either by modification or by a very responsive DBA.

"First thing we do is kill all the DBAs." ... I think Shakespeare meant to say that. Next, we realize that DatabaseIsSoftwareToo. -- JeffWinchell

Here's what I've been a part of on my XP team:

We create a script called "setup" which makes a copy of the production database(s) into a development sandbox. (For us, Integration is just another sandbox.)

"But Wait!", I can hear you cry. It's too slow!

In order to conquer the speed problem for large databases, you have to figure out your schema, and look for the large logging and other fast-growing tables, and when doing the database copy, you copy the schema from production, but you don't copy the data out of those tables. Or you only copy the most recent X rows, where X is related to the data chunks necessary for the system to function normally in the sandbox.

Relational integrity will likely be compromised by these simple sampling techniques. It's not too hard in RDBMS that has a system catalog to develop a generic solution that solves this. -- JeffWinchell

Now, right after a release to production, you have a pristine copy of the (truncated) production database. This means that not only do you have the most recent production schema, you also have any data anomalies that may be creeping into your production data.

The next part of the solution involves a minor enhancement to the setup script. After running the database copy, the setup script calls out to a "apply_schema_patches" program which basically has coded all the modifications to the schema that have been made since the last release, as well as inserts for any persistent data in new schema items, and any other data modifications that need to be done to bring the production database into line with the current TopOfBranch? code.

As you develop code and you find that you need to change the schema, there are two main paths.

1. if the schema change is simple: load your standard sql manipulation tool (sqlplus and derivatives) and make your schema change. Copy that schema change sql into your apply_schema_patches program, surround it with the appropriate code in the language the apply_schema_patches file is written in. Get the tests running, check in the code changes and the changed apply_schema_patches program, and voila, you have a program that you can use to update the schema on integration, get the tests passing again, and then the other developers can use the program to update the schema for themselves as well.

2. If the schema change is complex (rare): more particularly, when you have to make data migrations that can't be done in straightforward sql, the process is a little more time-consuming. The reason it is more time consuming is because in this case you make your changes to the apply_schema_patches program, and run the full setup program again to actually execute the schema changes and data manipulation. Assuming it works, and the tests all pass, you are golden. If not, you have to fix the apply program and rerun setup. In the rare cases where you have to do this, it can be time-consuming, and therefore, you might want to explore options as below.

In both cases, you end up with several advantages.

your sandbox is clean. There are no tricky, complicated bits to deal with the schema differing from the code as follows.
the integration remains trivial as in normal XP practice. There isn't a costly step to integration that requires the deferred database schema change.
you have a program that automates the schema changes that can be run during your release downtime while the production code is being updated to the latest revision
that program is very well tested by the continual integration and developer updating
that program is tested with the most current production data multiple times per day

Generally, you have to face the pain eventually, and like ContinuousIntegration, I think you'll find that in general, ContinuousDatabaseRefactoring is a better idea than deferring database refactorings.

-- Jeff Bay

How do you test for performance issues? At some point to you copy a larger chunk of data over?

-- DaveTauzell

Generally, we use test data in those tables, since it is faster to generate than copying over the whole data-set. Occasionally a performance problem with this tables will slip by for a while, but it gets caught in profiling very quickly. Once we know where a performance problem exists, tests pin that section down so that they don't creep back into being a problem again.

If we suspect a problem in this area, the tools we built allowed for the option of copying over the whole production database in order to sample a run with full data.

-- JeffBay

I have some experience of this. All systems have release cycles of some sort. As you say, in a corporate environment this is a heavyweight operation, and rightly so since there is significant risk. So what you need is two development systems. One is the QA environment where release procedures can be rehearsed and a full regression suite can be run. Then you've got the development box which can be unstable as you like. In this environment nothing is sacred under loose consensus. We used group e-mailing to warn of impending downtime and all operations were guaranteed to complete fast or be rolled back. I was the moderator of all database changes which for me meant splitting the development box into about five clones which could be booked out for high speed development. We got a full database build time, from disk upwards, down to about an hour allowing interactive integration debugging with repeated combing. All SQL was written as stored procedures (very important!). I suppose the point is to get yourself an environment you have complete control over, and work it harder than it ever has before :).

-- RichardHenderson

Here's how it works for my company:

Prerequisite: we are a very small company and are able to take our production database offline for short periods (long enough to do an upgrade), so that's not a problem for us. We aren't doing XP but I do try to encourage changes in that direction. We also have all code that touches the database in source control, so we can find the application code affected by a schema change.

We give each developer his own database account, and are fortunate that the production database is still small enough that we can copy it into a developer account. (That's unlikely to be the case much longer though - at some point we will have to use excerpts.)

Once the production database was deployed I created a script called "upgrade.sql". The specification of this script is that is should be able to upgrade any existing version of the database to the current development version. (The script can be run multiple times. If a change has already been made, it does nothing.)

The script contains PL/SQL similar to this pseudo-code:

  if not table_exists('foo'):
create table foo (...)

  if not column_exists('bar','baz'):
alter table bar add (baz ...)

Anyone who wants to make a schema change does so by adding some commands to the end of upgrade.sql. They test the upgrade script in their own environment and also test the application code and make any necessary changes. When they check in their code, they also check in upgrade.sql. A developer checking out the new code brings their schema up-to-date by running upgrade.sql.

When the code is deployed to production, we take the system offline and run the upgrade script just like we would in the developer environment. We know the upgrade script has been well-tested because the developers have been using it to upgrade their databases using a copy of the production data.

Of course not every schema change is as easy as adding a table or column. But the rule is that you write the data conversion routine when you change the schema. The schema change isn't complete without a data conversion script. (Just like it's not complete without unit tests.)

The upgrade script gets longer as people make changes so it gets unwieldy after a while. From time to time I move the oldest changes to another file, since they've already been applied for all known copies of the database.

(As the DBA I tend to make most of the additions to upgrade.sql, but the system would also work if people wrote more data conversion routines themselves.)

-- BrianSlesinsky

My approach is pretty similar to the ones described above, though there are a few differences.

I keep a full copy of DDL for the schema under revision control, along with data for lookup tables (complete if it's a small number of rows, or a reasonable sample if it's large) and data for tests. This loads in a few seconds to a minute or so, depending on the project, making loading up a fresh database for testing easy.
I generally don't make production data available to all developers, since it's usually sensitive data, often protected by privacy laws.
I use database update scripts as described above.
For one larger project, I built a schema diff program that would tell me what the differences were between two schemas, and I'd use this to diff a dump of the production database schema (not including data) and the DDL under revision control. This gave me confidence that I was actually testing against the schema that my production code was using.
It's definitely much, much better to do many small releases, when you're doing this kind of stuff. If you're clever, and it's important enough, you can find a fair number of ways to keep things running straight through upgrades. (I often do database upgrades in the middle of the day.) It costs more in compatibility code and testing, but less downtime always costs more.

-- CurtSampson

Try VMWare to solve the large database copy problem. We create a VM with the database, run VMWare in undo or non-persistent mode to try out the proposed code changes, and when we shutdown the VM, we can discard or keep those changes to the database.

Before using VMWare, we just had a background task that would copy one or a few copies of the database (4-6 GB of data) that we could use for testing when we needed it. The time it takes to do some tests, write some code is similar to the time it takes to do these background copies, so effectively little time is spent waiting for a new copy of the db to test. It does require more diskspace, but diskspace appears to be absurdly cheap for most situations. -- JeffWinchell

CategoryExtremeProgramming | CategoryAdoptingXp