Denormalization Is Ok

I work at a company that is doing the following:

This creates an amazing amount of duplication, as you can imagine. I grabbed a sample set of data, and 95% of it was redundant.

When a coworker and I undertook the task of making the data relational, we hit some serious organizational resistance. It turns out that the decision to denormalize was on purpose (!). The justification was that it's easier to query ("Get me all the products that are red").

Am I missing something? It seems to me that if a denormalized view is useful, it should be generated from an AuthoritativeSource that's normalized.

I think I'm going to be in the midst of a data management nightmare in a few months.


Agreed. Denormalization should be a last resort if nothing else can be done to speed things up. It's a basic relational database design principle that data should be stored OnceAndOnlyOnce.

-- DavidPeterson


Or perhaps these are different products, that just happen to be similar in attributes other than colour. Where is the denormalisation then?

By normalising, you are saying that conceptually, these colours are kinds of the same product, or subclasses of a product. But perhaps these concepts are not the most useful.

(More likely you are dead right, but I thought it was worth saying.)

I've seen this done because they didn't want to add an index. Pretty funny until my last contract where they bought an E64K (a mega-buck Cray equivalent unix server) with gigabytes of memory rather than add an index. Indexes must be heap scary to some people.


In DataWarehouse building, denormalization seems to be an acceptable technique. For instance, you may have summary tables that contain information that you could get with a query, but not so quickly. The StarSchema and other techniques seem to flout NormalForm regularly. (Or am I wrong? We're implementing a DataWarehouse soon, so I've been reading up, but maybe those with more experience could correct me?) --PhilGroce


If your data isn't changing then DenormalizationIsOk, but if your data is changing then you should be very wary about denormalizing - it's so easy for duplicated data to become out of sync. Writing insert, update and delete triggers to update multiple tables is not straightforward and needs careful planning and testing.

It may be an issue of short-term versus long-term. Normalization is the better long-term bet because it allows new requirements to be added with the minimal fuss. If you optimize the schema for a given current (short-term) usage, you may indeed speed up that one operation, but open yourself to risk down the road. I have seen a lot of companies with messed-up schemas and it costs them. Revamping it would be a huge investment because it would impact a lot of applications, so they live with bad schemas. FutureDiscounting perhaps was overdone. I think it may be better to find a compromize, such as periodic batch jobs that create read-only tables for special, high-performance needs. That way the live production tables can remain clean. --top


You can refactor things having a normalized database with new tables and build a view with the name of the old denormalized table.


It is generally accepted that a DatabaseIsRepresenterOfFacts. This has implications on how we interpret a normalized schema vs. a denormalized schema.

A fully normalized schema is when we cannot state a fact more than once. A denormalized schema on the contrary allows us to state a fact more than once, with the following consequences:

I agree with the thrust of what you are saying, but I didn't think that this was the usual definition of normalized. Normalisation (1NF, 2NF, 3NF, BCNF, 4NF, 5NF; also 6NF if you use DateAndDarwensTemporalDatabaseApproach?) avoids _particular_ kinds of redundancy and update anomaly, not all kinds -- it doesn't preclude "roll up" running totals in a separate table, for example, because it deals with relations individually. Also, a schema may be "unnormalised" but still prevent a fact from being stated more than once using additional constraints (and such a schema may be preferable to a normalised one, even as a logical schema, because normalisation beyond 3NF (IIRC) may make certain constraints harder to write). I think we need a different term for a schema (relations+constraints) in which it is impossible to be inconsistent. "Consistent" itself is rather overused, perhaps, so maybe "consistency-constrained"?

Therefore a denormalized schema is never justified. The main justification for a [de- surely?] normalized schema is very often justified by the fact that it helps performance, but this is hardly a good justification.

{If the boss/owner wants a faster system, you either give it to them or lose your job to somebody who can. Purists are not treated very well in the work world, for good or bad. Denormalization for performance takes advantage of the fact that any given cell generally has a read-to-write ratio of roughly 10:1 or more in practice. As long as this lopsided relationship exists, it appears that denormalization, or at least "join-and-copy", is an economical way to gain performance. It is roughly analogous to caching on a larger scale: if you read the same item over and over, it makes sense to keep a local copy instead of keep fetching from the "master copy" on each read.}

{{*First*, (even with current SQL DBMSs) there are better ways to achieve this than corrupting the logical data model: materialised views (indexed views, for SQL Server), for example. *Second*, this is the worst kind of premature optimisation -- not only are you spending the effort when it might not actually help (you've made the data bigger = slower than it should be in cases when you don't need the "joined" columns), in the process you are corrupting perhaps the most important public interface in the system, on which there will be many many dependencies (what happens when a new, more performance-critical app using the same data wants an incompatible "denormalisation"?). *Third*, in many cases, any performance benefits are lost in the additional validation needed on update to enforce consistency ("cache coherency", if you want to think of it as a cache) between the duplicates -- although queries may be 90% of the load, updates have a disproportionate effect (due to the "strength" difference between X and S locks, and because the "denormalisation" often means that the consistency checks are expensive as they need to read several rows).}}

Denormalization biases the schema towards one application. So the denormalized schema suits better the chosen application and makes the life harder for other ones. This might be justified, e.g. a DataWarehouse needs denormalized schema.

Why do you think that a DataWarehouse needs a denormalized schema ? A denormalized schema is automatically a redundant schema. A redundant schema can serve no logical purpose when compared to the equivalent non-redundant schema.

What it serves is a physical purpose: to be able to answer certain queries faster, for example by grouping related information closer together on the disk, precomputing the results of frequently asked queries, or partial results needed for the same purpose and storing them on the disk. However, this all comes at the expense of a greater risk of breaking logical correctness because physical concerns surface to the logical level. The traditional approach of denormalizing data warehouses was favored by poor implementation of the relational model in commercial SQL databases. The relational model mandates a sharp separation between the logical level and the physical level, which it rarely is the case in current DBMSes, although they are slowly making progress. For example almost always there will be a 1-1 relationship between logical structures (tables) and physical structures (table extents, type of row stored in physical pages).

The right solution to solving datawarehouse kind of problems is to be able to have controlled redundancy and more flexibility in storage at the physical level. For example, pieces of information from different table that are accessed together can be stored in the same physical disk page , even the same piece of information can be stored several time in various pages together with other related information. However, this can be done automatically by the DBMS, while the user will still see only one logical structure (the initial table), so the user can only insert/update/delete the fact once. This is not possible in a denormalized schema. The piece of the database kernel that controls the redundancy at the physical level can be thoroughly tested and doesn't run the risk of corrupting data like all the ad-hoc user programs that are needed to maintain a redundant schema.

Recently various SQL databases have been making progresses to this kind of flexibility in the physical layout through the use of clustered tables (grouping information from several tables in the same physical location to avoid join penalty), table partitioning (breaking a table apart in several physical locations either by data range , or by groups of columns), materialized views (storing and maintaining precomputed results for important queries).

-CostinCozianu

Agreed. DataWarehouse approaches like StarSchemas are doomed to be obsoleted (WithinFiveYears?? Perhaps even today, for many DW applications) by improvements to logical-physical separation in main-stream RDBMS products.


See DatabaseIsRepresenterOfFacts


EditText of this page (last edited September 2, 2005) or FindPage with title or text search