Star Schema

** Stars: A Pattern Language for Query Optimized Schema by StevePeterson.

Available in PatternLanguagesOfProgramDesign I and online. http://c2.com/ppr/stars.html

A Star schema is optimized for data analysis as is typically required in a Decision Support System. The star schema pattern language presented here is an attempt to provide a method for business analysts to develop a schema which is easy to query.


I'm going to go out on a limb and suggest that a well-normalized database (with respect to the concepts of the domain) is easy to query, and vice versa. No, I don't have the formal education of some of the gurus around here, but I've read enough to suggest that the two go hand in hand. On top of this, I'm going to suggest that many people are inflexible about what they consider the true domain objects in a system. Therefore, arguments about denormalization frequently boil down to:

1. A question of the classification of domain objects.
2. A question of the correct use of keys.

or at least, they should.


Umm - no. A well-normalized database is good for having an accurate representation of the details of your data. In reporting, you are frequently needed to group data across relatively coarse ranges of multiple dimensions, including time, and do output of the same kinds of aggregations over and over. OLTP systems (used for data entry and editing) are usually normalized, and OLAP systems (used for reporting and decision support lookups) are usually groups of star schema data marts built from queries of the normalized data.

In a star schema, you are usually looking at data that has been aggregated from the original, normalized data into a number of cubelets by chopping up dimension space into the smallest useful reporting units, and ignoring all dimensions (attributes) that will not be used for report queries. The data is usually further denormalized by having the same data represented in multiple star schemas at different levels of granularity (e.g. by day, by month, by quarter, and by year), for faster speed when doing coarser-grained reports and top levels of drill-down. It is OK to have data duplicated in such a schema because it is read-only and is populated based on a snapshot of normalized data at a point in time.

Note that this also separates the concerns of aggregating and report querying which is good. When querying the data, you just need to know what kinds of aggregate data you have, not how the aggregates are made, and when aggregating the data, you don't need to consider what combinations of dimensions are being queried, sorted, etc., just what aggregates are going into the OLAP system to support the queries one might want to do subsequently.

-- SteveJorgensen

Why is this [star-schema] better than simply defining some views on the normalised schema for the common queries. That way, you maintain the power and flexibility of the relational model and the normalised schema, and also get the "ease-of-use" and consistent application of the aggregating rules. (I'm assuming the "normalised schema" includes the history needed, so it may still be different from an existing OLTP schema.) Note that I'm talking about the logical view here, not physical aspects such as performance -- they can be taken care of with materialised views, clustering and other tuning of the physical DB.


A star schema has a fact table (ThirdNormalForm) surrounded by dimension tables (SecondNormalForm). A star schema uses surrogate keys and several techniques for representing data+keys that change over time. (Many source system keys do not change as the data changes, therefore a surrogate key is needed to distinguish the current source data from the old source data.

A formal education is not required, but RalphKimball and associates have written up a lot of experience that's easy to read.


Kimball's "Data Warehouse Toolkit" is worth a read. The accompanying CD contains his query tool, StarTracker?, which embodies some interesting concepts. Alas, StarTracker? is a 16-bit Visual Basic application... For a time I worked with a firm that did a 32-bit version; I wrote a companion metadata descriptor tool called StarManager to replace Kimball's hypertrophied .ini file. The ultimate plan was to support temporality in a way most data repositories don't; the operational data store is frequently amnesiac -- Kimball used to call them "twinkling databases." Sad to say, the company "went 404" some few years back, and we never had the chance to address the main challenge.


EditText of this page (last edited November 9, 2014) or FindPage with title or text search