Thin Versus Wide Table Definition

This is an attempt to make a working definition of "thin" versus "wide" tables, as discussed in FearOfAddingTables. This topic should not attempt to attach a value judgment to each practice, since there are already topics on that.

Candidates:

candidate A: A "wide" table is a table that has as many columns as possible without duplicating "actual" data (including keys) unnecessarily. "Actual" is stated to distinguish from the practice of counting Null's as "data". (If you count Nulls as data and try to factor repetition of them out, then it generally leads to "thin" tables.) -t
- Note that a denormalized table is not the same as a "wide" table.
candidate B: A table is "thin" or "narrow" if it is in 5NF (http://en.wikipedia.org/wiki/Fifth_normal_form) or higher. Tables not in 5NF or higher are not "thin" tables. Most schema have some tables in 5NF (esp. the many<->many relationships), so a discussion of "thin" tables is really discussing the disciplined normalization of tables to at least 5NF.
candidate C: A table is "thin" or "narrow" if it is in 6NF, having no non-trivial 'join dependencies' whatsoever. (http://en.wikipedia.org/wiki/Sixth_normal_form ; http://en.wikipedia.org/wiki/Join_dependency). That is, whenever two different columns are each implied by a candidate key, they are placed in separate tables. Advantages: very strict definition, fully blocks EAV tables (requires splitting EAV tables into multiple tables). This is also the point at which the normalization readily supports meta-data, such as temporal extensions (http://en.wikipedia.org/wiki/Temporal_database), source tracking, and security extensions. (Support for these extensions is why 6NF was introduced in the first place.)

A good definition should be unambiguously, and ideally mechanistically, identifiable. Does the above qualify?

Agreed, but there will be disagreements over terms like "unnecessary" and "key". In my experience, our design assumptions are often based on what will change in the future. We may add some duplication to a design to prepare for certain future changes, for example. These are difficult to quantify in an objective way because they are based on past experience, which is probably not documented in detail, or may not be a big enough sample size to be "statistically kosher". Also, what is a "key" and what is data may be debatable. "CA" in a US-State table could be seen as data (an abbreviation) or as a key, or both. Trying to define everything perfectly will just lead to endless fractal LaynesLaw recursion. -t

Actually, "key" has a very clear and formal definition. It's the rest that I found unclear.

You mean "unnecessary"?

{That'd be a start. Also problematic: your use of the words 'duplicate' and 'data'.}

I meant that the definitions of the normal forms have vague words in them; words that require domain interpretation.

[I find that to be a surprising assertion. Could you provide an example?]

To avoid risk of source bias accusations, please paste in your favorite definition for 5th or 6th form, and I'll take it from there.

[The ones on WikiPedia are adequate, and more importantly, easily accessible to everyone and can be referenced without pasting.]

Problems with candidates:

candidate A:

Concept of "duplicate" is a strange one at the definition level. Duplication is typically understood as a physical property, not a logical one. Whether keys are duplicated under-the-hood or not is an implementation detail involved in a battle between space savings vs. cpu savings vs. synchronization costs vs. duplication for redundancy vs. simplicity of implementation. Even if TopMind does mean something like "duplication at the logical level", even that is a very strange concept since, logically, every view introduces that sort of duplication. TopMind will need to carefully define what is meant by "duplicate".
- The "user" may not know whether they are looking at a physical model or a logical model, and implementation may be swapped without any outward visible evidence.
- Indeed. But that's a point against your candidate definition.
- The power of relativism can gum up definitions tied to "physical" thinking. Thus, mentioning "under the hood" may take us down a winding, messy path. -t
- You keep using that word, 'relativism'. I do not think it means what you think it means.
TopMind uses the word "data" to describe individual values. But, in the RelationalModel, each full tuple in a relation is one 'datum'. Domain values - including keys, numbers, and strings - are not 'data'. Any candidate definition about tables should probably adhere (where feasible) to the RelationalModel's terminology.
- I'm using "common" terms instead of academic terms. I apologize if this creates confusion.
- That's so fucking dishonest. If you're sorry, change you're behavior. If you're not sorry, don't apologize.
- I didn't mean to imply I was inherently wrong, only that it may have resulted in confusion. I don't wish to get into a debate about debating here.
"unnecessarily" may have far too much dependence on context to properly distinguish fat tables from thin tables. It also may turn into a TuringTrap if interpreted literally, with complex workarounds being used to justify that one does not, in fact, 'need' to duplicate a certain data element.

candidate B:

Perhaps too 'wide', still. 5NF allows EAV tables, for example, with multiple attributes per 'entity' key. Candidate C does not suffer this problem.
5NF itself has a weakness in that determining 5NF requires analysis of the domain (or domain model) rather than just the tables.

candidate C:

Shares the 5NF weakness about analysis of the domain rather than the tables. OTOH, the greater extreme to which this is taken in 6NF may be of some benefit: the schema itself will precisely represent the 'dependency' relationships between properties, and will correspond directly to a domain relationship model.

I vote for candidate C. 6NF is a formal description of the "thin table" mantra: never combine in a table that which can be reproduced by a view.

Making every table be just two columns, key and value, would "satisfy" it then. (Sometimes "key" will be compound).

Are you assuming that all relationships in a domain model are binary?

To be Clintonian again, I have a problem with "are" here. I'm hesitant to say that there is any "hard" coupling between any two candidate columns (or sub-columns) in the domain. The relationship in practice is dynamic, situational, relative, and may change over time. Mentally, I view the associations as probabilistic, not Boolean. It's roughly comparable to the "mechanical" view of physics versus the quantum view. Even with dates, I've seen apps that only want to use the month portion, for example. Thus, in that app's domain view, the day doesn't exist. What we seek in practice is the most UsefulLie. --top

Formulaic approach:

  Score = (wn * N) + (wk * K)

  Where 
  N = Null duplication
  K = Key duplication
  wn = weight of null duplication
  wk = weight of key duplication

In the "thin" table view, "wn" tends to be high and "wk" tend to low, while vise versa for "wide" tables. The closer to zero the result is, the more it fits that viewpoint.

Another way to look at this is to plot it out:

  N|Wide..........
  u|..............
  l|..............
  l|..............
  -|..............
  d|..............
  u|..............
  p|..........Thin
  .+--------------
  .....Key-Dup....

(Dots added to prevent TabMunging)

Thus, if we have lots of "Null duplication" but very little key duplication, then the table is considered "wide", or leans heavily toward "wideness". If we have lots of key duplication but very little "null duplication", then we are heavily in the "thin" camp.

I don't know of anybody proposing the upper right (except very sloppy designers), and I don't think the lower left is technically possible in most cases. Thus, in practice, designs (for designers who care) tend to be somewhere on a roughly strait line drawn between "Wide" and "Thin" on that chart.

  N|Wide..........
  u|..\...........
  l|....X.........
  l|......\.......
  -|.......P......
  d|.........\....
  u|...........\..
  p|..........Thin
  .+--------------
  .....Key-Dup....

"X" marks what I see in practice (ignoring bad data-duplication, which is not contentious here.) "P" marks where OOP proponents tend to be in my observation.

--top

See: NormalizationRepetitionAndFlexibility