Normalization Repetition And Flexibility

Based on a discussion in FearOfAddingTables

OnceAndOnlyOnce should be the primary design driver in my opinion. Some count lots of nulls as "duplication", I don't. They are simply "not there". You can't duplicate empty space. Duplicate keys count more than duplicate nulls because keys are values, nulls are lack of values.

Normalization is simply the algorithm for achieving OnceAndOnlyOnce. -- DaveVoorhis

Wider tables do NOT violate OAOO. By my calcs, skinny ones do because you repeat the one-to-zero-or-one keys over and over.

Do wider tables violate 3NF? What about 5NF? Foreign keys are not considered a violation of OnceAndOnlyOnce, because they are the means by which references are established. Also, if you're not creating related "skinny" (ugh, I hate that term) tables, how do you enforce consistency in column values? -- DV

{DV is correct here. Your 'calculations' would only be meaningful if foreign keys are always to be counted as violations of OnceAndOnlyOnce, which would be unreasonable. Besides: DatabaseIsRepresenterOfFacts. In that sense, it is most critical to ensure that facts are represented OnceAndOnlyOnce, which means at least 3NF. My own preferences are a bit beyond that: I believe that, in an ideal database, each row in a table should represent 'One And Only One' fact - i.e. no non-trivial dependencies whatsoever (SixthNormalForm?) + an additional idea of 'no assumptions' - all facts need to be explicitly represented; e.g. 'pizza place delivers vegetarian deluxe to your neighborhood' doesn't imply 'pizza place exists' unless it is either (a) explicitly encoded as a constraint, or (b) explicitly encoded as an entailment rule under exists ('exists(x) :- (x) delivers (y) to (z)'). Assumptions violate the 'One and Only One fact' rule, and they also act as semantic baggage that make the database far less flexible.}

But why would nulls be counted as "duplication" but not FKs? Both FK and nulls can be virtual or physical, depending on implementation. It appears to be a double standard. And "fact" often depends on the user. EverythingIsRelative. Let specific queries define the fact in the context needed. Or, put domain classification info into meta/classification tables. -- top

{Nothing DV or I said implies NULLs are counted as duplication. Nulls cause their own set of problems, and we oppose them for reasons that are uniquely NULL. If DatabaseIsRepresenterOfFacts, nulls represent no fact. They're a hack - a placeholder for a potential fact given a conjunction of many facts as seen in 3NF. In SixthNormalForm? there would be no NULLs anywhere in the database since it would be pointless to represent them except by the absence of those rows that would otherwise contain them. But to create any 'view' that joins many tables where not all facts are known, it may be necessary to specify some slots as being 'unknown' by some name or other (if one doesn't wish to compute a value for the position). This 'unknown' is exactly what a NULL represents. As a consequence, the domain of the column in the view or query is slightly different from the domain of the original fact table, because it includes this explicit NULL. I wouldn't oppose explicit and query-programmer-controlled forms of 'NULL' found in views, though I think that even in views, distinct NULLs should be described as unique variables (as described further in response to Lars). Indeed, that's exactly what 'variables' also mean: placeholders for unknown values.}

Nulls are nulls.. no one knows what they are. In your application logic, you probably duplicate the code each time trying to figure out what they are and what they do.. they aren't clear. They don't tell a story. They are simply like shrugging shoulders. Very dangerous for a computer to shrug its shoulders. Whether nulls are duplicate or not duplicate shouldn't even be discussed here. Nulls are another issue with another can of worms crawling out of them. -- worms-on-head-mind... err... MopMind?

{What a NULL represents can vary, just as what the number 17 represents can vary - based on context, it could be an age or part of a calendar data or the size of a small hash table. However, that does not mean people don't know what 'NULL' is. In a database, a NULL is a special extra value made available in each domain with the (original) intent of representing the absence of a fact. For the most part its use is in alignment with its original intent. However, the semantics of NULL as normally implemented violates the original intent, making it a hazard to utilize. This is revealed when it comes time to perform joins and queries and computations - technically, 'UnknownA' == 'UnknownB' should be 'UnknownC : Boolean', but 'UnknownA == UnknownA' should be true. So, if we MUST use NULLs, then every NULL is correctly implemented only as a unique NULL - or essentially a unique variable identifying a particular unknown (which, if it became known, could presumably be made to immediately percolate across all the computations performed in preparation of the view). But doing this correctly in the data storage itself is an unnecessary complexity because we don't, fundamentally, have any reason to represent NULLs directly in the DataBase - the only place we ever need them is in views. A KnowledgeBase? would be a different story, where knowing what you don't know and knowing how it relates to other things you know and don't know is one of the two big facets of Knowledge that you can explicitly represent (of four - know what you know, know what you don't know, don't know what you don't know, and don't know what you know).}

Nulls: Storing data in your application is generally what nulls cause.. usually some programmer specifies that NULL means not applicable or not entered in his application logic.. and this isn't future proof. Worse, sometimes nulls aren't described anywhere.. some database engineer just assumes everyone knows what nulls are. That is another topic.

Whether empty (null) columns are "bad" and whether they are "duplicates" can be separated into two different issues. As far as duplication, empties don't violate OnceAndOnlyOnce anymore than the extra foreign keys needed by thin tables. I'd even say less, but for the sake of argument I'll take a wash for now. (I like the word "empty" instead of "null" because the way existing RDBMS implement/treat nulls is poor in my opinion.) -- top

{Implemented properly, every logically distinct NULL would be a unique variable representing an unknown value and, consequently, certainly not be duplicate. The use of NULLs to represent not-applicable should be exchanged for a unique domain-value representation of not-applicable. NULLs, as implemented, are bad.}

Re: 'NULLs, as implemented, are bad' - Well, that's something we can both agree on. However, we'd probably fix it differently. Somewhere there is a topic(s) dedicated to such. If I encounter them, I'll link them here. -t


Foreign keys are not considered a violation of OnceAndOnlyOnce, because they are the means by which references are established. -- DV

What do you mean they are not duplicate? If you repeat them more, then you repeat them more. Seems open-and-shut to me. If design A represents the same info as design B, but B repeats a given foreign key 10 times while the A does it 5, then it is more duplicative with regard to that key.

What you view does not matter (i.e. think as in relvars and relations, not as in two different tables with duplicate data.. or even worse, think of a pointer.. it is kind of similar, but, I shouldn't say that or open that can of worms). It could even be implemented using pointers so that you avoid the duplication at the implementation level - but worrying about that isn't important. Something that is worse than duplicating it is storing this data in your application logic and managing it there.. so now people cannot query this info! Because it isn't in a table. And if it is in a wide table, people cannot query the information about your meta data as you call it. Opening up a thin table to store the data is your meta data and this is what we don't seem to grok, for whatever reason.

But the same could be said about nulls. In existing RDBMS, the DBA usually has to "deal with" all those extra FKs. As we discussed in FearOfAddingTables, maybe if the tools allowed virtualization/automation of such, it would not be an either/or choice.

{You are confusing implementation and model. A foreign key gives you a meaningful relationship between data. NULLs do not. If you are concerned about duplication on the physical layer, RDBMSs can be designed to denormalize tables and intern large key values on the physical layer. Your OnceAndOnlyOnce semantics should be applied to representation of facts (DatabaseIsRepresenterOfFacts) on the logical layer without concern as to duplication of domain-values.}

No, I am not. Meaningful information about relationships and meaningful information may be two different things, but it doesn't change the duplication argument.

{You make an assumption that I do not take for granted. Could you provide one example of meaningful information that is not about a relationship? To clarify: 'meaningful information' is information that someone could utilize to support a decision in the real world, as opposed to communications protocol overhead.}

Lack of a (known) relationship can be very useful information also. When you are boarding an international air flight, the lack of relationships to "suspicious" items is what allows you to get on the plane.

{I grant that this is meaningful information. But it is still about a relationship, and thus does not serve to justify your assumption. Could you try again?}

Graphs and values are representationally-interchangeable. "Is" versus "isn't" about relationships is thus not useful. - t

{I'm not following your argument. Under which conditions does a graph or other value constitute meaningful information?}


Re: "You are confusing implementation and model." (above)

Models are in our head, and vary from head-to-head. Implementation is an objective representation of a given model, and really the only thing we can objectively compare, for good or bad. Thus, I would like clarification on that statement. -t


PageAnchor pizza4

Often it's a matter of possible future changes. Take the pizza example at:

http://en.wikipedia.org/wiki/Fourth_normal_form

If we assume that pizza features will always be independent of delivery location, then the suggested adjustment makes sense. But often one cannot really know. Distant delivery may mean some items get soggy or dry, and thus the store doesn't want to give that option for distant locations anymore because they want to maintain their reputation for quality.

The first approach is more flexible even if it does introduce some redundancy. It can serve both purposes adequately. But the second (4th normal form) cannot handle one of these two cases without calling in a DBA and possibly a programmer. It performs one well but the other not at all. Thus, I'd judge the original as more flexible, but at the cost of violating OAOO. It's a matter of trade-offs. Sometimes violating OAOO is the right choice I'd have to say. (SoftwareDevelopmentIsGambling). Without more info on the probability of change, one cannot make a fully-informed proper call. -- top

{Technically, the table in that example is already in 4NF (and 5NF and 6NF) if pizza parlors are allowed to or capable of deciding that they'll deliver different items to different areas. Further, it's a thin table simply by virtue of not representing any conjunctive facts (i.e. it represents "(pp) delivers (variety) to (location)", not "(pp) delivers (variety) AND (pp) delivers to (location)"). I'll write up my understanding/definition of wide vs. thin tables above.}

This doesn't contradict my point. When deciding what is the "proper" way to design a system, the answer often depends on intended and future usage patterns. It's not practical to keep refactoring the data as management or marketing changes their mind over time over if or what variations are allowed or not allowed. Ads are simpler if each shop can deliver the same thing, but production people may grumble that such is not practical or costly and that some shops should be allowed to exclude some items. Getting fresh avocados may be difficult for a rural shop. Such "domain battles" are common. It comes down to the kind of thing found in DecisionMathAndYagni. - t

{I'll grant that any decision depending upon foresight is 'gambling'. Whether or not it is practical to refactor over time really depends upon which tools you have available to you. A great deal of the 'refactoring problem' (CodeChangeImpactAnalysis) lies with SQL's model for DataManipulation and DataDefinition?, the lack of runtime adaptivity, the associated straggler updates problems and so on. SQL relies upon a bunch of hacks for trigger-based mutable views. Solving many such runtime flexibility issues in a secure and generic manner has been of considerable interest to me (far more so than issues of normalization).}

We have agreed that if tools made it easier to have multiple viewpoints, both "wide" table and "narrow" table views, then it's not an either/or choice and each view can be used per feature/application/user convenience. Unfortunately, most RDBMS don't support such well, especially on the "write" side of things. Given the limits of existing tools, I lean toward lower normalization. See TableViewDistinctionArchaic. -t


How do you decide when wide tables are too wide?


I suggest this definition of 'wide' table (also GodTable) and 'narrow' (thin/skinny) table, which reflect how I've been using them thus far:

WideTable?: A 'wide' table refers to any table for which rows represent 'conjunctive' facts or relations. A traditional 'entity' table from EntityRelationshipDiagram and EntityRelationshipModeling is a particularly common example of this, consisting of (entity-id, attribute1, attribute2, ..., attributeN). One can note that there is no inherent or functional relationship between, say, attribute1 and attribute2 except as they both are aspects of whatever 'thing' in the world is identified or named by entity-id. As a consequence, this table can be said to contain conjunctive facts, essentially of the form isEntity(fred) AND attribute1(fred,fred's attribute1) AND attribute2(fred,fred's attribute2) AND (... so on ...).

NarrowTable?: A 'narrow' table refers to any table for which rows represent realizations of (strictly) exactly one abstract proposition; that is, records cannot represent any 'conjunctive' facts. Every 'column' in the table must be critical to the representation of a single fact. Note that there is no inherent limit on the number of columns in a 'narrow table'; all limitations are based on the semantics of these columns with regards to representation of data.

While 'narrow' table isn't defined in terms of normal forms, there is only one normal form guaranteed to be a 'thin table': 6NF - which forbids join dependencies on any proper subsets of attributes. 5NF still allows for join dependencies so long as those are implied strictly by the candidate key(s). Any table in 6NF will be a Thin Table.

But thin tables can be a lower normal-form so long as they don't semantically represent multiple facts... e.g. in the pizza-delivery table in the wikipedia 4NF example (pizza-parlor,pizza-variety,delivery-area) (http://en.wikipedia.org/wiki/Fourth_normal_form), the fact represented at each row is (pizza-parlor) will deliver (pizza-variety) to (delivery-area), not (pizza-parlor) will deliver (pizza-variety) AND (pizza-parlor) will deliver to (delivery area). The latter of these would be a semantic violation of 4NF. It is worth noting that if the wikipedia's 4NF example had even one actual case of a pizza-parlor delivering a different pizza based on delivery-area, it would prove the table was already in 4NF (and it would also happen to be in 5NF and 6NF). It is very likely that any examples TopMind or others can come up with regarding these 'semantic' future changes are already thin tables, and would (often) already be in 6NF if they fulfilled that 'special' exception that just hasn't been bothered to show up yet. (However, I'll later go into a full change-analysis later that shows that failing to make the 'correct' assumption here still leaves the NarrowTablePhilosophy? quite flexible, and certainly no less flexible than wide-table philosophy.)

It should be clear that we all work with at least some thin tables - even TopMind certainly uses thin tables for most or all of his one->many and many<->many relationships, since doing so is part and parcel to EntityRelationshipModeling as it is traditionally taught. When using a 'thin-tables' approach, one actually uses it as a higher-priority design criterion

The most common 'wide' tables in use by well-trained or experienced DBAs are exactly those corresponding to the 'Entity' tables, which (as mentioned in the above example) tend to contain an 'entity-id' then a sequence of One->(Zero|One) attributes. In the cases of 'Zero', these large Entity tables result in NULLs being placed in that location of the column (NULLs that don't even work correctly, no less... but that is a discussion for another page, or perhaps the top of this one...). These clearly represent conjunctive relationships. It is worth noting that 'thin' tables will never have an 'entity' table that carries more than just the identifier - all it can have is isEntity(fred). It can't say anything else about fred because then it would no longer be a 'thin' table.

Now to represent two basic philosophies:

WideTablePhilosophy?: Wider is better. If it makes cars more stable, and it works for my kitchen table, it will also work for my RDBMS. Wherever possible, conjugate as many facts as possible into a single table. 'Entity' tables, especially, are fine targets for adding new columns. NULLs aren't too painful; even though they could be implemented better, I can live with them. Optimization is important - if I break tables up, I'll need to join them together later, and joins are expensive. Joins are not so expensive to justify allowing one->many and many<->many relationships into the table, though. And I like having everything under just one name - sure, there are dozens or hundreds of columns, but I can access it all by requesting the table and 'drill down' by specifying whichever columns I want. To me, that beautiful, wide-table almost IS the database. Besides, I'm comfortable with this - it all seems 'simple' to me. I made it work for my last project; I can make it work for my next one. I'll pay no attention to those noises from the men in the ivory tower - they're doing some sort of collective MentalMasturbation and I really don't want to look too closely; besides, I have a headache.

Things the WideTablePhilosophy? person doesn't say: When making my arguments about simplicity, I'll just ignore those columns I shaved off to avoid sticking one->many and many<->many relationships in the same table as 'special exceptions' to my philosophy - everything important is one->(zero|one) anyway, so I can usefully 'drill down' and access all the important stuff. Hmm... you know that tendency we wide-table people have for adding "just in case" columns just in case someone needs to add some data we hadn't thought of? yeah, the columns that confuse the fsck out of everyone. Ignore it. Oh, and don't pay attention to how I go scrambling about at the beginning of every project attempting to figure out all the attributes I'll need - I don't think you could possibly do any better because I've never imagined a world any different. Also, I've occasionally noticed that when I need to make changes and I add a column to a wide table, or split a column because I later learn it is a many<->many or one->many relationship, I often need to go back and edit every query in the entire world that does an insert or update to my table. But that's okay... I control all the queries anyway.

Oh, and for the moment, I'm going to call 'bloat' on anything related to integrity, security, temporal data and other metadata, etc. 'bloat' because they've never entered my thoughts before and it's easier to ignore them than it is to conclude I should have been thinking about them all along (and my darwinian protect-my-food-source fish-genes are doing a lot of this talking for me). Besides, they must have significant trade-offs... right? Right?!

NarrowTablePhilosophy?: DatabaseIsRepresenterOfFacts, and facts, themselves, should receive FirstClass treatment in databases - in particular, they shouldn't be shoved like sardines into a tiny tin table. Ideally, every fact should receive the exact same treatment - this keeps things pure and simple. It just turns out there are a very great many advantages of doing this. Among these advantages: (1) there are never any NULLs, (2) there are never any update anomalies & integrity constraints never need application support, (3) there's suddenly plenty of space to add a lot of meta-data about each and every fact... in particular, to the right of that fact in the table, supporting: (3a) bitemporal databases (time entered, time deprecated, time true start, time true end), (3b) per-fact read & deprecate security (via capability requirement), (3c) derivation information - how the fact was acquired (e.g. derivation tree or agent that added it), (3d) truth-values (and real support for OpenWorldAssumption), (3e) confidence-values (and real support for knowledge bases and expert systems). (4) Better concurrency for most transactions due to finer grain - those that don't update the whole panoply of tables will only touch a few tables thus reducing need for waits or chance of time-space collision for optimistic transactions, (5) Thoughts less influenced by 'optimize now!', (6) YagNi compliant - and sufficiently flexible refactoring or changes even at runtime (e.g. if the pizza-parlor delivery example had been broken down into makes(pizza-parlor,pizza-variety) and delivers(pizza-parlor,delivery-area), then we later learn that not all pizza-parlors deliver all pizzas to all places, we simply create a new table deliversPizza(pizza-parlor,pizza-variety,delivery-area), populate it with the join of the original two tables, then turn those original tables into views on this new table - all as a single transaction - supposing we don't want to get creative about it and have "relations" that are partially extensive and partially derived (which is fully legit in the RelationalModel)), (7) Easy and very well-proven model to support infinite relations (logic functions, essentially - often exactly what you'd see in a LogicProgramming language) and even update tables by use of logical union or additional statements of entailment, (8) relations-as-types if you want them (infinite unary relations can easily operate as types, and vice versa) thus homogenizing even the concept of 'domain' - especially in combination with higher-order relations (relations that contain relvar names or relation-descriptors themselves). (9) More efficient, precise, and meaningful applications of 'triggers' - in particular, triggers NEVER used to maintain integrity (because they aren't needed for that task), and triggers could always be extremely precise as to which tables they need to touch and watch for updates. (10) Homogeneity of representation - 'one->many' and 'many<->many' relations are given exactly the same FirstClass treatment as 'one->(one|zero)' relationships, meaning that people don't have any chance of acting like the simple-minded Wide Table Philosophy persons that begin thinking a 'single' table carries everything of importance and they can reach it all by 'drilling down'. Add this to relations-as-types, and everything becomes homogeneous except for values (domain values). If you add relation-values, then everything is 100% homogeneous and a full SymmetryOfLanguage is achieved (and we 'academics' can toss bones near a monolith and let the MentalMasturbation begin).

Things the NarrowTablePhilosophy? often forgets to mention: There aren't any RDBMSs that make the thin-table approach optimal, yet - we need, in particular, query and index-optimizations to pre-join or denormalize tables at the physical layer (these optimizations are already well known and mathematically proven) and the ability to easily add tables (as easily as we currently add WikiWords or new rows to a table - including support for transactional addition of rows and manipulation of views) and extend 'views' (e.g. adding new entailment-facts to 'exists(x)'). Current RDBMSs simply won't do. Also, we'll need a good TableBrowser, and the RDBMS will need to keep a lot of information about which tables it has and their relative constraints, such that we can easily construct useful 'views' for the user (update-able views, ideally). Tools like this exist, but nothing that anyone would consider of great quality. In particular, the ability to rapidly construct, edit, and destruct views is quite important, and editable views need to make clear which columns can be edited because they have a variable associated directly with them (you can't edit a column that is the result of certain forms of calculation). Finally, we really, truly, honestly need to replace the DML & DDL; SQL is awful, period, but even more so for the narrow table philosophy - joins need to be syntactically cheap and very easy to both write and comprehend, and support for relvar names and complex types (including relation-values) would be highly appreciated. The best query languages for Narrow tables seem to be those developed initially for LogicProgramming languages, such as PrologLanguage, MercuryLanguage, and DataLog. So, basically, we're asking everyone to ultimately trash or significantly update: (a) all the RDBMSs, (b) all their knowledge of the most common DDL/DML in this world (SQL), (c) any training they may have received in EntityRelationshipModeling, etc. - of course this idea is going to be popular!

We NarrowTablePhilosophy? people view 'homogeneity', 'purity', and 'SymmetryOfLanguage' as 'simple'. We think the Wide Table Philosophy people don't really grok 'simple', certainly not as Einstein meant it, no matter how much lip-service they pay to it. We 'academics' are of the opinion that you'd find our changes great once they were in place - we believe that even TopMind would think them simple and great and wouldn't be able to imagine using wide tables if the narrow table philosophy had already been incumbent when he received his training. We know that overturning any incumbent and established base is so difficult that you generally need something many times better than the current solution to do so... or enough money to market it as many times better - we know this because we run into this wall over and over and over and over.

Fortunately, NarrowTablePhilosophy? actually is many times better, subject to the proper automated optimizations, access to an even semi-decent TableBrowser, and selection of a massively better DML/DDL. This is especially true when it comes to meta-data, which will become more and more important to businesses (as they learn about it) and pretty much every other domain - it is of extremely great import for data mining, for example. It is also especially true in a wide array of domains that RDBMSs have had a difficult time penetrating. I have no doubt that Narrow Table Philosophy will very likely be victorious. Unfortunately, it looks like we won't be receiving help from the fish-brained people.

Relative Costs of Change: (under construction)


Do you believe it acceptable to violate 3NF?

If you're not creating related "skinny" (again, I hate that term) tables, how do you enforce consistency in column values? If foreign key values represents a violation of OnceAndOnlyOnce such you should use "wide" tables, do you believe code block delimiters (e.g., "{" and "}" in JavaScript, C/C++, Java, etc.; "begin" and "end" in Pascal or Pascalesque languages, etc.) represent a violation of OnceAndOnlyOnce, hence you should use one large function/procedure definition instead of multiple small ones? -- DV So... You generally favour normalization to 3NF, you use ConstantTables to constrain enumerated column values, you realize that ridding all duplication is usually impossible (hence some foreign keys must be okay), and you regard tables with a blatant violation of OnceAndOnlyOnce as too wide. When, then, do you actually create wide tables? Is it possible that there's been some sort of vastly miscommunicative ViolentAgreement here all along, in which the normalized "skinny-table" schemas we create are actually the same as the normalized "wide-table" schemas you create? Is it also possible that the terms "wide" and "skinny", having no quantified degree associated with them, actually mean nothing and are nowt but a source of confusion and ambiguity? In other words, your wide table might be my skinny table? -- DV

I've already agreed that ridding all duplication is practically impossible. Perhaps we need a better definition of what "skinny tables" are. For example, is it all about ridding nulls, as some have implied, or are there domain-oriented data integrity issues involved? -- top

Alternatively, perhaps we can dispense with the terms "SkinnyTables?" and "wide table" entirely, as they have no generally-accepted definition, nor do they have any quantified or objective meaning. As such, use of these terms only leads to confusion and quarrels. -- DV

Agreed. What is the best name for the Fabian null-ridding technique?

Probably "Pascal's technique for eliminating nulls" or "Darwen's technique for eliminating nulls" or "Smout's technique for eliminating nulls" or "Duncan's technique for eliminating nulls", and so on, depending on whose technique it is. (The latter two, to my knowledge, are as-yet unpublished.) When I get some time, I'll create a page to summarize them. -- DV

Skinny tables are not just for reducing nulls.. but a normalization technique in general.. as shown in ConstantTable. Just call them thin tables since they have nothing to do with human skin.. or normalized tables.

What's a "skinny table"? What's a "thin table"? Really, we should avoid this ad-hoc terminology. Looking back, some of the debate appears to be due to differing interpretations of these undefined (and probably redundant, or at least unnecessary) terms. -- DV

It is a table with only a few columns.. for example consider a bitmask table or the one demonstrated in ConstantTable. It is narrow or thin because it only has very very few columns (usually less than 5). But, this leads to table thinking instead of relvar. What does one call a tuple that only has two positions in it? We have to at least name this as some sort of pattern. I prefer thin or narrow over skinny. It's harder to say something like that table that Darwen uses in that PDF file. Rather it is easier to say something like thin table. But, if we can find a better name, I'm all for it. It still needs to be patterned, regardless.

Why five columns? Why not six? Or three? There is already a term for this quantity -- degree. A tuple with only two positions in it has a degree of two. I think it will be far more productive to either speak in terms of specific degrees -- especially for interesting tuples/relations, such as those with degree 0, degree 1 or degree infinity -- or attempt some consensus on what a "thin table" or a "wide table" means. I doubt that's possible, especially as a tables/tuples/relations with degree > 1 and less than infinity are all equivalent (setting keys aside for now.) The only distinctions are psychological. -- DV

Sure, so something like TwoDegree? or SmallDegree?.. etc. Wide table is simply an AntiPattern... a huge table that has way too much unnormalization. And wrong thinking - people should stop thinking in tables so much. Products like MySQL have forced people to think in "spreadsheets" instead of relations.

Spreadsheets have forced people to think in "spreadsheets". I'd hardly blame MySql for that, as schemata defined for MySql can be as normalized as for any other DBMS. I do agree, however, that the term "table" should be deprecated, as it is ill-defined. Now then: A huge (define "huge") relation may be perfectly normalized, and what does "small" mean? The degree is orthogonal to normalization once degree > 1. A relation with degree < 2 is inherently in (at least) 5NF. -- DV

myIsam doesn't support (or didn't for many many years) foreign keys and cannot be normalized. It is a spreadsheet (I'm just dissing it). As for what a wide and thin table is.. well in real world examples when we normalize: typically we take some data and put it into a narrow table compared to the table which we are normalizing. We start putting some data into smaller tables (compared to the other table). That's just a point of reference, that when we normalize, generally, more narrower tables are made, and hence not as much width in those tables versus the major table we were normalizing. These narrow tables may grow with time, and will need to be once again normalized - and most likely after normalizing there will be more narrower tables than before when unnormalized. Maybe it can be explained in degrees instead of narrowness and thinness.

Normalization is unrelated to the presence or absence of foreign key constraints. As long as JOINs are possible (and it would be a pretty useless SQL or relational DBMS that didn't permit JOINs), normalization is viable. As for your use of "thinness" or "narrower" etc., these may be handy informal ways of specifying relative degree between relations, e.g., one can say that relation A is narrower than relation B as a way of saying that degree(A) < degree(B). It is not an absolute measure. -- DV

Coding all the important database constraints into your application logic? You consider it acceptable normalization? I mean, even an excel spreadsheet can be normalized.. if you use enough visual basic macros!

Database constraints are orthogonal to normalization. Normalization is not a measure of the strength of a DBMS, it is a set of algorithms for reducing redundancy and update anomalies in a logical database design. It is not physical; the result of normalization is what becomes a physical implementation. A database schema can exist only on paper and still be fully normalized. Paper, I will note, also lacks foreign key constraints or database constraints of any kind. -- DV

That orthogonal word is overused, I'm afraid, and you are using it for mental masturbation. The orthogonal word, is quite often used when we are discussing academic nonsense like how relational model doesn't apply to constraints and blah blah. And you are close to making the "quote of the month" on Fabian's site. Keys are logical. And paper is not what we are discussing.. we are discussing implementing a database into a product - and if the product doesn't support ways to enforce relations and referential integrity.. I'm afraid that paper won't help us. If we were discussing paper, we wouldn't be discussing MySql. Foreign keys and constraints can exist on paper. I'll provide you with some interesting quotes. "Integrity is a central, critical database, not application function for all sorts of reasons that I explain in my book. Application-enforced integrity is, essentially, data management without DBMS. That's what we did before we had DBMSs, and we invented DBMSs because that proved not to be cost-effective." -- Fabian

"I presume you are talking about enforced referential integrity? Why does it have to be enforced? You can have relations [between tables] without them being enforced. Many of the larger database applications, such as SAP R/3 and Peoplesoft don't use enforced foreign key constraints because of speed issues." -- Fabian's Quote Of The Month

There appears to be some misunderstanding here. I have in no way deprecated database constraints. They are vital. However, they have nothing to do with normalization. Failure to normalize is not due to a limitation of any database product, nor is it promoted by any database system feature. It has nothing to do with database products. If it's not done properly, it is a failure of the database designer to do a proper design job. -- DV


RE: Failure to normalize is not due to a limitation of any database product, nor is it promoted by any database system feature. -- DV

I'd argue that 'failure to normalize' IS promoted by typical database system lack-of-features (or, more accurately, normalization is discouraged). In particular, the lack of powerful and assured physical-layer optimizations that cross tables can make it difficult for algorithmic-complexity-aware programmers to further normalize even in the face of potential errors. Most systems-language programmers need guaranteed optimizations, like guaranteed TailCallOptimization, before they feel comfortable trusting the system to 'get it right'. I certainly do. A very common perceived (and generally accurate) optimization is that denormalization on the logical layer (or, equivalently, resisting higher-level normal forms) will result in the same structure on the physical layer, and that by this one avoids costs for joins that would be required if the tables were further normalized. However, if one could simply suggest the Database to effectively 'denormalize' certain tables at the physical layer or otherwise be prepared for a great many join-queries over a certain common subset of tables in such a manner as they could ensure the costs of joins on certain other tables were exactly O(1) rather than O(N) (which would be typical of a join of two tables hash-indexed on the same key) or O(N*log(N)) (which would be typical of a join of two tables tree-indexed on the same key), they would then feel comfortable using the higher level normal forms. For someone who wants the ability to dynamically add new tables of data, this would be even better if the database could simply figure out such optimizations based on the queries it sees. This could be done by any number of mechanisms at the physical layer, of course - e.g. having the same index carry pointers into several tables, and having the query-optimizer take full advantage of this; 'physical layer denormalization' refers only to ensuring that once a shared primary key is found, that joins using that key would cost no more (at least algorithmically) than would having constructed the same logical denormalization in the first place. I strongly suggest you design cross-table optimizations into RelProject; the degree to which automated optimizations shape and influence higher level code is not something many designers think about, but it is significant, so optimizations ought to make doing the 'correct' thing the 'best' choice.

Duly noted. In stating that "failure to normalize is not due to a limitation of any database product", etc., I meant that failure to normalize is not due to lack or presence of (say) foreign key constraints, or any other typical DBMS construct. Any modern DBMS will let you create precisely the tables/relvars you want, whether based on a fully-normalized design or not. However, performance is indeed a concern, and perhaps sometimes overly a concern. For example, I once worked on an interactive database-driven desktop application with an expert data modeler. She studiously normalized the logical design to at least BCNF, then denormalized the implementation back to effectively 1NF on the mistaken assumption that desktop DBMSes (the Jet database engine, in this case) couldn't possibly be fast enough to support JOINs in an interactive context. She wasn't convinced otherwise until I implemented the design in BCNF and demonstrated - on real data - that the response times were more than adequate. Given her mainframe background and lack of PC experience (at the time) I suppose this isn't surprising. But I digress. The RelProject currently implements no optimization, and guarantees the worst possible performance in every case. JOINs between two relations, for example, are O(N*M). This is partly due to premature optimization being the root of all evil, and partly due to my intent to use Rel as a research platform for exploring various ideas I have for database optimization. For the educational settings in which Rel is currently used, that's turned out to be perfectly acceptable. For practical applications, I will have to implement some reasonable amount of optimization. -- DV

Keep in mind that "Premature Optimization is the Root of all Evil" applies only to hand-implemented optimizations (like your expert data modeler did with her design). Nothing about the rule applies to automated optimizations, and you don't need to optimize your optimizer-code to get the full benefits. However, I agree that it can wait until Rel is being shifted towards production settings.

Re "premature optimization": Well, yes, I was being a bit facetious, but I'm sure you know what I mean. Walk first, run later, etc., and avoid the paralysing infinite-regress problem of not deploying <x> until the <y> optimization is working, which is dependent on <p> which should have <z> optimization, which won't work until <q> is implemented, and so on. Or, just brute-force <x>, get the damn thing out the door, and have a student write <y> later as a thesis project. -- DV


PageAnchor "Dorthy": (context: deletion of unnecessary one->(one or zero) attribute relation)

Please clarify this. A given query needing to reference column X is generally independent of the wide/thin model. Thus, the need to rid it from a query is the same regardless. The usage of X is due to a domain task need, not the underlying table model. For example, if a tax verification task needs social-security number, that is because there is a business need for it, regardless of which table or data model its stored in. If we later don't need it, then we clear out the reference either way. If we are using asterisks, then its usually a situation where we wouldn't change it anyhow. (The list of comparisons is a good start. Kudos.) -- top

While the usage of X is a domain thing independent of the underlying table model, the layout or location of X within the table model will depend on which philosophy is being followed. Because X is a (zero|one) attribute for some entity, it is the perfect target to be sucked into a 'wide' table under the WideTablePhilosophy?, while it would go into an entirely new table in the NarrowTablePhilosophy?. (See also the approach to having added X to begin with.) If you think about it, after you delete a column from a 'wide' table, you certainly need to run around and touch all the queries that update or insert to that table, and you would be wise to touch even those that query it just to be sure you missed nothing. Since this is a 'wide' entity table, that means every query associated with that entity. Compare this to the NarrowTablePhilosophy?, where you have a unique table essentially attrX(entity-id,x-value). You only need to touch the queries that insert, update, or query table 'attrX', and since 'X' has been determined to be unnecessary, there aren't likely to be many queries, updates, or inserts that target 'attrX' explicitly, which means you can get away with deleting 'attrX' without too many worries even if you don't do a search for queries related to X - this is especially true if you added attrX recently and have no reason to believe it in use. (Note that timing doesn't help the WideTablePhilosophy? people since they had to touch all the inserts and updates for the entity-table even when adding X in the first place.) Anyhow, this gives NarrowTablePhilosophy? an advantage when removing X in most cases (fewer queries to touch), and an even stronger advantage in special cases (recently added). Does this help clear things up? (If so, I'll move this clarification text above.) Do you still have reason to dispute it?

There seems to be a misunderstanding somewhere here. Let's start from square one. When I am designing a task for an app, I ask myself, "what pieces of info do I need from the DB to carry this task out? Hmmm, I need X, Y, and Z to do the necessary computations". Whether X, Y, and Z are in one table or 3 does not affect the task's need for these 3 items. The decision to use XYZ is NOT based on schema design. If the task needs these 3 items then it needs these 3 items. If they were not in the DB, then the task would have to get them from a user prompt or a file maybe. They are "necessary inputs". Removal is the same. It's true we may end up touching a different query or sub-query in a given task, but we still have to change our queries within the task. -- top

You seem to be restricting your thinking to read-only queries. Anyhow, I edited the above to be a bit clearer.

No I am not. A distinction needs to be made between having to change the routine/task and having to change an existing query. From an finger-wise editing standpoint, the effort to remove a column from INSERT query A (wide) versus having to delete INSERT query B (thin) is about the same. It's a wash. (The cost of having to have multiple insert queries will ding other metrics.) If you are not counting finger and eye movement as your metric, then what are you counting?

Only if you simultaneously assume (a) full control over the queries (i.e. none of them run on a client's machine to which you don't have access), (b) ready access to edit them, (c) file location and access time ~= 0, (d) edit-time >> 0, should 'finger-wise' editing standpoint even come into place. And it should be clear that 'd' is firmly violated in this case. I must laugh a bit, I find it almost ludicrous that you even mention it. Post-deployment costs are high on a per-change basis. The wide-table philosophy is hurt relative to the narrow-table-philosophy because the wide-table-philosophy can require making changes to programs completely unconcerned with 'X' (generic inserts and selects PLUS those that explicitly touch X), whereas the narrow-table-philosophy guarantees this isn't a problem (you only ever touch the table with X if you actually want X).

You seem to be talking in generalities that have a lot of unstated assumptions behind them. And, I am NOT "dictating" that finger movements etc. be the prime metric; I am just trying to figure out what metrics you are actually using to compare by stating my assumptions and then letting you correct them. If you have a more efficient way to check my assumptions against yours, I'm all ears. -- top

Nothing among my unstated assumptions was outside what I'd expect any computer scientist with field experience programming APIs & protocols of any sort (including database schema) to understand quite well. Bringing up 'finger movements' as even a potential prime metric still strikes me as ridiculous... e.g. would you ever even think of measuring how much effort it takes to change the HTTP protocol in finger and eye movements? No? Well, then why would you think it for deployed schema changes?

Again again again, the "finger-movement" suggestion was merely to prod you to clarify (or begin to state) the metrics you are using above. If the metrics are not clear to the reader of your chart, then the chart is useless. I'm just the messenger. You seem more interested in bashing me than making your text clearer. You want to invent problems. CodeChangeImpactAnalysis offers some suggestions. -- top

I'm willing to specify assumptions for any particular thing on request, but I do have a right to expect you are of the experience level at which you typically portray yourself.

State Your Metrics

My metrics for which particular thing? You could look in the defs/cost assumptions section above. CodeChangeImpactAnalysis applied generically has been part of the above metric, but (as noted above) often is only a -moderate- consideration relative to the costs post-deployment of even getting the rights and capabilities to change or impact the code.


Conversion Tax?

There is an issue that is being ignored above: wide tables fit the user usage patterns better. The user view of the data tends to fit the wide-table philosophy, and thus there are less translation/conversion steps. It is somewhat analogous to the OO/relational impedance mismatch where the "shape" of objects being different from relational tables creates the need for a messy translation layer. For example, most CrudScreen edit forms save to just one wide table and we thus need only one insert statement. With thin tables, such a form would have to write to multiple tables. (I assume that you are not suggesting a thin-table-friendly app redesign.) Some RDBMS have updatable views, but these complicate the design, and may be slower. Plus, if you have a view that maps thin tables to the user view (wide), then you have two places to update if you add or remove a column from the "entity" view: one in the actual table, and one in the view. Under the wide table, you only have to change the single table itself.

As a general rule: The closer your internal idioms match domain/user-side idioms, the simpler the design.

-- top

Consider a situation where your users are viewing a website that offers services to several states in the USA. Customers want to find all the cities available in Michigan that your website serves. A wide table doesn't help. You should have your cities in normalized form. It all depends what your users are doing. Consider your customers want to find all the cities that your website has available that start with the letter 'H' in Michigan, because they can't remember the exact name of the city but remember that it might start with H. A normalized table that has cities related to state helps here. Some huge wide table CRUD screen is poor view for many situations - and one should not design his databases so that they map to some huge ugly wide CRUD screen that customers have to put up with. At least, most clearly designed web GUIs use things like checkboxes, edit boxes, text areas, etc that are not thrown in some horrible CRUD screen. Rather they should be placed on the page downward in a nice way.. such as the way people fill out forms on the web where they submit data.

I get the feeling that the type of applications you design are huge wide CRUD screens where people directly enter boolean values and NULLs, instead of using eye candy like check boxes, radio selections, drop downs, etc that are spread on the screen in what is called a view or GUI. Do you really let people enter the data into the database directly, without any eye candy and human interface? What happens when these humans have to view all the cities in a certain state only? Your wide table ain't gonna match up to what the human needs.

If your application is one huge ugly CRUD screen where people enter booleans and prefer scrolling sideways instead of downward (very inhuman) then I can see where your viewpoint is coming from - actually, I can't see where your viewpoint is coming from - as it sounds like the type of apps you develop are excel spreadsheet views of everything, and it makes little sense to me.. since most GUIs are not like one huge ugly excel spreadsheet spanning 3 pages wide. Most GUIs are for humans that like to search things, update pieces of things, find cities available, find facts relating to certain data, etc. Consider even Midnight Commander when you open up the dialog box to view available chMod commands - these bitmasks/mods should be in a thin table or an enumeration, not shoved in some wide table that we cannot sanely access, since the bitmasks are hidden away and cannot be queried in their own bitmask (thin) table!

I think your CRUD application experience has distorted your entire view on what the relational model and normalization is all about. I do however remember going around to houses as a child once selling chocolate bars, and we had this big spreadsheet-like form where the person buying the chocolate bar would fill in their Name, Address, Phone, Chocolate Bar Count, City, State. But if the user ever wants to query what states we sold chocolate bars in.. they are better in their own (thin) table for the view. So the argument that thin tables are worse for the view is nonsense when you are querying the data and searching it - which frankly lots of applications do. And, I can't say I've seen many poorly designed apps that allow one to view the entire wide table as one huge wide screen where I have to scroll sideways 5 pages.

    Table: Products
    -----
    productID
    prodDescript
    price
    ...

Table: Customer --------- customerID lastName firstMiddle street cityRef zip ...

Table: Cities ------------ cityID cityName state
{I do expect updatable views (see the "Things the NarrowTablePhilosophy? often forgets to mention" section). We can actually achieve O(1) performance for join-views precisely equivalent to to the views you see for wide-tables, in addition to extending that to many<->many tables, given intelligent design of table indexes, so the performance will be no worse than the wide-table-solution (at least not algorithmically). And you only need to update explicitly specified views; I rather expect that views of this sort would often be runtime-constructed based on queries to the RDBMS meta-data.}

{And I won't trade simple design for simple interface and operation and data-management. It's a bad idea. It would be like choosing BrainfuckLanguage because it's simple to implement. I consider such decisions 'simplistic'.}

Lisp would be a better example. Lisp fans brag about the power of simple concepts as the atomic building blocks. I admire Lisp, but would not want to use it in team-based production environment (LispIsTooPowerful).

{If when you said: "As a general rule: The closer your internal idioms match domain/user-side idioms, the simpler the design.", what you meant was: As a general rule: The ability to apply arbitrary layers of translation between the user and the program results in a simpler design, then I'd agree that Lisp is a better example. And, to a certain degree, I might even be inclined to agree with it. But somehow, after reading the rest of your statement, I'm disinclined to believe you were promoting translations. You're promoting keeping a simplistic design in order to avoid translations - e.g. to exactly represent what the user sees and interacts with inside the database. As such, I think BrainfuckLanguage is, indeed, the better example.}

{RE: "(I assume that you are not suggesting a thin-table-friendly app redesign.)" - I'm not planning on redesigning apps at the moment, no (though NoApplication is certainly on my agenda, I consider it an independent concern). However, I do expect to use a more multi-table friendly DML/DDL. As noted above, SQL has got to go for the NarrowTablePhilosophy?.}

With better/overhauled RDBMS, I may use more thin-table design also. We should probably limit this to existing "typical" RDBMS because possible future fixes are wide open. Anyhow, even if a view maps multiple small tables to a single "writable" wide table, the double-maintenance issue above still stands: you have to change both a view and a table for many changes. -- top

That depends on whether the view itself is generated by meta-query or not, really (no reason views can't be generated by the TableBrowser). And I don't like limiting my thinking to "typical" RDBMS, but I can see some merit in it. I'll agree that, limited to "existing" RDBMS and lacking any need to maintain meta-data or temporal information, I'd often choose wide-tables in 5NF.

I think we are almost reaching a "tool problem" consensus: TableViewDistinctionArchaic -- top


PageAnchor: "Paperback Gone"

Here, we are removing column "d" (paper-back flag) because we don't need it.

 //---- Snippet A
 qry1 =  select a,b,c,d from wideTable;  // removal 1
 ...
 print(qry1.d);  // removal 2

//---- Snippet B qry1 = select a,b,c from foo; qry2 = select d from thinTable; // removal 1 ... print(qry2.d); // removal 2

//---- Snippet C qry1 = select * from wideTable; ... print(qry1.d); // removal 1

//---- Snippet D qry1 = select * from foo; qry2 = select * from thinTable; ... print(qry2.d); // removal 1

//---- Snippet E qry1 = select * from foo; qry2 = select d from thinTable; // removal 1 ... print(qry2.d); // removal 2
(Insert statement issues follow roughly the same pattern as the non-asterisk examples.)

Honestly, top, your example needs considerable work. What's with having two queries each time? And who would ever 'select d from thinTable' (which would return, in any proper relational database, exactly one of: {}, {true}, {false}, {true,false})? If you need 'd', you'd grab it in a join. Where the wide-table solution fails is 'select *' when you don't need 'd', and inserts when you don't know or don't care about 'd'. Where are these represented above?

I skipped join and view examples because they didn't seem materially different than the wide-table example. Anyhow, this is your opportunity to respond with allegedly more representative code examples. (Disclaimer: the above are meant to be scenario studies and not code practice recommendations. But do note that the DBA may not be able to set coding standards.) -- top

Really? "Didn't seem materially different" to you, eh? Your version of science seems to be following your gut feeling & burgerkinging it rather than anything more rigorous. Frankly, by skipping cases you entirely misrepresented the issue... built yourself a little straw house. For some huffing and puffing:

Wide Table Solution: Cases

 create table foo (id,a,b,c,d,e);
 select * from foo; ... uses 'd' explicitly
 select * from foo; ... don't care about 'd'
   -- sub-case 1: code is fragile and depends on existence of 'd' column
   -- sub-case 2: code is robust and can chug along happily without 'd'
 select b,c,e from foo; ... (can't use d)
 select b,c,d,e from foo; ... (must use d; no other reason to ask for it)
 (note: ignoring, for moment, issue of 'd' appearing in 'select ... where' clause;
  however, it is obvious that these would also need repaired in event of d's elimination).
 insert into foo values (new-id,new-a,new-b,new-c,new-d,new-e)
   -- sub-case 3: code has explicit knowledge of 'new-d'
   -- sub-case 4: code is just injecting a default value for 'new-d' as a code placeholder
 insert into foo(id,a,b,c,e) values (new-id,new-a,new-b,new-c,new-e)
 insert into foo(id,a,c,d,e) values (new-id,new-a,new-c,new-d,new-e)
Narrow Table Solution: Cases (noting that SQL is a poorly designed DML for narrow tables)
 create table isFoo(id);
 create table fooA(id,a);
 create table fooB(id,b);
 create table fooC(id,c);
 create table fooD(id,d);
 create table fooE(id,e);
 select (anything) from isFoo,fooA,fooB,fooD where ... (massive join on isFoo.id) ...
   must use 'd' (either in query or post-query) - no other reason to ask for fooD.
 select (anything) from isFoo,fooA,fooC,fooE ... (massive join on isFoo.id) ...
   must NOT use 'd' because 'd' is only available if one asks for fooD
 atomic { insert into isFoo values (new-id); 
          insert into fooC values (new-id,new-c);
          insert into fooD values (new-id,new-d); 
          ... }
 atomic { insert into isFoo values (new-id); 
          insert into fooA values(new-id,new-a)
          insert into ... excluding fooD }
For the narrow table solution, there are strictly fewer cases; all those that include 'd' must do so explicitly by including 'fooD' and are equivalent to the wide-tables that explicitly request 'd'. Most importantly there is no narrow-table equivalent to subcases 1 or 4, which are the problem-patterns for purposes of removal of a column. Where the DBA cannot (or through lack of foresight does not) rule those out by demanding certain coding standards be adhered to (which, as you note, is the case often enough to be worth mentioning), the narrow table solution is better. Generically, the narrow-table solution is better in some cases, and no worse in all others, for the removal of a column scenario, making it objectively better overall (for this scenario).

It just happens that thin tables have won as either 'no worse and in some cases better' or 'significantly better' (e.g. for temporal data and security) in EVERY scenario I have ever tried them in, when under the assumptions of a better DML (based off DataLog, perhaps, with special support for left-joins), some good TableBrowser that supports editable views, metadata for automatic construction of useful editable views, and an optimizing RDBMS that supports joining thin-tables on shared identifiers at O(1). The lack of good tools and DML are the main things dragging them down.

I'm still skeptical. My experience differs.

No, top. You seriously lack experience with narrow table solutions; hell, you barely even grok them and your 'examples' of narrow-table solutions have consistently missed the mark. You cannot legitimately claim your experience differs. All you can claim is that you're comfortable and satisfied with wide-table solutions.

The few I had encountered were difficult to work with, inflexible, and slow. If they were exceptions to the rule, then I lived a coincidental life. The downsides were obvious and substantial, and any benefits were too obscure or indirect to detect. Based on other topics, the type of applications you worked on appeared to differ from those that I have. Thus, it may just be different domain or specialty experience. Further, the fact that "narrow" tables are relatively uncommon suggests that others found them a pain also. There's also the possibility that it's a TechniqueWithManyPrerequisites. In other words, something that takes lots of skill, practice, and the right org environment to "do right", and doing it wrong is worse than not doing it at all. -t

{What were those few narrow table solutions that you encountered? What made them difficult to work with, inflexible, and slow?}

I have already described these in the various "table width debate" topics. In the name of OnceAndOnlyOnce, I do not wish to repeat them. It would perhaps be nice if I provided topic names and/or page-anchors, but I'm not going to do that today.

{I cannot find any specifics.}

Sigh. Fewer joins, and simpler insertions and updates are two. Smaller name-space of tables is another. I realize name-space-size may be a subjective preference or differ per mind, but I work better with a smaller name-space, all else being equal. We've been over these already. -t

{If smaller namespace, fewer joins, and simpler inserts/updates are the deciding factors, then why normalise at all?}

To rid data duplication. Fully de-normalizing also makes inserts/updates difficult because then we have to manage data duplication. And if your tables fit domain nouns closely, then there is smooth and/or natural mental mapping between domain nouns and tables: employees, vendors, paychecks, orders, help-desk-tickets, etc. Heavy normalization clutters up the name-space with piddly attributes mixed among primary domain nouns.

{Just ridding data duplication generally gets you to BCNF. Do you then consider the anomalies that can result from not achieving higher normal forms to be worth it?}

I would consider such on case-by-case basis rather than Always-Do-X. I do believe you are exaggerating the frequency of the "anomalies" you talk about; or else your specialty areas bring you to different problem ratios. They are infrequent compared to other problems in my personal observation.

{Considering them on a case-by-case basis is reasonable. You did not appear to be advocating considering normalisation on a case-by-case basis; you appeared to advocate no higher normal forms, by strongly arguing against "thin" tables.}

I don't see that in my text. For example, near the top, I say OnceAndOnlyOnce should be the primary factor. I didn't say "only" factor. If you find a problem phrase, point it out at the spot of the problem and I'll cheerfully review it.


Are there really that many DBAs out there who are only DBA's? On several internet projects I've worked on (large and small) and on several offline projects I've worked on, the programmers/developers are the DBAs and the developers.

I've seen a wide mix. Big telecoms and power utilities I did contracts for had dedicated DBAs. Most Oracle shops do as Oracle is usually used for bigger projects (as it costs an arm and a leg) such that a dedicated DBA is almost necessary. -- top

Hmmm, the fact that it is costing an arm and a leg may mean that is it not purely necessary and that this is just a way for consultants to make money. However, I've also found similar problems with dedicated graphics designers who don't know a thing about HTML or programming or templates or databases (and some don't even know how to properly write emails, they only know graphics, etc). I guess I come from a smaller/medium project size background but even on larger projects the communication between developers and the database admins (who are usually developers too) must be very tight in order for the project to work out.

[I'm afraid I'll have to back top on this one, sort of. I currently work at a company that has at least two full-time DBAs, and they are busy enough that asking them to write the applications as well would be counterproductive. My previous employer didn't have any, but several of our clients did, and only one also had time to spend writing applications. The only one that had any free time was the one using Oracle.]

Do you use the word "afraid" because you think I might let it go to my head, expanding my ego, making me more arrogant and annoying? Nah, never happens. See, I started out fully arrogant and annoying to avoid such drift ;-) -- top

I still have my doubts - for example, the companies these people work with - do they have employees that spend all their time on C2 wiki and just pretend to be programmers and DBA's? Are you part of the company that is lacking time - because I sure see a lot of people wasting at least 8 hours per day on this wiki. Okay, not wasting - spending time. That includes myself, although IveLeftToCode is a good pattern to follow when your company decides it needs a dedicated DBA - because maybe the developers just aren't working hard enough.

What I have trouble seeing, is how a DBA can modify a database without notifying all his programmers and all his application programmers (which may take just as much work as if he was a developer himself and could navigate through the code in his mind and on his screen). i.e. they must be a very tightly nit development team - no matter if wide tables or thin tables are used. If a dedicated DBA modifies the database, it doesn't matter whether the database is fully normalized or not - the developers still need to be very carefully notified and extremely cautious that the application isn't broken. In cases such as where you have an un-normalized wide table with Image1, Image2, and Image3 instead of an images table, its even worse for the application BTW.

[One can use views. In the RDBMSs I've used, the applications can't tell a view from a table (unless they look at the meta data). As such, the DBA can create views that present the data the way the application is expecting while still allowing the DBA to control the way data is actually stored.]

{As I understand it, one job of a dedicated DBA is to receive change requests complete with valid technical reasons, sit on them for a few days (or weeks), then tell you 'NO!' for a bunch of valid political reasons. Or... at least that's my understanding after having worked with one. DBAs won't often be modifying the logical database - not without cooperation of the programmers. It is almost certainly NOT a good idea to have a full DBA sitting on the database in the infancy of the application or service, though it is likely smart to have someone (be it a good programmer or DBA) who understands both the relational model and the implementation & optimization & logical consistency details.}

[The DBAs I've worked with have had security and performance as their primary jobs. The stuff involving schema changes is a very small portion of it, and was always developer driven, with the developer responsible for identifying points where the change would break the applications.]

A "good" DBA will ask good questions and explain their reasoning well rather than just say "no". Good designs will consider and balance BOTH the here-and-now app-specific requirement AND the longer-term view that (useful) data tends to outlive applications. It should be a lively give-and-take.


Enough talk, time for a scenario test: PublicationsExample


Discussion continued at NormalizationRepetitionAndFlexibilityDiscussion because this grew TooBigToEdit.


CategoryInfoPackaging, CategoryAbstraction, CategoryDatabase


AprilZeroEight and DecemberFourteen


EditText of this page (last edited December 23, 2014) or FindPage with title or text search