Based on a discussion in FearOfAddingTables
OnceAndOnlyOnce should be the primary design driver in my opinion. Some count lots of nulls as "duplication", I don't. They are simply "not there". You can't duplicate empty space. Duplicate keys count more than duplicate nulls because keys are values, nulls are lack of values.
Normalization is simply the algorithm for achieving OnceAndOnlyOnce. -- DaveVoorhis
Wider tables do NOT violate OAOO. By my calcs, skinny ones do because you repeat the one-to-zero-or-one keys over and over.
Do wider tables violate 3NF? What about 5NF? Foreign keys are not considered a violation of OnceAndOnlyOnce, because they are the means by which references are established. Also, if you're not creating related "skinny" (ugh, I hate that term) tables, how do you enforce consistency in column values? -- DV
{DV is correct here. Your 'calculations' would only be meaningful if foreign keys are always to be counted as violations of OnceAndOnlyOnce, which would be unreasonable. Besides: DatabaseIsRepresenterOfFacts. In that sense, it is most critical to ensure that facts are represented OnceAndOnlyOnce, which means at least 3NF. My own preferences are a bit beyond that: I believe that, in an ideal database, each row in a table should represent 'One And Only One' fact - i.e. no non-trivial dependencies whatsoever (SixthNormalForm?) + an additional idea of 'no assumptions' - all facts need to be explicitly represented; e.g. 'pizza place delivers vegetarian deluxe to your neighborhood' doesn't imply 'pizza place exists' unless it is either (a) explicitly encoded as a constraint, or (b) explicitly encoded as an entailment rule under exists ('exists(x) :- (x) delivers (y) to (z)'). Assumptions violate the 'One and Only One fact' rule, and they also act as semantic baggage that make the database far less flexible.}
But why would nulls be counted as "duplication" but not FKs? Both FK and nulls can be virtual or physical, depending on implementation. It appears to be a double standard. And "fact" often depends on the user. EverythingIsRelative. Let specific queries define the fact in the context needed. Or, put domain classification info into meta/classification tables. -- top
{Nothing DV or I said implies NULLs are counted as duplication. Nulls cause their own set of problems, and we oppose them for reasons that are uniquely NULL. If DatabaseIsRepresenterOfFacts, nulls represent no fact. They're a hack - a placeholder for a potential fact given a conjunction of many facts as seen in 3NF. In SixthNormalForm? there would be no NULLs anywhere in the database since it would be pointless to represent them except by the absence of those rows that would otherwise contain them. But to create any 'view' that joins many tables where not all facts are known, it may be necessary to specify some slots as being 'unknown' by some name or other (if one doesn't wish to compute a value for the position). This 'unknown' is exactly what a NULL represents. As a consequence, the domain of the column in the view or query is slightly different from the domain of the original fact table, because it includes this explicit NULL. I wouldn't oppose explicit and query-programmer-controlled forms of 'NULL' found in views, though I think that even in views, distinct NULLs should be described as unique variables (as described further in response to Lars). Indeed, that's exactly what 'variables' also mean: placeholders for unknown values.}
Nulls are nulls.. no one knows what they are. In your application logic, you probably duplicate the code each time trying to figure out what they are and what they do.. they aren't clear. They don't tell a story. They are simply like shrugging shoulders. Very dangerous for a computer to shrug its shoulders. Whether nulls are duplicate or not duplicate shouldn't even be discussed here. Nulls are another issue with another can of worms crawling out of them. -- worms-on-head-mind... err... MopMind?
{What a NULL represents can vary, just as what the number 17 represents can vary - based on context, it could be an age or part of a calendar data or the size of a small hash table. However, that does not mean people don't know what 'NULL' is. In a database, a NULL is a special extra value made available in each domain with the (original) intent of representing the absence of a fact. For the most part its use is in alignment with its original intent. However, the semantics of NULL as normally implemented violates the original intent, making it a hazard to utilize. This is revealed when it comes time to perform joins and queries and computations - technically, 'UnknownA' == 'UnknownB' should be 'UnknownC : Boolean', but 'UnknownA == UnknownA' should be true. So, if we MUST use NULLs, then every NULL is correctly implemented only as a unique NULL - or essentially a unique variable identifying a particular unknown (which, if it became known, could presumably be made to immediately percolate across all the computations performed in preparation of the view). But doing this correctly in the data storage itself is an unnecessary complexity because we don't, fundamentally, have any reason to represent NULLs directly in the DataBase - the only place we ever need them is in views. A KnowledgeBase? would be a different story, where knowing what you don't know and knowing how it relates to other things you know and don't know is one of the two big facets of Knowledge that you can explicitly represent (of four - know what you know, know what you don't know, don't know what you don't know, and don't know what you know).}
Nulls: Storing data in your application is generally what nulls cause.. usually some programmer specifies that NULL means not applicable or not entered in his application logic.. and this isn't future proof. Worse, sometimes nulls aren't described anywhere.. some database engineer just assumes everyone knows what nulls are. That is another topic.
Whether empty (null) columns are "bad" and whether they are "duplicates" can be separated into two different issues. As far as duplication, empties don't violate OnceAndOnlyOnce anymore than the extra foreign keys needed by thin tables. I'd even say less, but for the sake of argument I'll take a wash for now. (I like the word "empty" instead of "null" because the way existing RDBMS implement/treat nulls is poor in my opinion.) -- top
{Implemented properly, every logically distinct NULL would be a unique variable representing an unknown value and, consequently, certainly not be duplicate. The use of NULLs to represent not-applicable should be exchanged for a unique domain-value representation of not-applicable. NULLs, as implemented, are bad.}
Re: 'NULLs, as implemented, are bad' - Well, that's something we can both agree on. However, we'd probably fix it differently. Somewhere there is a topic(s) dedicated to such. If I encounter them, I'll link them here. -t
Foreign keys are not considered a violation of OnceAndOnlyOnce, because they are the means by which references are established. -- DV
What do you mean they are not duplicate? If you repeat them more, then you repeat them more. Seems open-and-shut to me. If design A represents the same info as design B, but B repeats a given foreign key 10 times while the A does it 5, then it is more duplicative with regard to that key.
What you view does not matter (i.e. think as in relvars and relations, not as in two different tables with duplicate data.. or even worse, think of a pointer.. it is kind of similar, but, I shouldn't say that or open that can of worms). It could even be implemented using pointers so that you avoid the duplication at the implementation level - but worrying about that isn't important. Something that is worse than duplicating it is storing this data in your application logic and managing it there.. so now people cannot query this info! Because it isn't in a table. And if it is in a wide table, people cannot query the information about your meta data as you call it. Opening up a thin table to store the data is your meta data and this is what we don't seem to grok, for whatever reason.
- {I'm having some difficulty tracking what you are attempting to say.}
But the same could be said about nulls. In existing RDBMS, the DBA usually has to "deal with" all those extra FKs. As we discussed in FearOfAddingTables, maybe if the tools allowed virtualization/automation of such, it would not be an either/or choice.
{You are confusing implementation and model. A foreign key gives you a meaningful relationship between data. NULLs do not. If you are concerned about duplication on the physical layer, RDBMSs can be designed to denormalize tables and intern large key values on the physical layer. Your OnceAndOnlyOnce semantics should be applied to representation of facts (DatabaseIsRepresenterOfFacts) on the logical layer without concern as to duplication of domain-values.}
No, I am not. Meaningful information about relationships and meaningful information may be two different things, but it doesn't change the duplication argument.
{You make an assumption that I do not take for granted. Could you provide one example of meaningful information that is not about a relationship? To clarify: 'meaningful information' is information that someone could utilize to support a decision in the real world, as opposed to communications protocol overhead.}
Lack of a (known) relationship can be very useful information also. When you are boarding an international air flight, the lack of relationships to "suspicious" items is what allows you to get on the plane.
{I grant that this is meaningful information. But it is still about a relationship, and thus does not serve to justify your assumption. Could you try again?}
Graphs and values are representationally-interchangeable. "Is" versus "isn't" about relationships is thus not useful. - t
{I'm not following your argument. Under which conditions does a graph or other value constitute meaningful information?}
Re: "You are confusing implementation and model." (above)
Models are in our head, and vary from head-to-head. Implementation is an objective representation of a given model, and really the only thing we can objectively compare, for good or bad. Thus, I would like clarification on that statement. -t
PageAnchor pizza4
Often it's a matter of possible future changes. Take the pizza example at:
http://en.wikipedia.org/wiki/Fourth_normal_form
If we assume that pizza features will always be independent of delivery location, then the suggested adjustment makes sense. But often one cannot really know. Distant delivery may mean some items get soggy or dry, and thus the store doesn't want to give that option for distant locations anymore because they want to maintain their reputation for quality.
The first approach is more flexible even if it does introduce some redundancy. It can serve both purposes adequately. But the second (4th normal form) cannot handle one of these two cases without calling in a DBA and possibly a programmer. It performs one well but the other not at all. Thus, I'd judge the original as more flexible, but at the cost of violating OAOO. It's a matter of trade-offs. Sometimes violating OAOO is the right choice I'd have to say. (SoftwareDevelopmentIsGambling). Without more info on the probability of change, one cannot make a fully-informed proper call. -- top
{Technically, the table in that example is already in 4NF (and 5NF and 6NF) if pizza parlors are allowed to or capable of deciding that they'll deliver different items to different areas. Further, it's a thin table simply by virtue of not representing any conjunctive facts (i.e. it represents "(pp) delivers (variety) to (location)", not "(pp) delivers (variety) AND (pp) delivers to (location)"). I'll write up my understanding/definition of wide vs. thin tables above.}
This doesn't contradict my point. When deciding what is the "proper" way to design a system, the answer often depends on intended and future usage patterns. It's not practical to keep refactoring the data as management or marketing changes their mind over time over if or what variations are allowed or not allowed. Ads are simpler if each shop can deliver the same thing, but production people may grumble that such is not practical or costly and that some shops should be allowed to exclude some items. Getting fresh avocados may be difficult for a rural shop. Such "domain battles" are common. It comes down to the kind of thing found in DecisionMathAndYagni. - t
- [Added comment] I notice that systems intended to be sold/rented to many customers within a general niche wire their schemas for the more flexible version of possibilities even though many customers may not use such a feature. They usually don't make custom schemas for each customer. The customers who don't use the extra layers or flexible parts are stuck with schemas more complicated than they, a particular company, need. Thus, a company making pizza delivery software for many shops of different brands would probably design it to have toppings potentially dependent on (limited by) delivery location even if it makes more work for shops that don't need that dependence. Or perhaps make a config switch to shut it off from the user perspective, but the internal table management is still more complicated. I recommend SimulationOfTheFuture to help with such decisions in general. -t
{I'll grant that any decision depending upon foresight is 'gambling'. Whether or not it is practical to refactor over time really depends upon which tools you have available to you. A great deal of the 'refactoring problem' (
CodeChangeImpactAnalysis) lies with SQL's model for
DataManipulation and DataDefinition
?, the lack of runtime adaptivity, the associated straggler updates problems and so on. SQL relies upon a bunch of hacks for trigger-based mutable views. Solving many such runtime flexibility issues in a secure and generic manner has been of considerable interest to me (far more so than issues of normalization).}
We have agreed that if tools made it easier to have multiple viewpoints, both "wide" table and "narrow" table views, then it's not an either/or choice and each view can be used per feature/application/user convenience. Unfortunately, most RDBMS don't support such well, especially on the "write" side of things. Given the limits of existing tools, I lean toward lower normalization. See TableViewDistinctionArchaic. -t
How do you decide when wide tables are too wide?
- A blatant violation OnceAndOnlyOnce of values.
- Bad decision. DatabaseIsRepresenterOfFacts, not of values. This would fail you for the pizza place above (lots of duplicate values).
- If its difficult to change the schema, such as a boxed app purchased by Pizza companies with different topping-per-location patterns, then the lower normalization buys us more flexibility.
- This claim of yours is not justifiable. First, if it ever turns out you need the flexibility, then the Pizza example is already in 6NF, so there isn't a case of "lower normalization" buying you flexibility. Further, if you decide that tables are too wide based upon duplication of values (OnceAndOnlyOnce of values) you would certainly have attempted to split the table even when it shouldn't have been split based on massive duplications of names and such - keep in mind that the multi-valued normalizations are already encouraged in ERM. In any case, the pizza example constitutes a 'thin table' since the fact it represents has no top-level conjugations (see 'narrow table' definitions, below).
- So it "depends on the domain" whether the duplication is "legitimate". Thank You.
- That isn't an entirely correct interpretation. It depends on semantics of a table row and domain constraints whether there is any "duplication" in OnceAndOnlyOnce representation of facts at all - there is never an issue as to whether there is 'legitimate' duplication if there simply isn't any. And basing your decision on 'OnceAndOnlyOnce of values' is STILL a bad decision.
- I've given a fairly realistic scenario above. It's a matter of optimizing myriads of trade-offs and considering the domain, not a matter of any absolute do-or-die principle. I've made my point sufficiently.
- You're sadly mistaken if you honestly believe you have properly demonstrated that the scenario you provided in any way supports the point you are attempting to make. You can't just throw a scenario on the table, TopMind, and pretend it supports what you're saying. You've certainly failed to demonstrate any negative quantifiable or qualifiable trade-offs, and the various normalization forms have always considered the domain - anything from 2NF+ involves domain semantics and domain functional dependencies. Perhaps we should demand you prove competency that you even know what you are talking about.
- {I think Top has proven, often and well, that he doesn't know what he's talking about. TopOnTypes and Top-on-OO are sufficient evidence; this is just icing on the cake. Which is sad, really, because I often see the germs of sound ideas underneath the rhetoric. Top's ideas might have some value if they were soundly presented and defended -- especially with rigorous reference to theory, and evidence of an in-depth understanding thereof. As it stands, they come across as amateurish, arrogant, poorly-educated and ill-considered.}
- Science should guide our decisions more than theory. This is because it is difficult for humans to distinguish between MentalMasturbation and practical results. And you have not shown anything close to science. Scenario analysis is the closest thing we have available for us on C2, but you guys seem resistant. Either you are afraid of scrutiny, or think that clever ideas are magically immune from science. You think you have a GetOutOfScienceFreeCard?. You don't. -- top
- {Theory, at least that mentioned here, is pure science founded in mathematics and logic. What you appear to advocate is the application of social science and psychology to computing, which is fine, but it needs to be rigorous and controlled in order to have credibility. Unless you can back up your claims with rigorous and controlled studies, they are naught but arm-waving of the worst sort.}
- Bullsh8t. Math is not science and science is not math. There is a relationship, but not equivalence. Math documents a model, but does not prove it against reality. Science is what ultimately connects nifty ideas to reality. And Psychology is not rigorous, unfortunately. See DisciplineEnvy. -- top
- So are you using psychology as your GetOutOfScienceFreeCard?? What an effin' hypocrite you are. Science creates a model, TopMind, and the hard scientific theories (the ones with measures and formal qualifications) are always founded in mathematics and logic. If you think this is 'Bullsh8t', it's probably because you're a 'big8ted sh8t' who lacks significant comprehension of real science OR real math. But that's just my opinion, albeit one shared by enough people who interact with you to warrant an ObjectiveEvidenceAgainstTopDiscussion.
- Mathematics and logic are perhaps a necessary ingredient, but necessarily a sufficient one for good science.
- They are sufficient when dealing with already formalized computational models, such as programming languages and CPUs. We know exactly how these things interact, down to the hardware level and up to the high-level programming languages - i.e. we can assert that the mathematical models of these are as close to perfection as any model can achieve, and those cases where it isn't constitute manufacturing flaws or compiler bugs. Knowing this, we can, in fact, use math and logic for 80%+ of all observations. The only ones it can't touch are psychological and statistical observations over human users and politics (e.g. noting that humans lack perfect foresight, and humans make typographical mistakes and logic errors, and that every human ever observed does so - we've never, in fifty years, found a programmer that gets everything perfect on the first try, and noting that businesses prefer 'plug-in' programmers, and noting that most humans don't like learning many different languages creating StumblingBlocksForDomainSpecificLanguages). These observations are necessary to help shape policy and language-decisions, and they can't be diminished... but they also rarely need to be quantified (qualification is generally sufficient to make appropriate decisions - even if we DID quantify these things, there is simply NO place we could ever plug in the numbers). Your own 'ideal' regarding the use of empirical observation in ComputerScience fields is one I consider entirely naive. You need to use more math and logic, and limit observational science to the places you can demonstrate it will help.
- Logic tells us that the "limiting factor" is often the human mind, NOT the machine. Humans must grok and manage their models to use them effectively as tools. Computers don't care whether we use thick or thin tables, they just follow orders; it's the humans that care. You use "math and logic" with a SovietShoeFactoryPrinciple flaw such that you focus on the low-hanging and/or narrow factors. Observational science is one tool to guard against SovietShoeFactoryPrinciple. Complete science doesn't allow you to skip the empirical step just because it's hard or expensive. You earn the right to call me names only when you complete all the steps, not just those that are easy or that you like. - t
- The user may want to switch on or off location-dependent toppings without having to hire a DBA. App-level duplication detection can handle this better than having someone or something change the schema (and re-import it). Sometimes you guys act like you have no field experience with the messy real world: you shoehorn things into your ideal model and let somebody else deal with conceptual misfits that invariably pop up (or keep hiring you back for yet more mula). -- top
- {Funny... I've been thinking exactly the same about you. You remind me of a lot of middle-manager-turned-programmers I met in the late 80s during the ExBase boom. Like you, they tended to vehemently deprecate proven theory and best practices. Of course, I suppose there's little pressure to do otherwise when the average application is deployed in one location and has at most ten tables and maybe a handful of users.}
- They probably had been burned by pompous charlatans.
- {Yes, they'd probably been burned by each other.}
- If you actually comprehended the ideal models, it would be clear to you that there isn't an issue - if the pizza parlors are even capable of delivering different pizza-varieties to different locations, the example from http://en.wikipedia.org/wiki/4NF is (as a schema) already justifiably in 6NF; it's just awaiting the exceptional datum that proves the rule. It seems to me that you use arguments like this as straw pillars to support and justify your approach: creating even dirtier models, embedding conceptual misfits either intentionally or retroactively, arguing the offhand chance that they might fit better someday. YagNi. If you can justify it for domain reasons, you're likely already normalized. If you can't, then you're just being sloppy. I think that sort of sloppiness is unprofessional.
- You had your chance to point out such with specific examples. Show that you are smart, don't claim it. Everybody on the gaddam web thinks they are a genius.
- That would come across as intelligent if I hadn't pointed it out with a specific example. As is, you just come across as blind.
- [That would not have been funny if you had just spelled gaddam genious crectly http://www.urbandictionary.com/define.php?term=genious.]
- This topic is getting too long for FireFox to mark spelling in a timely matter.
- [Then what happens if your customer needs four images of the pizza in the database and you only have 3 placeholder spots.. normalizing the database so images are in another table solves the issue. Flexibility is exactly what normalization solves! Or say someone has pizza toppings that they want.. are you going to give them a wide table of all the pizza toppings - I think not. Rather the toppings will be in a normalized table.]
- I don't know which scenario you are referring to. I am talking about the scenario at http://en.wikipedia.org/wiki/4NF
- If most are boolean, some kind of "classification or feature" table may be in order instead of bunches of boolean columns.
I suggest this definition of 'wide' table (also GodTable) and 'narrow' (thin/skinny) table, which reflect how I've been using them thus far:
WideTable?: A 'wide' table refers to any table for which rows represent 'conjunctive' facts or relations. A traditional 'entity' table from EntityRelationshipDiagram and EntityRelationshipModeling is a particularly common example of this, consisting of (entity-id, attribute1, attribute2, ..., attributeN). One can note that there is no inherent or functional relationship between, say, attribute1 and attribute2 except as they both are aspects of whatever 'thing' in the world is identified or named by entity-id. As a consequence, this table can be said to contain conjunctive facts, essentially of the form isEntity(fred) AND attribute1(fred,fred's attribute1) AND attribute2(fred,fred's attribute2) AND (... so on ...).
NarrowTable?: A 'narrow' table refers to any table for which rows represent realizations of (strictly) exactly one abstract proposition; that is, records cannot represent any 'conjunctive' facts. Every 'column' in the table must be critical to the representation of a single fact. Note that there is no inherent limit on the number of columns in a 'narrow table'; all limitations are based on the semantics of these columns with regards to representation of data.
While 'narrow' table isn't defined in terms of normal forms, there is only one normal form guaranteed to be a 'thin table': 6NF - which forbids join dependencies on any proper subsets of attributes. 5NF still allows for join dependencies so long as those are implied strictly by the candidate key(s). Any table in 6NF will be a Thin Table.
But thin tables can be a lower normal-form so long as they don't semantically represent multiple facts... e.g. in the pizza-delivery table in the wikipedia 4NF example (pizza-parlor,pizza-variety,delivery-area) (http://en.wikipedia.org/wiki/Fourth_normal_form), the fact represented at each row is (pizza-parlor) will deliver (pizza-variety) to (delivery-area), not (pizza-parlor) will deliver (pizza-variety) AND (pizza-parlor) will deliver to (delivery area). The latter of these would be a semantic violation of 4NF. It is worth noting that if the wikipedia's 4NF example had even one actual case of a pizza-parlor delivering a different pizza based on delivery-area, it would prove the table was already in 4NF (and it would also happen to be in 5NF and 6NF). It is very likely that any examples TopMind or others can come up with regarding these 'semantic' future changes are already thin tables, and would (often) already be in 6NF if they fulfilled that 'special' exception that just hasn't been bothered to show up yet. (However, I'll later go into a full change-analysis later that shows that failing to make the 'correct' assumption here still leaves the NarrowTablePhilosophy? quite flexible, and certainly no less flexible than wide-table philosophy.)
It should be clear that we all work with at least some thin tables - even TopMind certainly uses thin tables for most or all of his one->many and many<->many relationships, since doing so is part and parcel to EntityRelationshipModeling as it is traditionally taught. When using a 'thin-tables' approach, one actually uses it as a higher-priority design criterion
The most common 'wide' tables in use by well-trained or experienced DBAs are exactly those corresponding to the 'Entity' tables, which (as mentioned in the above example) tend to contain an 'entity-id' then a sequence of One->(Zero|One) attributes. In the cases of 'Zero', these large Entity tables result in NULLs being placed in that location of the column (NULLs that don't even work correctly, no less... but that is a discussion for another page, or perhaps the top of this one...). These clearly represent conjunctive relationships. It is worth noting that 'thin' tables will never have an 'entity' table that carries more than just the identifier - all it can have is isEntity(fred). It can't say anything else about fred because then it would no longer be a 'thin' table.
Now to represent two basic philosophies:
WideTablePhilosophy?: Wider is better. If it makes cars more stable, and it works for my kitchen table, it will also work for my RDBMS. Wherever possible, conjugate as many facts as possible into a single table. 'Entity' tables, especially, are fine targets for adding new columns. NULLs aren't too painful; even though they could be implemented better, I can live with them. Optimization is important - if I break tables up, I'll need to join them together later, and joins are expensive. Joins are not so expensive to justify allowing one->many and many<->many relationships into the table, though. And I like having everything under just one name - sure, there are dozens or hundreds of columns, but I can access it all by requesting the table and 'drill down' by specifying whichever columns I want. To me, that beautiful, wide-table almost IS the database. Besides, I'm comfortable with this - it all seems 'simple' to me. I made it work for my last project; I can make it work for my next one. I'll pay no attention to those noises from the men in the ivory tower - they're doing some sort of collective MentalMasturbation and I really don't want to look too closely; besides, I have a headache.
Things the WideTablePhilosophy? person doesn't say: When making my arguments about simplicity, I'll just ignore those columns I shaved off to avoid sticking one->many and many<->many relationships in the same table as 'special exceptions' to my philosophy - everything important is one->(zero|one) anyway, so I can usefully 'drill down' and access all the important stuff. Hmm... you know that tendency we wide-table people have for adding "just in case" columns just in case someone needs to add some data we hadn't thought of? yeah, the columns that confuse the fsck out of everyone. Ignore it. Oh, and don't pay attention to how I go scrambling about at the beginning of every project attempting to figure out all the attributes I'll need - I don't think you could possibly do any better because I've never imagined a world any different. Also, I've occasionally noticed that when I need to make changes and I add a column to a wide table, or split a column because I later learn it is a many<->many or one->many relationship, I often need to go back and edit every query in the entire world that does an insert or update to my table. But that's okay... I control all the queries anyway.
Oh, and for the moment, I'm going to call 'bloat' on anything related to integrity, security, temporal data and other metadata, etc. 'bloat' because they've never entered my thoughts before and it's easier to ignore them than it is to conclude I should have been thinking about them all along (and my darwinian protect-my-food-source fish-genes are doing a lot of this talking for me). Besides, they must have significant trade-offs... right? Right?!
NarrowTablePhilosophy?: DatabaseIsRepresenterOfFacts, and facts, themselves, should receive FirstClass treatment in databases - in particular, they shouldn't be shoved like sardines into a tiny tin table. Ideally, every fact should receive the exact same treatment - this keeps things pure and simple. It just turns out there are a very great many advantages of doing this. Among these advantages: (1) there are never any NULLs, (2) there are never any update anomalies & integrity constraints never need application support, (3) there's suddenly plenty of space to add a lot of meta-data about each and every fact... in particular, to the right of that fact in the table, supporting: (3a) bitemporal databases (time entered, time deprecated, time true start, time true end), (3b) per-fact read & deprecate security (via capability requirement), (3c) derivation information - how the fact was acquired (e.g. derivation tree or agent that added it), (3d) truth-values (and real support for OpenWorldAssumption), (3e) confidence-values (and real support for knowledge bases and expert systems). (4) Better concurrency for most transactions due to finer grain - those that don't update the whole panoply of tables will only touch a few tables thus reducing need for waits or chance of time-space collision for optimistic transactions, (5) Thoughts less influenced by 'optimize now!', (6) YagNi compliant - and sufficiently flexible refactoring or changes even at runtime (e.g. if the pizza-parlor delivery example had been broken down into makes(pizza-parlor,pizza-variety) and delivers(pizza-parlor,delivery-area), then we later learn that not all pizza-parlors deliver all pizzas to all places, we simply create a new table deliversPizza(pizza-parlor,pizza-variety,delivery-area), populate it with the join of the original two tables, then turn those original tables into views on this new table - all as a single transaction - supposing we don't want to get creative about it and have "relations" that are partially extensive and partially derived (which is fully legit in the RelationalModel)), (7) Easy and very well-proven model to support infinite relations (logic functions, essentially - often exactly what you'd see in a LogicProgramming language) and even update tables by use of logical union or additional statements of entailment, (8) relations-as-types if you want them (infinite unary relations can easily operate as types, and vice versa) thus homogenizing even the concept of 'domain' - especially in combination with higher-order relations (relations that contain relvar names or relation-descriptors themselves). (9) More efficient, precise, and meaningful applications of 'triggers' - in particular, triggers NEVER used to maintain integrity (because they aren't needed for that task), and triggers could always be extremely precise as to which tables they need to touch and watch for updates. (10) Homogeneity of representation - 'one->many' and 'many<->many' relations are given exactly the same FirstClass treatment as 'one->(one|zero)' relationships, meaning that people don't have any chance of acting like the simple-minded Wide Table Philosophy persons that begin thinking a 'single' table carries everything of importance and they can reach it all by 'drilling down'. Add this to relations-as-types, and everything becomes homogeneous except for values (domain values). If you add relation-values, then everything is 100% homogeneous and a full SymmetryOfLanguage is achieved (and we 'academics' can toss bones near a monolith and let the MentalMasturbation begin).
Things the NarrowTablePhilosophy? often forgets to mention: There aren't any RDBMSs that make the thin-table approach optimal, yet - we need, in particular, query and index-optimizations to pre-join or denormalize tables at the physical layer (these optimizations are already well known and mathematically proven) and the ability to easily add tables (as easily as we currently add WikiWords or new rows to a table - including support for transactional addition of rows and manipulation of views) and extend 'views' (e.g. adding new entailment-facts to 'exists(x)'). Current RDBMSs simply won't do. Also, we'll need a good TableBrowser, and the RDBMS will need to keep a lot of information about which tables it has and their relative constraints, such that we can easily construct useful 'views' for the user (update-able views, ideally). Tools like this exist, but nothing that anyone would consider of great quality. In particular, the ability to rapidly construct, edit, and destruct views is quite important, and editable views need to make clear which columns can be edited because they have a variable associated directly with them (you can't edit a column that is the result of certain forms of calculation). Finally, we really, truly, honestly need to replace the DML & DDL; SQL is awful, period, but even more so for the narrow table philosophy - joins need to be syntactically cheap and very easy to both write and comprehend, and support for relvar names and complex types (including relation-values) would be highly appreciated. The best query languages for Narrow tables seem to be those developed initially for LogicProgramming languages, such as PrologLanguage, MercuryLanguage, and DataLog. So, basically, we're asking everyone to ultimately trash or significantly update: (a) all the RDBMSs, (b) all their knowledge of the most common DDL/DML in this world (SQL), (c) any training they may have received in EntityRelationshipModeling, etc. - of course this idea is going to be popular!
We NarrowTablePhilosophy? people view 'homogeneity', 'purity', and 'SymmetryOfLanguage' as 'simple'. We think the Wide Table Philosophy people don't really grok 'simple', certainly not as Einstein meant it, no matter how much lip-service they pay to it. We 'academics' are of the opinion that you'd find our changes great once they were in place - we believe that even TopMind would think them simple and great and wouldn't be able to imagine using wide tables if the narrow table philosophy had already been incumbent when he received his training. We know that overturning any incumbent and established base is so difficult that you generally need something many times better than the current solution to do so... or enough money to market it as many times better - we know this because we run into this wall over and over and over and over.
Fortunately, NarrowTablePhilosophy? actually is many times better, subject to the proper automated optimizations, access to an even semi-decent TableBrowser, and selection of a massively better DML/DDL. This is especially true when it comes to meta-data, which will become more and more important to businesses (as they learn about it) and pretty much every other domain - it is of extremely great import for data mining, for example. It is also especially true in a wide array of domains that RDBMSs have had a difficult time penetrating. I have no doubt that Narrow Table Philosophy will very likely be victorious. Unfortunately, it looks like we won't be receiving help from the fish-brained people.
Relative Costs of Change: (under construction)
- (Meta) Definitions and Assumptions: (since TopMind somehow held the ridiculous notion that 'fingerwise-editing' considerations were valid) [I was simply restating my interpretation of your statements. Your metric is not clear.]
- deployment: A schema (or API of any sort) is 'deployed' after there is code outside of ready control that depends upon it. This can, depending on the situation, include application code deployed to client machines, code written by groups other than the DBA, or even code with tight Q&A related to a code-freeze. Deployed code resists change in a real dig-in-your-heels sort of way, no matter how rational the change - people refuse to update clients, it takes a long time to tell people who have written code that touches your database what has changed, it is difficult to fix server-code for the new schema, etc.
- cost assumptions: The cost of performing a change to a schema is dependent on many things. Prior to going live or deployment, it can be considered near free - one is just moving boxes around on a single piece of paper describing the schema, and one has not meaningfully 'shared' this piece of paper with anyone else so there is no need to keep anyone else up to date. Once a database is live - has real data - there is an additional cost of creating the actual change within the database - e.g. performing the necessary join and inserting the necessary data; this, in the context of everything else, is a relatively minor cost. Further, one must repair local queries, views, insert/update statements, procedures, triggers, local application or server code, etc. - and ultimately test these changes - in accordance with the new database schema. This can actually be quite expensive, especially since it may require changing unit tests and running them with the new schema and so on. And once a schema is deployed, the cost for a change becomes remarkably high - the cost of the edits themselves become only moderate relative to the cost for communicating the need to edit or getting permissions to perform the edit. After deployment, the cost of change can be considered nearly prohibitive if the change 'breaks' existing code, but if one must break code then it is the number of times one breaks the code that really determines cost (whether or not it breaks 'badly').
- sharing: In the very best case, you can assume that just you or you and the people in a small, tight-knit group are the only persons writing queries for your schema - this applies to a common but fundamentally limited class database scenarios. In this case, you have the ability to hand-check all queries of the database, to enforce certain coding styles, to enforce usage through a particular API, etc. In the broader sense, however, databases can be made available for querying by people outside your control - e.g. business partners. This creates an example of a 'deployed' schema, but has the additional and significant effect that you cannot assume good programming habits on the part of the database users. In fact, you can probably assume the opposite: sloppiness, laziness, fragile code, etc. In these cases, any defense or protection that can be acquired by 'good' programming habits can be defenestrated. E.g. for 'wide' tables, you can eliminate some changes required on 'removal' of a column by ensuring you always use 'select explicit,columns' and 'insert into table(explicit,columns)' instead of 'select *' and the generic 'insert'. This does NOT make the wide-table solution equivalent to the narrow-table solution; it only means that the wide-table solution can, with effort and self-discipline, be made "no worse" than the narrow-table solution for a limited application of that scenario.
- Scenario: I just learned I need a new one->(one or zero) or one->one attribute
- WideTablePhilosophy?: Find the appropriate entity-table. Add a new column and initially load it with a default value (NULL in the case where NULLs are allowed). Then update this table against a join of this table and the one with the new data (if one is available), diligently specifying all the columns from the original entity table (maybe dozens) and the new value from the table containing the new value. If there is no table of values, then just add a new column and add any known values by hand. After the new column is added, visit every piece of DML in the world that inserts or updates to this table and make sure it takes the new attribute into account. Also check regular queries if any of them are known to be sensitive to the addition of columns. DBA Cost: Table update, New query to perform join, Cost to check all queries in existence.
- NarrowTablePhilosophy?: Add a new table newAttribute(e,attr) and load it with the known set of values (if one already has a table of the values, that would be quite easy, since everything you need is there. Then declare the appropriate constraints: newAttribute(e,attr) requires isEntity(e); newAttribute(e,attr) has candidate-key(e), etc. (Or, as a flexible alternative, specify 'isEntity(e) :- newAttribute(e,_)' depending on whether you want dynamic entities or static entities.) In the one->one case, either add an 'explicit' assumption (e.g. assume(attr,value) :- newAttribute(e,attr),known(e),unknown(attr) - representation depends heavily on DDL) or attempt to add a requirement that 'isEntity(e) requires newAttribute(e,attr),known(attr)' and find the necessary data for the failure cases.
- Cost Comparison: Populating the data is (slightly) more expensive for the wide-table solution due to the greater amount of text needed to create an update query that accounts for the data already there, though use of good tools can reduce the data-population costs to near equal. The narrow table solution always requires specification of constraints, but this is also easy with good tools (e.g. one-lineing the constraints, foreign-key specs, and defaults - especially for very common patterns like this one, such additions would be near trivial). The great difference in cost comes for the WideTablePhilosophy? after deployment. If the schema is widely deployed, the cost of checking any queries including insert and delete queries could be prohibitive. These costs in the WideTablePhilosophy? can be mitigated by addition of "just in case" fields, with a trade-off semantic cost of adding confusion and undocumented 'features' to the system (especially if multiple pieces of data are conjoined into one string in one just-in-case field). Prior to deployment, costs are not meaningfully different; post-deployment, narrow table solution wins hands down.
- Scenario: I just learned I don't need this one->(one or zero) or one->one attribute
- WideTablePhilosophy?: If you delete the column in the entity-table, you'll need to touch all SELECT* and general INSERT queries to ensure they don't break along with any that explicitly SELECT or UPDATE the attribute. Alternatively, you can allow it to remain around as cruft.
- NarrowTablePhilosophy?: For this one you'd have a unique table for the attribute, and you'll only need to touch those DML statements that query, insert, or update to that table. Alternatively, deprecate the table (logical delete in a bitemporal database) so it just comes back as 'empty'.
- Cost Comparison: Post-deployment, the wide table solution has a strictly greater potential for cost, though intelligent design of queries can help minimize this cost and keep it near-equal to that of the narrow table solution. If any inserts or queries or application code needed touched after adding the attribute, one will certainly need to do so again upon removing the attribute. The narrow-table-solution has the advantage that one never needs to touch existing queries when adding the attribute (since the table is unknown initially, no queries could possibly be affected) which means that you only ever, very strictly, need to touch the queries that were modified or created to explicitly include the attribute after the attribute-table was added. This, overall, results in much better cost prospects for the narrow table solution. However, with some care and attention, the cost for deletion can be reduced to approximately equal in the wide table solution. Pre-deployment, costs aren't meaningfully different. And letting cruft build up has the same cost either way: zero for the DBA, and some potential confusion for everyone else.
- (Dispute at PageAnchor "Dorthy") - modified above; should this be resolved?
- Notes of Wisdom: For both the WideTablePhilosophy? and NarrowTablePhilosophy?, avoid jumping at the opportunity to delete data. People are fickle. Ask them again in a month if they really don't need it. If there isn't data already there, don't worry about it to much - the NarrowTablePhilosophy? guys at least can drop it immediately (adding the table later is cheap), but the WideTablePhilosophy? people may wish to keep it around... "just in case".
- Scenario: I just learned I need a new one->many or many<->many fact
- WideTablePhilosophy?: Create a new one->many table or many<->many table, and add associated constraints.
- NarrowTablePhilosophy?: Create a new one->many table or many<->many table, and add associated constraints.
- Cost Comparison: Identical. It's worth noting that the wide table philosophy creates a narrow table for this instance.
- Scenario: I just learned this one->(zero or one) fact should have been a one->many fact or many<->many fact
- WideTablePhilosophy?: Create a new table for the one->many fact, populate it with the existing data, then remove the one->(zero or one) representation of that fact, then Follow all the steps of removing the one->(zero or one) fact, except be sure to keep the valu
- NarrowTablePhilosophy?:
- Scenario: I just learned this one->many fact or many<->many fact could have been a one->(one|zero) fact
- WideTablePhilosophy?:
- NarrowTablePhilosophy?:
- Scenario: I just learned this one->(zero|one) or one->one attribute needs a modifier
- WideTablePhilosophy?:
- NarrowTablePhilosophy?:
- Scenario: I just learned I need to have old data for auditing purposes
- WideTablePhilosophy?:
- NarrowTablePhilosophy?:
- YagNi compliance
- WideTablePhilosophy?:
- NarrowTablePhilosophy?:
Do you believe it acceptable to violate 3NF?
- I don't want to commit to an absolute always/never. I would say "usually not".
If you're not creating related "skinny" (again, I hate that term) tables, how do you enforce consistency in column values?
- What kind of consistency?
- Ensuring that a column value does not vary unnecessarily. E.g., that if a user can specify a colour, ensuring or encouraging it to be entered the same way for every row, rather than as "Blue" on row1, "blue" on row2, then "blu.", "bleu", "bl.", "BLUE", etc.
- ConstantTable and constraints. (Yes, a ConstantTable is indeed a "skinny table".)
If foreign key values represents a violation of OnceAndOnlyOnce such you should use "wide" tables, do you believe code block delimiters (e.g., "{" and "}" in JavaScript, C/C++, Java, etc.; "begin" and "end" in Pascal or Pascalesque languages, etc.) represent a violation of OnceAndOnlyOnce, hence you should use one large function/procedure definition instead of multiple small ones? -- DV
- They count as duplication, but not the only duplication. One has to weigh all the stuff, not just parenths. (How much they count in code volume count metrics is an open issue. Most seem to agree they shouldn't count as much as variables.) In practice there is usually a WaterbedTheory trade-off in what kind of duplication is accepted because ridding all duplication is usually impossible.
So... You generally favour normalization to 3NF, you use ConstantTables to constrain enumerated column values, you realize that ridding all duplication is usually impossible (hence some foreign keys must be okay), and you regard tables with a blatant violation of OnceAndOnlyOnce as too wide. When, then, do you actually create wide tables? Is it possible that there's been some sort of vastly miscommunicative ViolentAgreement here all along, in which the normalized "skinny-table" schemas we create are actually the same as the normalized "wide-table" schemas you create? Is it also possible that the terms "wide" and "skinny", having no quantified degree associated with them, actually mean nothing and are nowt but a source of confusion and ambiguity? In other words, your wide table might be my skinny table? -- DV
I've already agreed that ridding all duplication is practically impossible. Perhaps we need a better definition of what "skinny tables" are. For example, is it all about ridding nulls, as some have implied, or are there domain-oriented data integrity issues involved? -- top
Alternatively, perhaps we can dispense with the terms "SkinnyTables?" and "wide table" entirely, as they have no generally-accepted definition, nor do they have any quantified or objective meaning. As such, use of these terms only leads to confusion and quarrels. -- DV
Agreed. What is the best name for the Fabian null-ridding technique?
Probably "Pascal's technique for eliminating nulls" or "Darwen's technique for eliminating nulls" or "Smout's technique for eliminating nulls" or "Duncan's technique for eliminating nulls", and so on, depending on whose technique it is. (The latter two, to my knowledge, are as-yet unpublished.) When I get some time, I'll create a page to summarize them. -- DV
Skinny tables are not just for reducing nulls.. but a normalization technique in general.. as shown in ConstantTable. Just call them thin tables since they have nothing to do with human skin.. or normalized tables.
What's a "skinny table"? What's a "thin table"? Really, we should avoid this ad-hoc terminology. Looking back, some of the debate appears to be due to differing interpretations of these undefined (and probably redundant, or at least unnecessary) terms. -- DV
It is a table with only a few columns.. for example consider a bitmask table or the one demonstrated in ConstantTable. It is narrow or thin because it only has very very few columns (usually less than 5). But, this leads to table thinking instead of relvar. What does one call a tuple that only has two positions in it? We have to at least name this as some sort of pattern. I prefer thin or narrow over skinny. It's harder to say something like that table that Darwen uses in that PDF file. Rather it is easier to say something like thin table. But, if we can find a better name, I'm all for it. It still needs to be patterned, regardless.
Why five columns? Why not six? Or three? There is already a term for this quantity -- degree. A tuple with only two positions in it has a degree of two. I think it will be far more productive to either speak in terms of specific degrees -- especially for interesting tuples/relations, such as those with degree 0, degree 1 or degree infinity -- or attempt some consensus on what a "thin table" or a "wide table" means. I doubt that's possible, especially as a tables/tuples/relations with degree > 1 and less than infinity are all equivalent (setting keys aside for now.) The only distinctions are psychological. -- DV
Sure, so something like TwoDegree? or SmallDegree?.. etc. Wide table is simply an AntiPattern... a huge table that has way too much unnormalization. And wrong thinking - people should stop thinking in tables so much. Products like MySQL have forced people to think in "spreadsheets" instead of relations.
Spreadsheets have forced people to think in "spreadsheets". I'd hardly blame MySql for that, as schemata defined for MySql can be as normalized as for any other DBMS. I do agree, however, that the term "table" should be deprecated, as it is ill-defined. Now then: A huge (define "huge") relation may be perfectly normalized, and what does "small" mean? The degree is orthogonal to normalization once degree > 1. A relation with degree < 2 is inherently in (at least) 5NF. -- DV
- People want to see and edit information in a format comfortable to them. There is nothing wrong with that, and ideally relational should be able to deliver (remember the BurgerKingPrinciple?). It's just a matter of mapping on an as-needed basis. -- top
- Please clarify, what the hell does burgerking-ing mean? Do i have to grab an image and mock it in the SkinnyTables? page?
- I think he means "have it your way", for which there are two standard mechanisms: queries (with JOINs), and VIEWs. I hope Top is not of the (mis)impression that the schema must mirror the user's desired view of the database. -- DV
- The closer it is to the DB user's preference, the less they'll have to transform it. Fitting the most common view is indeed a factor to consider (among others). --top
myIsam doesn't support (or didn't for many many years) foreign keys and cannot be normalized. It is a spreadsheet (I'm just dissing it). As for what a wide and thin table is.. well in real world examples when we normalize: typically we take some data and put it into a narrow table compared to the table which we are normalizing. We start putting some data into smaller tables (compared to the other table). That's just a point of reference, that when we normalize, generally, more narrower tables are made, and hence not as much width in those tables versus the major table we were normalizing. These narrow tables may grow with time, and will need to be once again normalized - and most likely after normalizing there will be more narrower tables than before when unnormalized. Maybe it can be explained in
degrees instead of narrowness and thinness.
Normalization is unrelated to the presence or absence of foreign key constraints. As long as JOINs are possible (and it would be a pretty useless SQL or relational DBMS that didn't permit JOINs), normalization is viable. As for your use of "thinness" or "narrower" etc., these may be handy informal ways of specifying relative degree between relations, e.g., one can say that relation A is narrower than relation B as a way of saying that degree(A) < degree(B). It is not an absolute measure. -- DV
Coding all the important database constraints into your application logic? You consider it acceptable normalization? I mean, even an excel spreadsheet can be normalized.. if you use enough visual basic macros!
Database constraints are orthogonal to normalization. Normalization is not a measure of the strength of a DBMS, it is a set of algorithms for reducing redundancy and update anomalies in a logical database design. It is not physical; the result of normalization is what becomes a physical implementation. A database schema can exist only on paper and still be fully normalized. Paper, I will note, also lacks foreign key constraints or database constraints of any kind. -- DV
- 2NF and higher are constraint-dependent, at least on the semantic constraints (whether or not those are ever explicitly represented in the database). This doesn't contradict you, of course, but it's worth pointing out.
- And I'll note that 'physical implementation' is also independent of the 'result of normalization'. Were I designing an RDBMS, I'd make it easy for the database to decide upon or take suggestions to perform denormalization-based optimizations (i.e. data existing in the physical-layer in a pre-joined fashion but logically provided in the original table design fashion). This would (where the optimization would actually be selected) reduce runtime join costs and possibly storage costs where primary keys are shared by many tables as foreign keys into another, and NULLs in the physical layer would not compromise the logic layer.
That orthogonal word is overused, I'm afraid, and you are using it for mental masturbation. The orthogonal word, is quite often used when we are discussing academic nonsense like how relational model doesn't apply to constraints and blah blah. And you are close to making the "quote of the month" on Fabian's site.
- If you're going to subject yourself to this sort of mental self-abuse, you ought to at least do it behind closed doors; it's ghastly. There isn't a damn thing you can do to go about proving your claims that 'the orthogonal word is overused' (how do you show a word is overused?) or that 'DV is using it for mental masturbation' (he most certainly is not). My own impression is that you are sputtering uselessly, habitually spitting out AdHominems and denigrating anything remotely rigorous as 'academic nonsense' and 'MentalMasturbation' because (a) you'd rather emit blind denials than ever admit concession of a point, and (b) you aren't thinking clearly enough to table it. If you can't come up with something intelligent to say, keep your mouth shut; in the name of justice, people are entitled to abuse you a bit (with biting comments, at least) if you act like an idiot and simultaneously insult them - especially if you do so out of foolishness and pride rather than true stupidity.
Keys are logical. And paper is not what we are discussing.. we are discussing implementing a database into a product - and if the product doesn't support ways to enforce relations and referential integrity.. I'm afraid that
paper won't help us. If we were discussing paper, we wouldn't be discussing
MySql. Foreign keys and constraints can exist on paper. I'll provide you with some interesting quotes.
- It is true that constraints are an important part of databases. DV has said nothing to deny this. However, database constraints ('Consistency' or 'Integrity') are still a completely separate issue from normalization. Atomicity, Isolation, and Durability are also distinct from normalization. They are no less important for being orthogonal; they're simply independent.
"Integrity is a central, critical database, not application function for all sorts of reasons that I explain in my book. Application-enforced integrity is, essentially, data management without DBMS. That's what we did before we had DBMSs, and we invented DBMSs because that proved not to be cost-effective." -- Fabian
"I presume you are talking about enforced referential integrity? Why does it have to be enforced? You can have relations [between tables] without them being enforced. Many of the larger database applications, such as SAP R/3 and Peoplesoft don't use enforced foreign key constraints because of speed issues." -- Fabian's Quote Of The Month
There appears to be some misunderstanding here. I have in no way deprecated database constraints. They are vital. However, they have nothing to do with normalization. Failure to normalize is not due to a limitation of any database product, nor is it promoted by any database system feature. It has nothing to do with database products. If it's not done properly, it is a failure of the database designer to do a proper design job. -- DV
RE: Failure to normalize is not due to a limitation of any database product, nor is it promoted by any database system feature. -- DV
I'd argue that 'failure to normalize' IS promoted by typical database system lack-of-features (or, more accurately, normalization is discouraged). In particular, the lack of powerful and assured physical-layer optimizations that cross tables can make it difficult for algorithmic-complexity-aware programmers to further normalize even in the face of potential errors. Most systems-language programmers need guaranteed optimizations, like guaranteed TailCallOptimization, before they feel comfortable trusting the system to 'get it right'. I certainly do. A very common perceived (and generally accurate) optimization is that denormalization on the logical layer (or, equivalently, resisting higher-level normal forms) will result in the same structure on the physical layer, and that by this one avoids costs for joins that would be required if the tables were further normalized. However, if one could simply suggest the Database to effectively 'denormalize' certain tables at the physical layer or otherwise be prepared for a great many join-queries over a certain common subset of tables in such a manner as they could ensure the costs of joins on certain other tables were exactly O(1) rather than O(N) (which would be typical of a join of two tables hash-indexed on the same key) or O(N*log(N)) (which would be typical of a join of two tables tree-indexed on the same key), they would then feel comfortable using the higher level normal forms. For someone who wants the ability to dynamically add new tables of data, this would be even better if the database could simply figure out such optimizations based on the queries it sees. This could be done by any number of mechanisms at the physical layer, of course - e.g. having the same index carry pointers into several tables, and having the query-optimizer take full advantage of this; 'physical layer denormalization' refers only to ensuring that once a shared primary key is found, that joins using that key would cost no more (at least algorithmically) than would having constructed the same logical denormalization in the first place. I strongly suggest you design cross-table optimizations into RelProject; the degree to which automated optimizations shape and influence higher level code is not something many designers think about, but it is significant, so optimizations ought to make doing the 'correct' thing the 'best' choice.
Duly noted. In stating that "failure to normalize is not due to a limitation of any database product", etc., I meant that failure to normalize is not due to lack or presence of (say) foreign key constraints, or any other typical DBMS construct. Any modern DBMS will let you create precisely the tables/relvars you want, whether based on a fully-normalized design or not. However, performance is indeed a concern, and perhaps sometimes overly a concern. For example, I once worked on an interactive database-driven desktop application with an expert data modeler. She studiously normalized the logical design to at least BCNF, then denormalized the implementation back to effectively 1NF on the mistaken assumption that desktop DBMSes (the Jet database engine, in this case) couldn't possibly be fast enough to support JOINs in an interactive context. She wasn't convinced otherwise until I implemented the design in BCNF and demonstrated - on real data - that the response times were more than adequate. Given her mainframe background and lack of PC experience (at the time) I suppose this isn't surprising. But I digress. The RelProject currently implements no optimization, and guarantees the worst possible performance in every case. JOINs between two relations, for example, are O(N*M). This is partly due to premature optimization being the root of all evil, and partly due to my intent to use Rel as a research platform for exploring various ideas I have for database optimization. For the educational settings in which Rel is currently used, that's turned out to be perfectly acceptable. For practical applications, I will have to implement some reasonable amount of optimization. -- DV
Keep in mind that "Premature Optimization is the Root of all Evil" applies only to hand-implemented optimizations (like your expert data modeler did with her design). Nothing about the rule applies to automated optimizations, and you don't need to optimize your optimizer-code to get the full benefits. However, I agree that it can wait until Rel is being shifted towards production settings.
Re "premature optimization": Well, yes, I was being a bit facetious, but I'm sure you know what I mean. Walk first, run later, etc., and avoid the paralysing infinite-regress problem of not deploying <x> until the <y> optimization is working, which is dependent on <p> which should have <z> optimization, which won't work until <q> is implemented, and so on. Or, just brute-force <x>, get the damn thing out the door, and have a student write <y> later as a thesis project. -- DV
PageAnchor "Dorthy": (context: deletion of unnecessary one->(one or zero) attribute relation)
Please clarify this. A given query needing to reference column X is generally independent of the wide/thin model. Thus, the need to rid it from a query is the same regardless. The usage of X is due to a domain task need, not the underlying table model. For example, if a tax verification task needs social-security number, that is because there is a business need for it, regardless of which table or data model its stored in. If we later don't need it, then we clear out the reference either way. If we are using asterisks, then its usually a situation where we wouldn't change it anyhow. (The list of comparisons is a good start. Kudos.) -- top
While the usage of X is a domain thing independent of the underlying table model, the layout or location of X within the table model will depend on which philosophy is being followed. Because X is a (zero|one) attribute for some entity, it is the perfect target to be sucked into a 'wide' table under the WideTablePhilosophy?, while it would go into an entirely new table in the NarrowTablePhilosophy?. (See also the approach to having added X to begin with.) If you think about it, after you delete a column from a 'wide' table, you certainly need to run around and touch all the queries that update or insert to that table, and you would be wise to touch even those that query it just to be sure you missed nothing. Since this is a 'wide' entity table, that means every query associated with that entity. Compare this to the NarrowTablePhilosophy?, where you have a unique table essentially attrX(entity-id,x-value). You only need to touch the queries that insert, update, or query table 'attrX', and since 'X' has been determined to be unnecessary, there aren't likely to be many queries, updates, or inserts that target 'attrX' explicitly, which means you can get away with deleting 'attrX' without too many worries even if you don't do a search for queries related to X - this is especially true if you added attrX recently and have no reason to believe it in use. (Note that timing doesn't help the WideTablePhilosophy? people since they had to touch all the inserts and updates for the entity-table even when adding X in the first place.) Anyhow, this gives NarrowTablePhilosophy? an advantage when removing X in most cases (fewer queries to touch), and an even stronger advantage in special cases (recently added). Does this help clear things up? (If so, I'll move this clarification text above.) Do you still have reason to dispute it?
There seems to be a misunderstanding somewhere here. Let's start from square one. When I am designing a task for an app, I ask myself, "what pieces of info do I need from the DB to carry this task out? Hmmm, I need X, Y, and Z to do the necessary computations". Whether X, Y, and Z are in one table or 3 does not affect the task's need for these 3 items. The decision to use XYZ is NOT based on schema design. If the task needs these 3 items then it needs these 3 items. If they were not in the DB, then the task would have to get them from a user prompt or a file maybe. They are "necessary inputs". Removal is the same. It's true we may end up touching a different query or sub-query in a given task, but we still have to change our queries within the task. -- top
You seem to be restricting your thinking to read-only queries. Anyhow, I edited the above to be a bit clearer.
No I am not. A distinction needs to be made between having to change the routine/task and having to change an existing query. From an finger-wise editing standpoint, the effort to remove a column from INSERT query A (wide) versus having to delete INSERT query B (thin) is about the same. It's a wash. (The cost of having to have multiple insert queries will ding other metrics.) If you are not counting finger and eye movement as your metric, then what are you counting?
Only if you simultaneously assume (a) full control over the queries (i.e. none of them run on a client's machine to which you don't have access), (b) ready access to edit them, (c) file location and access time ~= 0, (d) edit-time >> 0, should 'finger-wise' editing standpoint even come into place. And it should be clear that 'd' is firmly violated in this case. I must laugh a bit, I find it almost ludicrous that you even mention it. Post-deployment costs are high on a per-change basis. The wide-table philosophy is hurt relative to the narrow-table-philosophy because the wide-table-philosophy can require making changes to programs completely unconcerned with 'X' (generic inserts and selects PLUS those that explicitly touch X), whereas the narrow-table-philosophy guarantees this isn't a problem (you only ever touch the table with X if you actually want X).
You seem to be talking in generalities that have a lot of unstated assumptions behind them. And, I am NOT "dictating" that finger movements etc. be the prime metric; I am just trying to figure out what metrics you are actually using to compare by stating my assumptions and then letting you correct them. If you have a more efficient way to check my assumptions against yours, I'm all ears. -- top
Nothing among my unstated assumptions was outside what I'd expect any computer scientist with field experience programming APIs & protocols of any sort (including database schema) to understand quite well. Bringing up 'finger movements' as even a potential prime metric still strikes me as ridiculous... e.g. would you ever even think of measuring how much effort it takes to change the HTTP protocol in finger and eye movements? No? Well, then why would you think it for deployed schema changes?
Again again again, the "finger-movement" suggestion was merely to prod you to clarify (or begin to state) the metrics you are using above. If the metrics are not clear to the reader of your chart, then the chart is useless. I'm just the messenger. You seem more interested in bashing me than making your text clearer. You want to invent problems. CodeChangeImpactAnalysis offers some suggestions. -- top
- So, are you saying I should mention stupid stuff like "the maximum number of characters on a line" to prod you to clarify (or begin to state) your metrics as you often fail to do?
- There have been times were people misinterpreted something I wrote into something really silly, and stating that silly scenario did indeed help me realize where the mental branching was happening.
- Or did you honestly think "finger movement" was a valid metric? If a reader of any chart is blind or illiterate in the language used, then that chart is useless to that reader whether or not it contains useful information. I'm quite certain that it is NOT my task to teach you to read. You seem more interested in blaming your lack of comprehension on other people than you do in acquiring it yourself.
- How hard would it be to select some familiar scenario, say something from the CampusExample or PublicationsExample, and make some dummy SQL in app code highlighting the alleged claim of "more to change"?
- You tell me: how hard would it be to prove these are in any way representative examples? Publications example doesn't even have a clear purpose or goal in the page you provided. Is it supposed to be a listing of periodicals published? or a card-catalog? or an inventory? How can you ever compare two models when you don't even know what the heck they're supposed to be modeling?
- The main purpose of comparing scenarios here is to understand what kind of things the other person is imagining. Perhaps your coding style is so different from mine that the example is not applicable to one or the other. But that alone may be useful info. If your assumptions depend on coding style, we've learned something. As far as PublicationsExample, assume a general library. I've worked on corporate libraries where they had how-to books, financial statements from competitors, etc.
- You are not communicating where this alleged "more change" happens. I don't know what you are counting to come up with that claim. The thoughts in your head are not coming down sufficiently onto the written page.''
- Go re-examine the cost assumptions section above. When you can ask clear and specific questions, I'll be happy to answer them.
- I find it wishy-washy and non-committal. At points you seem to be suggesting that merely the need to change is sufficiently costly because of all the bureaucracy that goes with it. Thus, "have to make any change" seems to be your metric. That's an okay start, but you still are not communicating why skinny tables would prevent more changes. Like I keep saying, which resources are used by a program/task is mostly driven by info need, not the source of the info. -- top
- I've only started filling in the list above. Post-deployment, costs indeed are largely determined by the need to make any change. And your claim that "which resources are used by a program/task is mostly driven by need, not the source of the info" is correct, but largely irrelevant: necessary changes to programs or tasks as a consequence of altering the source ARE, both in practice and (trivially) from definition, driven by changes to the source of the info and determined by use of those sources. So, unless there is a good reason you "keep saying" it, you should stop saying it.
- Let's start with something specific: can you provide a single specific biz-logic scenario where a removal change would be required in the wide-table solution but not the narrow one?
- Easily. Add a random one->one attribute to 'publication' such as, perhaps, the boolean for 'hardback'. Then decide that it isn't needed, or needs to be replaced with something that allows more readily for 'electronic|softback|hardback'. For both the narrow solution and wide-table solution, any queries that explicitly needed this 'hardback' column will, of course, need to be touched. However, in addition to those, any wide-table solution, task, or application that performs a 'select *' or a generic 'insert' will ALSO need to be touched; application-code is often fragile and can break if an unused column is removed or inserted. Therefore, there are changes required after a removal from the wide-table solution that are not required after a removal from the narrow one. Admittedly, you can avoid these particular costs (for the removal case in the wide-table solution) IF you control all the application code and the queries within it and you fastidiously ensure that there are NO generic 'insert into table values ...' (and you always use 'insert into table(attr1,attr2,attr3) values...' instead) and that there are NO generic 'select * from table' placed anywhere near fragile code. But that would require small projects that are not widely representative of the use of relational databases.
- See example at PageAnchor: "Paperback Gone". I don't see the wide table creating more update spots under a variety of variations. You seem to be complaining more about asterisks than wide tables. (Let's save the asterisk debate for another day). Also, small apps are more common than large apps such that the total lines of code is probably roughly even between them. But, app-size is not an issue here yet anyhow.
- Let's just have you leave it at "I don't see" - it really does sum up your acceptance of your own ignorance. The 'asterisk' debate is NOT a separate issue from wide tables. And control over app-code is the issue; any single programmer can only have full control over a relatively small application - it's a consequence of finite lifespan.
I'm willing to specify assumptions for any particular thing on request, but I do have a right to expect you are of the experience level at which you typically portray yourself.
State Your Metrics
My metrics for which particular thing? You could look in the defs/cost assumptions section above. CodeChangeImpactAnalysis applied generically has been part of the above metric, but (as noted above) often is only a -moderate- consideration relative to the costs post-deployment of even getting the rights and capabilities to change or impact the code.
Conversion Tax?
There is an issue that is being ignored above: wide tables fit the user usage patterns better. The user view of the data tends to fit the wide-table philosophy, and thus there are less translation/conversion steps. It is somewhat analogous to the OO/relational impedance mismatch where the "shape" of objects being different from relational tables creates the need for a messy translation layer. For example, most CrudScreen edit forms save to just one wide table and we thus need only one insert statement. With thin tables, such a form would have to write to multiple tables. (I assume that you are not suggesting a thin-table-friendly app redesign.) Some RDBMS have updatable views, but these complicate the design, and may be slower. Plus, if you have a view that maps thin tables to the user view (wide), then you have two places to update if you add or remove a column from the "entity" view: one in the actual table, and one in the view. Under the wide table, you only have to change the single table itself.
As a general rule: The closer your internal idioms match domain/user-side idioms, the simpler the design.
-- top
Consider a situation where your users are viewing a website that offers services to several states in the USA. Customers want to find all the cities available in Michigan that your website serves. A wide table doesn't help. You should have your cities in normalized form. It all depends what your users are doing. Consider your customers want to find all the cities that your website has available that start with the letter 'H' in Michigan, because they can't remember the exact name of the city but remember that it might start with H. A normalized table that has cities related to state helps here. Some huge wide table CRUD screen is poor view for many situations - and one should not design his databases so that they map to some huge ugly wide CRUD screen that customers have to put up with. At least, most clearly designed web GUIs use things like checkboxes, edit boxes, text areas, etc that are not thrown in some horrible CRUD screen. Rather they should be placed on the page downward in a nice way.. such as the way people fill out forms on the web where they submit data.
I get the feeling that the type of applications you design are huge wide CRUD screens where people directly enter boolean values and NULLs, instead of using eye candy like check boxes, radio selections, drop downs, etc that are spread on the screen in what is called a view or GUI. Do you really let people enter the data into the database directly, without any eye candy and human interface? What happens when these humans have to view all the cities in a certain state only? Your wide table ain't gonna match up to what the human needs.
If your application is one huge ugly CRUD screen where people enter booleans and prefer scrolling sideways instead of downward (very inhuman) then I can see where your viewpoint is coming from - actually, I can't see where your viewpoint is coming from - as it sounds like the type of apps you develop are excel spreadsheet views of everything, and it makes little sense to me.. since most GUIs are not like one huge ugly excel spreadsheet spanning 3 pages wide. Most GUIs are for humans that like to search things, update pieces of things, find cities available, find facts relating to certain data, etc. Consider even Midnight Commander when you open up the dialog box to view available chMod commands - these bitmasks/mods should be in a thin table or an enumeration, not shoved in some wide table that we cannot sanely access, since the bitmasks are hidden away and cannot be queried in their own bitmask (thin) table!
I think your CRUD application experience has distorted your entire view on what the relational model and normalization is all about. I do however remember going around to houses as a child once selling chocolate bars, and we had this big spreadsheet-like form where the person buying the chocolate bar would fill in their Name, Address, Phone, Chocolate Bar Count, City, State. But if the user ever wants to query what states we sold chocolate bars in.. they are better in their own (thin) table for the view. So the argument that thin tables are worse for the view is nonsense when you are querying the data and searching it - which frankly lots of applications do. And, I can't say I've seen many poorly designed apps that allow one to view the entire wide table as one huge wide screen where I have to scroll sideways 5 pages.
- I don't know what you are imagining here. It doesn't match what I do. I suggest you break these into one example per section and we can deal with them one at a time instead of brief descriptions all crammed into one monster paragraph. -- top
- Are you suggesting that I modularize, like a MopMind? would do? Why not just put it all in one wide paragraph or use sentence anchors?
- Consider that one has to for example make a drop down box that selects all the available cities in a certain area of neighborhood.... a normalized table helps here. One huge ugly CRUD screen for all data entry and searches.. just isn't flexible nor helpful for developers working on long term solutions (maybe I'm too web oriented, and the chocolate bar data entry spreadsheet form applies to some applications).
- Consider that we want to find all the homes we sold chocolate bars to. We do not want to include duplicates. The homes are better off in their own normalized table, since the homes do not relate to the chocolate bar type purchased! When we want to view only unique homes... we have to have unique homes.. not unique customers. So again, the view of a thin table helps when we want to query the homes, since the homes relate to the homes, not the chocolate bars or the person's first name. So the view, being flexible, rather than fixed.. is better when it can access normalized data.. crud screens seem better for fixed wide spreadsheet style views. I've always thought of databases as flexible - create any view I want. You seem to map the data to the application. But what happens when you need to grab statistics about what relates to what? Then the view starts to map much better to the thin tables!
- Both of these examples seem to assume tables unnormalized even at the value level. I don't make those kinds of tables You seem to assume an extreme form. "Addresses" and "Products" (candy) are different entities. Same with "Cities".
- [then what do you mean by wide tables and thin tables, because it sure isn't clear what it is you are talking about when you say wide and thin. Surely, thin doesn't just mean the "missing data" null solution that you saw in the Darwen PDF file..]
- A typical schema may resemble:
Table: Products
-----
productID
prodDescript
price
...
Table: Customer
---------
customerID
lastName
firstMiddle
street
cityRef
zip
...
Table: Cities
------------
cityID
cityName
state
- Above are thinner tables, Top, and what is your view on this? Bad? Good? It isn't clear, because you don't make it clear as to whether a wide table is XYZ or whether a thin table is ABC. I'd take it maybe one step further and put the street/zip in its own table so you could find info on the houses, if needed, btw.. but then the phone number could apply to a person, and not the house, and this is where we must start to make compromises. But the idea is not to put ALL the above data into one huge table! Are you suggesting it is put in one huge table? It's not clear what you mean by wide and thin, and this was Dave's point about Degrees. Frankly the terms wide and thin reek of EverythingIsRelative.
- Like I said before, domain issues are the primary driver for my design, not any hard-and-fast normalization rule. As far as addresses, see ContactAndAddressModels. There are a wide range of design options there depending how "fancy" one wants to get. If the company merely uses addresses for identity purposes, then the high-end may be overkill. But the issue that's the most contention is thinner tables merely to remove nulls. I'd like more scenario-based evidence for this. Yes, nulls are annoying (as poorly implemented in most RDBMS), but so are thin tables merely to rid them. -- top
{I do expect updatable views (see the "Things the
NarrowTablePhilosophy? often forgets to mention" section). We can actually achieve O(1) performance for join-views precisely equivalent to to the views you see for wide-tables, in addition to extending that to many<->many tables, given intelligent design of table indexes, so the performance will be no worse than the wide-table-solution (at least not algorithmically). And you only need to update explicitly specified views; I rather expect that views of this sort would often be runtime-constructed based on queries to the RDBMS meta-data.}
{And I won't trade simple design for simple interface and operation and data-management. It's a bad idea. It would be like choosing BrainfuckLanguage because it's simple to implement. I consider such decisions 'simplistic'.}
Lisp would be a better example. Lisp fans brag about the power of simple concepts as the atomic building blocks. I admire Lisp, but would not want to use it in team-based production environment (LispIsTooPowerful).
{If when you said: "As a general rule: The closer your internal idioms match domain/user-side idioms, the simpler the design.", what you meant was: As a general rule: The ability to apply arbitrary layers of translation between the user and the program results in a simpler design, then I'd agree that Lisp is a better example. And, to a certain degree, I might even be inclined to agree with it. But somehow, after reading the rest of your statement, I'm disinclined to believe you were promoting translations. You're promoting keeping a simplistic design in order to avoid translations - e.g. to exactly represent what the user sees and interacts with inside the database. As such, I think BrainfuckLanguage is, indeed, the better example.}
{RE: "(I assume that you are not suggesting a thin-table-friendly app redesign.)" - I'm not planning on redesigning apps at the moment, no (though NoApplication is certainly on my agenda, I consider it an independent concern). However, I do expect to use a more multi-table friendly DML/DDL. As noted above, SQL has got to go for the NarrowTablePhilosophy?.}
With better/overhauled RDBMS, I may use more thin-table design also. We should probably limit this to existing "typical" RDBMS because possible future fixes are wide open. Anyhow, even if a view maps multiple small tables to a single "writable" wide table, the double-maintenance issue above still stands: you have to change both a view and a table for many changes. -- top
That depends on whether the view itself is generated by meta-query or not, really (no reason views can't be generated by the TableBrowser). And I don't like limiting my thinking to "typical" RDBMS, but I can see some merit in it. I'll agree that, limited to "existing" RDBMS and lacking any need to maintain meta-data or temporal information, I'd often choose wide-tables in 5NF.
I think we are almost reaching a "tool problem" consensus: TableViewDistinctionArchaic -- top
PageAnchor: "Paperback Gone"
Here, we are removing column "d" (paper-back flag) because we don't need it.
//---- Snippet A
qry1 = select a,b,c,d from wideTable; // removal 1
...
print(qry1.d); // removal 2
//---- Snippet B
qry1 = select a,b,c from foo;
qry2 = select d from thinTable; // removal 1
...
print(qry2.d); // removal 2
//---- Snippet C
qry1 = select * from wideTable;
...
print(qry1.d); // removal 1
//---- Snippet D
qry1 = select * from foo;
qry2 = select * from thinTable;
...
print(qry2.d); // removal 1
//---- Snippet E
qry1 = select * from foo;
qry2 = select d from thinTable; // removal 1
...
print(qry2.d); // removal 2
(Insert statement issues follow roughly the same pattern as the non-asterisk examples.)
Honestly, top, your example needs considerable work. What's with having two queries each time? And who would ever 'select d from thinTable' (which would return, in any proper relational database, exactly one of: {}, {true}, {false}, {true,false})? If you need 'd', you'd grab it in a join. Where the wide-table solution fails is 'select *' when you don't need 'd', and inserts when you don't know or don't care about 'd'. Where are these represented above?
I skipped join and view examples because they didn't seem materially different than the wide-table example. Anyhow, this is your opportunity to respond with allegedly more representative code examples. (Disclaimer: the above are meant to be scenario studies and not code practice recommendations. But do note that the DBA may not be able to set coding standards.) -- top
Really? "Didn't seem materially different" to you, eh? Your version of science seems to be following your gut feeling & burgerkinging it rather than anything more rigorous. Frankly, by skipping cases you entirely misrepresented the issue... built yourself a little straw house. For some huffing and puffing:
Wide Table Solution: Cases
create table foo (id,a,b,c,d,e);
select * from foo; ... uses 'd' explicitly
select * from foo; ... don't care about 'd'
-- sub-case 1: code is fragile and depends on existence of 'd' column
-- sub-case 2: code is robust and can chug along happily without 'd'
select b,c,e from foo; ... (can't use d)
select b,c,d,e from foo; ... (must use d; no other reason to ask for it)
(note: ignoring, for moment, issue of 'd' appearing in 'select ... where' clause;
however, it is obvious that these would also need repaired in event of d's elimination).
insert into foo values (new-id,new-a,new-b,new-c,new-d,new-e)
-- sub-case 3: code has explicit knowledge of 'new-d'
-- sub-case 4: code is just injecting a default value for 'new-d' as a code placeholder
insert into foo(id,a,b,c,e) values (new-id,new-a,new-b,new-c,new-e)
insert into foo(id,a,c,d,e) values (new-id,new-a,new-c,new-d,new-e)
Narrow Table Solution: Cases (noting that SQL is a poorly designed DML for narrow tables)
create table isFoo(id);
create table fooA(id,a);
create table fooB(id,b);
create table fooC(id,c);
create table fooD(id,d);
create table fooE(id,e);
select (anything) from isFoo,fooA,fooB,fooD where ... (massive join on isFoo.id) ...
must use 'd' (either in query or post-query) - no other reason to ask for fooD.
select (anything) from isFoo,fooA,fooC,fooE ... (massive join on isFoo.id) ...
must NOT use 'd' because 'd' is only available if one asks for fooD
atomic { insert into isFoo values (new-id);
insert into fooC values (new-id,new-c);
insert into fooD values (new-id,new-d);
... }
atomic { insert into isFoo values (new-id);
insert into fooA values(new-id,new-a)
insert into ... excluding fooD }
For the narrow table solution, there are strictly fewer cases; all those that include 'd' must do so explicitly by including 'fooD' and are equivalent to the wide-tables that explicitly request 'd'. Most importantly there is no narrow-table equivalent to subcases 1 or 4, which are the problem-patterns for purposes of removal of a column. Where the DBA cannot (or through lack of foresight does
not) rule those out by demanding certain coding standards be adhered to (which, as you note, is the case often enough to be worth mentioning), the narrow table solution is better. Generically, the narrow-table solution is better in some cases, and no worse in all others, for the removal of a column scenario
, making it objectively better overall (for this scenario).
- I'm still trying to decrypt case 1, but for case 4, you don't have to mention columns not used in INSERT statements.
- You must think me remarkably dense if you didn't believe I already know about explicit columns on inserts after reading the final two wide-table insert cases I had written above. The ability to do this doesn't make the problem case disappear.
- Your usage of "placeholder" left some questions.
- Obviously they were answered questions, so they weren't "left".
- If you have to mention it because its in the form, then it would be there for thinville also.
- Indeed - places where 'd' is explicitly referenced would be there for both wide-tables and narrow-tables (via 'fooD'), and therefore thin tables are no worse than wide-tables in these cases. This was mentioned above.
- Further, I often use insert-statement generators that automatically skip empty/blank columns. SQL is pretty good about not requiring you to mention columns not used (but sucks in other ways). There would be one "save" reference either way for the INSERT scenario that would have to be removed. Or are you counting the fact that for INSERT statements you have two places in a given statement for the widie - the name before the Values clause and value itself. You are indeed right about this. However, this is a flaw of SQL syntax because it should allow UPDATE-like pairing. And again I often use statement generators that limit it to one statement per column. I'll give you minor points on that, but only as a syntax design flaw in SQL. -- top
- Sigh. Waving your arms a bit and claiming some self-discipline doesn't make the basic issue go away. Self-discipline only helps your case so long as you are the only person who ever writes queries for your schema. And, no, the finger-wise editing shit related to SQL syntax and two places for removal of the 'd' column are so minor that I'm not even counting them.
- If you are not counting it, then what are you counting for INSERT scenarios to claim advantage on #4? (You are right about counting the clause generators. My bad. But, others have mentioned IDE's, I would note.)
- Fundamentally, there is always the potential of someone performing such an 'insert' as seen in sub-case 4. This can happen out of laziness (for really-wide tables, specifying all the columns twice may be too much a hassle). This can happen for a number of other reasons (preference of programmers outside your control, for example - you can't stop other people from doing things you consider stupid any more than all my reasoning power can prevent you from doing things I consider stupid). This 'INSERT' code breaks when you delete column 'd' even though it could have been written so it wouldn't break. This means that narrow-tables are better - more advantageous - in these specific situations where they prevent this pointless breakage, such as when the DBA lacks full control over all the queries; see also the sharing section under assumptions and definitions, above.
- I thought you said you were not counting the "splitness" of INSERT syntax. Have you changed your mind, or am I misreading the above? I don't care if you do, I just want to make sure thats the case.
- The finger-wise editing of the 'split' INSERT syntax is not counted as a significant cost. However, the tendency for people to lazily perform the insert with all the columns is real and has a significant impact on code-change costs.
- If they include such out of habit, they may also include all your column-tables out of habit. Happy Pastey syndrome wouldn't be limited to just one approach. However, I won't challenge this one even though we ding it for different reasons. You can keep that point. But note that you are slipping back into psychology for your arguments ("...tendency for people to lazily perform..."). Things will quickly degenerate if we go that route because either psychology is a "soft" science, and/or we'll project our own habits into the decision because we own only one body.
- That 'psychology' backpedaling of yours isn't very strong. The reason people use 'select *' doesn't matter - the issue of laziness is merely illustrative to help you grasp that people can and do apply many different solutions to the same problem for reasons you might disagree with. Heck, some might even do it even for reasons you'd initially agree with (DoTheSimplestThingThatCouldPossiblyWork - some interpret 'select *' as 'simpler' than 'select a,bunch,of,columns'. If it breaks later, they say, you can fix it later - even post-deployment... just call everyone using your database up and force them to change their code! You often promote behavior in the name of simplicity that I consider simplistic in the greater context.) What matters is that the removal of column 'd' can break code that never explicitly touches 'd' - not even to print it. This also means that 'd' isn't even available for searches via refactoring browser, and you can't be sure that removing 'd' only requires you touch code that explicitly touches 'd'. In the narrow-table solution, even if programmers habitually loaded all the tables with their query, you'd always have explicit mention of the table 'fooD' in any code that might break because 'fooD' is removed.
- Things will quickly degenerate if we go that route because either psychology is a "soft" science, and/or we'll project our own habits into the decision because we own only one body. If we stick to things like statements changed or functions changed or contiguous segments changed, it's more objective. But I agree that psychology matters even if its difficult to quantify. However, it won't settle any arguments, but rather gum them up. The implication was that the benefits of slim were objective. ObjectiveEvidenceNeverFound still stands.
- The only concern of objectivity here is that we both agree to consider low-cost-of-change better than high-cost-of-change; the psychology and reasons for people taking certain paths aren't particularly relevant - not unless you can guarantee they will do what you consider wise (i.e. that the negative cases are impossible.) Theorem: If you demonstrate that X is better than Y for at least one case C in scenario S, and that X is no worse than Y for all other cases in scenario S, then you have proven that X is objectively better for scenario S. The reasons and psychology behind case C are entirely irrelevant; so long as case C is not impossible, this theorem is valid; it holds because being better at one thing and no worse at others is always better overall.
- Well, there's a problem with your theorem, but it's not relevant just yet because you have not showed it net objectively better.
- Oh? Put your money where your mouth is, TopMind. Find something that contradicts the notion that being equal in all ways except qualifiably better in one is a negative property.
- I have shown at least one metric that asterisking helps: not having to alter a routine if the app code itself does not "use" the removed column. What you are counting to counter that is not net clear yet. Further note that not all scenarios are necessarily equal in frequency and even per-shop (per WaterbedTheory). Thus, to assign them equal weights is premature. -- top
- Eh, TopMind, you obviously didn't think that one through. The asterisk is NOT helping here; or, more accurately, the ONLY help it is providing is saving a few keystrokes in the initial creation of the query, which just happens to be an entirely different scenario than the one under fire. Queries that explicitly request columns associated with app-code that does NOT "use" the removed column ALSO do not have to be altered when you remove the column BECAUSE they never request that column in the first place. Therefore, come time to remove the column, the 'select *' offers you NO ADVANTAGE over such code. This just happens to include all the narrow-table code that doesn't require the column. In addition, you're (in a remarkable display of stupidity) assuming all cases of 'select *' are sub-case 2 - that application code is ALWAYS 100% robust and can ALWAYS survive the removal of a column without breaking. Not only is this trivially false (I've seen plenty counter-examples and could write some in a jiffy if you really need it); it isn't even uncommon. There are many cases where you have to alter the routine when the app code itself does not "use" the removed column because the app code breaks anyway after that column's removal. While the asterisk might help a tiny bit in the initial scenario of writing the query in the first place, when it comes time to remove a column it is objectively no better or worse for every single case. And, sure, not all scenarios are equal in frequency. But that fact doesn't help YOUR point in any way.
- Sorry, I misspoke. Snipers did not shoot at me in Bosnia after-all. Anyhow, the metric I meant to use was number of statements that have to be changed, not functions. To add or remove a column, you have to change the query and the app code; but asterisking only has to change the app code (to make use of it). (Again, we are not judging the merit of asterisking here, only acknowledging that it occurs in the field.)
- Ah, so it all comes down, for you, to finger-wise movements. I'll give you that the asterisk saved you from moving your fingers to touch the query statement. Instead you just need to handle broken application code - sometimes even application code that never touches 'd'. But don't mind that! You saved the time it takes to delete "d," from "select id,a,d,e"! That's surely a significant boon relative to the costs of either fixing the application code to exclude 'd' and dealing with the extra complication - the potential for broken, fragile application code that never used 'd' in the first place! Pffft. As I said, finger and eye movement metrics don't count for much - not unless you can offer good reason for them to be significant relative to the costs of finding the right code to fix and the even greater (post-deployment only) cost of acquiring capability to fix it. If you were to ask: hand-entry of data would be significant, as would hand/eye repair of rows that failed a translation due to poor logical consistency and excessive use of poorly implemented NULLs. Deleting an extra "d," along with a bunch of OTHER scattered uses of "d" in application code associated with the query? Not so much - it's just one extra place to touch, in a highly associated grouping of code (query + response handler), of at least two; it is not even algorithmically significant! It, quite simply, doesn't even begin to make up for costs of fixing application code that breaks simply because it used 'select *' and was a little on the fragile side (e.g. using column offsets, or storing to a fixed-width array of strings).
- It is simply not clear to me what metric you are using. When you are almost clear, it appears to be related to volume of code that needs changing, or some variation of "finger-movements". When you are not clear, it sounds like psychological arguments in disguise. Again, I've coded for both table sizes and don't find wide ones causing cancer and killing puppies. You need to work harder to articulate where the allegedly extra problems will be. -- top
- I've never used psychological arguments as a metric, just to help illustrate a reason you can't casually dismiss a valid metric that you had been fastidiously ignoring. My metric is asymptotic cost of change per scenario, measured in both potential and necessary (potential >= necessary). In practice, this is closer to volumes of code that need changing - not absolute number, but relative portions. The cost of having to edit code in one solution that would not need edited in the other is a great deal higher by this metric than is the cost of having to delete one additional "d," along with at least one explicit use of said 'd' in application (indeed, the 'd,' is cheap and has no effect on asymptotic cost at all). You are, perhaps, used to thinking in 'finger-movements' so you attempt to translate what I'm saying into it - I suppose if you only have that hammer, everything looks to you like a nail. The cost of changing a volume of code internally is, in post-deployment situations, often less expensive than achieving the capability to change it (which may require contacting all sorts of people who have written queries for the database). In pre-deployment (or non-deployment situations), the cost of changing each volume of code and performing all unit-tests and passing everything is pretty much the sum of the costs (albeit not minor). I do assume that queries are written to go along with application or other code that utilize them. As a general consequence of that, fixing the queries themselves is a tiny fractional cost of fixing the application code, firing up the appropriate development environment, and testing the change via any unit tests or analysis. When you keep tooting your horn about a tiny savings in a tiny fractional cost of the total change, I keep rolling my eyes and yelling at you; it's penny-wise and pound foolish. The relative potential cost of having to change application code that breaks that never even touched 'd' is far, far greater than any such savings.
- It's not just moving fingers, but includes the time to find the right spot to edit, and the risk of unintentional damage. The more "spots" you have to edit, the more probability of error, in general. I've accidentally made an "as" clause by accidentally deleting the comma while doing other edits in SELECT clauses, for example ("as" is not explicitly needed in many dialects.) Finger movements are often tied to other issues. Maintenance requires a lot of editing and eye-checking of tedious details. Over-complicating the design can make such add up. - t
- Sub-case 1 is similar (where 'd' isn't even used by the post-query code but elimination of the 'd' column breaks the post-query code).
- Usage of an asterisk is a bonus feature. You don't have to use it if you don't want to. Plus, it makes it so that one does not have to change the SELECT statement(s) at all, only delete column usage from the app code using it.
- From the perspective of code breaking upon removal of a column, it's a negative feature. The fact that people can and do use it because they want to is, relative to this scenario at least, harmful. And changing application code is considered among the most expensive impacts of change, so it's hardly a good thing; it's especially bad where the app-code breaks because the 'd' column is missing even though it wasn't using the 'd' column explicitly (i.e. sub-case 1).
- You cannot asteristize such skinny tables meaning that each column reference must be explicit. Thinville does have to touch both a query and app usage. So wide gains points there. So far its a wash: Some points are lost on Insert's stupid syntax, but some is gained on asteriskability.
- The narrow table solution ONLY needs to touch those queries and applications that explicitly request the 'fooD' table. I.e. said queries must explicitly request a special table just to get the 'd' column. This means they'd be of equal cost to the wide-table solution in those cases where 'd' is required for domain purposes. HOWEVER, you're forgetting to add to the wide-table costs all the post-'select *' application code that doesn't use column 'd' but breaks anyway when 'd' is removed (i.e. sub-case 1), and all the 'insert into foo values' code where column 'd' was included but was not part of a form or anything (sub-case 4). You cannot legitimately drop said things even if you use great self-discipline and always select/insert by columns, because DBAs for wide-tables are not always in the position to control all query code. If anything, points are lost in this scenario due to the 'asteriskability' for wide-tables.
- The app itself using "d", such as in a Print statement, has to be changed under both. Thus, app usage is not a difference maker. So, I don't know what breakage you are referring to for SELECT.
- Tell me, 'TopMind', can you truly not envision (or even remember) a situation where, say, some C or C++ application code that performs a 'select *' but that never uses 'd' will break after the 'd' column is removed? Are you trying to say that these situations - those mentioned in sub-case 1 and 4 - are impossible? Are you willing to defend this implication of yours? Or are you just pretending to be an idiot to better irritate me?
- I'll answer the question if and only if you apologize or remove the rudeness.
- And THAT behavior is one of the childish things that makes me greatly dislike you. Not only do you force me to repeat myself umpteen times in umpteen different ways just to get you to barely grasp a point; you also get all prissy when I get irritated at you for doing so. Nah, you don't need to answer, TopMind. I know the correct answers to the above questions are "yes, but I never thought of it" and "no, I'm not pretending to be an idiot."
- Your writing style is poor. That is not my fault. And no, that's not my answer to the C++ question. (And yes, sometimes my writing style is also poor, but I don't fault the reader to such an extreme as to call them names.)
- You're nearly illiterate due to your habitual avoidance of educational material. That is not my fault. And that was my most flattering version of your answer to the C++ question - the ones that would make you seem utterly stupid are: "it's their fault; they could have been smart about it and gone to extra effort to make their code robust", and "I'd never use C++". Either of these would show you really don't comprehend the scenario.
- Neither of those match my withheld response, but the problem being language-specific may have merit. Thanks for giving me more ammo for possible later use.
- You're welcome to try using it. I look forward to once again demonstrating how foolish you're being since code robustness isn't a language-specific issue and C++ is just illustrative of a language that makes it easy to produce fragile code. And I'll note, it doesn't matter what your response would have been; no doubt you could surprise me with how foolish you can be. The fact that you just now started thinking it language-specific still demonstrates that you barely grok anything.
- There are probably weird practices that can break anything and everything. The fact that I cannot anticipate all does not make me dumb. You can't either, regardless of whether you want to be mortal or not. In practice it's about weighing trade-offs: "should one bloat up their source code or design in case somebody uses an odd practice." is a more useful question. The kid who rides his/her bicycle with 10 layers of padding and helmets is more likely to die of heat-stroke or miss spotting a car because they can't turn their head fast enough due to padding. It's all about weighing choices and their probabilities. I suspect this is another case of your fastidious designs causing the problems described in BloatInducedReadingConfusion and/or SafetyGoldPlating. [Changes made after replies below. -t]
- {Whose fastidiousness? Here, you're making reference to BloatInducedReadingConfusion and SafetyGoldPlating as if they are proven, recognised, and well-established industry factors. They aren't. They're things you made up, that have no recognition in the industry, and both pages are full of contention which call into question the credibility of their use here. By contrast, GoldPlating is a proven, recognised, and well-established industry factor, but that's not what you're talking about here, is it?}
- Re: "well-established industry factors" - More FakeIndustryCanons? Tsk tsk tsk.
- {You mean you've never heard of GoldPlating? Or the phrase it's probably derived from with a similar meaning, "gilding the lily"? You've never experienced developers or engineers adding needless extras to a product? You must have, and they're well-recognised. On the other hand, BloatInducedReadingConfusion and SafetyGoldPlating are things you made up. So far, no one recognises them but you, and yet you're trying to use them to substantiate an argument, above, as if everyone should recognise them and accept them as fact.}
- I'm not following. There are plenty of nicknames for factors, I don't dispute that; but the existence or lack of a nickname doesn't make the factor "weigh more" or an official canon. Some nicknames are used more simply because they are mentally catchy and/or easy to remember, not necessarily because they are more important. "Bloat" is also a nickname for violating parsimony and/or code-size conservation, by the way. I didn't invent/introduce the term "bloat" into coding circles. YagNi is also closely related to "bloat", and I didn't invent that either. If a wiki topic focuses on a specific kind of bloat, then it's rational to call that topic "bloatX" or "bloat and X" or the like. I'm not doing anything weird or odd.
- PageAnchor topic-disclaimers-01
- {That's a reasonable assessment of bloat, but what is considered bloat is a value judgement, and BloatInducedReadingConfusion is based entirely on your personal assessment of it. Thus, your use of BloatInducedReadingConfusion -- as if it was a generally-accepted principle -- to substantiate an argument is highly questionable.}
- A lot of factors are value judgements. "User friendly" is mostly a value judgement(s), but is a widely discussed topic.
- {User friendly is a widely recognised concept. Your BloatInducedReadingConfusion -- where "bloat", as I recall, was the usual OO scaffolding discussed in AvoidDirectAccessOfMembers -- and SafetyGoldPlating are only recognised by you.}
- They are topic titles, not "official industry phrases". You appear to be protesting about nothing. If you wish to put a disclaimer at the top of such titles, I'm okay with that as long as you follow these reasonable guidelines:
- 1. You don't ONLY do it to topics I'm involved with. That would be unfair and misleading to general wiki readers. Label content, not people.
- 2. Get WikiZen approval/consensus before you add disclaimers to a large volume of topics. You are not The King of this wiki.
- {They are topic titles that are being used to substantiate an argument, as if the statement implied by the title is unquestionably true and the content wasn't contentious.}
- As if? Please elaborate. Is this as-if-ness only a characteristic of my wiki titles? If so, do you have the sifting evidence? If not, why complain to just me? Your complaint is ill-defined. I highly suspect personal bias against me.
- {I have no bias against you; I am biased against weak arguments shored up by questionable evidence posed as strong evidence. I used "as if" colloquially, when I really meant "you're doing this." You're the only wiki participant I've seen who regularly treats page titles as reference to established fact when the page content is highly contentious. Above, you wrote "I suspect this is another case of your fastidious designs causing BloatInducedReadingConfusion and/or SafetyGoldPlating," as if BloatInducedReadingConfusion and SafetyGoldPlating were well-known, generally-accepted, widely-understood, factual factors of software development, which your correspondent was clearly causing by being fastidious, when it's actually a specious argument based purely on evidence-free opinion.}
- I didn't fucking "pose" anything. If you interpret it that way, it's because you interpret English in an outlier fashion, or are biased. I see nothing wrong with my text. I cannot read your weird mind to know why it interprets things that way. Wiki titles are ONLY wiki titles. If you hallucinate them as something else, then see a doctor.
- {Read your own quote again. If you didn't intend to use the BloatInducedReadingConfusion and SafetyGoldPlating pages as evidence to support your point, then what did you intend?}
- Referencing another topic is NOT implying they are "well known". That's completely and utterly ridiculous. You are protesting textual factoring? Idiot.
- {Why don't you answer the question? You weren't drawing attention to BloatInducedReadingConfusion and SafetyGoldPlating as topics to review or consider, nor were you using them to refactor text. You were using them as tacit evidence, as if their truth is a given, to construct what appears to be little more than an AdHominem attack.}
- I'm not sure what you mean when you say "as if". Please clarify that phrase or find a different way to say it. I will agree the phrase could be interpreted as an AdHominem attack. However, that has NOTHING to do with any plot to trick readers into thinking the referenced topics are common/official concept names.
- {Why are you evading the question? I already explained "as if", and I'm not claiming you're trying to "trick readers into thinking the referenced topics are common/official concept names", I'm claiming you're trying to trick readers into believing your assertions are established fact when they aren't.}
- I have no fucking idea how you came to that interpretation. NONE. You did not explain "as if" well.
- {It's a pretty fair assumption that if you write, "I suspect this is another case of your fastidious designs causing BloatInducedReadingConfusion and/or SafetyGoldPlating," then you believe BloatInducedReadingConfusion and/or SafetyGoldPlating are factual, and you want the reader to believe they're factual too. No?}
- What exactly do you mean by "factual"? And what does factual-ness have to do with commonly recognized concept names? How does common-ness relate to factual-ness, or does it? I don't understand your argument at all. It's very fuzzy to me, and frankly should be split off from this topic as it's bloating it up with arguments about arguments.
- {What did you intend by writing, "I suspect this is another case of your fastidious designs causing BloatInducedReadingConfusion and/or SafetyGoldPlating"?}
- I intended what I said. It reads fine to me. I don't see a problem beyond the fact it can be construed as an AdHominem attack, which doesn't seem to be the primary complaint against it.
- {So you intended it to be read that being fastidious results in two truths? I haven't even started on the AdHominem yet; that's next.}
- Two truths? I don't understand.
- {Sure, that BloatInducedReadingConfusion and/or SafetyGoldPlating exist and result from fastidiousness.}
- The debate on whether they "exist" belong in the topics themselves (or "discussion" offshoots).
- {Indeed it does. However, by using them in a sentence like "I suspect this is another case of your fastidious designs causing BloatInducedReadingConfusion and/or SafetyGoldPlating", you make it appear that their existence has been confirmed and accepted, when it hasn't.}
- Re: "You make it appear". How so? This seems another form of your puzzling "as if" claim. It appears to be a manifestation of interpretation inside of your mind rather than anything directly in my text. -t
- {It appears that way because obvious non-facts in their place wouldn't make sense. "I suspect this is another case of your fastidious designs causing the world to tilt off axis and/or universal continuous flatulence," doesn't make sense. It only make sense if BloatInducedReadingConfusion and/or SafetyGoldPlating are intended to be read as true.}
- I'm sorry, but your analogy is not helping me understand your complaint. It didn't serve its intended purpose. There is no context related to it such that I don't see what it's intended to do. I've read it 4 times.
- {Your statement states that fastidiousness results in two things. If the two things are false, it doesn't make sense. If the two things are in question, it doesn't make sense. It only makes sense if the two things -- BloatInducedReadingConfusion and/or SafetyGoldPlating -- are intended to be read as true.}
- They are titles of (alleged) problems, not Boolean values.
- {Be that as it may, the statement only makes sense if the alleged problems are assumed to be more true than false.}
- How can problems be "true"? Do you mean "real", as apposed to imagined?
- {Yes. I.e., the statement only makes sense if the existence of the alleged problems is assumed to be more true than false.}
- Like I already said, that debate belongs in the topic itself.
It just happens that thin tables have won as either 'no worse and in some cases better' or 'significantly better' (e.g. for temporal data and security) in EVERY scenario I have ever tried them in, when under the assumptions of a better DML (based off DataLog, perhaps, with special support for left-joins), some good TableBrowser that supports editable views, metadata for automatic construction of useful editable views, and an optimizing RDBMS that supports joining thin-tables on shared identifiers at O(1). The lack of good tools and DML are the main things dragging them down.
I'm still skeptical. My experience differs.
No, top. You seriously lack experience with narrow table solutions; hell, you barely even grok them and your 'examples' of narrow-table solutions have consistently missed the mark. You cannot legitimately claim your experience differs. All you can claim is that you're comfortable and satisfied with wide-table solutions.
The few I had encountered were difficult to work with, inflexible, and slow. If they were exceptions to the rule, then I lived a coincidental life. The downsides were obvious and substantial, and any benefits were too obscure or indirect to detect. Based on other topics, the type of applications you worked on appeared to differ from those that I have. Thus, it may just be different domain or specialty experience. Further, the fact that "narrow" tables are relatively uncommon suggests that others found them a pain also. There's also the possibility that it's a TechniqueWithManyPrerequisites. In other words, something that takes lots of skill, practice, and the right org environment to "do right", and doing it wrong is worse than not doing it at all. -t
{What were those few narrow table solutions that you encountered? What made them difficult to work with, inflexible, and slow?}
I have already described these in the various "table width debate" topics. In the name of OnceAndOnlyOnce, I do not wish to repeat them. It would perhaps be nice if I provided topic names and/or page-anchors, but I'm not going to do that today.
{I cannot find any specifics.}
Sigh. Fewer joins, and simpler insertions and updates are two. Smaller name-space of tables is another. I realize name-space-size may be a subjective preference or differ per mind, but I work better with a smaller name-space, all else being equal. We've been over these already. -t
{If smaller namespace, fewer joins, and simpler inserts/updates are the deciding factors, then why normalise at all?}
To rid data duplication. Fully de-normalizing also makes inserts/updates difficult because then we have to manage data duplication. And if your tables fit domain nouns closely, then there is smooth and/or natural mental mapping between domain nouns and tables: employees, vendors, paychecks, orders, help-desk-tickets, etc. Heavy normalization clutters up the name-space with piddly attributes mixed among primary domain nouns.
{Just ridding data duplication generally gets you to BCNF. Do you then consider the anomalies that can result from not achieving higher normal forms to be worth it?}
I would consider such on case-by-case basis rather than Always-Do-X. I do believe you are exaggerating the frequency of the "anomalies" you talk about; or else your specialty areas bring you to different problem ratios. They are infrequent compared to other problems in my personal observation.
{Considering them on a case-by-case basis is reasonable. You did not appear to be advocating considering normalisation on a case-by-case basis; you appeared to advocate no higher normal forms, by strongly arguing against "thin" tables.}
I don't see that in my text. For example, near the top, I say OnceAndOnlyOnce should be the primary factor. I didn't say "only" factor. If you find a problem phrase, point it out at the spot of the problem and I'll cheerfully review it.
Are there really that many DBAs out there who are only DBA's? On several internet projects I've worked on (large and small) and on several offline projects I've worked on, the programmers/developers are the DBAs and the developers.
I've seen a wide mix. Big telecoms and power utilities I did contracts for had dedicated DBAs. Most Oracle shops do as Oracle is usually used for bigger projects (as it costs an arm and a leg) such that a dedicated DBA is almost necessary. -- top
Hmmm, the fact that it is costing an arm and a leg may mean that is it not purely necessary and that this is just a way for consultants to make money. However, I've also found similar problems with dedicated graphics designers who don't know a thing about HTML or programming or templates or databases (and some don't even know how to properly write emails, they only know graphics, etc). I guess I come from a smaller/medium project size background but even on larger projects the communication between developers and the database admins (who are usually developers too) must be very tight in order for the project to work out.
[I'm afraid I'll have to back top on this one, sort of. I currently work at a company that has at least two full-time DBAs, and they are busy enough that asking them to write the applications as well would be counterproductive. My previous employer didn't have any, but several of our clients did, and only one also had time to spend writing applications. The only one that had any free time was the one using Oracle.]
Do you use the word "afraid" because you think I might let it go to my head, expanding my ego, making me more arrogant and annoying? Nah, never happens. See, I started out fully arrogant and annoying to avoid such drift ;-) -- top
I still have my doubts - for example, the companies these people work with - do they have employees that spend all their time on C2 wiki and just pretend to be programmers and DBA's? Are you part of the company that is lacking time - because I sure see a lot of people wasting at least 8 hours per day on this wiki. Okay, not wasting - spending time. That includes myself, although IveLeftToCode is a good pattern to follow when your company decides it needs a dedicated DBA - because maybe the developers just aren't working hard enough.
What I have trouble seeing, is how a DBA can modify a database without notifying all his programmers and all his application programmers (which may take just as much work as if he was a developer himself and could navigate through the code in his mind and on his screen). i.e. they must be a very tightly nit development team - no matter if wide tables or thin tables are used. If a dedicated DBA modifies the database, it doesn't matter whether the database is fully normalized or not - the developers still need to be very carefully notified and extremely cautious that the application isn't broken. In cases such as where you have an un-normalized wide table with Image1, Image2, and Image3 instead of an images table, its even worse for the application BTW.
[One can use views. In the RDBMSs I've used, the applications can't tell a view from a table (unless they look at the meta data). As such, the DBA can create views that present the data the way the application is expecting while still allowing the DBA to control the way data is actually stored.]
- (For database retrievals, this is generally true. For database updates, VIEW support is variable. Some DBMSes do not allow VIEW updates at all, others implement updates according to vendor-specified internal rules, and some require the database developer to specify explicit rules, i.e., what INSERT, UPDATE, and DELETE mean when applied to a given VIEW. There is not yet a universal set of VIEW update rules that will allow any reasonable VIEW to be treated as a virtual table. This is still a matter of academic research and debate.)
- [While universal rules would make the job of setting up and maintaining those view easier, it isn't necessary to solve the problem (how to change the database without modifying all programs). All you need is a database with updatable views, even if specifying how to update is proprietary and manual. Obviously the solution doesn't work if the RDBMS doesn't support updatable views in any form.]
- (What I forgot to mention was that among DBMSes that implement vendor-specified internal rules, it is typical to find VIEWs which should theoretically be updatable but aren't updatable at all, and DBMSes that support user-defined update rules are (unfortunately) rare. In short, using VIEWs to isolate applications from the database is often impractical in update contexts.)
{As I understand it, one job of a dedicated DBA is to receive change requests complete with valid technical reasons, sit on them for a few days (or weeks), then tell you 'NO!' for a bunch of valid political reasons. Or... at least that's my understanding after having worked with one. DBAs won't often be modifying the logical database - not without cooperation of the programmers. It is almost certainly NOT a good idea to have a full DBA sitting on the database in the infancy of the application or service, though it is likely smart to have someone (be it a good programmer or DBA) who understands both the relational model and the implementation & optimization & logical consistency details.}
[The DBAs I've worked with have had security and performance as their primary jobs. The stuff involving schema changes is a very small portion of it, and was always developer driven, with the developer responsible for identifying points where the change would break the applications.]
A "good" DBA will ask good questions and explain their reasoning well rather than just say "no". Good designs will consider and balance BOTH the here-and-now app-specific requirement AND the longer-term view that (useful) data tends to outlive applications. It should be a lively give-and-take.
Enough talk, time for a scenario test: PublicationsExample
Discussion continued at NormalizationRepetitionAndFlexibilityDiscussion because this grew TooBigToEdit.
CategoryInfoPackaging, CategoryAbstraction, CategoryDatabase
AprilZeroEight and DecemberFourteen