Oh Look! An 80KB ThreadMess!! How exciting - I can't wait to read it ...
Why, it needs to be........normalized
No it is better in a wide and long view with the most data crammed in the table as possible, even if all the chit chat doesn't relate to the subject. Just put some page anchors or cursors in the database, and also make sure to access the data procedurally - scanning through it all in sequential order.
If that's an accurate summary of my opinion, I'll eat a plastic fast-food movie promotional action figure. Without ketchup even.
- [I'd pay to see that. Not much, mind you, but still...]
See TopDrunkOnFastFood
ThinVersusWideTableDefinition is an attempt to define the key issue of the debate.
I've seen the downsides of over tabling. I tend to lean toward fewer tables, and am thus bothered a bit by the suggestion to make more tables. I suspect people are trying to make tables fit OOP, when it should perhaps be the other way around. Is this another HolyWar at play? How about some examples and scenarios to explore.
In fact, if you lean toward fewer tables that could even be seen as more object oriented... with good ObjectRelationalMapper (with Hibernate to use RealWorld example), you can save a complete object graph equivalent to huge number of tables in to a single table... that of course is not considered a good practice, but it is certainly possible... and easy to do.
I believe having few tables is bad from a purely relational point of view, because by reducing the number of tables, you make foreign key integrity relationships less meaningful (If you store cities, countries, and flavors of ice cream in a single table, and then add a foreign key relationship to it form another table, that only needs ice cream flavors, the database will not be able to help you to avoid putting a city on the place of an ice cream flavor).
Having fewer tables also make it harder to write queries, because you have to remember to filter stuff before writing joins:
select cities.* from godtable as cities, godtable as states, godtable as countries where cities.type='city' and states.type='state' and country.type='country' and cities.parentgodid=states.godid and states.parentgodid= country.godid and country.name = 'China'
or
select cities.* from (select * from godtable where type = 'city') as city, (select * from godtable where type = 'state') as state, (select * from godtable where type = 'country') as country where cities.parentgodid=states.godid and states.parentgodid= country.godid and country.name = 'China'
are worse than the easier to read, easier to write, and perhaps even more efficient:
select cities.* from cities,states,countries where cities.stateid=states.stateid and states.countryid = country.countryid and country.name = 'China'
Don't you think?
Of course, if you have an ObjectRelationalMapper, you can tell the Object Relational Mapper that the "type" column should be used as a discriminant. Map the god table to 3 classes (city, state and country) explain how they are related and then write the query like this (Note that with a good ObjectRelationalMapper you could write the queries with the exact same syntax regardless of the underlying existence (or not) of a GodTable):
select c from cities c where c.state.country.name = 'China'
Or even like this, if you choose to do so (but I prefer the abbreviated method):
select city from city ,state,country where city.state=state and state.country = country and country.name = 'China'
Of course, you could achieve similar effects using CREATE VIEW to simulate 3 tables, but if you do so, why not simply have 3 tables from the beginning? You also have the problem that your DBA might not want to grant you privileges to create views. Requiring special privileges to create views to make your queries more readable is one of the well-known
SqlFlaws.
I am not sure what the above country example is trying to demonstrate. A generic "location" table may be useful in the case of an international databases in which the number of "levels" varies per country. Concepts such as "state", "county" may not be applicable to other (non-US) country's conventions. Of course, "it depends".
Exactly, it depends, the above country example is trying to demonstrate that, by default, unless you get help from an ObjectRelationalMapper, it is harder to write queries when having a single GodTable (like location), I agree with you that a location table could be helpful for an international database (or maybe not because SQL is not that good for handling recursive trees), but the point is that unless you're sure that you actually are going to get a specific advantage by creating a GodTable you should prefer to split stuff into different tables, because while having few tables seems to make your design simpler, in fact it is only moving that complexity to another place (your queries).
In this case, if it is for a US company that does not have "odd" requirements, it generally makes sense to have STATE, COUNTY, and CITY tables (although county is not used very often by commercial firms and is usually skipped). But I have seen borderline cases where regions/areas specific to the domain (marketing or internal area carve-ups) could perhaps be better served by a locational GodTable of sorts. At least the drawbacks versus the benefits are close enough that I could not fault either decision. Dealing with ever-changing domain areas can get sticky, and a GodTable just may be a bit more flexible.
But I have seen examples where the name of the GodTable is something like "GLOBAL_CATALOG" and after the developers add locations to it, and see it works fine, they start adding all kind of stuff to this GLOBAL_CATALOG (like Colors, auxiliary Status Information (stuff like "Open, Close, Busy, Blocked, Paid, InTransit?"), and finally they decide to store the names of their favorite ice cream flavors.
I've also kicked around the idea of having one table for the smallest partition (such as a specific retail store), and then another table to map those specific spots to general categories or groupings such that they could even overlap if need be. Location info is often not a perfect tree and different departments may have different overlays/regions. Thus, this is the "set-centric" view.
table: Spot
-------
spot_id
spot_name
etc...
table: Spot_group_links
-------
spot_ref // f.k. to Spot table
group_ref // f.k. to Locat_groups table
table: Locat_groups
--------
group_id
group_name
group_type // ex: tax, marketing, probation, etc. (optional)
etc...
The groups perhaps could also be non-locational, or quasi-locational. The boundaries can get sticky in real-world stuff.
Separate State/County/City tables might be a bad idea if you're attempting to represent hierarchical locations (because, as you mention, there are places where these overlap, though at least counties and states don't overlap). E.g. the 'City' table should not have an entry about which 'State' it is in unless you wish to treat as two or more cities every city that lands on a state border. A single table expressing a relationship between entities and locations (this entity (which might be a location) participates in that location-entity) is, semantically, the better way to go... even for expressing that a particular City 'overlaps' a particular state. Then, if necessary, the 'State' table may carry facts exclusive to the state (flag, motto, symbols, governor, etc.).
Unfortunately, this also is where Relational (esp. relational calculi and query languages) breaks down a bit: having one relation-table be associated with foreign keys from several different tables. A single 'GodTable' for relating entities and their locations should, semantically, be used for everything from States and Counties to Employees and Vending Machines, each of which have their own facts and associations. However, making it work well in relational (in practice or theory) can be a pain. In practice, use of true 'GodTable's that relate lots of different kinds of entities seems to work better in the very flexible Logic programming languages, like Prolog, where there conceptually exists a true-to-the-word 'relation' for every single predicate (i.e. a full, potentially transfinite and uncountable, set of N-ary tuples with semantic meaning derived from both the name of the predicate and placement in the tuple). Logic programming languages, and their query and data specification mechanisms, don't offer many constraints or implications about how one predicate will relate to another; anything with a shared value will do so long as it is described in a list of predicate truths.
FearOfAddingTables shouldn't be any higher than FearOfAddingPredicates. However, the price paid for flexibility and correctness is usually space and speed. At which point does optimization cease being premature? Obviously, to most of you RelationalWeenies, the answer is 'the moment I arrive'. You don't at all consider jumping straight to one of the popular 'relational' languages to be a premature optimization of a predicate-logic specification for data. Nor do you feel it appropriate to 'add a new table' to, say, add the 'official state anthem' (that can be lumped into the 'state' table within which you've already lumped 'flag' and 'motto' and 'capital'). Rant rant rant rant rant rant rant. I feel a little better now. I'm quite fond of optimization myself, but would prefer to see some real advances in full DBMS systems (you know, the sort with ACID transactions and concurrency? and maybe automatic distribution and replication?) that are focused more upon correctness and flexibility than upon speed; the DBMS itself should be capable of deciding when tables should be joined in representation as part of query-optimization and space-savings; users should feel that they can add a thousand tables (one unary-table for the 'IsState' predicate, one binary table for 'EntityMotto' relationship, one binary table for 'EntityFlag', etc. where a State is one 'Entity-Kind' that can have flags and mottos) all without significant loss of speed or efficiency.
For ultimate flexibility, may I suggest the MultiParadigmDatabase, which is essentially one big dynamic-row GodTable. (If you want entities, you add an Entity attribute to a "row".)
(moved discussion to MultiParadigmDatabase)
(Moved from RelationalIsTooAbsolute)
I think it's because amateur database developers find creating tables scary, and try to avoid doing it whenever possible. (Tongue {somewhat} in cheek...) -- DV
I am not an amateur, but I disagree with the myriad-table approach. The debate can be found in FearOfAddingTables.
Having searched the literature, I fail to find "myriad-table approach" mentioned anywhere. I can only assume, therefore, that it has no academic or engineering basis, and that aberrations must be entirely attributable to CREATE TABLE phobia or philia. Otherwise, proper normalisation will -- in an automated fashion -- give you the requisite number of tables, no more and no less, every single time. -- DV
Have you found anything that "proves" the opposite? And "proper normalization" has subjective components to it, such as "related". Ultimately everything is related, at least gravitationally. The "related" we use is from our own mental models, which are UsefulLies, but still lies. OnceAndOnlyOnce is the best guide in my opinion, but there are situations were even that is not sufficient because there are different ways to score the weight of the "violations" when there are trade-offs involved. I've seen heated debates about where the "nulls" in sparse tables should be counted as "duplication". You get philosophical funbies such as, "How can they be duplication when they are stand-ins for something that doesn't exist? You can't duplicate something that doesn't existence. Can lack of existence be duplicated? Is empty space between galaxies all just a bunch of duplication of emptiness?" Fun stuff, but it never gets settled. -- top
Proves the opposite of what? "Proper normalization" is not subjective. It is an algorithmization of OnceAndOnlyOnce based on functional dependencies. If I recall correctly, Date described it as "the automation of common sense". If your functional dependencies are unclear, then your analysis is insufficient. Go back to the users and ask questions. If it's still unclear (rare, but happens when the users don't know, either), this will make a difference in a few tables; on a large schema, a very small and manageable percentage. This does not represent some philosophical approach to data modeling, because there isn't one. You're either modeling the real-world facts -- to support user requirements -- in a normalised manner, or you aren't. If you aren't, then you should be -- modulo the occasional (and hopefully, deeply begrudged) denormalisation to achieve acceptable performance within budgetary constraints. -- DaveVoorhis
I've seen situations where there are trade-offs and no clear answer, especially when trying to anticipate future change patterns. I cannot remember the details at the moment. But, let's see if OnceAndOnlyOnce can pick the "proper" RelationalGuiDilemma.
If you can't identify the functional dependencies among the data elements in your "future change patterns", then they have no business being bodged into tables until you do. As for your RelationalGuiDilemma, is there something about it (I've only glanced) that precludes using the standard supertype-subtype pattern? E.g., supertype (widget_id_pk, common_attribute_1, common_attribute_2); derivedtype1 (widget_id_pk, unique_attribute_1, unique_attribute_2); etc. That is the standard and time-tested approach in such situations. Of course, SQL systems with table inheritance (e.g., PostgreSql) might be an alternative. -- DV
There are various approaches as described there, and none are perfect. In other words, we are weighing trade-offs with no mathematically proven single "right" solution. That's my point. -- top
There are standard approaches to standard problems, and normalisation stands as the accepted algorithm for eliminating redundancy and update anomalies. However, all modeling decisions (especially those that denormalise) must be made in light of the requirements. I see no requirements stated on the RelationalGuiDilemma, only a request for a static model. What is the intended purpose of that thing? -- DV
You go right ahead and publish these "standard problems" along with the mathematical proof they are objectively better, and let's see how it survives public review. -- top
[It already has been published. The database model did not exist long ago. It took lots of public review and proving to realize the relational model was the way to go far large database management. It took someone to prove and invent the DBMS in the first place... otherwise we might still be using linked lists or arrays for large databases (each of us with our own application driven database). The relational model was proven (to the extent which TrulyTrue things can be proven, to sensible people) and is the reason we have databases today (even if they don't follow it perfectly, the proof helped people at least scheme up something like it... if it were not proven we might have just linked lists, arrays, and other reinventions... it is just a pattern).]
You are assuming here that "the relational model" automatically results in skinny tables and that skinny tables have been proven objectively better. You are not being very specific.
Please stop using white trash talk such as chopping, choppers, skinny tables, etc. Next thing you know you are going to claim that Types are just tattoos.
Foreign keys, relations, normalization, et al have all been proven to be better than one wide Excel spreadsheet. As the project gets larger, the more the relations, normalization, foreign keys, etc become more important in maintaining the data. Being able to query a bitmask table for the available bitmasks, or being able to look at a bitmask enumeration Type is simply a much better longer term solution than hiding fields in a huge wide table without any clear specifications or relations (most likely the specifications and relations are instead stored in your application logic). Running a query over a huge wide table just to find out the five types of bitmasks you offer isn't sane.. and this has been proven. In fact, you can't find out what your data is related to because you aren't practicing the relational model, if you can't find this info out quickly. Over time, using the relational model does automatically result in tables that are not as wide -- despite your claim about skinny tables (could we please stop the white trash talk?). You are confusing the relational model with TableOrientedProgramming. You should stick to TableOrientedProgramming and just write off the idea that you will ever understand relations.
Compromise - Sets are not orthogonal
I believe somewhat of a consensus or compromise has developed on other topics spawned from this one. Whether one's view is a wide-table view or a narrow table view should be just that: a view. What is underneath the hood does not have to matter to the query language user. Relational has the potential to be flexible enough such that one can use whatever "named group" of columns they want. It's just sets. In theory, named column sets can overlap: it does not have to be either/or. However, limits of existing RDBMS (such as limited updateable views and wimpy column dictionaries) with a combination of a shortage/underfunded/uncaring DBA's makes this goal difficult in practice. The battle above is mostly about which to give priority given the existing limits. --top
Fine, as long as the schema is in (at least) ThirdNormalForm. There may be some debate over whether or not some higher normal form is the minimum acceptable, but there should never be a debate over whether less than 3NF is acceptable or not. It isn't. -- DaveVoorhis
Another anecdote: I'm implementing a smallish intranet app to replace an MS-Access one that some complained was too complicated. It tracks PC's along with hardware and software. One thing I decided to do with have a single table for software and hardware rather than make a separate Software table and Hardware table. One of the reasons I did this was that the distinction can be blurred. For example, is a dongle that comes with configuration software truly just one or the other? Is a USB thumb-print reader for security access to a specific app software or hardware or both? I don't want to hard-wire such classification into the system prematurely. If I'm on vacation or get hit by a bus, there may be nobody around to adjust the columns if a software-only field becomes shared (or vise versa). There's not exactly spare staff around here to do the new-column dance all the time. --top
- It's quite possible that software and hardware do represent two user interpretations of the same concept -- which might be "assets", "inventory" or "resources" -- that are distinguished by an attribute, perhaps "AssetType". Then it's fine. On the other hand, if you have combined two distinct concepts in a single table, then this approach is wrong. It may cause update anomalies, and will certainly represent meaningless "facts", such as that Microsoft Word is a "hardwaresoftware" or that Microsoft Word consumes unknown (null) watts. Meaningless "facts" inevitably mean more effort at a reporting and/or data-entry level to filter them out or convert them into something meaningful. If there are common attributes found in distinct concepts, they should be factored into a supertype table with hardware and software as subtype tables. Monolithic user views, if needed, should be reconstituted from the tables using VIEWS, not by hardwiring them into a denormalised schema in an attempt to re-create a spreadsheet in a database. Finally, the poor TruckNumber of DBAs on your team should never be an excuse for bad data modeling -- unless you feel that the possible difficulty of finding a plumber to replace a tap that might break should be an excuse to always drink out of the toilet bowl.
- There is indeed an "item type" or "asset type" (haven't settled on name yet). And if need be, validation and column presentation can change based on those.
- Validation and column presentation change based on the "item type" column? What do you mean? -- DV
- For example, not showing a serial-number field on the data entry screen when the sub-category is "open-source software".
- How do you know you're not making the client application unduly complex by making the schema (incorrectly) simple? Which do you find easier to deal with -- a complex schema and a simple client-side application, or a simple schema and a complex client-side application? -- DV
- Well, that is indeed true. Often its difficult to get sufficient DBA service, and a work-around is to shift some of the burden to the app. Plus, the DBA's head is usually not as deep into the domain as the app dev, so the DBA is not as ready to make thoughtful domain-centric decisions with regard to schema design. --top
- Great, you have just described exactly what an ObjectRelationalMapper does. Since it is ofter difficult to get sufficient DBA service, and you realize your database model is too different to your domain, and you don't have the authorization to modify it, you ask the ObjectRelationalMapper to modify it for you (in this case, for example, any decent ObjectRelationalMapper makes it really easy to map a single HardwareSoftware? table to 2 classes, and you can even specify which fields go the Hardware class, and which fields go to the Software class... and configure validation rules for your Hardware and Software classes (all of this OnceAndOnlyOnce and centralized, instead of scattering it all over you application code). Of course, you may not like idea of using ObjectRelationalMappers?... (while perhaps you are building one each time you are building an application, specially if you are coding in an OO language) perhaps what you need here is a DatabaseModelToDomainModelMapper?? (on other words, a way to define "views" and "tables" and everything that normally goes inside the database but outside the database, so that you do not need to ask the DBA for permissions) -- LuxSpes
- ''But object relational mappers do scatter everything all over your application code... including the database objects being placed all over code instead of making say SQL templates and SQL schema scripts that are separate (when you need to duplicate the DB, have an SQL script handy, for any programmer - not just the one using the ORM mapping in the language!).
- Any good ObjectRelationalMapper is able to generate the DDL that matches the objects (of course, I haven't seen one that does that too for validation and business rules, but LINQ could evolve in that direction)
- Object relational mappers are kind of like HTML widgets that aren't maintainable separately - how come the ORM folks are so fond of maintaining HTML in templates separately from the application logic - but when it comes to the database, they put it all in one pile with the objects? For example, why not have html.div.box mapped to the business object too? Why store that in separate templates? The ORM folks believe in taking all the complexity out of the database and putting it into the application - yet they don't believe in taking all the HTML complexities and putting it into the application too?
- A database is like another module that can be queried by several people - not just programmers that know ActiveRecord. The object relational mappers make the database no longer a module... rather the database acts as if it is inside your app as just another business object, sigh. Better, would be something like TutorialDee and RelProject where at least you don't use business objects - but truly relational relvars (which are just instances of a DatabaseType)! Another issue with object relational mapping is that there is really no relational happening - objects aren't sets of rows/tuples that can be queried - so how can I query my object relationally? It kind of defeats the purpose, of a database - sadly.''
- A good ObjectRelationalMapper allows you to query your objects, using a language like JavaPersistenceQueryLanguage?, it is in fact very similir to Sql but with object oriented extensions. With LINQ or EOF you can even apply your queries to transient objects.
- The above, by the way, represents generally-accepted wisdom in data modeling. Therefore, deprecating it represents an extraordinary claim, so you'll need to counter it with extraordinary evidence in order to be convincing. (See FraudulentMindset) So far, your anecdote is not convincing. -- DV
- But I've seen too many cases where they get the slices (abstractions/classifications) wrong. It is hard to undo those and you are stuck with a mess. A messy wide table is easier to deal with than messy skinny ones. I think the real problem is one or both of:
- 1. It's hard to get domain abstractions right. You have to understand the domain well, and one cannot learn both a domain well and domain modeling in general well because the second requires changing domain focus often, precluding the first. A messy wide table is easier to deal with than messy skinny ones.
- The underlying problem here, in my opinion, is that many people doing data modeling lack business experience. Inventory/accounting/asset-management/human-resource/payroll systems tend to look like inventory/accounting/etc. systems, and are subject to similar forces, regardless of domain. People lacking that experience, regardless of domain, often get the abstractions wrong and create incorrect models that aren't suitable for any inventory/accounting/blah/blah/blah purpose. -- DV
- There's not enough vetting going on. BetterOnlyIfDoneRight. That's why I prefer a poor wide table over poor narrow ones.
- Sounds irrational. While a well chosen wide table might be better than a poor narrow one, you have NO justification for saying 'poor wide tables' are better than 'poor narrow ones'. Nor have you done an analysis on the relative ease of constructing poor wide tables vs. poor skinny ones. A simple degrees-of-freedom analysis on ways to mess up would indicate it is much easier to make poor wide tables than poor narrow ones.
- Less specificity allows the application itself to manage groupings and categories instead of having to work around the one hard-wired into the schema. It is easier to add your own specificity when none exists in the original than work around those incorrectly imposed. It is easier to change the image on a blank piece of paper than one one with a picture already on it. --top
- That still doesn't sound like anything promoting wide tables over skinny ones.
- I've a growing suspicion that your objection is not to "narrow tables" per se -- you'd probably find a set of 5NF relations perfectly intuitive, because they are. I suspect your objection is to attempts at normalisation that are simply incorrect, and that have erroneously created a set of tables that each contain portions of (possibly multiple) entity types. In short, perhaps you actually prefer good normalisation to bad normalisation, which is as it should be. -- DV
- 2. Bleep changes. New stuff comes along: new technology, new companies, new categories, new delivery methods, etc. Over-specification is less flexible than underspecification in these circumstances.
- I'm not clear what you mean here... Do you mean achieving 3NF or BCNF is over-specification? Either the schema is in 3NF, or it isn't. If it isn't, it should be. If that's not it, could you give a simple example of an over-specified schema vs. a properly specified one? We might be in ViolentAgreement here. -- DV
- Response to original claim: Wide tables are over-specified in these circumstances. TopMind is making the wrong assumption on which, between skinny tables and wide tables, requires more forethought and specification.
- 3. Over-specification creates job security for the over-specifiers. Thus, there may be some bias going on.
- Claims of deliberate fudging to create job security are awfully difficult to prove. Among good data modelers, I've yet to see it -- the driving forces are almost invariably normalization (typically 3NF or BCNF) and the system requirements in equal measure. -- DV
- I believe much of it to be subconscious, not deliberate. People have an inborn propensity to protect their source of income. A fish looking for food puts its needs ahead of those of its food. True, we are not fish, but still driven by a similar kind of self-protection. Darwinian evolution dictates that genes protect like genes. And, I have never seen anyone punished for producing a poor schema. If anything, it was job security.
- You have never seen anyone punished for producing a poor schema... Alright folks, that's it. I declare this person a troll. He is using this wiki to waste our valuable time so that he can detour us from spending time on important things. By the way, I've never seen anyone with Aids or Cancer have any sort of health problem, ever. It's all in the mind.
- Well it's probably true. I, too, have never seen anyone punished for producing a poor schema - except in the self-flagellation sense... but that only applies if they are the ones stuck with using their poor schema. Anyhow, Top isn't a troll; he honestly believes the stuff he says. He's more like a fish, subconsciously protecting himself and his livelihood by advocating the 'simplicity' of solutions that are actually of greater total complexity. It's his subconscious that makes him focus on how to make a tiny subset of a system 'simplistic' and forbids him from seeing how its effects ripple outwards to everything else. So maybe you can go trolling for some fish.
- If I am subconsciously mentally filtering or magnifying some anecdotes over others without knowing it, what makes you think you are not doing the same? Humans are biased, period. Nobody lives in the center of truth. I attempt to give an honest answer. And in my observation, you seem to incorrectly assume there are tons of spare DBA's around to hard-wire business validation logic into the schemas in order to double-enforce stuff. It rarely has happened in my life. I simply call it as I remember it. --top
- To answer your first question: I introspect. Regularly. You do not. I know and acknowledge my own biases, limitations, prejudices, bad habits, and motives, even when I can't be bothered to change them. You can't say the same. I second-guess myself, re-read my statements, challenge my own views and allow them to mutate. You latch onto ideas like they were babies, think mutation an aberration - I've even heard you use ideas like "throwing out the baby with the bathwater" regarding your own fledgling ideas. To you, your own mind is a fuzzy little dunghole into which you never deeply peer; you do not Know Thyself. Oh, and I rarely use anecdotes - I like them as examples for explanation of concept (e.g. UserStories), but never as proof.
- I HAVE changed my style over the years as I learned new things. For example, I realized that one-size-fits-all generic frameworks are an unobtainable goal. I've since shifted to smaller tinker-and-toss HelpersInsteadOfWrappers that don't marry me to a monolithic ball-and-chain. I've also come to MostHolywarsTiedToPsychology? gradually. I too used to be an absolutionist. I learned I was wrong. If you want to see me change my mind with regard to things like thin tables, then show them being better than fat ones in realistic scenarios. Try for at least 5 and hopefully around a dozen scenarios. You gotta show show show, not talk talk talk. Overwhelm the reader with realistic scenarios, not braggings about how smart you are or how open-minded you are. Everybody thinks they are smart and open-minded such that claiming such is tired news and sways diddly squat. See HowToSellGoldenHammers. I keep telling you that X is the best way to change my mind, but you keep selling Y instead. It appears to be stubbornness on your part for trying Y over and over even though it fails and I tell you it will fail.
- ''You asked how I know I'm not mentally magnifying anecdotes? I gave you an answer. I don't really see how you even began to think I was trying to sell any ideas outside the context of the question based on how 'open-minded' I am. Don't ask irrelevant questions if you don't want off-topic answers, TopMind. Anyhow, I'll claim the same: "If you want to see me change my mind with regard to things like thin tables, then show them being better than thin ones in realistic scenarios. BUT with an extra caveat: the scenario needs to include a wide array of operations (full life-cycle, preferentially) because TopMind has a habit of going so far past 'simple' that he reaches 'simplistic' and fails to consider anything outside his tiny little 'one-action-scenario' as relevant." I don't care so much for '5-12' scenarios; it is better in my mind to create one representative scenario and explore it thoroughly. And, TopMind, I don't consider changing your mind worth anything. Why? Because, to the best of my knowledge, everyone else already disagrees with you, and most of those who might agree are dinosaur-soldier-DBAs not really thinking about much beyond keeping their jobs. Even if you told those people that 'TypesAreSideFlags' or mentioned 'FlirtDataTextFormat' to them, they really wouldn't give a damn. How much could YOU put to use even if you did understand TypeTheory as well as I? or even as well as BenjaminPierce? You might appreciate types a bit more, but you probably couldn't use them in practice because you'd be limited by your query languages and RDBMS implementations. I, on the other hand, represent a mind shaped by academia with a love of languages, logics, and simplicity - if you could convince me of anything at all, you could convince a very large fraction of post-graduate academics in similar fields. I happen to be working on a WikiIde and a new language that could become remarkably influential. DV happens to be working on RelProject, and will ultimately shape the optimization decisions that further shape how programmers are influenced to shape data. RK has BlueAbyss. Etc. Convincing designers and architects to change their ideas can have a big impact on the future. So, frankly, I'd be much more interested in convincing someone like me or DV or RK of any idea I might come across than you - it might, then, actually have a chance of impacting the future - have a chance of having been worth the effort of communicating and properly supporting the idea. You are really not worth that effort.
- If you want an example to analyze up the wazoo, then perhaps PublicationsExample may fit the bill.
- As of yet, PublicationsExample is barely an example. You have in no way attempted to analyze it or explore it in any significant manner. However, I'll consider demonstrating an extreme-skinny-tables approach for you.
- You are free to ask questions.
- Responding to the rest: I believe the role of the DBA will cease, as the technologies mature, to include certain aspects of integrity and security and even the creation and deletion of tables. These will be moved more into the programmers' hands and simultaneously improve in efficacy by use of better security models (CapabilitySecurityModel) and a greater degree of metaprogramming (to associate application-language types to DBMS domains and constraints). After that happens, DBAs would still be important for maintaining data recovery (e.g. remote backups of data), availability, and consulting for schema design. Schema design will be reduced considerably by a greater focus on skinny tables and SixthNormalForm?, and most optimizations that a trained DBA can perform today (and there aren't very many at the schema level) will ultimately be automated - even choice of indexes can be based upon what programmers initially profile as 'likely' queries and what the database actually experiences. Availability concerns will reduce over time due to Grid Computing and web-service-hosting-agencies that support distributed operations and distributed duplication of any data marked critical. Recoverability concerns will ultimately be reduced by growing use of bitemporal databases in combination with the redundant-distributed backups and just plain whole-service-backups provided by the web-service-hosting-agencies. And I don't just believe this will happen; I think it should happen. I don't expect it to happen fast, of course, but I anticipate that within three decades we'll be largely migrating in this direction.
- Yeah, but you know that little Einstein rule that stated something about simple, but not simpler.. in other words simple doesn't mean chopping out the stuff we don't need. Consider BrainFuck which is simple enough to conquer the world. A wide table is just too simple.. simpler than simple. It's so simple that it leads to complex application logic. A problem with ActiveRecord in RubyOnRails is similar... they claim that the application code should contain all the DB complexities. Sigh.
- "As simple as possible, but no simpler" is all about keeping necessary complexity as part of every model in a system of models such that the other models in the system don't need to handle artificial complexity. A model that lacks necessary complexity should be considered simplistic, not simple. Simplistic models are complexity multipliers because they force complexity to be added to a large number of other models in the system. See AddingEpicycles. Most of my objections to Top's ideas (those I object to or make suggestions on - including TypesAreSideFlags and FlirtDataTextFormat) are ultimately the result of this problem of being 'simplistic', it being obvious to me (for reasons I can and do provide), and it being completely irrelevant to him (since Top has not yet grokked the difference between simple and simplistic).
- Again again again, when I SEE a good dose of these "complexity multipliers" for realistic scenarios, I may believe you. Where are they? You just talk a lot, you don't show. I've worked with some systems I considered over-engineered, and they were messes. I would rather have a simple mess than a complex mess. Maybe you are the 5% that can do heavily-engineered systems right. But it is hard to tell the realbies from phonies without knowing what to look for. PaulGraham beat multiple competitors using dynamic and meta techniques and became a zillionare from it. There's at least one graph point. --top
- You have a very biased notion of what constitutes a 'realistic scenario'. And I would also rather have a simple mess than a complex mess. BUT I know this means having lots of little tables with well-defined or at least obvious dependencies, not a big GodTable with who-knows-how-many non-obvious dependencies. As far as complexity multipliers, you do it with half your ideas, and I diligently point them out to you; if you can't take my words as hints towards constructing a half-dozen scenarios as easily as I can without first forming words, then you're just screwed, aren't you? You'll always be handicapped mentally if you can't run scenarios in your head, and you'll be handicapped in debates if you can't immediately synthesize and run some counter-scenarios in an attempt come up with REAL counter-arguments; whining "all this talk! it doesn't mean anything to me because I literally don't understand it!" doesn't cut it. It simply isn't a valid form of debate. And since you're usually a HostileStudent, I'm definitely not going to be kind to you and go to the effort of providing '5-12' samples without you asking really nicely.
- I am generally polite. I only "flair up" if personally insulted multiple times. If you don't want to show thorough scenarios to illustrate the magic powers of thin tables, then so be it. You have simply admitted you haven't done your part above. If you admit you could do more and didn't, how does that make me a HostileStudent? Looks more like a LazyTeacher? to me. I've seen bad thin-table designs causing real problems. If we don't have a representative of a good thin table design, then it's a unicorn as far as skeptical readers are concerned. --top
- This "LazyTeacher?" has, over the course 'doing more' in many subjects, learned that you are a HostileStudent all of the time. You seem to believe you get a fresh start in each new subject. That would not be a belief that in any way reflects reality.
- Whatever. You believe that I am the problem and I believe that you are the problem. In my mind I am merely being scientific: Show X actually happening, don't claim it. Rockets have to fly to prove themselves; you can't just keep bragging about your blueprints. I am not the bad-guy.
- There have been a great many things you claim are true in your mind. And they probably are - in your mind at least. But didactic examples aren't scientific unless both implemented and examined against well formalized tweaks to the same example (i.e. laboratory settings, controls, etc.) They're only didactic, and technically worse than rigorous analysis when it comes to making or backing claims. You might not be a bad-guy, but if you aren't then you're dreadfully naive.
- "Rigorous analysis" is NOT a substitute for testing an idea in production.
- I agree. But there is the opposite truth, too: "Testing an idea in production" is NOT a substitute for rigorous analysis. Testing an idea in production is like one unit test - a single point of data; it can improve confidence in something already shown by analysis, but by itself carries very little meaning. If you think otherwise, then I can only assume you've never bothered to acquire an education upon or analyze methods of analysis. And the human element in software systems is ONLY relevant at the Human Computer Interface (which does happen to include programming languages) - it should never dictate design, architecture, or policy past that point except insofar as desired features (properties) of the HCI can only be met by certain structures or architectures (e.g. you can't build an HCI debugger without a certain amount of reflection in the system). The moment you use it as a reason for a particular design, you had better be able to demonstrate that your assumptions about how the HCI relates to the underlying architecture actually hold - not just 'say' they hold as you typically do.
- They both have their place. One tends to answer "IF" and the other "why". And your assumptions about HCI not applying to system design appear to assume unlimited labor [DemandForEvidence]. If labor was cheap and plentiful, then overengineering to prevent every last possible thing may be warranted. But because labor is limited, HCI can make or a break a system. If a schema is hard to read, it may result in a mistaken human decisions by a designer. One of the biggest difference between anal people and non-anal people is that the anal people pretend like they have forever to devote to something trivial such that they overdo something and never get to other things.--top
- Everything I said indicates HCI is important; features desired for HCI pretty much define a system. But, except insofar as you are provably acquiring said features by certain design decisions, you're being extremely dishonest when you claim your design decision is for HCI purposes.
- If you think it is, then we have a huge disconnect here. The rocket must fly in real space before it gets the grade. (We use the closest approximation of production software: scenario analysis and sample code.) There is no rigorous analysis for the human element. You cannot accurately model an individual's behavior to software code via equations. If you could, then I'd be asking HAL to open the pod door.
- You've never provided any proof for your production ideas, and this is generally impractical to have people fly in on airplanes and watch a real production system - hence we use references and case studies that show the production software succeeded and that a method succeeded, etc. What you are doing here is claiming that some how your personal opinions are magically more scientific than other people who have provided references and justifications. You've not backed up any of your words with production system proof or case studies. We'll just take your word for it.? I've provided sample code on many of my pages that I contribute to, and PDF files that provide sample code, real world case studies, real world solutions that solve a problem etc. Mostly people ignore the case studies, PDF files, real world examples, etc. and just continue to argue for the sake of arguing. Top hasn't provided third party sources and case studies (not just personal ones, third party resources)
- Objective proof that something is "better" outside of performance? No you have not. You mistake mental experiments as truth.
- What the heck has performance got to do with any of it? I'm talking about case studies, third party resources, links to other sites, code samples in general (on the pages I create not just related to database talk). You claimed that you were being more scientific. Search for your little quote. I've seen no science.
- It's just talk. They ignore the downsides, for example, thinking the up-sides that they mention are sufficient. They don't invite counter-scenarios, for example. They are too sellsy.
- [They may be, others provide screenshots of code, PDF files, case studies. Don't confuse them with the entier world. Top provides some code samples but I've not seen as many third party links from you, third party code samples, etc. They (not me) however sometimes also use the third party resources that I provide too - meaning they don't have to always provide third party references if I've already done so. And note that I am not in agreement with all their talk about OOP and exceptions being better (especially for API's such as operating systems where exceptions and OOp simply don't work so swell due to the side effects and implications OOP and exceptions cause in an API or OS).]
- Where's your best evidence for thin tables? Ignore the types debate for now, just tables.
- {The objective proof of "skinny tables" is inherent in the definition of fifth normal form and the anomalies it avoids. <5NF == update anomalies; >=5NF == no update anomalies. QED. Extend that to 6NF if you're using a temporal database system...} -- DV
- One of you agreed that removing *all* redundancy is not always warranted. It was also illustrated with the pizza example (I forgot where I put that). Domain design decisions may suggest leaving some redundancy in place. You guys appear to be contradicting yourselves. If redundancy remove was the ONLY criteria for all design (which should have been stated as a given), then you would perhaps be right.
- so-called 'TopMind', you are utterly wrong even in your own field. 5NF isn't "just about redundancy removal" - none of the normal forms are. All normal forms are defined in terms of Functional Dependencies and Semantic Constraints. Every single one of them. Review them, if you must, if they aren't just "academic nonsense" and "MentalMasturbation" to an almighty brain such as yours (which is capable of imagining that at least everything is true in his mind). Redundancy removal is a consequence of normalization, and (therefore) normalization is a means by which to achieve redundancy removal. Another way to remove redundancy is to delete all your tables - there is no redundancy at all in a database without any data. It just so happens that normalization is a form of removing redundancy that doesn't lose you any information or the ability to represent the facts you wish to represent. So, obviously the goal is not JUST to remove all redundancy (because that was our goal, we'd just DoTheSimplestThingThatCouldPossiblyWork and delete the tables); rather, the goal is to achieve OnceAndOnlyOnce representation of facts - to eliminate redundancy without losing data or changing the set of facts we wish to represent. And that should have been obvious to you from the start. And note that OnceAndOnlyOnce representation of domain values isn't on the agenda at all.
- Just because one can maximize 2 or 3 factors does not mean those the results say one should always maximize JUST those factors. In the pizza example, could you confidently say that the higher-normalized form is always the right form? See SovietShoeFactoryPrinciple. I don't dispute that high normal forms maximize the factors that the authors are measuring. What I dispute is that there are not other factors, generally human and economic factors, that are also important even if they are difficult to quantify.
- Seems okay to me so far, though I'm having troubles seeing how you think it relevant to the greater context. After all, if I can't quantify or qualify human or economic factors, then neither can you - which means you can't possibly use them in any rational argument you're attempting to present to support or defend your side of things. That said, I think various human and economic factors can be quantified (given number) or at least qualified (given formal descriptors and/or booleans), such as: need for repetition in expression, security, availability, accessibility, tolerance to human fault, measures for damage control, etc.
- Reply below at PageAnchor 953.
- {Top, in the real world, exceptions certainly do occur, such that one needs to denormalise to 1NF, create duplicate code, or violate whatever rule, generally-accepted principle, or best practice you care to name. That does not, however, mean the rules, generally-accepted principles, best practices or proofs are weak, incorrect, invalid, or incomplete. It merely means there exist special circumstances -- which often make for interesting anecdotes or case studies -- that should be documented in terms of their deviation from accepted practice (with strong justification, obviously) rather than a weak deprecation of accepted practice in general.} -- DV
- That's why I'd like to explore normalization for specific scenarios, such as the PublicationsExample. That way we don't have to compare generalizations with generalizations. If that example needs a tuning, as one of you suggested, then lets tune it.
- Ease of measuring and level of importance are not necessarily related. Let's say relevant factors include ABCDEF. However, DEF are difficult to compute with, so we ignore them and only maximize factors ABC. You may get some nice "proofs" just by using ABC. However, those proofs failed to factor in DEF, and are thus incomplete. --top
- Your assumption that factors DEF weren't considered is, most likely, a wrong one. But you can name them if you think otherwise, and I can go to my list-of-reasons-that-thin-tables-are-better and see if they were considered. For example, economics WAS considered, in the form of thin-table solutions being far better suited to YagNi. So was security, integrity/correctness, optimization, semantic purity, human-factors such as the ability to easily browse and manipulate tables through HCI, etc. Whatever you think I'm missing has probably been considered, so unless you can actually name something where you can offer evidence of relevance and I can't respond with evidence of how said relevant thing has been considered and is better or not relevantly worse - I'm not inclined to believe your position is more than just a bunch of hot air and yelling that I could have, maybe, possibly not taken a vague something into account in deciding my view regarding thin tables.
- Plus, there's the open issue of whether nulls should count as "redundancy", if redundancy is your sole criteria. Proofs are easy when there are only one or a few factors. However, reality usually injects far more factors than math models can handle. I think you guys WANT a simpler world that simply does not exist, so you blind yourself to the difficult factors, such as human psychology. (This should probably be moved to the redundancy topic.) --top
- {Quibble, quibble, quibble... You asked for evidence, you got it. We know we're right; you know you're right, so where does this nonsense end?} -- DV
- Per usual, it probably ends with top appealing to EverythingIsRelative, ThereIsNoAbsolute?, and ObjectivityIsAnIllusion - the ultimate ways to say: "I don't care what you think. I don't even care what the world is. I'd much rather live in my own imagination."
- WTF? I asked for proof, not just evidence. There's evidence for just about every design approach because there are many factors involved such that at least one factor will likely favor any personal favorite. This should be understood.
- You received proof, not just evidence. But we can only point you at the proof, not make you accept it.
- [There are several PDF files and third party references that agree with DV, with case studies, real world examples (even ConstantTable is a thin table). He just doesn't provide the URL links because he is sick of quibbling with you, and I already provided many links several other places. Evidence, by the way - is a form of proof - according to the dictionary. Where is your proof, by the way, top? Or is proof irrelevant to you, even though it seems to be so relevant to you? Wait, did I just contradict myself?]
- {Indeed. I grow weary of this. Top isn't learning anything from us, we're not learning anything from Top, and there's more than enough material to counter Top's views -- lest the uneducated be mis-informed. I'm sure he could claim the same. Isn't there code we should be writing?} -- DV
- [I've never seen TopMind reference another source/scientific/case study once. And yet he claims that other people are not scientific. Let's add this to the TopHypocrisy? repository.]
- You should know what my response to this would be by now. --top
- I'll give you some help: I never claimed my favorite techniques are objectively better. I am not an absolutist, unlike you. ProgrammingIsInTheMind.
- Apparently you're a 'not absolutist' who doesn't comprehend the English language. Many of your claims are stated in an absolute manner; that you just happen to not be an absolutist does not affect the nature of your claims. An example of a claim stated in an absolute manner: (1) "ProgrammingIsInTheMind", (2) "I never claimed my favorite techniques are objectively better", (3) "I am not an absolutist, unlike you", (4) "TypesAreSideFlags", (5) "A messy wide table is easier to deal with than messy skinny ones.". I could come up with a great many more examples. Do you honestly believe you've not made (a great many) absolute claims? Perhaps that is why you get into so many battles where you give the great impression of arrogance by using 'I think' and 'In my opinion' as though they had any validity whatsoever. I'm reminded of the phrase: "I do not think it means what you think it means."
- Redundant - is this a web forum or mailing list and not a wiki?
- Huh?
- Never attribute to malice (or even self-interest) what can be adequately explained by incompetence... :) -- DV
- They tend to be related. If there is no monetary motivation to improve in an area, it won't happen.
- 4. Wide tables are Over-Specified. With 'narrow' tables, one simply adds a new table any time facts that don't fit into an old table need to be added to the DataBase. But with Wide tables, I need to think ahead of time as to which facts I'll need in wider table. If I don't, I pay the price every time I need to go through and update every DML query that updates or inserts rows in the big tables.
- 5. Skinny tables are YagNi compliant - one simply dumps facts into the database. Conjunctive facts with the same referent simply get broken up into individual facts. When taken to the extreme, as seen in logic-programming languages, there are never any 'domain abstractions' and thus nothing to 'get right'.
Top seems to embed a lot of what I'd consider erroneous assumptions into his arguments; responding to some of them implies agreement - a bit like trying to answer: "have you stopped beating your wife yet?"
MuAnswer. Wide tables are the over-specified ones, and skinny tables aren't domain-abstractions. The more extreme normalization forms don't even allow for domain abstractions - there are no 'entities', just unique 'identifiers' that just happen to be common across a set of facts and tables.
Well, and what if it goes the other way? what if while you are in a coma they decide that software and hardware are pretty different stuff and that they need lots of relations from other tables to the HardSoftWare? table, and they want the database to help, because for example it makes no sense to connect... I don't know, a table about voltage consumption to software, referential integrity could have helped you here, but since now you have mixed Hardware and Software, you won't receive any help. Also whenever you want to print reports that are only about software, or about hardware, you need to filter your results, all the time... as the system evolves it ends up having a view that for most (all?) uses and purposes wrap the fact that there is a HardSoftWare?, and every programmer deals with the Hardware view and the Software view... after a while, somebody arrives and asks: why we don't have a separate tables? it makes no sense to mix them, it has no usable purpose, and it unnecessarily complicates our queries... and the answer is: That is the way it has always been. Enter the era of FearOfRefactoringTheDatabase? because of FearOfAddingTables
Based on experience, I judged that as the less likely scenario. Having accidental extra fields is less of a problem than accidentally missing needed fields. One is an inconvenience, the other is a data stopper. --top
- It is only a stopper if adding new tables is non-trivial...
- If there are no programmers or DBA's around to make the change, then obviously that is the case. Staff priority shifts and turn-over make that a non-trivial fear for most shops. If that differs from your experience, I don't know what to say.
- You can't imagine the world any different than your experience, eh? Imagine an RDBMS design where RelVars were dynamic and just considered empty before being loaded. I don't find it difficult.
- Why MultiParadigmDatabase (precursor to DynamicRelational) does just that. But it does not exist yet in practice. I thought we generally agreed that if existing products offered better schema virtualization that our differences in opinion would be much narrower because fat/thin would no longer be either/or.
- I apparently wasn't party to whatever conversation where "we generally agreed". And I'd still reject MultiParadigmDatabase and DynamicRelational (due to integrity and domain issues), but I can see why you bring them up. But there are systems - in practice - that make runtime creation of new tables quite easy. You've even worked with a few. 'thin table' solutions are based around systems where adding new tables is just as easy as inserting new rows to a table. If you make assumptions on the security-environment - in particular that everything must be bottlenecked through a DBA - then one will (by even rudimentary analysis) need to over-specify and rely on great foresight as you do for your wide-table solutions - in particular, you need to do a lot of thinking up-front to make sure everything is there that the users will possibly ever need. The more extreme thin-table design solutions (especially those deriving from logic language programming) take a YagNi approach - just add whichever tables you need (at the moment) to represent the facts you have (at the moment), and add more tables if you learn you need more tables (in the future); no foresight is required. Adding a table in such a system doesn't need to be any more difficult than adding a new WikiWord to the WikiWiki or adding a new function to a Ruby program - and where security is an issue (IMO, it always is), security solutions are provided via either access lists or (better) capability-requirements at any or all of: the whole 'database' of table names, each individual table, and potentially each fact.
SkinnyTables?, Data stopper, bad chopper, skinny timmy. Enough. This person is insane - the white trash needs to end, now.
You have a round-about way of asking "please clarify". What I meant is that an extra field that is irrelevant to a particular situation is going to be less of a problem for the data entry operator and/or report reader than the lack of a field that they need.
PageAnchor: 953
Re: "...if I can't quantify or qualify human or economic factors, then neither can you - which means you can't possibly use them in any rational argument you're attempting to present to support or defend your side of things."
Well, that's a very interesting philosophical question. Think of it this way: would you want somebody to organize your (physical) desk and closet and garage based on formulas optimized for factors that could be measured from various models or pooled studies, or based on what you feel is comfortable for *your* work-style and habits? Software models are to serve humans first and machines second. They are like a custom domain-specific work-bench to solve a problem or provide a service. -- top (ctd.)
{If I could give a formal description of *my* work-style and habits, sure, I'd base it on that. But that requires the ability to qualify my work-style and habits. Indeed, putting one's preferences into code decisions can be one form of saying what those preferences are, but it's foolish indeed if one doesn't even know one's own preferences - in that case, stick with what is proven to work and decide later if there are specific things you can 'qualify' as not liking about it -- db}
One of you suggested that wide tables are done for "laziness". Obviously, people must feel comfortable doing such for *some* reason. [{Habit comes to mind.}] The suggestion is that it creates more effort in the long run, but I don't see it on net. I see lots of tables as clutter, just like clutter on a desk. The less stuff on a desk you have to sift through and grok, the faster and cleaner one's thinking. -- top (ctd.)
{Tables aren't "on your desk", so the analogy doesn't hold. With Relational, you always have a "view" that is constructed automatically from at least one table. Maybe if you had a magic button that could construct things "on your desk" into whichever formation you could specify, your little metaphor would apply. Besides, be it 100 tables vs. 100 columns - you've got the same amount of stuff to sift through and grok either way. -- db}
I've worked with both types of design, and my summary gut feeling is that thin is not in. Wide tables allow one to "drill down" incrementally. You get the general entity concept first (the "wide" table), and then filter out what you don't need via column and row filtering. It's a hierarchical thought process more or less. (Yes, there are LimitsOfHierarchies, but we're talking mostly about a mental map, not a formal taxonomy. In fact, trees are overused because they are mentally catchy.) -- top (ctd.)
- Being in or not (as in fashion, or whatever) might land you on Fabian's quote of the month page. This has nothing to do with being in or being hip and satisfactory with the coworkers or whatever it is you are trying to be. The relational model is not what you describe - and you should stop talking about your table oreinted programming model (if it is even a model) as if it is relational - because it isn't. Over the past years reading through this wiki, I've come across tens of pages (maybe hundreds) where you appear to be defending the relational model, when in fact you have absolutely no relation to the relational model whatsoever. It is very confusing, misleading, etc. You ought to defend your table oriented model as the TOP model or something else, and stop speaking of the relational model in any communications. The relational model is also not about gut feelings or pokes that I feel on my ribs or anything like it. Top model (if it is a model) may be about that, but the relational model has basically nothing to do with what you are defending.
- We can save the RelationalDefinition battle for another day. We have enough on our plate right now. --top
{With 'wide' tables you'll also have a bunch of scattered 'exception' tables (because wide tables are very non-homogeneous) that carry all the one-to-many and many-to-many relationships, and as consequence you still don't get "the general entity concept" - just a fractured view of it. I'd much prefer to have good tools for constructing useful views on the fly than 'pretend' I'm getting some significant benefit out of wide tables. I'll also note that the value of the 'entity' concept has always been at question - it may be worth its own little debate. Thin table approaches, at least those with which I'm most familiar, don't embrace this 'entity' concept... they sort of kick it aside in favor of
DatabaseIsRepresenterOfFacts. The closest they come is recognizing semantic facts like 'exists(identifier-goes-here)'. -- db}
With skinny tables its like trying to remember the name of an obscure person: there's no mental drill-down available; the mind has to do a sequential search. Now, I perfectly agree that if better schema sifting and virtualization tools were available, such may not be an issue. But, to misquote Rumsfeld, we have to fight the war with the army we have, not the army we want. -- top (ctd.)
{I'll agree that the most popular RDBMSs aren't ideal for the thin-table approach (partially due to the spiral-effect: nobody outside the academic community uses thin-tables, so there isn't a lot of visible money-benefit of building one, so nobody builds one, so nobody outside the academic community uses thin tables ...). That can reasonably affect your current actions, but I don't believe it a great argument for why wide tables are better. And, yes, a thin-table approach would rely on the ability to rapidly identify or learn of tables. Among other things, first-order-logic as part of queries would be useful in thin-tables, making meta-data (data about the database and the data) into FirstClass data. This requires the ability to use a column-name as an attribute-value, grab it out as part of a query, and query upon it. Your proposed DataDictionary(ies) follow this pattern, but similarly lack much support from modern RDBMSs. Anyhow, what do you think Rumsfeld would have said should he have had the ability to construct armies out of well-arranged ones-and-zeroes simply by spending a few $million to hire some smart and experienced people to do it for him? -- db}
Note that I don't disagree with all thin tables. The one that buggers me the most is doing it merely to get rid of nulls. Nulls are too volatile and trivial a concept tie key design decisions to. One shouldn't have to rework bunches of tables just because a column becomes "required" instead of optional, or required for different circumstances. -- top (ctd)
{I can't imagine needing to rework a bunch of thin tables because a column becomes 'required', either. It might help to know my conception of 'thin tables' in its extreme form is essentially that which I've derived from my study of logic programming languages and automated theorem proving: exactly one fact per row, or ConjunctiveNormalForm? transformed into rows on a table. A typical entity-table basis for databases has (ENTITY-ID,LIST-OF-ATTRIBUTES). E.g. if there are three attribute-columns 'wears', 'makes', and 'eats', then this single row would represent three conjunctive facts wears(fred,what-fred-wears) AND makes(fred,what-fred-makes) AND eats(fred,what-fred-eats). The 'thin-table' approach would break this into three tables (wears, makes, eats) each with a binary relationship. However, there are non-binary relationships that cannot be expressed in this manner, e.g. where an abstract proposition (a predicate) has one, zero, or more than two than one parameter. 0-ary tables are simply boolean variables (essentially), unary tables are straightforward classifiers (like 'exists(x)' or 'isNumber(y)'), and ternary+ tables really need all three-or-more entries (e.g. delivers(bob,pizza,fred)). There are NO entity tables that also carry facts about the entity. Ever. The closest you can come is an extensive-list of entity-identifiers of a single type (isPublication(y), isPeriodical(z), exists(x)). Thin tables, in the more extreme form, entirely rejects EntityRelationshipModeling. Instead, its best-practices focuses on OnceAndOnlyOnce explicit representation of facts (e.g. you'd never say 'is("publication",x)' AND say 'isPublication(x)', though you could say 'is("publication",x) :- isPublication(x)' - i.e. use implied or entailed facts... which (in this case) turns 'is' into a query or view). And there are ways of specifying schema and constraint requirements (e.g. that 'isPublication(x) requires known(y),hasTitle(x,y)') - db}
Like I said before, that's like purchasing the decorations for a house based on the clothes you happen to have on. Big design decisions should be tied to "big" concepts, those that are as invariant as possible. Empty columns often don't qualify. (Note that I am not happy about the way current RDBMS implement nulls, but that's another topic.)
Perhaps we can sum this up in a more general rule: Don't make hard partitions in designs unless the hard partition exists in the domain itself. Otherwise, changes will be painful and risky. (Originality disclaimer: I've heard similar rules from other software engineering literature, but cannot recall any specific names or sources from the top of my head.) --top (end.)
{It seems you are making assumptions about the virtues of your own choices again. Can you present any convincing arguments and evidence that the 'wide' Entity table is a 'softer' partitition over the concept-space than are the thin tables? Or are you just assuming it? -- db}
I can appreciate that more tables, rather than fewer, may represent a visualisation problem. This is certainly true when you have to deal with a project that has hundreds of tables. That does not mean, however, that you should accept or allow the update anomalies that result from denormalisation. That would be like deliberately writing bad code because you have to program in Notepad. Instead, use (or create) tools that facilitate visualising complex schemata. The "relationships" view in Microsoft Access is surprisingly good for this, especially in terms of following foreign key relationships, though I've long wished for a more sophisticated schema management and visualisation tool that would permit zooming in and out, showing/hiding only the tables that are related by foreign keys to a specified table, and defining arbitrary groups of tables -- e.g., Accounts Receivable, General Ledger, Inventory, Billing, etc. -- and hiding/showing group details (i.e., the individual tables within a group, or the attributes within the tables within a group) as needed. I started working it -- as a kind of ER-modeller/DBA tool from Hell -- around the turn of the century, but, alas, the pressing demands of other projects have largely squeezed it out of the running. I will come back to it, though -- it would make an ideal administrative front-end for the RelProject. -- DV
We'd have to look at specifics to see if there are really "update anomalies". Not all "problematic" wide tables result in the typical update issues. The devil's in the details.
If your design is not, for example, in 5NF when 5NF is warranted -- or in 3NF, period -- then there are, by definition, update anomalies. These will occur unless you can guarantee the affected tables will never be updated. One of the values of the normal forms is that they can guarantee, for each level, which anomalies will not occur. That means you don't have to look at the "specifics", or rely on conditions (such as "<x>, <y> and <z> tables will never be updated") that hold true today but may not hold true tomorrow. -- DV
Moved reply to SafetyGoldPlating.
Maptional?
"Relational" means roughly "based on tables". However, the "thin table" approach is more akin to maps (dictionaries) than tables. Perhaps the thin-table approach should be called "mapational" or "maptional". (Reminds me of the Muppet Show theme song: "Muppetational" is used to rhyme with "sensational" and "inspirational" IIRC.) --top
- Can you support your (rough) definition of "relational" as "based one tables"? It seems to leave out most of the important parts of the RelationalModel (which IIRC doesn't make any mention of tables in its purest form). Do you think DrCodd or most RelationalDatabase researchers or practitioners would support such a definition? Perhaps we need a TopDefinitionForRelational? before continuing.
- There are different aspects of DrCodd's model that one can emphasize or diminish to support their HobbyHorse. Thus, it would probably be a fruitless exercise in argument loops. --top
- If finding a definition of RelationalModel and Maptional is a fruitless exercise, then shall we delete this section?
- What is the purpose of introducing this term?
- There seems to be an assumption that lack of a precise or agreed-upon definition is reason to not ponder something. I consider this an exploratory section where ideas are kicked around. BrainStormFirstCleanLater. I introduced the term as a working definition to avoid repeating the longer description each time. --top
- Ah, I think I understand now. Based on this and the discussion below, Maptional is used as an easier way to refer to databases that contain only tables with only two columns where one column is the primary key?
- [Close. To be complete, I expect you'd also need to add the constraint that there are no other keys or indices.]
"Thin table" approaches still allow for n-ary relations of arbitrary degree 'n'. The rule determining the degree is essentially that one row = exactly one complete fact (i.e. no conjunctive 'AND' clauses implicit in the predicate reading of the fact represented by the row because those can always be fully reconstructed by use of table joins). As such, likening 'thin tables' to 'maps' seems an invalid analogy
even before one starts to consider the variation in needs for indexing (maps are indexed on exactly one key, but thin tables could be indexed by any subset of the powerset of columns) and the common difference in query and manipulation languages (maps don't typically support joins, intersects, unions, views, etc.). --
WhiteHat
- Well, set expressions/operators can certainly be added to maps. My concern is that if most of the tables are thin, then perhaps one might as well byte the bullet and endorse maps as the "atom" of this thin table philosophy. It may simplify the math/model to always "know" one is working with just maps. Keeping atoms simple sometimes makes the model cleaner. Just something to think about. If the majority of your tables are essentially maps, then perhaps making them all maps allows one to ditch notation baggage and confusion related to supporting the possibility of 3 or more columns, allowing a more compressed notation/model. (I'm not endorsing maptational, just pointing out something about the model.) Similar issues of atom support at RelationalLispWeenie. --top
- It is true that relational operators can be added to maps, but then you wouldn't just have a map anymore... sort of like supporting 'dequeue' for a stack results in something other than a stack. 'Maps', and all objects, are defined at least as much by their accessors as by their underlying structure. But you are wrong about switching to 'maps' keeping things 'simpler'; indeed, in practice people who utilize maps to emulate tables often find they need to maintain two maps, carrying identical data, with the indexing reversed. Essentially, you'll have user-maintained indexes instead of user-maintained databases. It would be far simpler (for the user moreso than for the language implementor) to have 'tables' be the primitive and allow the user to simply define which indices are needed... just as is done in modern database systems. -- WhiteHat
I must say I am surprised that
TopMind, who is acutely aware of these differences and often proselytizes over the relative benefits of the table-based systems over common collection objects like maps, even suggests this 'maptional' analogy... it is almost enough to make me suspect more disingenuous designs. --
RedHat
Please clarify. I didn't endorse "maptational".
- No, but you did liken an approach to databases that you disagree with to "maptional" despite the clear and well known insufficiencies of maps, thereby implying that "thin tables" are similarly insufficient and no better than maps. I.e. you either made a low, subtle, backhanded insult to the efficacy of "thin table" approaches or you displayed ignorance in an area of your own expertise and were thus unaware of the low, subtle, backhanded insult. Which is truer? -- RedHat
(By the way, WhiteHat and RedHat appear to be the same person based on IP address. I almost suspect an attempt to make it look like more than one person is "ganging up on me". But its possibly just paranoia on my part and just a burp in wikiware.)
--top
It appears to be a posting style judging from the user pages.
- Correct. Different hats for different viewpoints, even from the same person. -- WhiteHat
But they were not cross-judging each other.
[Note: A related MapTational topic has been moved to TheAdjunct as part of an EditWar compromise.]
Partial Agreement?
I thought somewhere in this ThreadMess (or another) we generally agreed that "thin versus wide" could be merely a view on the same schema/meta information if our tools and/or DB engine design were sufficiently powerful and/or properly designed. One could switch between a "thin" view and a "wide" view of tables as needed. A difference remains over what the default should be when using the half-@ss tools we currently have at our disposal. (For example, existing support for editable views is often poor.) -top
Incidentally, how would naming-collisions be avoided with so many tables even if the multi-view approach was taken? With so many potential tables, things may start to get crowded without some good conventions. -top