I'm coining an awkward term as a thought experiment. All these HolyWars over "pure" relational versus the acceptance of "duplicate rows" (AKA "bags" in data-structure-speak) in existing RDBMS got me thinking about the implications of such. What practical benefits are we giving up by giving in to bags? (Note that this does not preclude the existence of unique keys as an option, an added "constraint".) Is "bag math" sufficiently messier or less powerful than map-math (AKA "pure relational")? Is WaterbedTheory at play? --Top
SQL is, of course, "bagatational".
One practical benefit we give up by "giving in" to bags is that certain query optimisations which are possible on true relations are impossible with bags. Query expression re-writes that may be used on relations will produce duplicate (or more duplicate) rows when applied to bags. Presumably, duplication has meaning in a bag-oriented system (otherwise we wouldn't need duplicates), so this is obviously unacceptable.
- Can you offer concrete examples of the query optimizations which are possible on true relations but impossible with bags? --AnonymousDonor
- Sure. Here, \ represents difference and u represents union. Given sets A, B and C then A \ (B u C) = (A \ B) \ C. Obviously, an optimisation strategy can choose to execute either A \ (B u C), or (A \ B) \ C, depending on which is lower cost. Given bags P, Q and R then P \ (Q u R) != (P \ Q) \ R. By using bags, we lose the ability to optimise by transforming the one expression into the other. See J. Albert's (1991) "Algebraic Properties of Bag Data Types" paper, which supplies the relevant operator definitions and is related to the topic of this thread at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.97.7657 or download it here: http://www.vldb.org/conf/1991/P211.PDF
- {But they are not really bags or don't have to be bags internally. See topic "Internal Versus External Bags" under BagAtationalDiscussion. Thus, the above may be moot.}
A second practical benefit we lose is the ability to detect and prevent double-entry errors. Presumably, if we want duplicate rows in a table, then duplication has meaning. In a system that allows duplicate rows, that meaning is hidden. The existence of a duplicate obviously means something (otherwise we wouldn't be allowing duplicates), but we choose not to indicate it. So, if we accidentally double-enter a row, how do we distinguish that erroneous extra row from a meaningful duplicate?
'''In the same way in which we distinguish errorneous extra row in a table that has autoincrement primary key. Hardly. Uniqueness doesn't help to prevent errors by itself.
Also, what makes you think that duplication must mean something? Sometimes we just don't care about it at all. Perhaps we only want to know if a record exists or if it is before/after some other record. If YAGNI uniqueness, then don't do it. --Grey'''
[Conversely, if you aren't gonna need duplicate rows, why support them? -DavidMcLean?]
- I'm not against the idea of a shop- or database-specific config setting to not allow them. -t
- I am, on the basis that the config setting propagates complexity throughout the query engine, particularly the optimiser. There are reasons to import duplicates when they occur in the world outside the database, but then you coalesce or uniquely tag duplicates so none are stored in the database. There are reasons to eliminate columns when sending data from the database to the outside world, which acceptably allows duplicates to be emitted. I've yet to see a compelling reason -- or any reason, for that matter -- to maintain duplicates in a database.
- I believe we've spent something like 6 pages going around and around on such cost/benefits. I doubt going to 7 or 472 pages will solve it such that I'm satisfied to let a given shop decide if they want to forbid bags or not rather than insist they be universally forbidden per The Gods of Logic and Parsimony. -t
Continued at
BagAtationalDiscussion
(Discussion moved from HowOtherQueryLanguagesAddressSqlFlaws)
- SmeQl: As currently defined, tables are bags rather than sets. Duplicate rows are allowed unless the Unique operation is used.
- Con: In the RelationalModel, duplicate rows are meaningless, redundant, and can cause erroneous results if used as the basis for totals, counts, and other summaries or aggregate results. What purpose is a duplicate fact? See DatabaseIsRepresenterOfFacts.
- Sometimes we have to be "ugly" to match real-world ugliness. If the domain wants a bag, give it a bag. Another "purity" fight here? Sometimes the user simply doesn't want to see the damned key. Why are you forcing them to see it? If I ask the computer to not show the key, the thing better not show it or I'll go buy another that obeys me. If it lectures me about purity, I'll smash its screen. Further, using bags as results may be more efficient in some cases.
- Except for the last sentence, which is accurate but of negligible concern, that's a nice rant.
- As a compromise, it could perhaps add the restriction that only the "final" result can be a bag. In other words, a bag cannot be fed into another relational operator. Another way of saying this is that a result "table" must have a primary key defined to be used by subsequent operators. - t
- This is essentially how a tuple ARRAY in TutorialDee is used.
- I find arrays too close to "table" to bother making a different "type". You pulled an IS-A when HAS-A would be better abstraction and concept sharing.
- TutorialDee's ARRAY is perhaps a bit of a misnomer. It is not a general-purpose array. It is more akin to a SQL cursor but with randomly-accessible rows, hence the name ARRAY.
- Does it really need two different "hard" types? Sure, the underlying implementation may change if different attributes are chosen for a given "table", but the language user perhaps shouldn't have to care about implementation, which may change anyhow in the future when better or alternative implementations are found.
- The distinction between RelVar and ARRAY is necessary, because an ARRAY allows duplicates but cannot be persistent and supports limited operations related to its cursor-like array-flavoured nature, whilst a RelVar does not permit duplicates but can be persistent and supports the full RelationalAlgebra. They can no more be considered alike, equivalent or interchangeable than a SQL table can be considered to be a SQL cursor or vice versa.
- Duplication and being persistent are insufficient reasons for inventing a new type that's so similar to tables. And, duplication and persistence are potentially unrelated to each other. Why not make these attributes be like two independent toggle switches? I find your justification for separate "types" poor. It may also make using most of the language with existing RDBMS tables difficult. But we can AgreeToDisagree and move on.
- I think you're letting the name "ARRAY" lead you into thinking it's a table or table-like thing. As I mentioned before, "ARRAY" is probably a misnomer. Read TheThirdManifesto (the book, not the Wiki page or the Web site). You'll see that ARRAY is essentially a replacement for SQL's CURSOR. ARRAY and RelVar are as similar as chalk and cheese, which makes your comments about toggle switches and difficulty with existing RDBMS tables almost completely non sequitur.
- A decent query language should be able to return a non-unique result set to the user without the need to use something like cursors to pull it off. If you need cursors to pull it off, that's a ding against it in my book. -t
- {Why do you believe formatting a result-set for the end-user is the responsibility of the query language? Seems to me like that feature should be in a separate (or layered) 'presentation formatting' language.}
- The original purpose of query languages was to answer ad-hoc questions like "How many green widgets did the blue customer buy in June?", not so much for internal processing. And, it's a common need in association with queries.
- {Indeed, the purpose of a query language is to obtain an answer, not to format it for display.}
- But what's so hard about allowing the equiv of a SELECT list to simply exclude the primary key(s)? Just simply allow it rather than start a purity holy war about how the Sun will rise in the west if you allow it.
- {What's so hard about allowing contractors to simply mortar our walls with bull feces? It isn't as though the Sun will rise in the west if you allow it. What's so hard about kicking a puppy? Lift your leg then punt. Just simply allow it rather than start a holy war against puppy punting. What's so hard about treating these arguments as credible? Simply allow them rather than complain about sophistry, relevance, or hand-waving. I'm being reasonable, which to me means the opposite of being anal.}
- It is "reasonable" to you, not to me. It's a tool, not a cop, court, and jail. Humans should spank humans, not machines. And a bag is not always the "wrong domain model" anyhow. -t
- A decent query language should not allow duplicate tuples/rows, at least without having to use something like cursors to force them. If you deal with bags by default, that's a ding against it in my book.
- Again, we disagree on that point.
- Yeah, but you are unremittingly, unceasingly, and indubitably wrong in every conceivable way. I'm surprised you're still able to post, what with the staggering burden of profound incorrectitude weighing down your limbs. See BagAtational for a few more reasons why bags should be bagged.
- I see tradeoffs, not a free lunch. Discussion continued in BagAtational. [EditHint: perhaps this whole sub-topic should be moved there.] -t
See Also: DynamicRelational, RecordBasedDatabase, BagSetImpedanceMismatch, BagVersusSetControversyRoadmap
CategoryDatabase
AprilTen