This topic explores reasons why one would want to violate "pure" relational within a "relational" query language or relational tool.
Potential "violations" include:
- Customer does not want to see the keys in results
- Non-removal of duplicates results in faster response (purity sacrificed for performance)
- Sorting (relational does not directly "recognize" or support order)
- Special requirements, such as "preserving" original row-count.
Discussion originally in a now-defunct topic called "Sql Union is Odd":
The SqlLanguage "Union" operator is a curious animal, especially with regard to uniqueness. It is quite practical and needed in the real world, but from a purism-of-relational perspective, it can be difficult to reconcile.
Consider two tables of the format:
table: "Sample"
--------------
key1
key2
value
If key1 + key2 defines the primary key, what if we union two tables of this format and some of the keys overlap? If we redefine the result table to have 3-column key (key1, key2, value), then we can avoid duplication if "value" is unique, but ideally we'd want a way to know this is going to happen. How we want duplicates handled (or ignored) generally needs to be considered on a case-by-case basis.
UNION selects unique rows. UNION ALL selects all rows. By the way, the UNION operator in the RelationalAlgebra only selects unique rows, because no relation may contain duplicate tuples. Indeed, the notion of a "duplicate tuple" should be meaningless -- what point is there in representing the same fact twice? See TheThirdManifesto.
I encountered a case in which the "value" was a count. Take for example the domain of sales statistics where one is adding up all sales.
table: "sales"
----------
store_ID
product_ID
quantity_sold
If UNION is used instead of UNION ALL, then the final count can be wrong even though it is possible to have duplicate rows being union'ed. For example, they may be by month, and we remove month when we do the union. We may have records like this:
SELECT store_id, product_id, qty_sold FROM jan_sales
UNION ALL
SELECT store_id, product_id, qty_sold FROM feb_sales
store_ID..product_ID..quantity_sold
-------------------------------
57...........1234..........7 // from January
57...........1234..........7 // from February
Etc...
This situation would present itself regardless of the query language I believe. In this case we want to
keep duplicates. I suppose we could work around it by keeping the month, but for practical reasons we may exclude it to reduce space yet preserve the total row count as a verification technique. Either way, one has to be careful when using UNION.
- ...in SQL [only].
- Lack of "ALL" in other languages makes one have to consider uniqueness issues. UNION ALL models putting one deck of cards on top of another: conceptually very simple. The "unique" union requires one abandon "physical" thinking. (This may be a good thing, but is "odd" regardless for the uninitiated.)
--top
[Top, if your argument about "being careful" boils down to "I can use this operator incorrectly", then I think you should spare us the dedicated WikiPage and sum up your entire line of reasoning with "BadCodeCanBeWrittenInAnyLanguage". And what you want above is NOT to keep duplicates, but rather to differentiate rows using implicit data you have described as '// from January' and '// from February'. If you want to think of records as 'physical' objects, stick a unique auto-num in each row (thus giving rows ObjectIdentity). And all of relational, along with any other paradigm, is "odd" for the uninitiated. There is nothing natural about procedural programming, for example. See TeachMeToSmoke.]
- It documents some questions and issues that may come up when one uses UNION. I don't know why that seems to bother you. Suppose you are teaching an SQL class and a student asks, "when would somebody want to use UNION ALL?" You could send them here. If you want a (yet another) different name for the topic, then just say so. And I'm not clear on what you are getting at with the month comments. The reasons for removing them (and things like them) are given later.
- [Calling a feature 'odd' or 'unintuitive' says more about you than it does about the feature. BadCodeCanBeWrittenInAnyLanguage, so an argument that a feature is 'odd' merely because you can use it to write bad code is utterly vacuous. That's why your argument bothers me: it is utterly vacuous, and vacuous arguments in general bother me.]
- It's not clear to me what you are complaining about. Do you wish to change the title of the topic? If so, what do you suggest. Further, people's relationship to tools may be just as worthy of discussion as the tool itself.
- [What you have here is at best an instance of SqlAntiPatterns or SqlFlaws and, at worst, is a StrawMan attack on SQL that is actually totally pointless repeat of the generally broader complaint that the computer does what you told it rather than what you wanted it to do. As of yet, I've seen no justification that this is a 'pattern' or 'flaw' at all. It's looking a great deal more like a StrawMan attack. Such vacuous attacks on any language or language feature deserves complaint, whether it be MyFavoriteLanguage or one I find many flaws in (such as SQL).]
- "Odd" does not necessarily mean "bad". Some entertainers are "odd" and we like them that way. Again, if you don't like the topic name, feel free to suggest an alternative. Would "Things to consider when using sql UNION" be better? I find it too long. It appears to me that you are complaining for the sake of complaining. If a non-Top being made the statement, I doubt you would complain. Wiki is full of informal titles.
- ["Odd" connotes "bad". And it seems as though you feel your completely vacuous argument would be redeemed by changing its title. How's this for a title: ThingTopMindFindsConfusingAboutSql?.]
- How about a few other topic suggestions, Doc. Perhaps a bigger issue is that SQL does not require unique result sets (unless you use the DISTINCT keyword, or DISTINCTROW in some dialects). One doesn't have to include the primary key in the SELECT statement. The result is thus not necessarily relational. Related: AlwaysUseSelectDistinct.
- [The inconsistency between SELECT (implicitly all unless marked distinct) and UNION (implicitly distinct unless marked all) would be reasonable to drop into 'SqlFlaws'. However, I don't believe it would merit a full page like SqlUnionSelectInconsistencyIssue?.]
- Do you universally agree it is a "flaw" of SQL?
- It's certainly a questionable, and somewhat error-prone inconsistency, i.e., a flaw. SQL would no doubt have benefited from defaulting to DISTINCT behaviour for both SELECT and UNION.
- Conceptually I wouldn't complain about that as a default. But, that could result in a noticeable performance hit. "Flaw" is too strong a word. It's one of those trade-offs that will anger one group no matter what is chosen. (Performance-weenies can be just as strong-willed as purity-weenies.)
- "Flaw" is an accurate reflection of the inconsistency, not the choice of default. Anyway, true performance-weenies would solve the problem with hand-optimised assembly language. Hence, leaning toward purity and consistency over performance should have been a reasonable choice. Allowing duplicate rows in order to improve performance would then have been a matter of adding clearly-recognisable keywords (e.g., 'SELECT ALLOW DUPLICATES' or 'UNION ALLOW DUPLICATES', much as 'INDEX ...' et al. is added to table definitions in order to improve performance.
- Hand-assembly is asking too much. Given the very limiting equipment environment that existed when SQL was created in the late 70's, it was a rational choice. Those shopping for products often will take performance over purity anyhow. Whether they are "wrong" is business/economic issue with lots of variables.
- The reference to hand-optimised assembly was intended to be humorous. Perhaps the choices the SQL designers made over three decades ago were reasonable at the time, but this only highlights the fact that SQL is at worst obsolete and at best suitable only as a low-level, machine-generated protocol for communicating with legacy DBMSes.
- No standard is perfect. Still, "flaw" is too strong a word, and at this point we'll just have to AgreeToDisagree. It looks like a standard practitioner-versus-purist debate now. They rarely end pleasant.
- [I gain the impression that TopMind isn't grokking the scope of the stated "flaw". The flaw described above refers only to the inconsistency in default behavior. Put it this way: there is a high chance that the reason TopMind believes SqlUnionIsOdd? is because the default behavior of UNION doesn't match the default behavior of SELECT. It could be resolved either way (with the "other" way being to have UNION act as UNION ALL by default, and requiring UNION DISTINCT to be specified if you want to lop away duplicates). The points raised about optimization and whatnot might be valid points in another argument, but are ultimately irrelevant to discussion of whether the inconsistency in SQL syntax and default behavior is a flaw.]
- Okay, I mis-interpreted it. I read "inconsistency" to mean the "sins" of duplication, not in comparing SELECT's behavior to UNION's behavior. I apologize. Post-fupah, I agree it is a "flaw". However, does anybody claim a relational query language shouldn't offer the "duplicate" view option for SELECT and UNION (or their equiv.)? And, I suggest the topic be renamed "allowing query duplicates" or the like. -t
- A relational query language, i.e., one that faithfully and strictly implements only the RelationalModel, shouldn't offer the "duplicate" option because relations, by definition, cannot contain duplicate tuples. Allowing duplicate tuples in relations is as incorrect (and as meaningless) as allowing the characters 'x' and 'y' in integers. However, it is reasonable for a relational language to provide -- in addition to relations -- non-relations (e.g., tables, bags, etc.) and associated operators that do permit duplicate tuples/rows. TutorialDee, for example, defines an ARRAY type that permits duplicate tuples.
- [We already have topics: DuplicatesAreBad, AlwaysUseSelectDistinct, SelectDistinctIsaCodeSmell, DatabaseIsRepresenterOfFacts, DatabaseIsRepresenterOfEntities, etc. If you're asking for my opinion and practice (i.e. what I'm implementing in my own language), I've rejected support for duplicates in relations. I'm not even certain where I'd fit them in... relational queries in my language are derived from Datalog rather than SQL and each 'query' returns a whole database (a record of one or more potentially interrelated relations) rather than a single relation. My language also embeds relational as a primitive tier in a functional layer. Attempting to support duplicates in relations would only introduce undesirable complexity, more verbose syntax, extra concerns for programmers to track mentally, and likely some LanguageIdiomClutter with the higher layer (e.g. functional and procedural) facilities, all to no significant benefit.]
- [Besides, I suspect supporting duplicates directly in relations themselves might be useful for optimizations in the short run but harmful for performance in the 'whole system' sense... i.e. when making higher-order use of a relation, a join on a relation that maintains duplicates can be exponentially more expensive than one that removes duplicates. Instead of putting the decision to tolerate duplicates in the relations themselves, I suggest you should put that decision into the browser. This would allow for a few duplicates when browsing 'lazily' produced relations to slip through for performance when a decision in a higher non-relational layer decides it can tolerate a few duplicates without introducing error, and do this without imposing duplicate-handling semantics and performance costs and syntax into the relational layer. I.e. I would suggest that tolerating duplicates (along with 'order by' and a weaker, heuristic 'try to order by') should be a function of the cursor and transport layer, not of the relational and query layer.]
- [If the above ideas confuse or intrigue, I'd be happy to discuss them in detail below. Just quote the appropriate passages.]
Curiously, having worked with "true" relational languages and their underlying concepts for almost a decade, I probably find "physical" thinking about databases as alien as you find abandoning "physical" thinking. You may view UNION ALL (in SQL) as modelling putting one deck of cards on top of another. I view UNION (in the RelationalAlgebra) on two decks of cards as equivalent to saying, "There is an Ace of Spades", "There is a two of Spades", "There is a three of Spades", "There is an Ace of Spades", etc. I.e., it is a recitation of facts, in which a "duplicate fact" is meaningless.
Remember that a relational DatabaseIsRepresenterOfFacts. As such, omitting the month (or other unique key) fails to represent the fact that sales occurred in a particular month. The query would be better represented as:
SELECT store_id, product_id, qty_sold, 'jan' as month FROM jan_sales
UNION
SELECT store_id, product_id, qty_sold, 'feb' as month FROM feb_sales
I'm not sure what space is reduced by excluding the month, except perhaps in pathologically vast cases.
- I gave an over-simplified example. There may be many more columns in practice needed to "preserve" uniqueness. For example, if 3 extra columns are needed to preserve uniqueness, then we have approximately doubled the size of the results. Such is not uncommon for large or messy databases (fixing the mess may be beyond our control).
- Indeed, there may. However, preserving the columns is an accurate reflection of facts. Removal of columns results in a representation of incomplete information, akin to eliminating the primary key of a RelVar or table. This may not be a problem if the UNION query is used only for (say) reporting, but may cause problems when it is (for example) joined with another table or query result.
- Like any tool/feature, if not used properly, it can result in problems. In practice, slimming the data set for data-transfer or storage purposes is a common need. We just have to do it so as to not "damage" something. I could probably live just fine without "ALL", but in some cases it saves processing steps. And as you mentioned, it may be more efficient in some cases. We trade purity for speed.
- [If you needed to slim the result, a 'pure' way would be to ship across the wire only the facts you require... in this case the quantity sold per product per store: do a join on product ID and store ID, introduce a third column that is the sum of the two qty_sold columns, then project the three columns you need to ship across the wire. (Remember, we have no functional requirement to preserve row count.) As to whether you're saving any processing with UNION ALL... that depends on whether or not you'd be doing the same processing on your side. Claims that purity costs performance often seem to be made with myopic attention to small 'aspects' of the problem.]
- Please clarify. A code-volume shoot-out may be interesting.
- [What is it you think a code volume shootout would accomplish? And what relevance would it have to the above?]
- Such problems are trivially avoided by accurate implementation of the RelationalModel. There are other ways to achieve "slimming the data set" without compromise, such as introduction of a surrogate key.
- Yes, but that is more steps, space, and perhaps time.
A more practical consideration is the fact that UNION (as opposed to UNION ALL) may need to perform duplicate detection. This can, in some cases, present unacceptable performance overhead. However, regarding duplicate rows as "acceptable", without clearly understood and well-managed reasons to permit them in very limited circumstances in SQL databases (true relational databases do not permit duplicate tuples), will inevitably result in all the problems that duplicate rows can cause, such as unintended JOIN results.
SQL provides a choice that a more "pure" language may not provide. Whether having this choice is "good" or not is a sticky philosophical problem. Or more specifically, another one of those economic/business choice versus "purity" debates.
This "choice" that is available in SQL (along with support for NULL) often results in unintended or unexpected query results that cannot occur in a more "pure" implementation of the RelationalModel, without any benefit except a marginal gain in terseness (which is offset by the overall needless verbosity of SQL.)
That said, most debates over "purity" vs pragmatics are in fact a debate over priorities. The purists favour (for example) provability or future scalability/modifiability over immediate pragmatic concerns. The pragmatists favour (for example) short-term expediency or gross simplicity over provability, etc. In the absence of a clear benefit of one view over the other (which tends to only be evident in specific cases), the debate is inevitably a religious one.
I'll leave the "null" issue to other topics.
Reading between the lines here, I suspect much of the debate is a side-effect of the fact that Top didn't know there was a UNION ALL until he saw it on this page.
That's not true. I've been using it for years. An accusation a day keeps the civility away. It's also irrelevant since those above are arguing that it be removed or avoided.
Sorting
The RelationalModel does not directly support "sorting" as we generally know it. However, in TqlQueryOperators, the "order by" operator gets around this by generating a sequence number column. The "client" (or output operation) can then use this sequence number to physically re-order (or check order of) the results according to the sequence numbers. The Order-by operator uses the primary key to "settle ties" when generating the sequence. --top
Whilst the notion of "order" in a relation is meaningless, TutorialDee defines an ORDER operator that accepts a relation and an order specification as operands and returns an array of ordered tuples.
I felt there was no need to create a new collection "type". Adding a sequence number allows order info to be present without creating NonOrthogonalLanguageFeatures. --top
I presume you do not implement relations, but tables (i.e. bags), yes? The equivalent TutorialDee operator to your "order" is RANK, which as I recall is discussed in TheThirdManifesto but not part of the grammar.
If you are referring to DynamicRelational, that has no significant relation to SMEQL/TQL. SMEQL does not specify type systems and dynamism, attempting to distance itself from that decision. Further, the techniques described in DynamicRelational are not necessarily non-relational as described in TupleDefinitionDiscussion. -t
I think there may be a misunderstanding here. I wasn't referring to DynamicRelational but to TQL, nor was I referring to type systems. I was referring to the nature of your tables. I am assuming you do not implement relations, because if you provide ordered tuples/rows (or duplicate tuple/rows), you are (by definition) implementing structures that are not relations. Or do you only provide the ORDER operator that returns a column of ranks, and rely strictly on the client to use the ranks to sort the ResultSet? In that case, you could implement relations. Do you?
To be practical, the query system should have a way to return ordered rows to the user. In the physical world the result must have some kind of order whether we want it to or not (unless maybe you "print" by dropping record capsules into a bowl of liquid simultaneously, but that's still a 2D ordering.) Thus, every relational system must eventually "violate" the no-order rule in order to "deliver" the result to the outside world. If we have to do this because physics forces us, we might as well give a "smart" or "usable" order. As far as duplicates, that is something I have delayed addressing. --top
- [Your conclusion does not follow from your premises. You assert that the relational system must violate the no-order rule. This is not true. While at some point relations must be ordered for serialization, neither the ordering nor the serialization need to be implemented as part of the relational system. Layered language approaches achieve this naturally. Sorting and ordering can be placed in the 'higher' layer with transactions, cursors, and other non-relational things.]
- The key is "to be practical". More on this below. -t
- [Your conclusion does not follow from your premises INCLUDING your "to be practical" hypothesis.]
- You always say that, but the details usually prove you wrong.
- *Boggle* As (mainly) an observer in these things, it's clear that it's you who is proven wrong again and again and again and again, over the same old things every time, with an almost masochistic degree of obliviousness on your part.
Indeed, this is why TutorialDee defines an ORDER operator that returns an ARRAY of ordered tuples and a LOAD operator that returns an ARRAY of tuples in arbitrary order. Hence, there is no violation of any "no-order rule", because ORDER returns an ARRAY rather than a relation. (Note: Published editions of TheThirdManifesto, up to and including the 3rd edition, combine ORDER and LOAD into a single operator. Subsequent work (unpublished at the time of this writing in February 2009) suggests separating ORDER and LOAD, and deprecates LOAD for converting a relation to an array.)
[In my layered language, translating a list to a relation may readily be performed in the functional layer, but transforming a relation back into an ordered set can only be performed in the procedural layer (which is a full tier above the functional layer, but still isn't TuringComplete). The separation is driven by keeping the functional layer deterministic, which could not be achieved with 'ORDER' or 'LOAD' operators. (That said, an aggregation operation could perform the translation if, and only if, there was a total order on the elements... though I expect it would be difficult to prove associativity and commutativity for that aggregation function.)]
One problem with letting the "wrapper" do the sorting is integration. In TqlExampleOne we can see how "sequence-based" ordering is used to perform "top N" grouping computations. This ability means that the "query engine" knows how to sort. If we let a wrapper do it, then the wrapper must either duplicate the effort (OnceAndOnlyOnce violation), or share a library with the query engine. If the second, why beat around the bush and just merge and be done with it? It would be "closely related" to the query engine if the second, and thus not really "separate". It's too much dancing around JUST to avoid a ticket from the Relational Purity Police. It is generally more practical if a query system provide common services asked of "collection engines"; at least until the time somebody figures out how to make such services plug-and-play, efficient, and still be language neutral. -t
- [There is a difference between layered languages and simple wrappers that is mostly described in terms of integration. Layered languages are designed to integrate from bottom up and top down. And I think you're begging the question regarding your "top N" grouping example... that is, you're assuming that the ability to perform sorts must be part of the 'query' layer. The 'issue' you raise simply doesn't exist with different assumptions.]
If I am not mistaken, the layers in the author's "layered language" are internally structural. I would assume (and hope) the syntax would be integrated such that the layers would be invisible unless there is an explicit user-driven need to perceive them.
There's probably a way to separate SMEQL's "OrderBy?" operation from the rest of the operations using creative packaging and labeling of parts, but it appears to be an empty victory so far.
[The victory only appears empty if you fail to recognize the opportunities it presents. In the case of ordering for relations in particular:
- without ordering it is easier to optimize indexing (i.e. using hashes)
- Ordering is only done if explicitly requested. I did not suggest hard-wire ordering into the data storage technique other than making it relatively efficient when asked for. If you use hashes but need to order stuff often, then the hashes may not be the best choice. It may be a violation of OnceAndOnlyOnce to have one index for speed and another for sorting requests. Generally it's more effective to just have one index per column. A B-tree index will serve both purposes, look-up and sorting, at roughly a "B-" grade level, while a hash will serves look-ups at "A" but sorting at "D" (roughly). There are special needs where redundant indexing types per column is effective, such as data warehouses where there's very few writes and plenty of RAM and disk. But, generally B-trees are the best known compromise between different usage patterns. -t
- Further, having a Sort command merely available does not itself dictate index type, but rather usage patterns should. Not having a Sort command in your language does not mean sorting is never needed. If it is more efficient for two sub-systems, such as your relational system and array system, to share a single index, then a mechanism to allow it should exist. -t
- I do not suggest that ordering is never needed. Putting ordering concerns into another layer only makes sorting unavailable to relations, unavailable to intermediate views and queries; ordering is still available to higher-layer facilities such as relation browsers and cursors. In order to defend ordering being in the relational layer, you need to defend ordering (and reasons for explicitly requesting ordering) for intermediate queries and views and such as well as for primary relations and tables. Your above points are irrelevant as a defense for supporting ordering in the relational system itself.
- I am only suggesting that the parts be able to use existing or new indexes without having to reinvent the wheel or duplicating data bytes JUST to maintain conceptual separation. I am skeptical that this can be done without some integration or common interface coordination. Thus 100% separation is probably an idealistic pipe-dream. Sorting may not be "in" the relational engine, but it should at least be heavily allied to avoid mentioned problems. -t
- Your skepticism is noted. I'd agree to a statement that the ability to annotate 'suggestions' to a query optimizer or regarding table layout, for example, would be quite useful. E.g. parts of a query could be marked for laziness, and other parts for full or partial sorting, etc. Or one could suggest that tables be laid out in order to optimize a particular set of queries with given frequencies. So long as these are suggestions and not mandates, they avoid the need for a SufficientlySmartCompiler without any sacrifice of purity or the ability of the system to choose more appropriate optimizations than you suggested.
- without ordering it is easier to perform distribution and redundancy (since machines don't need to maintain extra ordering information, can actually distribute using partial hashes)
- without ordering one can more readily support deltas and subscriptions to views (since such deltas don't need to carry extra information about 'movement' and 'positioning' of tuples in a relation)
- the absence of ordering makes it possible to define lazy joins and cross products (since one doesn't wait until every entry is produced)
- the queries themselves are more reusable as intermediate results since there are no issues of you 'wasting' an intermediate ordering of a query.
- the ordering itself or heuristics for what constitutes 'best answers' can now be manipulated independently of the queries,
- novel ordering concepts become easier to define, test, and optimize independently of the queries, etc. Example: lazy heuristic ordering, where one prioritizes lazily produced answers but doesn't wait for all answers to be produced (useful for embedded AIs and logic processing with constant memory overhead)
[When trading away 'purity', one is often trading away opportunities provided by said purity. Paradoxically, when trading away purity for performance one is often trading away opportunities to optimize for performance. In the short term this might be a win. In the long run, the optimizer starts beating even skilled humans on a regular basis.]
This tends to only be the case if the system is "clean" from start to finish. The messier the reality, the less the highly-abstracted techniques work effectively. If one does not have control over the entire system, which is often the case, then they can only work with the parts given to them as is. An example of this tendency can be found in AutoKeysVersusDomainKeys where Costin appears to assume (demand?) that an organization be "clean". -t
What is being described and implied here is not enforced high-abstraction but composability, in which generalised primitives (rather than specific-purpose high-level constructs, as in SQL) may be effectively combined to express high abstraction where needed, but the individual components are sufficiently low-level to maintain flexibility. A good example is the RelationalModel, in which the low-level RelationalAlgebra operators (and their even more primitive "building blocks", e.g., ExtendedSetTheory) may be used in infinite combination to implement both low-level and high-level mechanisms. See PrimitivesAndMeansOfComposition.
FebruaryZeroNine