Query Language Going Outside Relational

This topic explores reasons why one would want to violate "pure" relational within a "relational" query language or relational tool.

Potential "violations" include:


Discussion originally in a now-defunct topic called "Sql Union is Odd":

The SqlLanguage "Union" operator is a curious animal, especially with regard to uniqueness. It is quite practical and needed in the real world, but from a purism-of-relational perspective, it can be difficult to reconcile.

Consider two tables of the format:

  table: "Sample"
  --------------
  key1
  key2
  value
If key1 + key2 defines the primary key, what if we union two tables of this format and some of the keys overlap? If we redefine the result table to have 3-column key (key1, key2, value), then we can avoid duplication if "value" is unique, but ideally we'd want a way to know this is going to happen. How we want duplicates handled (or ignored) generally needs to be considered on a case-by-case basis.

UNION selects unique rows. UNION ALL selects all rows. By the way, the UNION operator in the RelationalAlgebra only selects unique rows, because no relation may contain duplicate tuples. Indeed, the notion of a "duplicate tuple" should be meaningless -- what point is there in representing the same fact twice? See TheThirdManifesto.

I encountered a case in which the "value" was a count. Take for example the domain of sales statistics where one is adding up all sales.

 table: "sales"
 ----------
 store_ID
 product_ID
 quantity_sold
If UNION is used instead of UNION ALL, then the final count can be wrong even though it is possible to have duplicate rows being union'ed. For example, they may be by month, and we remove month when we do the union. We may have records like this:

  SELECT store_id, product_id, qty_sold FROM jan_sales
  UNION ALL
  SELECT store_id, product_id, qty_sold FROM feb_sales

store_ID..product_ID..quantity_sold ------------------------------- 57...........1234..........7 // from January 57...........1234..........7 // from February Etc...
This situation would present itself regardless of the query language I believe. In this case we want to keep duplicates. I suppose we could work around it by keeping the month, but for practical reasons we may exclude it to reduce space yet preserve the total row count as a verification technique. Either way, one has to be careful when using UNION.

--top

[Top, if your argument about "being careful" boils down to "I can use this operator incorrectly", then I think you should spare us the dedicated WikiPage and sum up your entire line of reasoning with "BadCodeCanBeWrittenInAnyLanguage". And what you want above is NOT to keep duplicates, but rather to differentiate rows using implicit data you have described as '// from January' and '// from February'. If you want to think of records as 'physical' objects, stick a unique auto-num in each row (thus giving rows ObjectIdentity). And all of relational, along with any other paradigm, is "odd" for the uninitiated. There is nothing natural about procedural programming, for example. See TeachMeToSmoke.]

Curiously, having worked with "true" relational languages and their underlying concepts for almost a decade, I probably find "physical" thinking about databases as alien as you find abandoning "physical" thinking. You may view UNION ALL (in SQL) as modelling putting one deck of cards on top of another. I view UNION (in the RelationalAlgebra) on two decks of cards as equivalent to saying, "There is an Ace of Spades", "There is a two of Spades", "There is a three of Spades", "There is an Ace of Spades", etc. I.e., it is a recitation of facts, in which a "duplicate fact" is meaningless.

Remember that a relational DatabaseIsRepresenterOfFacts. As such, omitting the month (or other unique key) fails to represent the fact that sales occurred in a particular month. The query would be better represented as:

  SELECT store_id, product_id, qty_sold, 'jan' as month FROM jan_sales
  UNION
  SELECT store_id, product_id, qty_sold, 'feb' as month FROM feb_sales
I'm not sure what space is reduced by excluding the month, except perhaps in pathologically vast cases.

A more practical consideration is the fact that UNION (as opposed to UNION ALL) may need to perform duplicate detection. This can, in some cases, present unacceptable performance overhead. However, regarding duplicate rows as "acceptable", without clearly understood and well-managed reasons to permit them in very limited circumstances in SQL databases (true relational databases do not permit duplicate tuples), will inevitably result in all the problems that duplicate rows can cause, such as unintended JOIN results.

SQL provides a choice that a more "pure" language may not provide. Whether having this choice is "good" or not is a sticky philosophical problem. Or more specifically, another one of those economic/business choice versus "purity" debates.

This "choice" that is available in SQL (along with support for NULL) often results in unintended or unexpected query results that cannot occur in a more "pure" implementation of the RelationalModel, without any benefit except a marginal gain in terseness (which is offset by the overall needless verbosity of SQL.)

That said, most debates over "purity" vs pragmatics are in fact a debate over priorities. The purists favour (for example) provability or future scalability/modifiability over immediate pragmatic concerns. The pragmatists favour (for example) short-term expediency or gross simplicity over provability, etc. In the absence of a clear benefit of one view over the other (which tends to only be evident in specific cases), the debate is inevitably a religious one.

I'll leave the "null" issue to other topics.


Reading between the lines here, I suspect much of the debate is a side-effect of the fact that Top didn't know there was a UNION ALL until he saw it on this page.

That's not true. I've been using it for years. An accusation a day keeps the civility away. It's also irrelevant since those above are arguing that it be removed or avoided.


Sorting

The RelationalModel does not directly support "sorting" as we generally know it. However, in TqlQueryOperators, the "order by" operator gets around this by generating a sequence number column. The "client" (or output operation) can then use this sequence number to physically re-order (or check order of) the results according to the sequence numbers. The Order-by operator uses the primary key to "settle ties" when generating the sequence. --top

Whilst the notion of "order" in a relation is meaningless, TutorialDee defines an ORDER operator that accepts a relation and an order specification as operands and returns an array of ordered tuples.

I felt there was no need to create a new collection "type". Adding a sequence number allows order info to be present without creating NonOrthogonalLanguageFeatures. --top

I presume you do not implement relations, but tables (i.e. bags), yes? The equivalent TutorialDee operator to your "order" is RANK, which as I recall is discussed in TheThirdManifesto but not part of the grammar.

If you are referring to DynamicRelational, that has no significant relation to SMEQL/TQL. SMEQL does not specify type systems and dynamism, attempting to distance itself from that decision. Further, the techniques described in DynamicRelational are not necessarily non-relational as described in TupleDefinitionDiscussion. -t

I think there may be a misunderstanding here. I wasn't referring to DynamicRelational but to TQL, nor was I referring to type systems. I was referring to the nature of your tables. I am assuming you do not implement relations, because if you provide ordered tuples/rows (or duplicate tuple/rows), you are (by definition) implementing structures that are not relations. Or do you only provide the ORDER operator that returns a column of ranks, and rely strictly on the client to use the ranks to sort the ResultSet? In that case, you could implement relations. Do you?

To be practical, the query system should have a way to return ordered rows to the user. In the physical world the result must have some kind of order whether we want it to or not (unless maybe you "print" by dropping record capsules into a bowl of liquid simultaneously, but that's still a 2D ordering.) Thus, every relational system must eventually "violate" the no-order rule in order to "deliver" the result to the outside world. If we have to do this because physics forces us, we might as well give a "smart" or "usable" order. As far as duplicates, that is something I have delayed addressing. --top

Indeed, this is why TutorialDee defines an ORDER operator that returns an ARRAY of ordered tuples and a LOAD operator that returns an ARRAY of tuples in arbitrary order. Hence, there is no violation of any "no-order rule", because ORDER returns an ARRAY rather than a relation. (Note: Published editions of TheThirdManifesto, up to and including the 3rd edition, combine ORDER and LOAD into a single operator. Subsequent work (unpublished at the time of this writing in February 2009) suggests separating ORDER and LOAD, and deprecates LOAD for converting a relation to an array.)

[In my layered language, translating a list to a relation may readily be performed in the functional layer, but transforming a relation back into an ordered set can only be performed in the procedural layer (which is a full tier above the functional layer, but still isn't TuringComplete). The separation is driven by keeping the functional layer deterministic, which could not be achieved with 'ORDER' or 'LOAD' operators. (That said, an aggregation operation could perform the translation if, and only if, there was a total order on the elements... though I expect it would be difficult to prove associativity and commutativity for that aggregation function.)]

One problem with letting the "wrapper" do the sorting is integration. In TqlExampleOne we can see how "sequence-based" ordering is used to perform "top N" grouping computations. This ability means that the "query engine" knows how to sort. If we let a wrapper do it, then the wrapper must either duplicate the effort (OnceAndOnlyOnce violation), or share a library with the query engine. If the second, why beat around the bush and just merge and be done with it? It would be "closely related" to the query engine if the second, and thus not really "separate". It's too much dancing around JUST to avoid a ticket from the Relational Purity Police. It is generally more practical if a query system provide common services asked of "collection engines"; at least until the time somebody figures out how to make such services plug-and-play, efficient, and still be language neutral. -t

If I am not mistaken, the layers in the author's "layered language" are internally structural. I would assume (and hope) the syntax would be integrated such that the layers would be invisible unless there is an explicit user-driven need to perceive them.

There's probably a way to separate SMEQL's "OrderBy?" operation from the rest of the operations using creative packaging and labeling of parts, but it appears to be an empty victory so far.

[The victory only appears empty if you fail to recognize the opportunities it presents. In the case of ordering for relations in particular:

[When trading away 'purity', one is often trading away opportunities provided by said purity. Paradoxically, when trading away purity for performance one is often trading away opportunities to optimize for performance. In the short term this might be a win. In the long run, the optimizer starts beating even skilled humans on a regular basis.]

This tends to only be the case if the system is "clean" from start to finish. The messier the reality, the less the highly-abstracted techniques work effectively. If one does not have control over the entire system, which is often the case, then they can only work with the parts given to them as is. An example of this tendency can be found in AutoKeysVersusDomainKeys where Costin appears to assume (demand?) that an organization be "clean". -t

What is being described and implied here is not enforced high-abstraction but composability, in which generalised primitives (rather than specific-purpose high-level constructs, as in SQL) may be effectively combined to express high abstraction where needed, but the individual components are sufficiently low-level to maintain flexibility. A good example is the RelationalModel, in which the low-level RelationalAlgebra operators (and their even more primitive "building blocks", e.g., ExtendedSetTheory) may be used in infinite combination to implement both low-level and high-level mechanisms. See PrimitivesAndMeansOfComposition.


FebruaryZeroNine


EditText of this page (last edited February 19, 2009) or FindPage with title or text search