Tql Chained Join

Chained Joins Considered to solve TqlLacksaNaturalJoin.

One thing to perhaps consider is "chained" joins where natural joins are done in a chain. The table names could be put into a Column Table. Example:

  r = chainJoin(cols('employees, office, department, jobTitle'))
Or to use the new "shortcut" approach:

  r = chainJoin([employees, office, department, jobTitle])
Note that the names could be virtual tables as well as actual tables.

The above would be equivalent to:

  r1 = natJoin(employees, office)
  r2 = natJoin(r1, department)
  r3 = natJoin(r2, jobTitle)

// Alternative r1 = natJoin(employees, office, "inner") r2 = natJoin(r1, department, "left") // left outer join r3 = natJoin(r2, jobTitle, "right")

If the native DB engine does not support natural joins, I suppose a TQL processor can have the option of using a DataDictionary that contains the reference information needed for natural joins. Or, a Column Table could carry the join expressions. A ponder-item. But remember that this language is meant to allow user-defined (or at least DBA-defined) operations. Thus, you can make one and share it if it does not come with the original box.

A basic "Join Dictionary" may resemble:

 joins (table)
 ----------
 joinID
 description

joinTables (table) ----------- jtID // key [naming needs pondering] joinRef // foreign key to "joins" table joinType // inner, outer, etc. tableA tableB

joinColumns ----------- jtRef columnA // column from table A columnB // column from table B displaySequence
This only supports "equal" comparisons. A fancier version would allow more operations, or expressions in addition to or instead of column matches. But version 1.0 should keep it simple in my opinion.

-- top


Another approach may be just to define the "linkage" columns ahead of time (similar to the join dictionary above), and then indicate "inner", "left-outer", "right-outer" etc. between the table pairs for the actual join step. However, I haven't found a convenient syntax or interface for this yet. A related issue is whether natJoin (or similar op) can use column naming conventions if there is no join dictionary entry for a given pair of tables. There are a lot of paths we can take here, but it's hard to know which is the most practical. "Manual" joins can still be used when needed such that we don't have to cover all cases, just the more common ones. Also note that a "generated" (virtual ad-hoc) table, such as the result of joining 2 tables, cannot use a join dictionary because it's specified by table, and this info is "lost" for virtual tables. This is not a show-stopper, but does limit our options. -t


Declarative Versus Imperative

Why "chain" join? The word "chain" implies a particular sequence, and some special case beyond a "normal" join. The whole point of the RDBMS and believe it or not, SQL, is to model such things declaratively. That is, you tell it what you want, not how to get it. SQL falls down at being declarative in many spots, but that's no reason to copy or even add more damage to it. If join in your language can't even take multiple specification args, you've created a language significantly less powerful than SQL. Is that your intent?

I am not clear on what you are suggesting. It is not necessarily a literal time-based sequence. It is like having nested math operators where the inside operators are considered to be done "first". If you change the order it changes the meaning. See Example below. Thus, it is no more bound to temporal concerns than pure math, at least as far as what it is requesting. Implementation of course is bound by time, but optimization may choose a different order than what is literally requested as long as it gives the expected answer.

  x = a(b(c(y)))
  x = c(b(a(y)))  // not necessarily the same as first equation in math either
Oh fudge, I smell a LaynesLaw fight over "declarative" and "imperative" :-)

-- top

Perhaps one can illustrate it this way. Suppose we have this algebraic formula:

  x = a + a + a + a + a
There are (at least) two ways to go about evaluating this. The first is to use the usual algebraic order-of-operations and precedence rules and sequentially process it as stated. However, a smart tool may internally refactor the above to:

  x = a * 5
The result would be equivalent to the sequential solution, but more efficient. The formula asks "what", not "how". In other words, one can view it in a sequential perspective. However, it can also internally be refactored. The closer a query language is to math, the easier it is to apply these kinds of optimizations. I don't see where TQL is any less mathematical than SQL. But the meat of the argument is that it is easier to mentally view TQL in a sequential way if one wants to (but one is not forced to). This allegedly makes it easier to understand because one can use sequential experience to analyze it if so desired.

One example where SQL seems to have factoring problems is the form:

   SELECT ....
   WHERE x in (select count(*) from foo)
     AND y in (select count(*) from foo)
Ideally, this should be more like:

   // Bastardized SQL
   myFooQuery = (select count(*) from foo);
   SELECT ....
   WHERE x in (myFooQuery)
     AND y in (myFooQuery);
TQL easily provides this sort of refactoring without having to create DBA-side views (which can be a technical and political hassle). Note that a smart RDBMS may still see that the SQL in the first example has repeated sub-queries and perform the sub-query only once, essentially refactoring the internal execution to resemble the second example.


In reference to top's claimed equivalence between two expressions (chainJoin vs. three natJoins placed in -one particular- arbitrary order)p:

[These] aren't expressively equivalent even if the result of evaluating or processing them is equivalent. Any particular task can be expressed a number of different ways - e.g. RelationalCalculus vs. RelationalAlgebra, a list of instructions on how to do something vs. a list of things that need to be done; it's the nature of the expression, not the nature of the task, that determines the expression is declarative vs. imperative.

They can overlap. I doubt there is any pure test for declarative versus imperative, and it is merely another one of those LaynesLaw traps.

Some people have fuzzy understandings of everything, but that doesn't mean the concepts aren't pure and can't be understood. Imperative expressions describe communications (to be executed), and functional expressions describe values (to be evaluated). Neither is declarative. Declarative expressions are all about avoiding the "how", and thus are necessarily terms of consequents (the "what"). Declarative statements over communications describe effect or final state, and are called "goals". Declarative expressions over evaluations describe a value in terms of its properties. This isn't to say that declarative, imperative, and functional don't interact - most expressions will leverage some component of the others. E.g. you can communicate values (so imperative will usually involve some functional, even if just to add numbers), and you can calculate actions. You describe properties of values in terms of functions, and you describe properties of state or effect in terms of sensory data and communications. They are all connected. But, nonetheless, they are fundamentally different, as are all approaches to computing them. An easy way to recognize Declarative vs. Functional or Imperative is that Declarative will -always- involve a searching phase (really 'searching' for values, 'planning' for effects). It's simply a necessary step when you've described things in terms of consequents.

How would you classify this?:

  SHOW ALL INTEGERS IN x THAT SATISFY THE EQUATION x = 2 + 2
Which in practice is no different than:

  PRINT 2 + 2
They are interchangeable, at least for some equations.

Classifications: I'm assuming that "SHOW" and "PRINT" are synonyms since you say these are interchangeable. The whole expressions with SHOW and PRINT are both classified as 'imperative' because they describe an action (a communication) to be taken.

Each of them contains a subexpression that describes the value to be printed. The former subexpression, "ALL INTEGERS IN[sic] x THAT SATISFY THE EQUATION x = 2 + 2", is declarative. The latter subexpression, "2 + 2" is functional.

The evaluation mechanisms are, appropriately, disparate - the former requires a search to find all values that satisfy (x such that 'x is an integer' and 'x = 2 + 2'), while the latter requires the application of functional rules that will ultimately transform '2 + 2' into '4'. Admittedly the search will be simple in this case; unifying 'x = whatever' is almost always simple when the 'whatever' is a closure. (More difficult would be "ALL INTEGERS x THAT SATISFY THE EQUATION (x^3 - 4x^2 + x - 4 = 0)" or "ALL INTEGERS x THAT SATISFY THE EQUATION (fib(x+2)=13)".)

You're right that these are interchangeable. Indeed, you can interchange one value-expression with another whenever they evaluate to the same thing, and such expressions exist in abundance. However, that two expressions evaluate or execute as the same thing does not make them equivalent... especially not expressively equivalent. 'Imperative' and 'functional' and 'declarative' are all statements about the nature of expression and approaches to evaluation or execution, and they are not statements about the result of said evaluation or execution. Thus, the fact that two expressions are interchangeable does not affect whether they are, individually, imperative, functional, or declarative.

Consider that one could get the same imperative statements you listed above (which evaluate to 'PRINT 4') by saying: "PUT 2 into Register1, PUT 2 into Register 2, ADDR (1,2)->3, ADDTO (3,48), MOVR(3,1), PUTC" in some made-up assembly, though a little cleanup might be needed to restore the registers. Similarly, one could get the same value as your TqlChainedJoin with a big, 500kB BrainFuck program after hacking direct-disk-access into the implementation. These might be interchangeable, but I'd hope you'd not say they are equivalent.

Equivalent in the specific sense, general sense, or some context?

But what about something like the following? "Result" could be a simple value or a list of results (a data structure with values or perhaps even new equations if not enuf info is available to produce a real-number value). I see the distinction as continuous. PRINT above simply has more constraints about what it accepts than SHOW or SOLVE. PRINT is simply a limited "solver" that only accepts expressions that return scalar. We could internally implement a C-style PRINT statement by using a more general-purpose equation or expression solver, but rejecting/crashing any result that is not a single scalar.
      result = solveEquation(.......);
        or
      result = simplify(x + 4 - 1); // result is "x + 3" because x is undefined
        or
      result = simplify(4 - 1);  // result is "3", the same thing simple languages give you
[EditHint: Move above declarative discussion portion to DeclarativeDefinition or the like] (Maybe DeclarativeProgramming? Or one of its many branches?)


Back to the issue of whether the ordering matters for all "types" of joins: I don't really know. But consider this: what if somebody wanted to invent a custom join for their shop in which join order does matter? If we make a rule that "sequence is ignored" for chained joins, then it may break these situations. I'm tempted to at least supply the sequence info to the "internal" engine, and if it needs it, it has it, if not, then ignore it. TQL/SMEQL would be more flexible to future changes or custom additions this way.

In math, loss of a commutative property relationship(s) does not by itself turn an operation into "declarative". However, not being commutable or not knowing whether a given operation is commutable reduces the choices available to an optimizer. A user-defined-function definition language could perhaps allow a function writer to specify communativeness in relation to other operations, but that's kind of a "heavy" feature. It's probably simpler to say that user-defined functions are assumed to be non-commutative and are treated as such by the optimizer. If a vendor wants to add commutative speficific enhencements when SmeQl becomes big in 2025, they can. I won't stop them :-)

-- top


Column Scope

The issue of column scope needs to be addressed. Some column names may be in multiple referenced tables. The rule perhaps should be that precedence is by the order that the tables are listed. Thus, if "id" is in both table A and table B, and A is first in the list, then A's "id" is returned and B's ignored. Work-around's if a needed column overlaps are being considered (other than explicit joins plus "Calc"). -- top


See also: CategoryTql, TqlRoadmap, AlternativesToNaturalJoins


EditText of this page (last edited February 17, 2012) or FindPage with title or text search