Example Relational Join on Tree Equality using SMEQL
A "tree" in our tree-kit will be defined by convention as any table, virtual or actual, that has at least a numeric column called "parent_id" and a primary key called "key". (If the columns are not actually named that, then we can use the Calc operator to rename them. Also, some relational engines can automatically find or require a known key, such that the "key" specification may not be needed. However, we cannot rely on that feature across vendors.)
To identify any tree, we supply the table name and the root ID. This also implies that the same table can potentially "hold" multiple trees, and multiple over-lapping trees. Thus, we don't need a different table for each tree.
Let's introduce a domain function called "treeEqual" that compares the equality of two trees and returns 1 if a match or zero if not a match. It uses the convention we've established for trees here.
(A variation could be a query-level operation, not just a domain operation, but that is not necessary for our illustration. Also, "treeEqual" does not compare on the ID, only the values of matching column names. If we wanted to compare on ID, we could alias a copy of the ID using SMEQL's Calc first. Same with value column matching. SMEQL makes aliasing easy.)
Example "treeEqual" domain expression:
treeEqual(tableA, 47, tableB, 333)
Here are the layouts for our test tables:
bigTreeA // table name
--------
key
parent_ID
myStringA
bigTreeB
--------
key
parent_ID
myStringA // same name as prior table to simplify example
treeListA // table with list "A" of trees
----------
parent_id
descriptA
treeListB // table with list "B" of trees
----------
parent_id
descriptB
SMEQL Join:
r = join(treeListA, treeListB, (treeEqual(bigTreeA,
a.parent_id, bigtreeB, b.parent_id)))
Internal join processing table (abbreviated):
a.pid|b.pid|descrA|descrB| (join clause) | (result)
----------------------------------------------------------------
4 | 3 | zog | gleep| treeEq(bt_A, 4, bt_B, 3) | 1
4 | 19 | mig | drog | treeEq(bt_A, 4, bt_B, 19) | 0
23 | 3 | foo | bar | treeEq(bt_A, 23, bt_B, 3) | 0
23 | 19 | foo | blah | treeEq(bt_A, 23, bt_B, 19) | 1
Etc...
(The "A" and "B" table naming happens to match Join's alias system of "a" and "b" left/right convention.)
One of the most generic (but slow) ways to implement joins is to create a CartesianProduct of the two+ tables, and let the condition (such as a WHERE clause) select which sub-set to use. (One could emulate a left join by appending a null row to right table and putting an "or null" in the condition...I think.)
Old style SQL joins, before the JOIN clause was introduced, had this kind of feel. Although, under the hood shortcuts are taken for simpler expressions to avoid having to do actual complete cartesion products of the tables involved.
Here, the relational engine lets the "domain math engine" execute the expression for each row. But before the relational engine hands off a given row to the domain math, the requested row values are subtituted in the expression, as shown in the "join clause" portion of our internal table. (It may not actually need to store each intermediate expression; for it's only shown in table form for illustration purposes here.)
Only those rows with a True result value (non-zero) would be returned to the result set. The result would be only those trees that match, something like:
descripA descriptB
-------- ---------
zog gleep
foo blah
zork blah
etc...
(Actually, "parent_id" from the first table of the join would also be returned, per SMEQL rules; but it's not shown here to avoid confusion.)
--top
Where is treeEqual coming from? You've waved a magic wand and introduced it, but it's an important question.
- It's a UserDefinedFunction? more or less (or DBA defined). I thought you agreed that the implementation of such was not a key issue at [insert topic].
- I agree that the details of implementation don't matter, but that's not at issue here. At issue here is how the implementation gets to the database, whether new such tests can be composed flexibly as part of a query. At issue is whether such implementations can be made generic, such that they aren't implemented once-per-variation and they have a consistent interface to allow reuse and composition. Those things matter.
- And that's why we try to test them with practical examples.
- No. That's why a good LanguageDesigner tries to test them with examples that push the limits of what they view as likely to appear in practice, since the maker of the RDBMS tool and language can't fully anticipate actual usage in advance.
- Let's agree one should do both. But note that optimizing for extreme cases may de-optimize for the 90% cases. For this reason, practical testing is more important.
- Can you justify that percentage? Or did you just make it up? I do agree that practical testing is important, but aiming for it first smells of PrematureOptimization.
- I am illustrating a point, not citing double-blind studies. And, over-emphasizing ANY metric may lead to premature something. If the extreme cases make the tool overall better for practical purposes, then you have nothing to worry about anyhow. But I'm skeptical this is the case. Languages optimized for academic programming contests (or those that shine well in such contests) often prove impractical in the field. They deal with high-level abstractions well, but fail to handle nitty-gritty exceptions-to-the-rule well. -t
- I agree to some degree, though I suspect that it is far more often the lack of libraries and dedicated cross-compilers that hurt 'academic' languages on the field than it is inability to construct such libraries. Codebase inertia isn't something overcome merely by having a cleaner language.
- If you want bigger libraries you need more users, and if you want more users you need to find a better way to "sell" your ideas to PRACTIONERS. You may find us practitioners annoying because we don't speak your language and don't put as much stock in "ivory tower" works; but if you want what you want, you're stuck with our kind and need to produce realistic coded demos. -t
- [As a practitioner, I would appreciate it if you would stop pretending to speak for me. As a practitioner, I would much rather see a practical implementation of something vetted by the "ivory tower" than your collection of hacks and work-arounds. Defining a tree as a table that has a column with a certain name?!? I feel soiled just reading it.]
- Reinventing all my fav CollectionOrientedVerbs and DB behaviors from scratch for each and every collection "type" and not be able to readily share existing nodes and indexes across these makes me feel soiled, stuck in EncapsulationHell?. -t
- Encapsulation countered in RelationalTreesAndGraphsDiscussionTwo. Reinvention of CollectionOrientedVerbs is resolved by not reinventing them - instead convert (possibly lazily) to the collection type for which the CollectionOrientedVerbs are defined.
- Oh, I agree that one must "sell" ideas to practitioners. What I don't do is make your mistake of believing the language is at fault simply because it doesn't gain market traction. It takes more than incremental improvement to gain market traction in a system where inertia favors the incumbent. Keep that in mind when it comes to your "extensions" to the RDBMS as well, because those will fail for the same reason.
Relevantly, if I have a tree where some of the elements are trees of a different type in another table (e.g. a tree that contains integers and trees of strings), how would I modify treeEqual to work?
That would largely depend on the "domain math engine". In tight-typed system, it would probably generate a "type mismatch" error. In others it may try to convert per type conversion rules, based on the types defined in the schemas. I don't see how this question is a DB-oriented question.
Ummm... I think you failed to grok the problem. This isn't a "type mismatch" error (if it were merely a type mismatch, I'd agree with you that the problem here is contrived). Instead, I meant a properly typed tree of integers containing trees of strings. In a functional language, the an example type might look similar to:
define Tree A B = tree A (Tree A B) (Tree A B) | leaf B
define TreeOfIntegersAndTreesOfStrings = Tree Integer (Tree String Unit)
In your approach, this would result in a couple different tables:
intTreeTable
------------
node_id
parent_id //(refers to intTreeTable.node_id)
nSequenceNumber // 0 for left tree, 1 for right tree
bLeaf // true for leaf, false for tree node
nMyInteger // integer associated with tree node (unless this item is a leaf)
strtree_node_id // parent of tree of strings (ref. to strTreeTable.node_id)
strTreeTable
------------
node_id
parent_id //(refers to strTreeTable.node_id)
nSequenceNumber // 0 for left tree, 1 for right tree
sMyString //(string associated with a node)
// nothing needed to support leaves since they contain only 'Unit'
I hope the above better explains/clarifies the situation to you. My question was: how do I apply or modify 'treeEqual' to work for the intTreeTable, since I don't simply want to compare one 'strtree_node_id' to another - I need to know whether those are 'treeEqual' as well. I do not consider this an irrelevant question; at the moment, I see it as a significant flaw in your approach, made even more difficult since you're working with 'temporary' tables and such. It seems your approach only supports "shallow" trees, where the columns don't contain references to other DomainValue tables.
What does this have to do with comparing trees in the original example? Are you changing the example? First you wanted to see it do X, now you want it to do Y and complain that I didn't illustrate Y. And Y is still ill-defined. -t
The "original example" (tree of integers) is one instance of a "problem" (performing joins for equality between trees in relational). Solving the example but failing to solve the problem is a failure on your part, and is not an attempt at deceit or moving goal posts on mine. I also believe your case is hurt by your conjuring "treeEqual" out of whole cloth. I believe that providing you another example of the same problem will help you see why your original solution was simplistic and insufficient, and I suspect that you'll resist creating new domain functions every time you're presented with an example, which will help you focus on a real solution.
Are you asking how a given node can participate in being parts of different trees at the same time? -t
No, that really isn't what I'm asking. I asked how to perform treeEquals when one tree contains another tree (or graph, or matrix, or other complex value). By your question, are you implying a belief that strTreeTable should be folded into intTreeTable, perhaps to fit it into the mold supported by 'treeEqual'?
- Yes. Because like I said below, one often cannot control the source/format of the originating data. Things will often be in different formats to begin with. One of your nested trees may be from another company that never heard of our internal trees before contact. Thus, instead of striving for accurate pre-definitions (essentially IS-A), we focus on the minimum necessary for "treeEqual" to work (and still be relational). Some conversion or pre-processing of the data may be in order. Maybe a type-centric approach would work in a very controlled, pristine environment, but I don't encounter those very often. Maybe if you can find a way to make "contains" be virtual or situational, you may have something. Then recast your favorite FP operations, like Fold, into a relational-like form (they don't seem far off), and we may even find convergents and end all the bickering. -t
- I think your justification (cannot control data) contradicts your conclusion (pre-process data to fit it into the monolithic table with preset column names). And I'll suggest that type-centric approaches tend to create controlled, pristine environments, and that pre-processing data as you suggest is no less an AdapterPattern even when you're dumping the result data into strings or universal EAV tables. I wouldn't mind exploring ways to express values relationally in a standard manner, but I think it would be more profitable (and more practical) to support generic indexing over value sub-structure that can precisely meet the needs of the user. I.e. rather than explicitly breaking stack nodes and indexing on 'color' just in case you later need to find all red nodes, simply store the stack values normally and tell the RDBMS to index with a function that returns a relation from color to node and stack, or tell the RDBMS which queries it is likely to see so it can optimize for them.
- How about a nice coded demonstration for a domain/case that readers can relate to, not just lab AI experiments.
- If you think coding up a non-trivial project will help you better work out your thoughts, then go ahead. But I think it wiser to explore stuff on paper first.
- My thoughts? You mean to demonstrate your claims that it would make real software simpler and cleaner.
- I don't believe that code clarifies things. I think code, unless simplified to "lab toy" levels, invariably muddies up whatever you were trying to clearly explain. I demonstrate my claims with simple problems and solutions that I can justify. So, yes, the idea that a coded demonstration would be "nice" is entirely yours.
- Most coders find code clearer and more precise than English. There are too many concepts that English makes ambiguous or fuzzy. For example, "put a tree into a table cell". What exactly does "into" mean here? A reference (address/ID) is put in the table cell, or all the node data also? Code and data dumps tend to make this more explicit, or at least provides a more exacting framework from which to compare the options. -t
- Oh, I agree that code is precise, but too often people will mistake properties of the implementation for properties of the system being implemented. And sometimes the ambiguity of English is usable. For example, "put a tree into a table cell" allows ALL of the possibilities you named, plus a few (including hybrids), and none of them are incorrect; rather, some are simply more or less efficient than others, which allows a variety of implementations, some of which will be more or less optimal. Code and data dumps allow you to see one "point" in a space when one really wishes to discuss the whole space. Code does have its uses in proofs, but I suspect you will try to apply it more broadly than it really deserves.
- The upsides are better than the down-sides. People may envision magic generic machines in their head, but such visions are hard to communicate. And as you like to point out, there's lots of room for HandWaving in such notation-free notions (intentional and unintentional). A compromise may be some kind of pseudo-code. It's fine by me as long as one is willing to explain it when questions are asked.
- Upsides > (Downsides - Time and Energy Spent without balancing Recompense) may be true. I don't mind pseudo-code solutions, that's what I really meant by favoring simple problems and solutions.
If you mean "how do we represent and compare a tree of mixed types"? In practice, we often don't have a choice of how the source data is represented. The user of the tables is quite often not the original builder. We must perform our operations on existing structures, or make copies/views of them and reshape them to our fitting. Thus, I would first ask: how is the original source data represented? One approach for comparing is converting the integers into strings, something like an
AttributeTable with a Parent-ID. Another approach is to have a string column and a separate integer column in our virtual table node. One or the other may not be populated (null or empty) in a given node. -t
I'm not asking "how do we represent and compare trees of mixed types" - that's implied, since in the general case all trees are of mixed types (node vs. leaf, value held in node vs. value held in leaf). You can presume the original structure was a tree-form DomainValue that contained no explicit 'node id' or 'parent id', and that someone took your advice and destructured it into tables. Possibly, they could use one table for each 'type' of tree, so a tree containing trees might contain two tables. But it seems you're favoring TheMostComplexWhichCanBeMadeToWork approach of stuffing everything into one big table (FearOfAddingTables?) - an approach that may seem viable (if brutishly ugly) until I start putting references to Matrices and other complex types into the tree.
Where did I say anything about stuffing things into one-big-table? You mean for converting diverse sources?
- I ASKED: are you implying a belief that strTreeTable should be folded into intTreeTable, perhaps to fit it into the mold supported by 'treeEqual'? YOU ANSWERED: Yes. RESULTING CONTEXT: Two tables stuffed into one big table.
- YOU SAID: One approach for comparing is converting the integers into strings, something like an AttributeTable with a Parent-ID. Another approach is to have a string column and a separate integer column in our virtual table node. One or the other may not be populated (null or empty) in a given node. RESULTING CONTEXT: Just one table for two types of trees with either varying interpretations of data or extra, unused columns.
- For both cases, you favored stuffing things into one big table. I don't really care whether it's about "converting diverse sources" or not, since I consider that mostly to be your illegitimate HandWaving excuse for not looking for other answers to 'Tree Integer (Tree String Unit)'. After all, 'diverse source' was not part of the original problem statement. The real issue is that I can't use "treeEqual" in the general case of one tree table containing a reference to another such table, so I'm forced to merge all tree tables into one big table just-in-case a new tree table chooses to add references to the others.
- What exactly is "unit" in your example?
- Unit is a type with only one value, unit. Since it has only one value, unit can be encoded in ceiling(lg(1)) = zero bits.
- And "stuffing" is misleading. A view that comes from multiple sources is not necessarily an independent copy.
- So your version of "treeEqual" requires one first perform a query to construct a monolithic view? Ugh. Well, it's better than an independent copy, but it still is slow, not composable, not reusable, and rather complex for comparisons between slightly different tree types. I wonder how one avoids 'node_id' conflicts when combining multiple tables into one 'view'...
- A good optimizer could potentially not copy anything, but merely provide us a virtual view of a single table despite complicated underpinnings.
- A naive optimizer might do that... but there is a balance between performing copies and maintaining the structures to avoid them.
- And nesting is generally anti-relational.
- No. It isn't. On that point you are simply dead wrong. Complex or "nested" domain values aren't even a little bit anti-relational. They are anti-TopMind's-concept-of-relational, but that's hardly relevant.
- We reference, we don't nest.
- Relational relates, doesn't reference. There is a difference. I would suggest that "auto-numbering and surrogate IDs are anti-relational" is a far more defensible position than your own.
- Your reasons for insisting on it appear contrived to me. If you need it in special domains, fine, but we shouldn't ruin relational in general just to fit a few odd domains.
- Your arguments against use of complex structured DomainValues appear to me like WishfulThinking and HandWaving. And I agree that we shouldn't ruin relational just to fit a few odd domains, but the ruining of relational is not at stake here.
- YOU are HandWaving because you think your anecdotes are good enough. They are not. Stop waving and start coding.
- What I think is that there is a lot more than 'anecdotal' evidence for use of types and structured values from a broad variety of languages, domains (including math, science), and existing systems, and that until you start "table-izing" strings you're a hypocrite.
- And they largely stink when scaled past a certain point because one has to reinvent lots of stuff for them that come out of the box with a DB. And most of them are very custom per app/need, not generic kits because a full-functioned generic kit starts to look like a DB of sorts (GreencoddsTenthRuleOfProgramming).
- [I don't know what you mean. Please explain.]
- Every node should be sharable with every other "kind" of structure if we want to, use existing indexes, and be able to participate in existing CollectionOrientedVerbs without implementing them from scratch for each new structure. (We've been over this already and have a {contentious} draft list somewhere.) Encapsulation can be frustrating. -t
- [If you need to explicitly share every node with every other "kind" of structure, explicitly use existing indexes, etc., and find encapsulation frustrating, it sounds like you've attempted to use type systems (or their equivalent) to represent value collections rather than types. Attempting to represent types using collection mechanisms is equally frustrating.]
As far as measuring complexity, we'd have to check it against real-world situations. If you think all the input sources will be from a pristine consistent type system, I'll suggest you are dreaming. People providing info don't live just to make YOUR life easier. The conversions will likely be ugly either way.
It doesn't matter whether sources are "from pristine consistent type systems". People will provide an AdapterPattern either way - which means that cannot be argued as a point in favor of your argument. And you're wrong about people providing information: people provide information based on the formats that make processing that information easier. That's why you have "forms" in businesses, and "fill in the bubble" tests.
Adapters can always be made with enough effort. And views that provide a single table viewpoint to multiple diverse structures *is* a form of adapter.
Correct. Which means we both need to use adapters, and therefore you can't argue one approach worse than the other on that account.
I'm arguing that RDBMS are a longer-tested (road-tested), more familiar, and better understood tool. I'm just proposing extending them, while you are basically starting your "type DB" from scratch. Maybe a mature type-DB would be powerful and flexible, but it's economically logical to attempt to extend what already exists rather than start from scratch (unless a big show-stopper is discovered). We could do both, but the relational-based approach is likely to be ready first because it's not starting at the starting line and people are already familiar with RDBMS.
Saying your approach is "relational-based" as though the other is not is a lie, or at least a gross misrepresentation of which you should be ashamed. Use of structured DomainValues in an RDBMS still results in an RDBMS. What does need changed is the DataManipulation language - SQL isn't designed for extension to structured DomainValues. The one advantage you might have is some ability to compile your SMEQL to SQL - one-time implementation costs are lower for you.
By "structured domain values" do you mean domain values that are structures? The "ed" versus "es" needs to be cleared up here. Structure "types" are generally "opaque" to relational, as you would say, and making them more relational would make them less typish.
I refer to DomainValues constructed of more primitive values, or equivalently whose type is constructed of more primitive types and may be recursive. Examples include records, tagged unions, and even relations (a relation as an attribute value), etc.. Most of them, with the unique exception of relation, are 'opaque' to relational operators, though use of a function that returns some aspect of a value as a relation would allow profitable use of relational operators on arbitrary values (especially useful as multiple return values, like the square root of a number). Not certain I'd say relations are less 'type-ish', but I suppose one might call it more or less so based on the degree to which constraints are enforced.
Re: "And I'll suggest that type-centric approaches tend to create controlled, pristine environments" - I suspect that if your ideas made their way into practice, you'd find that people view "types" different enough from you that the consistency you seek to make things play together nicely will not really be there after the first developer leaves. You are assuming your philosophies will be taken up all or nothing. Once buzzwords wear off, people tend to mix-and-match according to personal internal psychology.
I'm making no such assumptions. You're imagining something very different from what I said. I did not say "global" pristine environments, for example, and I fully believe that there will be AdapterPattern operations between environments. I do not require "all or nothing" adoption any more than any introduction of new standards requires it.
See above about adapters.
MarchZeroNine