Bag Atational Discussion

Continuation of BagAtational:

A second practical benefit we lose is the ability to detect and prevent double-entry errors. Presumably, if we want duplicate rows in a table, then duplication has meaning. In a system that allows duplicate rows, that meaning is hidden. The existence of a duplicate obviously means something (otherwise we wouldn't be allowing duplicates), but we choose not to indicate it. So, if we accidentally double-enter a row, how do we distinguish that erroneous extra row from a meaningful duplicate?

Allowing duplication and using it "all the time" (paraphrased) are two different things. I never suggested "using it all the time", rather allowing it as an option. The real world is not programmed by a fastidious god(s) such that we need to be able to handle true domain bags without a lot of fussing.

I already gave an example of a customer/partner request that primary keys not be included in data sent to them. Another scenario is records from logging devices. Suppose the logging devices don't send sufficient info to guarantee a unique record, maybe because whoever set them up were lazy about making sure the ID's are set correct, and/or it only sends seconds instead of milliseconds, etc. I've worked in many projects where we have to accept customer or partner data as is. We were not permitted to lecture them on relational purity or whatnot. In another topic I describe how we found out that a state goofed and accidentally made duplicate car license plates. We can't send them to relational jail; plus the plates were already out there in the wild such that jailing them wouldn't fix it.

As far as performance, if it's possible for the RDBMS to know that all tables involved have a unique primary key-set, then it can use uniqueness in it's computational assumptions. I don't dispute that using bags in a given relational process may result in performance loss (or at least am not challenging that right now). But IF the involved tables are known to be unique, why would the mere option of allowing "bag tables" (outside of the current concern) noticeably impact performance of queries that only involve unique-slated tables? You seem to imply the existence of the option of bags alone dooms performance, which is almost as weird as quantum physics concepts. I request justification or clarification.

--top

Adhering to the RelationalModel and eliminating duplicates within the model does not prevent trivially outputting duplicates, via either an ARRAY (in TutorialDee) or explicit code, should this be what your client demands in a report or file. Can you show any circumstance outside of that where maintaining duplicates within the database makes sense, and that cannot be trivially and more appropriately addressed with surrogate keys and the like?

Sure, sticking a surrogate key (such as auto-increment integer) on our "sloppy logger" example will de-bag our table. In fact, in many if not most SQL-DBMS, an internal unique row ID is created anyhow. They just have properties that makes them unsuitable for display, such as post-deletion reuse. Thus, most existing RDBMS are not actually using the dreaded bag anyhow and you may be fussing over nothing. I just don't think a query language should have to do goofy programming to simply remove primary keys from the final output. -t

You haven't answered the question: "Can you show any circumstance [...] where maintaining duplicates within the database makes sense, and that cannot be trivially and more appropriately addressed with surrogate keys and the like?" Obviously, the "sloppy logger" can simply define a surrogate auto-number key.

{Top's argument regarding the "internal unique row ID" is non-sequitur. I'm not "actually using" the internal implementation details. As a client to a pseudo-relational service, I'm "actually using" the pseudo-relational interface.}

My point is that "bag" is a viewpoint in practice. If the internals really need to see unique rows for performance reasons, they can if they so choose.

{The performance issues are associated with the external semantics, and are unrelated to the internal implementation. (In particular, duplicates multiply across joins.) The performance issues for bags, though, aren't nearly so severe as they are for 'ordered' bags (aka Arrays), which are what TopMind really promotes. My own concern about bags is more semantic: the proliferation of duplicates has unclear meaning, and depends heavily upon the 'access path' to certain data. (Bags sacrifice some of the AccessPathIndependence achieved by the RelationalModel.)}

{Why does Top believe the query language should be responsible for "final output"? And why would he wish to enable this for intermediate compositions? Shall I expect SmeQl to print formatted HTML? Am I to expect the ability to perform relational operations, such as joins and unions, on HTML-ized views? To what degree should formatting outputs - hiding certain columns or marking negative numbers to show in red - be the responsibility of a query language?}

I generally expect the query language's abilities to stop at the "cell level". In other words, it should be sufficiently powerful enough to deliver the "cells" that I want in the order I want (as a grid/matrix). Sub-cell formatting such as borders or color are outside of the realm of the query language. In other words, "Here are the columns I want and here are the rows I want". Why would I want a crippled SELECT statement that forces me to include keys? You are asking the formatting tools to have a more powerful/practical SELECT ability than your query language. If it's useful for the formatting tool, then why is it not useful for the query language? Why the "shift"? Formatting tools are lower-class citizens that are allowed to get down and dirty with the mucky bags, but your query language is aristocracy that needs the purity of uniqueness? -t

{Nobody is forcing you to include keys, TopMind. If you project a relation to remove the keys, you'll only lose any multiplicity. If you only wish to exclude keys from a user's view, it should be as simple as hiding the key column.}

So you agree it should be easy to hide a key column from the final result? And please define "multiplicity".

{Sure, you should be able to hide a column from a report just as easily as hiding a column in a spreadsheet. And the reference to 'lose multiplicity' above refers to duplicates that will not exist in a proper relational projection that removes the key from a table. I did not believe saying you'd 'lose duplicates' is appropriate since, by assumption of 'project a relation', there are no duplicates to lose prior to projecting away part of the key.}

{And perhaps you should spend some time answering your own question: when IS the formatting useful within the context of the query operation? Ignore the formatting for the end-user or report for the moment, and answer: where would this feature be useful for sub-queries, that the end-user will never see?}

Your formatting question is not clear. Controlling which columns to include and in which order is a useful feature of a query tool, especially for ad-hoc queries that are not part of a formal application. I often use ad-hoc queries to inspect data such as hunting for bad data or suspicious patterns. I don't want to have to go to a second tool just to lop off columns that make it too wide for the screen or change the display order. That slows me down and costs the company real money.

{Nobody is saying the different languages need to be in different tools. SQL involves several languages in the same tool, i.e. for transactions and access control and cursor management, for data definition, and for data manipulation, and even for procedures and triggers. These languages interact in various ways, and are often somewhat layered. TutorialDee has been brought up before as using 'ARRAY' as a presentation format. Having been faced with such well-known counter-examples, repeatedly, why do you persist in maintaining the straw-man argument that a second language is going to "slow you down" or force you "to go to a second tool"? Is obstinate ignorance part of your grand strategy for convincing others?}

As far as sub-queries, the only examples I can think of right now are times where the key was compound and it slowed down processing and made the query verbose to include it all. What SQL engines usually do it assign an internal surrogate key, a kind of "row ID" per above. I simply took advantage of that feature to avoid carrying around the baggage of the compound key into the rest of the query. In other words, the query engine or RDBMS provides a de-facto surrogate key, and I took advantage of it without having to actually see it. (For this reason, SQL RDBMS are not really BagAtational after all, as described above.)

Another case is where the table is controlled by another party and they didn't include a proper key. My stuff can only read it. It is somewhat similar to the logging example above. I couldn't practically say, "You've committed a relational sin, and therefore I won't do business with you." If people get word that Rel DB's don't play well with the existing world, they will fail.

{RelationalDatabase will work fine with the existing world. If someone provided data with duplicates, of course you'd need to reformat it for relational: either remove the duplicates, or add a row number. Formatting data to make it consumable by your database should be nothing new to you, TopMind. Would you condemn SQL as being unsuitable for the existing world because sometimes you need to format input data? No? Well then why engage in hypocrisy by attempting to condemn true relational for the same reason?}

But you are making the task harder by being personally obsessed with purity.

{If I provide you data in XML, do you claim that dumping the data into 'tables' is "making the task harder by being personally obsessed with tables"? Or are you just being a hypocrite at the moment?}

I don't see your example as equivalent. It's apples to oranges. It's not about which base tools we use, but about the features of them. Due to real world situations AND compatibility with existing RDBMS, it appears rational to accept the option of semi-bag tables until the point where clear downsides are shown. I don't want to give up useful features become some people are baggophobics or puriphiles.

There's probably a good reason why early SQL DBMS allowed semi-bags. They had working lab examples of early relational systems and could see the impact of strict versus loose when they tested it on real-world data and queries.

Early SQL DBMSes allowed "semi-bags" (!?) because there was a (largely mistaken) belief that eliminating duplicates would represent an unacceptable performance and resource hit. That's the only reason.

What you propose is not nearly as trivial as leaving out keys from a SELECT statement. Does REL give out a "Purity Error" and dial the Purity Police?

Actually, what the RelProject does is almost precisely as trivial as leaving out keys from a SELECT statement. It's a WRITELN statement, by the way, and REL is a company that makes subwoofers. Rel is a name, not an acronym. (Though I do plan to buy a REL subwoofer at some point, just for the name. {And because I'm not that happy with my KEF subwoofer.})

As noted above, it's trivial to add unique IDs when importing data and trivial to remove them when outputting data. There is no reason to maintain duplicates between these two endpoints. Let me emphasise that: In a system based on the RelationalModel, there is no reason to maintain duplicates between data input and data output.

Again as noted above, one of the main reasons for avoiding duplicates is because they represent semantics but lack any description. Duplicates mean something, but what? And how do we distinguish them from erroneous double-entries?

Again again again, if it causes problems, then don't use them. I'm only asking for an option, which you say costs gazillion dollars to fulfill.

Absolutely it causes a problem. It adds complexity, both for the user and the developer. Setting aside user issues for now, to provide the capability to preserve duplicates from input to output in the RelProject would require a parallel set of data structures (bags, instead of relations), a parallel set of operators (to work on bags instead of relations), and an entire bag-oriented query optimizer. Given that there's no reason to preserve duplicates between data input and data output, I will not be implementing bags.

Existing RDBMS don't either: they have internal surrogate keys.

How (and whether) duplicates and/or row IDs are handled internally varies from DBMS to DBMS. Given that there's no reason to preserve duplicates between data input and data output, this is a non-issue in Rel. Neither duplicates nor row IDs are required (nor even desirable) anywhere between the data input and data output points.


Internal Versus External Bags

Moved from OfficialCertifiedDoubleBlindPeerReviewedPublishedStudy:

["Kind of topics???" Is there a theme to our arguments? I hadn't really noticed, other than to note your dislike of object-orientation and truly-relational databases, and an inclination to disagree with anything that isn't promoting ExBase or some close cousin to it. These days, that's pretty much everything. I've posted links to two articles that deprecate bag-oriented databases, one here and one on the BagAtational page.]

That was about machine performance, not the software creation and maintenance (query language) issues. And further, they are not really bags and/or can be converted to sets under the hood, and you conveniently deleted that issue in what appeared to be DisagreeByDeleting. A more practical performance question would be how explicit artificial primary keys (auto-keys) compete against internal primary keys and/or manufactured keys in cases where there's no natural domain key. I don't propose one use bags if there is a natural and strait-forward domain key. As I said in the stuff you deleted, it's not about bags versus sets, but one flavor of sets versus another flavor of sets.

[It wasn't DisagreeByDeleting, it was refactoring and removing all the cruft that added absolutely nothing to the points made at the very top of BagAtational. Anyway, if bags aren't really bags, then this and the entirety of BagAtationalDiscussion are moot. We can, as it turns out, dispense with bags. We can import them and convert them to sets, we can export sets as bags, but in between -- within the relational system -- we don't need bags. It's what I've said all along. I'm glad you agree with me.]

I disagree with your assessment that it "added absolutely nothing to the points made at the very top...". Anyhow, it can be a "bag" to the user but be a set internally, and usually is. When you say "we", are you talking about the internals of the system, or the user? It is possible to give a set to the internals and a bag or list to the user.

As far as the title of the page, perhaps it needs a revisit, but won't change the issues being compared.


Efficiency Relativism

As far as the optimization claims, it's a trade-off. Certain types of operations are more efficient under either. For example, inserts are usually cheaper under bags because the system doesn't have to check for duplicates. (There may be special implementations where the difference is small or non-existent, but they have other performance trade-offs.) I'm pulling another EverythingIsRelative on this one. No structure has been blessed by the gods more than another. (Under MaspBrainstorming I claim that maps may be "better" than nested lists as a base structure, but it's generally a WetWare claim, for I don't claim any fundamental law of nature favors them so far.) --top


See Also: DynamicRelational, RecordBasedDatabase, BagSetImpedanceMismatch


CategoryDatabase


AprilTen


EditText of this page (last edited April 18, 2011) or FindPage with title or text search