Why Do Existing Rdbms Allow Bags

Why do the existing RDBMS vendors allow their RDBMS to have bag tables (optionally allow duplicate rows) and process bags in queries despite going against Dr. Codd's original recipe? Does anybody know of historical materials about the decisions?

And why don't they reduce their usage with say warnings and keywords required to permit them in queries? Or at least have a "bag reduction" config settings that would require warnings and/or outright ban bags? The levels could include:


That's an interesting question. Historically, there appear to be a number of reasons:

1. In the early days of the RelationalModel, there was considerable scepticism about it, including (but not limited to) its insistence on disallowing duplicate rows. It was felt that too much RelationalModel-ness -- whose benefits might not be recognised or appreciated by the largely non-technical audience that make purchasing decisions -- might limit sales.

2. In the early days of SQL, in-depth understanding of the RelationalModel was limited. A typical interpretation of "table" held that it was merely the already-familiar file of records with a new name. Imposing any further deviations from that familiar territory seemed strange, risky, and probably unnecessary.

3. SQL was touted as a replacement for database systems that typically allowed duplicate records, and so precluding duplicates was felt to be too big a change.

4. It was often assumed that the technical process of eliminating duplicates would be too computationally expensive. The BS12 optimiser demonstrated it was possible to avoid invoking the duplicate-coalesce machinery unless absolutely necessary, and for it to involve negligible overhead, but the BS12 team were not encouraged to publish their results. Thus, it was often incorrectly assumed that duplicate-coalescing would represent unreasonable overhead. Notably, that erroneous assumption persists.

5. It was felt that whilst preventing duplicates certainly had benefits, there might be some as-yet-unimagined application which would require duplicate rows to be stored in the database. It would be commercial suicide to have a database system that couldn't do something that a competitor could.

Modern SQL DBMSs persist in supporting duplicate rows for all the same reasons, plus the added weight of needing to support both legacy databases and a SQL "standard" that specifies allowing duplicate rows.

In effect, SQL DBMSs do have the 'ALLOWBAG' keyword, but as its inverse: It's the DISTINCT keyword, which explicitly forces duplicate rows to be coalesced.

Warnings about bags are unlikely to be implemented for a simple technical reason: Interacting with SQL DBMSs is almost universally via ODBC, JDBC, and various proprietary but semi-standard equivalents to ODBC/JDBC. The vast majority of these are only designed to return, for a given query, a RecordSet or an error. There is no provision for returning both a RecordSet and a warning. However, many database design tools will nag the developer if he or she fails to define a primary key on a new table. Some nag the developer if he or she fails to identify a primary key when linking an external table.

{"Warning" is perhaps a misnomer on my part when taken to a technical level. I envision that it would produce an outright error, along with a message about missing the ALLOWBAG key-word (and perhaps a link/URL to explain more). This serves as a de facto warning in that a query writer is "warned" but has the option of continuing by adding the keyword. -t}

In that case, it would be a reasonable addition to a SQL implementation. Obviously, it would not be required in a true relational DBMS.

{I find it a reasonable compromise, especially as it helps with compatibility. Others will disagree. -t}

Are you referring to its addition to a SQL implementation? Or applying it to a true relational DBMS?

{It applies to just about any table-based DBMS that needs to communicate with outside systems and data. Another option is a DBMS-wide switch that prevents all bags, regardless of ALLOWBAG keyword. If your site decides to forbid bags completely, just set the master switch. One gets the error message, "This system has been set to not allow bags. Please contact the database administrator...". The system setting could pick from choices 1, 2, or 3 from the opening options above. "1" would usually be used for installations that have a lot of existing query code and no budget for cleaning and rewrites.}

I think it would be far simpler to simply convert bags to relations on import, and (if necessary) convert relations to bags on export. Rather than risk misusing or omitting ALLOWBAG (or its inverse, DISTINCT) throughout the database, disallowing bags forces any and all bags to be converted to relations on import/link. The number of places where an incorrect decision can be made are thus reduced, as consideration of how to deal with bags is now restricted to the point of import/link, rather than potentially being associated with any and every query. It also means that when you're constructing complex queries that may be based on views, you don't need to know whether the views have ALLOWBAG turned or not -- which obviously may have an impact on the correctness of your query. You can simply write your query, safe in the knowledge that no view can ever produce duplicate rows.

{The query language may not even "know" if it's "exporting" or not. It goes to a "device ID" without necessarily knowing or caring. If it has to know or care and change behavior based on that, it's overly complex, inconsistent, and confusing. It violates the spirit of virtualness. Any GateKeeper belongs at the database engine, not the query language.}

In most modern DBMSs, you can't separate the query language from the database engine. The query language controls the database engine, the sole source of data in the database comes via the query language, the sole means of obtaining data from the database is via the query language, and the database engine hosts the query language.

As for whether a query language is "exporting" or not, if it's exporting to a particular, specified, known-to-the-DBMS target, it should be done explicitly via statements and keywords in the query language. If exporting is going to be done by some external client of the DBMS, then it's out of the hands of the DBMS. What happens once data leaves the DBMS is not, and cannot be, our concern.


It could be that checking for duplicates is harder to implement, and causes a performance hit - but this is just speculation. If programmers want to get their product (like mysql) shipped quickly, they may take shortcuts and implement something half assed. As mysql matured, it got more relational features over time. This tells me that relational is harder to implement and violating relational is much easier to implement.

If you are going to allow bags, why not allow trees, subtables, cells inside cells (multiple values inside cells). All these violate relational/normalization like bags and offer convenience for people who want to do such things. Why not have some keyword that you can turn off/on multi value cells with? If you make your database a "JackOfAllTrades" (do everything and follow no specific model), it can become a mess, with too many features that don't offer anything significant to the table. --AnonymousDonor

Actually, trees, subtables, cells inside cells and even multiple values inside cells[1] do not violate the RelationalModel. Duplicate rows do violate the RelationalModel, because relations -- by definition -- do not contain duplicate tuples. However, nothing in the RelationalModel restricts what types a relation's attributes may be. They can be trees, sub-tables, lists, vectors, graphs, graphics, pictures, XML, executable files, audio, or anything else (and everything else) that may belong to a type. The power of the RelationalModel lies in organising values into tuples and relations and being able to manipulate tuples and relations algebraically. The RelationalModel does not dictate what should or should not be a value.

[When you put cells inside cells, it is not normalized - so instead of saying violating relational model, maybe I should be more precise and say "multi values in cells is not good normalization". Semi colon delimited data inside a single cell ("apple;orange;banana") is not good, because that data cannot be queried sensibly. Putting rows, multi cells, objects, or xml inside cells is inferior to having the data normalized and query-able. Normalization does dictate what should or should not be a value, and normalization is part of the relational model. A simple class (like a struct) without any methods, is very much like a multi cell storage device.. so as soon as you put that class or struct inside a cell, you are violating normalization. You can't easily get facts out of your database when it is not normalized. If the database is very hard to get facts out of then it defeats the purpose of being a database since DatabaseIsRepresenterOfFacts.]

"Normalized" is not a singular state but a series of definitions. The definition for FirstNormalForm is ambiguous. What, for example, is the qualitative difference between "colon delimited data inside a single cell" and, say, a collection of characters inside a single cell? The latter is commonly known as a string. Why is it acceptable to put a string in a cell but not colon-delimited data? Think carefully about your answer.

FirstNormalForm is simply the statement that data should be organised by relations. It isn't about categorically precluding multiplicity in a cell. That's fine and often necessary. (Even an integer is multiple bits in a cell!) It was intended that "multi-valued data" describe non-relations, whereas FirstNormalForm describes relations. FirstNormalForm is really nothing more than a prescription for the RelationalModel.


Re: "If you are going to allow bags, why not allow trees, subtables, cells inside cells..."

This is a SlipperySlope kind of argument, which is generally considered a fallacy. And, we could indeed allow tree-oriented operators "on top of" tables (see RelationalAndTrees). I have nothing against that. But nested cells are usually not needed to model most domains; there are other ways to model the same thing usually. One generally doesn't want to "hard-wire" in one relationship at the expense (difficulty) of others. You reference instead of nest so that you can more easily create other simultaneous references (relationships). -t


[1] Current thinking largely regards FirstNormalForm as at best ambiguous, and at worst meaningless.


MarchTwelve


EditText of this page (last edited April 15, 2012) or FindPage with title or text search