Why do the existing RDBMS vendors allow their RDBMS to have bag tables (optionally allow duplicate rows) and process bags in queries despite going against Dr. Codd's original recipe? Does anybody know of historical materials about the decisions?
And why don't they reduce their usage with say warnings and keywords required to permit them in queries? Or at least have a "bag reduction" config settings that would require warnings and/or outright ban bags? The levels could include:
- 1. Allowing bags without warning (current situation for most vendors)
- 2. Not allow bags in queries or tables without an explicit "ALLOWBAG" keyword or the like.
- 3. Not permit any bags under any circumstances
That's an interesting question. Historically, there appear to be a number of reasons:
1. In the early days of the RelationalModel, there was considerable scepticism about it, including (but not limited to) its insistence on disallowing duplicate rows. It was felt that too much RelationalModel-ness -- whose benefits might not be recognised or appreciated by the largely non-technical audience that make purchasing decisions -- might limit sales.
2. In the early days of SQL, in-depth understanding of the RelationalModel was limited. A typical interpretation of "table" held that it was merely the already-familiar file of records with a new name. Imposing any further deviations from that familiar territory seemed strange, risky, and probably unnecessary.
3. SQL was touted as a replacement for database systems that typically allowed duplicate records, and so precluding duplicates was felt to be too big a change.
4. It was often assumed that the technical process of eliminating duplicates would be too computationally expensive. The BS12 optimiser demonstrated it was possible to avoid invoking the duplicate-coalesce machinery unless absolutely necessary, and for it to involve negligible overhead, but the BS12 team were not encouraged to publish their results. Thus, it was often incorrectly assumed that duplicate-coalescing would represent unreasonable overhead. Notably, that erroneous assumption persists.
5. It was felt that whilst preventing duplicates certainly had benefits, there might be some as-yet-unimagined application which would require duplicate rows to be stored in the database. It would be commercial suicide to have a database system that couldn't do something that a competitor could.
Modern SQL DBMSs persist in supporting duplicate rows for all the same reasons, plus the added weight of needing to support both legacy databases and a SQL "standard" that specifies allowing duplicate rows.
In effect, SQL DBMSs do have the 'ALLOWBAG' keyword, but as its inverse: It's the DISTINCT keyword, which explicitly forces duplicate rows to be coalesced.
- Note that DISTINCT must be used with care. Sometimes the domain has duplication that means something in the domain. Tossing duplicates may result in incorrect results. KnowYourData?. -t
- That's why "modal" DBMS behaviour -- in which it sometimes deals with bags and sometimes with relations depending on per-query keywords or some global switch -- is risky. Omitting DISTINCT when it should be used results in error. Including DISTINCT when it shouldn't be used results in error. Modality introduces complexity and the opportunity for error. It is therefore preferable to entirely dispense with bags inside the DBMS, thus forcing the developer to only -- but always -- consider bags at the point of import/link or export.
- The real-world sometimes requires compromises away from the ideal. Anyhow, let's not reinvent that debate here.
- I'm curious to know what real-world situations would force you to maintain bags inside the DBMS, as opposed to importing/linking them as relations and exporting relations as bags.
- The primary one is systems we don't have schema-editing control over, yet it's our job to use them as-is. See BagNeedScenarios for other possible scenarios.
- If you're using a true relational system like Rel, you can import or link bags from their external source without altering them. Therefore, there is no need to edit any schema. BagNeedScenarios justifies being able to import/link bags to relations, or export bags from relations, but it doesn't appear to justify maintaining bags inside the DBMS.
- We need query languages to work smoothly with multiple database engines. We can't assume a monolithic environment. That's not how it is in reality.
- Indeed. But keep in mind that the majority of query language implementations only work with one database engine directly, the native engine that hosts the language implementation. In DBMSs that support connections to database engines other than the native engine, it's typically done via explicit connection specifications (e.g., connection strings.) Adding a keyword to these specifications to generate a key or whatever, and thus prepare the external data for use in the native engine, represents negligible overhead.
- I thought you were against keywords to manage bag control? If your support is circumstantial, I'd like to see refinement in the rules and the reason behind them.
- I am opposed to a keyword that let you choose, on a query-by-query basis, whether duplicate rows can be produced or not. "DISTINCT", SQL's keyword for this, is an abominable way to correct an abomination -- the fact that duplicate rows can appear everywhere in SQL. Such a keyword forces the developer to make decisions in far more places than is actually needed. As I've argued all along, keywords at the point of importing/linking external data sources to convert bags to relations, or at the point of export to convert relations to bags, are all that is necessary. There is no need to deal with bags anywhere else, so there is no need for special keywords anywhere other than at import/link or export.
- I am not a fan of SQL's current approach either. "DISTINCT" is not the converse of ALLOWBAGS, like you have suggested elsewhere, because disallowing them does not prescribe an adjustment technique, which DISTINCT does. And as I said elsewhere, making a distinction between "input/output" and "inside" a database is anti-virtual, creating an artificial two-worlds linguistic distinction, which the language may not be able to "enforce" anyhow. The Unix concept of "pipe" is powerful because it doesn't make a distinction between device or destination "types". If certain "devices" don't allow stuff, than add a GateKeeper in front of it, don't depend on the content creator/use for protection or cleaning. -t
- By precisely your pipe analogy, you lend support for placing mechanisms for bag<-->relation conversions strictly at the boundaries of the system. By placing them strictly at the boundaries of the system, the DBMS need contain only one type of "pipe" -- that which emits relations. That, in effect, is precisely your GateKeeper. By allowing bags and relations to be present anywhere within the DBMS, you effectively define two kinds of "pipe" -- those that emit relations, and those that emit bags. And, there is no GateKeeper to prevent mixing bags and relations. As for the distinction between "DISTINCT" and "ALLOWBAGS", it's immaterial. In both cases, it is a keyword that can be (or is) needlessly peppered throughout query and view definitions, resulting in a potentially confusing mix of bags and relations, when it could and should be as simple as including at most one keyword on import/link to turn a bag into a relation, or at most one keyword on export to turn a relation into a bag. That means one keyword per data source or target, rather than potentially one keyword (or not) per query.
- Like I said, we cannot control existing systems, but must work with them as-is, and they mostly permit bags. The de-facto standard is bags. I'm not sure how your keyword quantity math is being reckoned here. "one keyword per data source or target" is too large a granularity because the vast majority of their tables/views may have known primary keys and thus we don't want bags "on" by default.
- Sorry, I omitted "at most". It's at most one keyword per data source or target. If the source has a primary key, no keyword is required. If the target doesn't require a bag, no keyword is needed.
- How does this differ from the ALLOWBAG approach?
- First, my understanding of your ALLOWBAG proposal was that it could be applied to any query. I propose that a keyword to convert a bag to a relation only be allowed at the point of import/link of an external data source. That inherently limits the number of locations in a database where a decision about bags vs relations needs to be considered.
- Like I said above, you lose virtualness, and with it consistency and flexibility when you treat one "source" different than another. For example, perhaps we want to create a bag for export purposes, but this requires several steps. We don't want to limit which step is allowed to baggify a result set, otherwise we limit our processing choices.
- But my approach doesn't treat one "source" different from another. It unifies the treatment of all sources so that they are all relations. Your approach allows you to treat one "source" as a bag and another "source" as a relation, and requires two separate but parallel algebras, one for bags and one for relations.
- The vast majority of existing RDBMS already support both. It's perhaps not the ideal, but it's the environment we are stuck with in this incarnation of the universe. We had this argument already.
- The vast majority of existing RDBMSs support SQL. Rel does not support SQL. If we're going to depart from SQL, we're going to depart from its flaws, too.
- That's a practical flaw that will doom Rel from wider use. It would be like offering a new minicomputer that didn't support FORTRAN in the mid 1970's, or a new smartphone without an HtmlStack browser. If Rel is open-source, somebody will probably add SQL anyhow if it caught on.
- Perhaps. That remains to be seen. It is true that C++ retained backwards compatibility with C and that is no doubt a significant part of its success. On the other hand, Java, C#, Python and LUA haven't retained backward compatibility with anything, and they're popular. As Rel is OpenSource, anyone is free to add SQL support if they like.
- When that happens, it will then have the issues I describe. Until then, it's technically "protected" via lack of such a feature. A house without doors or windows (or holes) is indeed "protected" from unwanted intrusion better than one with. Whether anybody would actually want to live there is another matter.
- Any SQL implementation in Rel would presumably translate SQL to TutorialDee, so everything that holds true for TutorialDee would hold true in the SQL(-ish) implementation.
- But that leaves both in the same boat: both SQL and Rel-language could read/write bags to bag-allowing DBMS engines, and neither SQL nor Rel-language could read/write bags to non-bag-allowing DBMS engines.
- That's fine. The difference between Rel and non-Rel SQL DBMSs is that the output from Rel -- i.e., its query results -- are not subject to errors due to duplicate rows.
- But back to the original sub-point above: "you lose virtualness". Suppose you wanted to simulate an external bag for testing and/or training purposes, for example. The "ideal" of virtualness/swappability is at least as powerful as the ideal of "set purity". It's a case of conflicting ideals. Why should your ideals trump mine? We both believe they have practical ramifications. It's not that I am against "purity" per se, I just rank them different than you. -t
- What does "you lose virtualness" mean, exactly? In particular, what is "virtualness" and how do you measure it? Why would you use a relational database to simulate a bag for testing purposes? That's as peculiar as using a spreadsheet to simulate a word processor. It sounds like you're clutching at straws here.
- Bag-enabled DBMS's are the "QWERTY" of table-land: they are here and we must work with them. If you limit the "type" of I/O allowed in one "place" versus another, then you lose the ability to swap and virtualize I/O sources/results, including simulating a bag environment.
- Bag-enabled DBMSs are certainly common and indeed we must work with them. I have shown that it's trivial to work with them whilst gaining provable reduction in the potential for errors. Not only is Rel bag-friendly, its unification of bags and relations into relations gains consistency without any loss of capability. What does it mean to "simulate" a bag, anyway? Rel can trivially emit bags, which is surely more than sufficient.
- Scenario: Create a "local" table that acts like a remote one for testing purposes. We want to test our query/code that is intended to be used with an existing bag table on a remote server, but cannot use the production system for extensive testing/debugging.
- If you need to create a test scenario for an external presumably-SQL DBMS, you're not going to use Rel anyway. You're going to use another instance of your SQL DBMS. Your example is contrived.
- You seem to assume most companies/orgs will always give one lots of time, equipment, and money to install and fiddle for one-off or (would be) small projects. Perhaps you work for a trillionaire with money to burn? Defense contractor, by chance? Perhaps one could use MS-Access as a "simulation" for another bag-friendly DBMS, but it's still two different servers one has to connect up and switch back and forth between for certain kinds of testing and debugging. One org I worked made connecting databases between servers a pain because of security measures. ODBC requires certain ports, and if the routers block such ports, you can't hook them up without a red-tape "security review".
- I don't see anything here to do with debate over bags vs relations in Rel. This sounds more like it has to do with organisational politics and whether or not you're allowed to run separate production, development, and test servers.
- Organizational politics affect what resources and time we have available. I don't select tools based on the assumption of unlimited resources.
- Fair enough, but I still don't see what that has to do with the bags vs relations in Rel debate. If you need to simulate something a SQL DBMS does, use a SQL DBMS. Rel is no more appropriate for simulating SQL DBMS behaviour than MySQL is appropriate for simulating Excel.
- Ultimately it will have to be tested on a representative SQL DBMS, but a flexible tool would facilitate quick-and-dirty approximations for initial or repetitious debugging or testing. By repetitious, I mean cases where one has to tweak something, try again, re-tweak, etc. Certain kinds of bugs require such to find work-arounds. Anyhow, we'll never agree on this. I vote to have an ALLOWBAG flag option that works the same (as possible) regardless of internal or external and will LetTheReaderDecide on which approach they prefer.
- I know the sort of bugs that require you to tweak something, try again, re-tweak, etc. They're the sort of bugs caused by duplicate rows. Anyway, there's nothing in Rel to stop you from emitting bags to the external world as often as you like.
- You're right, we'll probably never agree on this. (At least we agree on that!) I vote to have an optional keyword on import/link of external data that generates a key when there are (or may be) duplicate rows in the source data, and an optional keyword to export a relation as a bag (i.e., exclude specified attributes.) Thus, in summary, I vote that Rel never allow duplicate rows as input to any query.
- The analogy given in another topic was the pipe concept from Unix. It wouldn't be nearly as useful if different sources were picky about the "kind" of pipe content they accepted. For example, if you couldn't redirect console output to a file because files were not permitted to contain certain character patterns, then you couldn't log or check console output offline: you could only study and debug the output while stuff is scrolling by on screen because the file system wouldn't permit a certain "kind" of pipe. It is virtualness and swappability that makes pipes such a powerful tool. One "device" or source/destination can act like another.
- As I've already explained, it is unifying bags and relations under relations that is equivalent to pipes. Keeping them distinct is like having pipes for character streams and channels for video streams. If you pass a video channel to a text filter, you get errors. That's akin to what happens if you pass bags to certain SQL operations. You get errors. Your (non-)universal "bag pipe" merely perpetuates a pointless opportunity for errors that is endemic in SQL.
- I consider bags the more generic (super-set) data structure rather than a different "kind" of thing, and thus more sharable with less technical drama in practice.
- You appear to be conflating data exchange with data processing. The RelationalModel facilitates data processing, and as we've already seen, does so with the the possibility for fewer errors than using bags in SQL. Fortunately, as we've also already seen, data exchange between pure relational and non-relational systems is trivial. Do you really consider a keyword or two at the boundaries of a true relational system to be "technical drama"? And, if you do, do you consider them to be more "technical drama" than the risk of errors due to duplicate rows?
- "Conflating" is the virtualness/swappability attribute. You are forcing an unnecessary distinction for the sake of perceived purity. Why make a "boundary" even more of a boundary?
- I am not forcing an unnecessary distinction for the sake of perceived purity. I am recognising an inherent and fundamental distinction (bags are not relations) for the sake of guarantees of actual correctness. You are ignoring an essential distinction for the sake of perceived -- and only perceived, not actual -- flexibility. Why ignore a fundamental boundary that already exists and risk errors?
- How are you measuring "essential"? Perceived? I guess I hallucinate life. You are a purple rabbit with a horse's tail.
- I meant "essential" as in "of or relating to essence", not as in "required". I meant "perceived" exactly the same way you did when you wrote "perceived purity."
- It's not "essential". You only claim it is.
- It is "essential" in that the essence of the distinction between a bag (in SQL, i.e., a table) and a relation is that a relation cannot have duplicate rows. A bag can. That is the essence -- i.e., the essential characteristic -- of the distinction between relations and (table) bags.
- If you approach them as words instead of as tools.
- What do you see as the essential distinction between tables and relations if you approach them as tools instead of words?
- Second, imagine two scenarios:
- 1. Given an external data source p, let p specify ALLOWBAG. Let there be views a, b, and c that reference p. With ALLOWBAG on p, we have to check views a, b, and c to determine whether incorrect results can be generated as a result of duplicate rows in p. The number of points where we need to check for possible errors due to duplicate rows is equal to the number of views that reference p.
- The right place in the "tree" to put it depends on the situation. It may be possible that only one of the 3 views, say "b", is actually a bag. In that case, that view and only that view would need ALLOWBAG. If instead "p" was a bag table, say an event log without a sequence number, then any query that uses it would need ALLOWBAG to access it (if using the ALLOWBAG convention). Further, we may not have control over p in the organization. We need a finer granularity because what we may be able to control and not control will vary widely.
- I think Scenario 2, below, clearly addresses why that is not ideal.
- 2. Given an external data source p, let p specify a keyword that converts bags to relations. Let there be views a, b, and c that reference p. Given that p is a relation, we do not need to check views a, b, and c to determine whether incorrect results can be generated as a result of duplicate rows in p, because there cannot be duplicate rows in p. The number of points where we need to check for possible errors due to duplicate rows is 0.
- That's a misnomer. Duplicates are not necessarily "incorrect", but rather a bag instead of set. A bag is not an "incorrect" data structure. Anyhow, for reasons given in other "bag" topics, we may not want to force p (or don't have permission) to look like a set via temporary ID numbers or whatnot. I'm resisting the temptation to reinvent those arguments here. We are going in circles again. The query or view designer should have a choice on what it presents based on domain needs (if given sufficient access).
- Duplicates in and of themselves are not necessarily incorrect, but applying a bag to certain relational operators or queries when you think you are applying a relation may result in unexpected (i.e., incorrect) results. The query or view designer loses no overall choice, but is only constrained as to where the choice must be made, and gains certain benefits from that constraint.
- And new risks, such as a temporary/dummy ID that can potentially be mistaken for a permanent key. You are trading one risk for another.
- That risk is trivially mitigated by making all temporary IDs derived from an UNSTABLE type, with appropriate obstacles in place to avoid emitting UNSTABLE type values. Thus, it trades a serious risk of producing incorrect results for an inconsequential risk of producing correct results with a stray temporary ID -- which a developer will have to make a significant and deliberate effort to emit from the DBMS.
- You assume query writers and users will know or care about the application of the UNSTABLE type. That assumption hasn't been tested over time in production yet. (Hmmmm, a tag.)
- True, but query writers and users don't have to know or care about the application of the UNSTABLE type, because it will be automatic as a result of applying a generated key.
- Automatic? Why would you make artificial key generation automatic?
- I meant that a generated key will automatically be of UNSTABLE type.
- Scenario 2 is clearly preferable to Scenario 1 because 0 <= n, where n is the number of views that need to be checked for possible errors due to duplicate rows.
Warnings about bags are unlikely to be implemented for a simple technical reason: Interacting with SQL DBMSs is almost universally via ODBC, JDBC, and various proprietary but semi-standard equivalents to ODBC/JDBC. The vast majority of these are only designed to return, for a given query, a
RecordSet or an error. There is no provision for returning both a
RecordSet and a warning. However, many database design tools will nag the developer if he or she fails to define a primary key on a new table. Some nag the developer if he or she fails to identify a primary key when linking an external table.
{"Warning" is perhaps a misnomer on my part when taken to a technical level. I envision that it would produce an outright error, along with a message about missing the ALLOWBAG key-word (and perhaps a link/URL to explain more). This serves as a de facto warning in that a query writer is "warned" but has the option of continuing by adding the keyword. -t}
In that case, it would be a reasonable addition to a SQL implementation. Obviously, it would not be required in a true relational DBMS.
{I find it a reasonable compromise, especially as it helps with compatibility. Others will disagree. -t}
Are you referring to its addition to a SQL implementation? Or applying it to a true relational DBMS?
{It applies to just about any table-based DBMS that needs to communicate with outside systems and data. Another option is a DBMS-wide switch that prevents all bags, regardless of ALLOWBAG keyword. If your site decides to forbid bags completely, just set the master switch. One gets the error message, "This system has been set to not allow bags. Please contact the database administrator...". The system setting could pick from choices 1, 2, or 3 from the opening options above. "1" would usually be used for installations that have a lot of existing query code and no budget for cleaning and rewrites.}
I think it would be far simpler to simply convert bags to relations on import, and (if necessary) convert relations to bags on export. Rather than risk misusing or omitting ALLOWBAG (or its inverse, DISTINCT) throughout the database, disallowing bags forces any and all bags to be converted to relations on import/link. The number of places where an incorrect decision can be made are thus reduced, as consideration of how to deal with bags is now restricted to the point of import/link, rather than potentially being associated with any and every query. It also means that when you're constructing complex queries that may be based on views, you don't need to know whether the views have ALLOWBAG turned or not -- which obviously may have an impact on the correctness of your query. You can simply write your query, safe in the knowledge that no view can ever produce duplicate rows.
{The query language may not even "know" if it's "exporting" or not. It goes to a "device ID" without necessarily knowing or caring. If it has to know or care and change behavior based on that, it's overly complex, inconsistent, and confusing. It violates the spirit of virtualness. Any GateKeeper belongs at the database engine, not the query language.}
In most modern DBMSs, you can't separate the query language from the database engine. The query language controls the database engine, the sole source of data in the database comes via the query language, the sole means of obtaining data from the database is via the query language, and the database engine hosts the query language.
As for whether a query language is "exporting" or not, if it's exporting to a particular, specified, known-to-the-DBMS target, it should be done explicitly via statements and keywords in the query language. If exporting is going to be done by some external client of the DBMS, then it's out of the hands of the DBMS. What happens once data leaves the DBMS is not, and cannot be, our concern.
It could be that checking for duplicates is harder to implement, and causes a performance hit - but this is just speculation. If programmers want to get their product (like mysql) shipped quickly, they may take shortcuts and implement something half assed. As mysql matured, it got more relational features over time. This tells me that relational is harder to implement and violating relational is much easier to implement.
If you are going to allow bags, why not allow trees, subtables, cells inside cells (multiple values inside cells). All these violate relational/normalization like bags and offer convenience for people who want to do such things. Why not have some keyword that you can turn off/on multi value cells with? If you make your database a "JackOfAllTrades" (do everything and follow no specific model), it can become a mess, with too many features that don't offer anything significant to the table. --AnonymousDonor
Actually, trees, subtables, cells inside cells and even multiple values inside cells[1] do not violate the RelationalModel. Duplicate rows do violate the RelationalModel, because relations -- by definition -- do not contain duplicate tuples. However, nothing in the RelationalModel restricts what types a relation's attributes may be. They can be trees, sub-tables, lists, vectors, graphs, graphics, pictures, XML, executable files, audio, or anything else (and everything else) that may belong to a type. The power of the RelationalModel lies in organising values into tuples and relations and being able to manipulate tuples and relations algebraically. The RelationalModel does not dictate what should or should not be a value.
[When you put cells inside cells, it is not normalized - so instead of saying violating relational model, maybe I should be more precise and say "multi values in cells is not good normalization". Semi colon delimited data inside a single cell ("apple;orange;banana") is not good, because that data cannot be queried sensibly. Putting rows, multi cells, objects, or xml inside cells is inferior to having the data normalized and query-able. Normalization does dictate what should or should not be a value, and normalization is part of the relational model. A simple class (like a struct) without any methods, is very much like a multi cell storage device.. so as soon as you put that class or struct inside a cell, you are violating normalization. You can't easily get facts out of your database when it is not normalized. If the database is very hard to get facts out of then it defeats the purpose of being a database since DatabaseIsRepresenterOfFacts.]
"Normalized" is not a singular state but a series of definitions. The definition for FirstNormalForm is ambiguous. What, for example, is the qualitative difference between "colon delimited data inside a single cell" and, say, a collection of characters inside a single cell? The latter is commonly known as a string. Why is it acceptable to put a string in a cell but not colon-delimited data? Think carefully about your answer.
FirstNormalForm is simply the statement that data should be organised by relations. It isn't about categorically precluding multiplicity in a cell. That's fine and often necessary. (Even an integer is multiple bits in a cell!) It was intended that "multi-valued data" describe non-relations, whereas FirstNormalForm describes relations. FirstNormalForm is really nothing more than a prescription for the RelationalModel.
Re: "If you are going to allow bags, why not allow trees, subtables, cells inside cells..."
This is a SlipperySlope kind of argument, which is generally considered a fallacy. And, we could indeed allow tree-oriented operators "on top of" tables (see RelationalAndTrees). I have nothing against that. But nested cells are usually not needed to model most domains; there are other ways to model the same thing usually. One generally doesn't want to "hard-wire" in one relationship at the expense (difficulty) of others. You reference instead of nest so that you can more easily create other simultaneous references (relationships). -t
[1] Current thinking largely regards FirstNormalForm as at best ambiguous, and at worst meaningless.
MarchTwelve