Using Bags For Privacy Purposes

Continued from BagSetImpedanceMismatchDiscussionTwo

Further, sometimes it's done for security/privacy reasons. You may want to supply a subset of employee info to an outside firm, but omit the employee number because it's not needed for the purpose of the request and a possible privacy violation.

Newspapers are asking for such info of government agencies of late to see if gov't employees are over-paid because Rush Limbaugh etc. are making it a political issue. (Rush and the subject of "bags", how fitting.) A newspaper doesn't need to see the employee number. The data supplier could generate a dummy key as it's being "saved", but it would be an unstable key and lead us to the same problem. The only full solution I see would be to make a surrogate key that "stays with" a given employee for the duration of the record in the employee master table. But that puts an extra burden on the data supplier. It's another key and index that has to be carried around. It's a lot of work just to satisfy the Set Gods.

Further, the newspaper could use it to make inferences about promotion patterns because the key makes an entity instance track-able over time, which is not the stated goal of sharing the info. The gov't agency, employees, and unions will want to supply only the bare minimum info necessary to satisfy the request. A stable keys goes outside of that. -t

  Salary_Info table
  ---------
  first_initial
  last_name
  rank
  job_title
  years_experience
  college_years
  yearly_salary

True, the paper may be able to also make promotion inferences using a guess-a-tron to match prior data sets with say 90%+ accuracy, but still nobody will want to give them more than absolutely necessary.

Using a true relational system like the RelProject, the above can be easily generated as output from a relation or imported/linked and a generated key produced to input it as a relation. There's nothing here to indicate why we'd want to actually keep it as a bag. You haven't indicated, anywhere on this Wiki, why the RelProject -- or any other true relational system -- should need, or even want, to manipulate bags beyond being able to import bags into relations and export relations into bags in order to trivially accommodate the occasional flawed external system.

Yes I have, you just ignore them and dismiss economic models for reasons that escape me. Our tools are to do work and deliver product to customers, not make ivory-tower zealots happy. You have a "purpose" problem. Rel is unnecessarily complex in having arrays being separate from tables, etc. It's type-happy. And it's not just about external systems: a query language that wants to challenge SQL should be able to be hooked into an existing RDBMS's engine without having to rewrite existing schemas and views. Being bag-phobic makes this more complicated.

No, what you've shown is only that bags may need to be imported or exported. You've not shown that they need to be maintained or manipulated as bags. And why would they need to be? If you import a CSV file into a DBMS, do you need to keep it as a CSV file in the DBMS?

Neither side "needs" anything (in most cases). "Need" is thus not a differentiating concept.
Fine. If you import a CSV file into a DBMS, do you want to keep it as a CSV file in the DBMS? Likewise, you've not shown that there's any impetus to maintain or manipulate bags in a DBMS.
Yes I have: somebody else has a bag table and tells you not to change the table layout and just do the job asked of you. Oracle-style RDBMS are not going away anytime soon. Our tools should accept that fact.
Importing or linking a "bag table" into Rel does not require changing the table layout at its source, so why would anyone care if you add a generated key in Rel? If you're allowed to use Rel at all, you're almost certainly going to be allowed to add a generated key in Rel.

It's been explained to you multiple times that ARRAY is closer to SQL's CURSOR construct than any persistent structure like a table or RelVar. It is disengenuous to persist in using deliberate "misunderstandings" to try to make a point. A query language that challenges SQL needs to address SqlFlaws. The RelProject does so, and people are using it for practical purposes; no existing schemas or views need be rewritten to link to external SQL DBMSs. At worst, the occasional keyword is needed to identify how external systems should be interfaced. That seems more reasonable than risking incorrect query results in order to avoid the occasional keyword, doesn't it?

I've given a suggestion that would greatly reduce accidental primary key dropping without outright forbidding bags. Your position is an extreme one. And you conveniently ignored the risk of a temporary key being mistaken for a permanent one. Pretending like that risk doesn't exist doesn't make it non-existence. Your bias is showing.

I showed how temporary keys can be trivially excluded in reporting tools, etc., by making them all derived from an UNSTABLE type. This is a non-issue. Furthermore, you've conveniently ignored my inescapable point: "At worst, the occasional keyword is needed to identify how external systems should be interfaced. That seems more reasonable than risking incorrect query results in order to avoid the occasional keyword, doesn't it?" Finally, you haven't dealt with the fact that allowing both bags and relations, in parallel, in a DBMS increases complexity. Complexity, unless definitively necessary, should obviously be avoided.

Who says it's a "report"? Why should we distinguish between one transfer/communication method and another? Why are RDBMS to be "blessed" as holy, but not other techniques? The journalists may load the CSV data into their RDBMS. We probably won't know; we just deliver what was agreed to. And how is UNSTABLE different enough from a keyword to matter? And a person using data not from Rel won't know it has an "unstable" key. Rel is a WalledGarden.

{If they can't answer the question about what they want as a csv file, that is their problem - you ship them a basic CSV file that you can easily create, and if that isn't good for them, they will complain and ask you for a different CSV file, or they will modify the csv data themselves. If they want to customize or change the data to a slightly different format, they can do so. I've dealt with CSV files before that aren't perfect; so I change the column names, fiddle with the data types in the schema that they shipped me, etc. That's another thing - are they shipping a schema with the csv file or just plain csv data with no schema? A lot of important details can be lost without a schema telling you information about the database constraints. CSV files aren't perfect, people can modify the data layout to their needs. }

But "give them what they want and only what they want" would apply to the database side also. If you are a data programmer why should you put in extra info that nobody asked for? If you are out to "fix the world" and inject temporary keys to "make the universe right", wouldn't that also apply to CSV's? If you wish to argue with users and technicians and aggressively insist on "no bags", then to be consistent you should evangelize against bags in all forms.

{The CSV file format doesn't even dictate that you need column names in the file - aren't column names useful in databases? You can optionally put column names at the top of the csv file and your database can import those column names. Are you therefore saying that databases shouldn't even have column names, because the csv file might not have column names so why should databases have column names? Basing your database on csv is bass ackwards. Csv is a physical flexible file format, much too general purpose. Are column names just nanny state dictating your columns? Should columns always be name-less? CSV format could be extended to allow a schema inside the csv file, but since the schema can be shipped separately it might be better to keep csv the way it is.}

If columns names are left off, hopefully it's done for a good reason. Any tool can be misused. Some database tools will still import non-column-named CSV files if I'm not mistaken. They use dummy names such as "field01, field02, field03, etc.". It's a good way to get something up and going in short notice. One can fix the column names later. For example, suppose you only need 4 out of 30 columns in the imported table for a project with a big deadline. You could use the tool to import it using the "field01..." convention, rename the 4 fields of near-term interest, finish your project, and adjust the other 26 column names during a less hurried time. This is called flexibility: LimpVersusDie. Of course such usage depends on the domain. I wouldn't want too much of it for say medical or space applications. It's better to be anal-to-the-max in those domains, even if it takes more time and resources. -t

{But often business apps need to be anal too - for example a bank wants to have correct transactions. A police station bloody well better be correct about the crimes it is documenting - one mistake in the database and that could put a person in jail for a crime he didn't do. It could be as simple as a boolean being set to "Tru" instead of "True" due to a spelling mistake that puts a person in jail for life.. if the database is just strings or blobs, "tru" is an acceptable cell value, which could throw off the entire logic of a query. With an AnalBucket that only accepts TRUE or FALSE and not "tru" or "fall", the app has more integrity. The idea that business apps need not be reliable or "anal" is nonsense.}

Word.

I developed business applications for 25 years. I never encountered a circumstance where "correct" wasn't a top priority. If "correct" didn't matter, they wouldn't have bothered obtaining data in the first place, let alone expended the resources to analyse it.

Well, that differs from my experience. Companies quite often do not wish to spend extra resources, such as more personal or more time, to gain incremental amounts of quality. But it depends on the domain. PickTheRightToolForTheJob. If you are dealing with money or crime, yes you have to be far more careful than say Google losing one out of every 10,000 email messages. They are giving the service away for free such that the risk of losing one out of every 10,000 customers is not a reason to spend say 50% more on their database to get ACID etc. It may be we each pick different flavors of domains and project based on our personality and experience. Getting "good enough" quality on the cheap may be where I shine the most. Somebody with a fastidious personality may not do well there, being a better fit for banking and medical life-support systems. But I do adjust my style and sometimes language based on the nature of the app. And video games are known to be chalk full of bugs. Apparently it's not profitable to spend extra time and money to tidy them up. For the most part, customers in the USA do not value quality at the expense of price. You could argue that they "should", but we are not God and cannot reboot the world to re-engineer humanity. --top

{DevilsAdvocate and OffTopic: The only time I've found "correct" becomes murky (correct not a top priority), is with graphics or gui. Sometimes I've sat around debating for hours with myself whether or not to use a gray OK button, or a black one. There is no correct artistic quality to an application. To avoid this problem, I settle on defaults.. I use the default operating system colors instead of my own.}

Indeed, HCI issues may be murky, but the issue here is correct vs incorrect data. It is not murky at all. The simple fact is that BagAtational systems like SQL permit errors due to duplicate rows. Those errors simply cannot occur in true relational systems. However, true relational systems impose no more additional effort over BagAtational systems than adding a keyword or two to correctly import or export bags when these are (occasionally) required. The advantage, in all conceivable respects, clearly lies with true relational systems. The notion that adding a keyword or two (in rare conditions!) represents a significant economic cost, or that generated keys escaping into the wild represent a significant business risk, is ludicrous.

What language has such a feature? Rel didn't the last time we came this issue. I smell exaggeration. Going through extra steps to get a bag when needed can create errors. More steps is often more errors. And you haven't shown that mistaking temporary keys for stable keys is sufficiently rare. You only claim it. Why should I or the reader "just believe it" because you claim it over and over repeatedly in multiple spots redundantly? And you ignore the warning technique I suggested, comparing everything to SQL's way instead.

A version of Rel, currently in testing and tweaking and to be released in a few weeks, has high-level facilities import/link bags and add a generated key via TutorialDee extensions. It's always had low-level facilities to import/link bags as RelVars, and it's always had high-level constructs to emit bags.

You made it sound like it was simple in existing version. It appears you are throwing features at the problem now. WaterbedTheory at work. You have an "output system" that is kind of reinventing the table processing system, but is somewhat different. Thus, you have duplication of concept in order to maintain your "purity".
It is in the existing version, and it's simple. It's about to get simpler. I have no idea what you mean by "'output system' that is kind of reinventing the table processing system" or "duplication of concept".

You haven't shown that mistaking temporary keys for stable keys would ever occur. I doubt it would.

That's your opinion. You exaggerate the occurrence of bag problems in my opinion. Out of all the big foulups with SQL I've seen and done, it ranks fairly low.
Relative to all errors made, it ranks fairly low amongst casual SQL developers because they make plenty of other mistakes. Relative to all errors made, it ranks very high among day-in day-out SQL developers, because it's such an easy mistake to make. It's the sole reason why SQL experts advocate strict use of SELECT DISTINCT. It is a category of errors that is eliminated by Rel and other true relational DBMSs.
DISTINCT can add a lot of processing to some queries because it has to sort the entire result set in order to detect duplicates. I'd like to have a CHOICE as a developer to favor performance over "protection" if the situation calls for it. SQL does make it too easy to make duplication mistakes and I'd like to see the warning system added, but still I want a choice. I'm the master and the machine is the slave. You have it backwards. The slave can make recommendations; that's fine. But me, the human, should get final say.
Rel is not SQL. The processing to prevent duplicates represents negligible overhead. If you want to have absolute choice and control over the machine, I recommend machine language. Otherwise, you will have to accept the design choices of the software author. Rel, by design, is an implementation of the RelationalModel. The RelationalModel doesn't maintain bags in the database.
That's almost like saying, "Since the inventors of the airplane, the Write brothers, used canvas for wings, all airplanes MUST USE canvas to be called an "airplane" and be allowed to fly.
Sorry, not following you. An implementation of the RelationalModel is an implementation of the RelationalModel. I don't know why you think it should be otherwise, unless you are fundamentally opposed to the RelationalModel itself.
I want tools that maximize my productivity. What they are called and how they are classified is quite secondary. Apparently Oracle and IBM felt the same way, at least feature-wise.
Notably, and perhaps notoriously, Oracle and IBM did not implement the RelationalModel. Because there are so few implementations of it, until I developed Rel, most developers didn't even have the opportunity to compare true relational and SQL DBMSs. Have you compared true relational to SQL DBMSs to gauge their relative productivity?

Your so-called "warning technique" is not even worth comment.

That's your opinion. I disagree.
In your estimation, what ratio of bag errors are from intentionally using bags versus accidentally using bags? If most are from accidents, then my warning suggestion will make a big improvement. (SQL doesn't warn.) If it's mostly intentional, then even Rel won't prevent much because you just admitted above you made it easier to import and export into bags such that the data programmer can create bags and/or move stuff into bags relatively easily if they want to using your new handy dandy Eaz-EE-Bag™ improvements in Rel. Welcome to Bagland, Buddy, what took you so long? -t
The majority of bag errors are from accidentally creating bags. But in a true relational DBMS, "accidentally using [generating] bags" simply can't occur. In SQL, "accidentally using [generating] bags" can occur any time a SELECT statement is used without the DISTINCT keyword. Rel makes it easy to import/link bags as relations because the outside world contains bags. Rel makes it easy to export bags because the outside world occasionally requests bags. Rel, however, makes it impossible to keep bags inside the DBMS, because there is no reason to keep bags inside the DBMS, and maintaining bags inside the DBMS can lead to errors.
If you say most are accidental, then my warning suggestion should be quite helpful. Yet, you dismissed it (without giving specifics). That doesn't sound logical. Please explain.
You are suggesting that a warning be provided whenever a bag is created. However, there is no reason to maintain bags inside the DBMS, so Rel doesn't allow bags in the DBMS. They can't be created, except explicitly, deliberately, and only for export to other systems. Therefore, there is no need to warn about something that cannot be created accidentally and doesn't exist internally!
You seem to be skirting the issue. If Rel can "easily interface" with existing bag-based RDBMS by importing and exporting bags (which is being made even easier thanks to alleged fancy new features), then if a data programmer intentionally wanted to put in or extract bags to such systems with Rel, they can. It's not much different than SQL, just more steps. You call it "output" or "formatting", but the result is the same either way such that labeling it "formatting" doesn't change the result. I'm not talking about Rel's built-in database engine, but using existing RDBMS. Let's try to state this another way:

     If most bags are accidental Then
         warning system will greatly help
     Else // most are intentional
         Rel has features to do what SQL does for existing RDBMS
     End If

I think you may be misunderstanding something. There is no reason to allow bags inside the DBMS, so Rel does not allow bags inside the DBMS. It can import them into relations, or export them from relations, but it does not keep them or manipulate them. Rel only keeps or manipulates relations.
I'm not taking about internal here.
That doesn't change my response.
Then your response makes no sense. We are talking to "somebody else's" bags. They exist and we can't control that fact, yet must talk to them (import and export) because it's our job to. Rel appears to fully allow that and thus doesn't prevent those bags. For accessing bagged tables/views we can't directly control, it's only a matter of degree of difficulty that separates 1) SQL, 2) my warning plan, and 3) Rel.
Rel can't prevent the outside world from using bags. It can, however, eliminate bags from within. SQL allows errors due to duplicate rows. Your "warning plan" allows errors due to duplicate rows. Rel eliminates errors due to duplicate rows.
For schemas one can't control, Rel allows extracting from and putting data in bags, correct? Thus, it CAN generate bags. It's just extra steps compared to other techniques.
It can export bags to external targets, but it can never perform operations on bags internally. No relational algebra operator in Rel can ever see a duplicate row.
SQL can't put bags into your special database engine either. Thus both languages allow bags and don't allow bags in the same places. Think about that.
Sorry, I'm not following you at all. SQL DBMSes allows duplicate rows in query results and in its databases. Rel doesn't allow duplicate rows in query results or in its databases.
Yes Rel does. You agreed it can import and export to bag tables/views in other DBMS, did you not? You just gave it a different name (such as "output" or "formatting"). However you didn't provide a mechanism for not using the "output" as input to another DBMS. Where is the spanker? That's nearly impossible to prevent on the query language side, and may not even be controllable from a given machine.
I can't control what other DBMSs do with my output, nor do I care. Why would it concern me what other people do with the output from my system? Rel can import/link from external bags to internal relations. Rel can export from internal relations to external bags. There are no internal bags, and hence no duplicate rows can exist inside Rel. You may be viewing Rel like this:

       [ bag -> relation -> bag ]

Rel is actually like this:

       bag -> [ relation ] -> bag

What matters to me, and what I can control, is what happens inside the '[ ]'s
But you are not technically changing what is possible in general; only, making it more steps. Thus, my initial statement is still essentially correct: "Both languages allow [stored] bags and don't allow [stored] bags in the same places" and it's "only a matter of degree". -t
Sorry, I don't understand. In SQL, I can generate duplicate rows as query results or store duplicate rows in internal tables and therefore cause erroneous results -- due to duplicate rows -- in other SQL queries. In Rel, I cannot generate duplicate rows as query results or store duplicate rows in internal tables (known as RelVars), and therefore I cannot cause erroneous results -- due to duplicate rows -- in other Rel queries. In SQL, the input to a query might contain duplicate rows. In Rel, the input to a query can never contain duplicate rows.
In Rel one CAN generate bag results and save/send it to a DB if that DB engine allows bags. And same with extraction. You appear to be playing language games here. "As a query result" is largely irrelevant. It doesn't matter whether the bag comes out of one's mouth, nose, asshole, or penis; it's still a bag. It can "output" bags, and if one provides or finds a transport mechanism, that "output" can be input to another DB. In other words, at least one path exists to take Rel results and save them into a target DBMS if that target DBMS accepts bags. (If it doesn't accept bags, SQL cannot do it either.) Do you need to see the links where you admitted that? -t
- {It can output bags, but it could also output XML too! Don't you understand the difference? Just because it can output XML if you wanted, doesn't mean that it now becomes an XML based database. It is relational, but has XML output possibility. There is a big difference in what you can output, and how the database actually stores the data for QUERIES. For queries it is relational - whereas for OUTPUT you could spit out XML, CSV, bags (of shit), anything you want. The fact that it takes an extra step to make something bag-atational, is the database saying to you "bags are not what we internally use, but you can have them as an output if you want", whereas in SQL "bags are good! they are on by default and are used internally! They are great things!}"
- For one, I'm not promoting SQL's approach. I've proposed a 3rd approach nicknamed the "warning approach". Thus, a comparison to SQL is not the full story. I agree that "on by default" is bad. I don't see a significant conceptual differences between a "query result" and say XML output. It's just different technique to move bits around and adapt them to different I/O mechanisms. These should be orthogonal issues so that we can use different transport mechanisms to more easily hook up diverse systems. This is an important feature for productivity; a bigger bottleneck than "bag errors" in my experience. It may be possible to pull an Apple and only use proprietary or language-specific transport mechanisms to "protect integrity", but many orgs don't wish to make that trade-off: an "integrity centric" WalledGarden versus something that makes sharing info easier. Still, it's a matter of degree, for even then, Rel doesn't outright forbid bag creation, but just makes it more difficult. My assertion still stands; you haven't knocked it off the hill yet.
- Rel doesn't forbid exporting bags. Rel categorically forbids using bags as input to Rel queries, thus eliminating a category of errors caused by duplicate rows.
- And SQL cannot put bags into the Rel Database Engine either because there is no driver to do it, and/or because the Rel Database Engine would probably reject bags anyhow regardless of the data supplier IF there was a driver. Thus, that feature is not a difference maker. Move bags to/from a non-Rel-DB-engine: both SQL and Rel can do it. Move bags to/from Rel-DB-Engine: Neither Rel nor SQL can do it. El Sameo -t
- Rel links/imports data from SQL DBMSs (via JDBC), but converts bags to relations so that errors due to duplicate rows cannot exist. Queries used within Rel will not have errors due to duplicate rows. Data exported from Rel will not have errors due to duplicate rows. SQL, on the other hand, allows errors due to duplicate rows.
- "Inside" the query language they may be different. I'll agree with that. However, they both can import bags and create bags from or for systems that allow bags. And, neither can export bags to systems that don't allow bags.
- Of course. They are both DBMSs, intended to do what DBMSs do. Only, Rel is less likely to cause errors due to duplicate rows.
- Less likely to cause certain classes of errors, perhaps. Less likely to cause errors in general, that's another story.
- I think you'll have to construct a land-markedly persuasive argument to convince me or any other proponent of the RelationalModel that the errors eliminated by strictly adhering to the RelationalModel are fewer than the errors caused by strictly adhering to the RelationalModel.
Sorry, not following you. I don't know why you consider something so obvious even worth mentioning. Obviously, I have no control over what external systems may do with data provided to them by Rel. I do have total control over what Rel can do with data provided to it by external systems, including ensuring that no duplicate rows will ever be input to a Rel query. I know of no SQL system that provides the same guarantee.
You are over-focusing on what Rel calls "query data". It's a vocabulary-only victory, for Rel can still "leak" bags. It has multiple orifices and just because one doesn't leak bags doesn't mean they all don't. It's like putting all your guards at one gate and bragging about how well that ONE gate is protected. While technically true, it misses the bigger picture that your compound overall still has vulnerable spots to Attila the Bag. Fastidious people often over-focus on spot-victories over the bigger picture. I've worked for such people, and my dad is one, and I observed their successes and failures and the reasons for it: they don't balance competing needs very well because they "stick" to a few narrow factors and don't realize they are sticking. They can still be productive, but need to be managed by somebody who knows how to unstick them in a gracious manner when appropriate. -t
I'm not sure what point you're trying to make. Rel and SQL are both query languages, so how does one "over focus" on "query data"? What does it mean to "leak" a bag? What is "query data"? Anyway, let me explain how SQL and Rel conceptually differ: In SQL, the input to a query can be a bag, so a certain category of errors can result. In Rel, the input to a query cannot be a bag, so a certain category of errors cannot result. In all other ways, Rel and SQL can accomplish exactly the same things.
- I thought you stated somewhere it can accept bags as long as an opening step adds a dummy sequence number or the like. Perhaps I remember it wrong? Even if I did, it's not that hard to create a PRE-processor that sequences the rows of a bag. However, I'm skeptical such would noticeably reduce total quantity and cost of mistakes. For one, it conjures up our old friend: the temp key mistaken for a perm key. (However, this one is probably knowable, unlike some of the other temp key scenarios. But it's still more info to screw up and confuse dazed, inattentive, and overworked programmers.)
- Dazed, inattentive, and overworked programmers are far more likely to make mistakes due to duplicate rows. Highly focused and attentive database developers already make such mistakes. It's good that Rel eliminates that category of mistakes.
- And the warning option would rid at least 90% of such in my estimation, without outright banning (internal) bags. It's a very nice compromise as far as most compromises go. A given query statement wouldn't accept, process, or produce a bag unless the query programmer explicitly gives it permission via a keyword(s).
- So you'll allow an extra keyword on queries to allow bags, but you don't want an extra keyword on import/link or export to accommodate bags???
- Is it really that trivial in Rel? If it was, then it's not "preventing" much.
- It's preventing the entire category of possible errors in query results that can be caused by duplicate rows. Therefore, it's preventing a lot, with the only downside being the occasional keyword on import/link or export.
- {So if mysql offers the option to export XML, that makes mysql XML based and just as evil as an xml database? Wtf are you smoking? You're just ArguingForTheSakeOfArguing}
- I'm not sure I understand your question. I see no reason to make a significant distinction between "query result output" and any other kind/format of output a query language may generate, at least in terms of defining "integrity", "purity", or "protection" capability. One cannot really control what something else (such as a database programmer) does with such info once it's generated by the query language. The "nozzle type" can only be considered a format control, not a "safety" mechanism against bags. It has nothing to do with database classification. The scope of what Rel can control is thus quite limited, and its protection power against bags has been exaggerated.-t
  - Rel only guarantees that a certain category of errors -- those due to duplicate rows -- are eliminated inside the DBMS. Nothing has been exaggerated about that, because that's precisely what Rel does, what SQL doesn't do, and what I've been claiming all along. However, given that a DBMS exists solely to produce correct query results, I don't think the significance of that should be casually dismissed, either. And, of course, we certainly can't dismiss the significance of DBMSs in general. But what happens to the data after it's been emitted by a DBMS is neither the concern nor under the control of the database developer. There's nothing Rel or any other DBMS can do about the data once it's left the building, so to speak.
    - Re: "...guarantees that a certain category of errors [duplicates] are eliminated inside the DBMS." - No. That's wrong. They are only eliminated inside the query language's internal processing. Rel (the query language) can still send bags to a DBMS that allows bags, just like SQL; and SQL canNOT send bags to a Rel DBMS (engine) because a Rel DBMS engine does not allow bags (when finished). See Figure 4729. The scope of the "prevention" is smaller than you claim.
    - I think there may be some misunderstanding here. Rel provides the same ability to import and export data that is available in any other DBMS, like MySQL, PostgreSQL, MS SQL Server, Sybase, DB2, Oracle Database, and so on. The difference is that all of these SQL implementations permit a category of query errors that result from duplicate rows. Rel does not. As for your Figure 4729, I'm not clear what you intended to show, but it tells me (as I just stated) that Rel provides the same ability to import and export data as any other DBMS. If that's your point, then I agree. The difference lies in whether a DBMS allows query errors due to duplicate rows. Rel does not. No query in Rel will ever generate an error due to a duplicate row, because views, queries and RelVars (whether linked to external data or internal) cannot produce or contain duplicate rows. Queries in SQL can generate errors due to duplicate rows in views, queries, and tables (whether linked to external data or internal). That is the sum and total of my claim, it's what I've claimed all along, and it is a significant point because the sole purpose of every DBMS is to produce accurate query results.
    - Plugging a dummy ID will not necessarily prevent mistakes. Mistakes are made by humans, not the query language (for the most part). If you don't understand the nature of your data you will make mistakes regardless. The error rate is matter of degree. And I've suggested a warning mechanism be added to SQL or SmeQl to reduce inadvertent use of bags. But we are getting off the point of the scope of the "protection": it's limited to inside the query language. You haven't provided any refutation, only talking about the alleged wonderfulness of the inside protection when that was not the issue I was describing.
    - Ah, but that's the point: Everything you do with a modern DBMS is "inside the query language". When you evaluate a query, it's obviously "inside the query language", but when you import data into a DBMS from a file or another DBMS or an application, or export data from a DBMS to a file or a report generator or a client application, it's still "inside the query language" because these are done via the query language. Of course, once data leaves the DBMS we have no control over it. We can control what goes on inside the DBMS, however, and that's all query execution. Everything a DBMS does -- aside from a handful of administrative utilities found in some DBMSs to bulk load or back up a database -- is done via query execution. There is no "outside" the query language! Therefore, a DBMS with a query language that prevents a category of errors (like Rel) is obviously preferable to a DBMS that allows a category of errors (like all the SQL ones).
    - You admitted Rel can both import and export bags. Thus, Figure 4729 stands. You haven't said anything new. Implementing I/O in a query language doesn't change the resulting charting characteristics of I/O interaction with DBMS. And you keep ignoring the warning option, comparing to existing SQL instead.
    - Yes, Rel can import and export bags. In both Rel and SQL DBMSs, an exported bag must be based on an underlying query. The fundamental difference between Rel and SQL DBMSs, however, is this: In SQL, the underlying query can have errors due to the presence of duplicate rows. In Rel, the underlying query cannot have errors due to duplicate rows. So, although Rel and SQL DBMSs both export bags, a bag exported from Rel is more likely to be correct than a bag exported from a SQL DBMS.
    - As for the warning option, it's not needed, because a relational query language does not need duplicate rows. Why warn about what isn't needed? Furthermore, the standard means for interacting with a DBMS is via ODBC, JDBC, or some proprietary equivalent to ODBC/JDBC. Given a query, these return either a RecordSet or an error. There is no provision to return a RecordSet and a warning. Finally, providing support for bags inside the DBMS (i.e., other than at import and export) means having to implement an entire algebra, similar to the RelationalAlgebra but not the same, in order to support bags. It would also mean having to support storing both bags and relations. Given that there is no need to support bags other than at import and export, implementing it would be a pointless effort.
    - WhyDoExistingRdbmsAllowBags explains the "warning" technique in more detail. I agree it may be extra implementation effort to support both sets and bags efficiently, but the engines already exist for the most part; we're not starting from scratch. If another query language became popular, most DB vendors would hook it into the existing engine rather than start over or support 2 products. The world is already wired to be dual. -t
    - If there was a clear justification for maintaining bags in the DBMS -- as opposed to importing/linking bags as relations, and exporting relations as bags (when needed) -- then I would consider support for bags as you describe. Otherwise, it just seems like pointless perpetuation of a historical bad idea.
    - The topic is not schema design here, but connecting up multiple data sources, many of which we have no control over the schema or source DBMS. Perhaps bag support is a form of QwertySyndrome, but from the perspective of a given org or developer, it may be the better option. We are paid to solve specific problems, not fix the world.
    - I'm not sure why you're bringing up schema design. I was talking about connecting up multiple data sources, many of which (maybe even all of which) we have no control over the schema or source DBMS. That's why, in connecting to these data sources in Rel, it's possible to link or import the source data so that the source continues to maintain a bag, but Rel will see it as a relation.
    - You'd get the same behavior from SQL because the Rel database engine (RDE) doesn't allow bags. This means that some kind of conversion step is needed if one wants to communicate with the RDE via SQL. Thus, again, it's not a difference maker between the approaches. "It's possible" under both.
    - It's only possible because Rel doesn't allow bags. Rel requires that any external bag be converted to a relation. I don't know what you mean by "'it's possible' under both." What's possible? Under both, what? I think I'm either grossly misunderstanding you, or perhaps you're suggesting something that DBMSs don't do. They don't have some means for injecting data into each others' storage engines. DBMSs that can talk to other DBMSs at all (many don't) use ODBC/JDBC/etc links to request data via queries, or update data via "update queries" aka updates.
    - If the Rel engine doesn't allow "outside" query languages to insert info, then that's a limitation of the Rel engine. If you want to call that a "protection feature", be my guest. Others will call it a limitation. If it does allow it, then it will likely only accept unique records such that SQL wouldn't be able to "put a bag in it" even if the SQL writer wanted to; and in that sense what it "emits" to the engine is no different than Rel (the language) for comparison purposes. Is there a third option I am missing?
    - Perhaps there's confusion over terminology. I don't distinguish "the Rel engine" from Rel, and I don't know what you mean by "'emits' to the engine". Do you access "the PostgreSQL engine" or "the MySQL engine" or "the SQL Server engine", or do you access PostgreSQL, MySQL, or SQL Server? As such, when the Rel JDBC/ODBC drivers are complete, it will provide precisely the same degree of ability to allow external clients to update a database that SQL does, except the query language will be Tutorial D instead of SQL and all tables (aka RelVars) must have a primary key.
    - Barring at least minimal SQL may likely limit its ability to connect through existing tools. SQL is currently the Lingua Franca of cross-platform table communication for good or bad. CSV is another, but brings up many of the same issues.
    - Users who need minimal SQL generally use SQL. Rel appeals to developers who wish to break away from SQL. I do recognise, however, the value of legacy support. I'm sure if there's sufficient demand for SQL compatibility, someone will add it.
  - {You could use a javascript pop up window "copy and pasting has been disabled" HaHaOnlySerious}
- {So if rel can output XML the protection against it becoming an XML/tree based database has been exaggerated}
- I am not understanding why you are asking that question. Is "it" the query language or the DBMS?
- In modern DBMSs, these are effectively indistinguishable. Query languages like Rel and SQL do not (generally) exist outside DBMSs, and DBMSs do everything via the query language. The query language is used to make assertions of truth via updates, and determinations of truth are made via queries. A DBMS provides the infrastructure in which to execute the query language.
- Fine, but what's that have to do with XML and "becoming"?
- You appear to be suggesting that because Rel emits bags, it is no better than SQL and is effectively BagAtational. Your correspondent is pointing out the apparent error in your argument, by drawing attention to the fact that if Rel emitted XML, it wouldn't be an XML database. Likewise, just because Rel emits bags, it isn't BagAtational.
- Having a useful tool is more important to me than "classification certification". My argument is not about classification. And BagAtational may be a non-discrete state anyhow.
- I agree, having a useful tool is more important than any other consideration. It's what drives me to support faithful implementation of the RelationalModel. The debate between the RelationalModel vs SQL/BagAtational is not about "classification certification", it's about guarantees of desirable behaviour.
- No, it's about productivity. Internal purity is only one of many factors to weigh together.
- Yes, it's about productivity. Guarantees of desirable behaviour mean I don't have to spend time determining, or checking to see, whether desirable behaviour is extant or not.
- But you may have more work-around code to check. Tradeoffs. Weigh them smartly.
- Do you genuinely believe that having to worry about bag<-->relation conversions at the boundaries of the system represents "more work-around code to check" than having to verify that queries with "ALLOWBAGS" are not only generating correct results, but that other queries -- for which views with ALLOWBAGS are their source of data -- are correct too?
- I'm not sure what you are comparing. The subject seems to have drifted somewhere.
- You seem to be implying that eliminating bags from the DBMS requires more "work-around code to check", when in fact the opposite is effectively true. If you allow bags, you (typically) have far more code to check that query results are correct. Any JOIN, UNION, etc. query that uses a bag as input could potentially generate incorrect results. If you exclude bags, and require that they be imported/linked as relations, then the only "code to check" is at the point of import/link.
- No. My main point is that in a typical org (that I'm familiar with) we often have to use as-is a wide variety of databases and destination requirements for which we cannot change the schema. I am not assuming we have a wide choice in what we can "lock down" bag-wise. "Here's what we give you. Now use it and shut up." My tool design suggestions are based on this assumption.
- Why do you think you need to change a schema in order to import or link a bag as a relation? If I link to a duplicate-allowing table named B in PostgreSQL as a RelVar named B in Rel, I don't have to change B in PostgreSQL.
- Are you talking about plugging bags with temporary/dummy ID's again?
- I sure am. It's an excellent solution. The only downside -- that a developer might confuse a temporary ID with a permanent one and output it from the DBMS -- is trivially dealt with by making all temporary IDs derived from an UNSTABLE type, with appropriate obstacles in place to avoid emitting UNSTABLE type values.
- My skepticism over UNSTABLE is described in WhyDoExistingRdbmsAllowBags.
And perhaps you are putting the GateKeeper in the wrong place. It may just belong at the table/view design side, not in the query language.
I don't know what you mean.
The protection against bags "coming in" should be at the database engine level, not at the query language side. That way different tools can use the engine without sneaking a bag in. You wish to protect the database, not so much individual DB tools, right?
Regardless whether you are using Rel natively or accessing the Rel database engine via ODBC/JDBC, bags cannot be sneaked in. External bags are converted to relations on import/link; bags can only be generated as exports. Bags cannot exist inside any manifestation of Rel, in any form, anywhere.
So the Rel database engine has a GateKeeper against bags independent of the Rel language?
No more (or less) than any other DBMS. The GateKeeper is the Rel language, which is TutorialDee. Data goes in and out through TutorialDee queries and resulting ResultSets. That's the same as almost every SQL DBMS, in which data goes in and out through SQL queries and resulting ResultSets. Some DBMSs provide bulk loaders but Rel doesn't, yet.
Most existing DBMS have "bulk load" type of utilities that probably are not implemented in the query language of the system. Parsing a billion INSERT statements over and over doesn't seem very efficient. But then again, I never lifted up the skirt to see.
Parsing a billion INSERT statements isn't efficient. Note I wrote that Rel doesn't have a bulk loader yet. When it does, it won't permit bags, either.

Now: How, really, would "going through extra steps" to create a bag "create errors"? This is getting ridiculous. Surely you're just quibbling to preserve some idealistic adherence to bags for the sake of "bag purity", or some such, aren't you?

All else being equal, the more programming/query code needed to carry out a task, the more chance for making an error. Perhaps I focus on "productivity purity", eh? -t

So you'd rather allow the developer to frequently create inadvertent and subtle errors that could be pervasive throughout the database, than entirely eliminate those and allow the developer to rarely create explicit and obvious errors only at the boundaries of the system?

You haven't demonstrated that bag errors are more common than other kinds of errors that the tradeoffs bring. Frequency and magnitude of error is indeed important, but your assessment of both differs from mine, especially if the warning system is put in place. You need to flesh out why you think X is more common than Y. I'm not going to just take your word for it, you need to present an explanation for frequency and impact assessments. I cannot read your mind. Repetition of your summary assessment is not sufficient.

In most production databases, are the bulk of tables and queries referring to tables in the database, or outside the database?

"The"? A good many of the systems/processes I work on talk to and connect to a variety of databases and sources; some I can control schema/layouts and some I can't. Thus, tools that work effectively with a variety of databases and data-sources in which we the coder/querier have a varying range of control over are a good thing. A fair amount of unchangable sources are going to be bags either because the schema has bags or the info we can access omits the primary key or info about the primary key.

You haven't answered the question.

I guess I'm not understanding your question then.

You don't have to understand it, just answer it.

The answer is 53.

I can only interpret such an evasive response as clear evidence that you do understand the question and you know the point I intend to make about it, but are unable to provide a cogent counter-argument. In other words, you know that in the vast majority of databases, most queries refer to content within the database rather than external data sources.You also know it is reasonable to assume that errors will be proportional to the relative frequencies of types of query, i.e., errors from queries that reference internal data will contribute more to the overall error count than errors from queries that reference external data. (We'll ignore for now the fact that in this case the former are more subtle and difficult to find than the latter.) Therefore, if there is a mechanism that prevents errors in internal queries, at the expense of introducing errors in external queries, it makes sense to implement the mechanism because it will reduce the overall number of errors.

Personally, I think this is either quite weakly stated, or else I outright disagree. I tried once, to discourage the complacent notion that these 'generated keys' are a mechanism that 'prevents errors'. I'd prefer something like 'makes them easier to fix'. In which case, there would be alternatives, conceivably. I'm not against these generated keys, but you're arguing me out of my complacency about them. I don't take them to be a panacea. That they prevent errors, is not the way, exactly, to put the case for 'generated keys', in my view. Refusing to allow duplicate rows without addressing the situation somehow, should be an easy point--really, why bother to argue for relational theory? And, note, that 'why bother to argue for relational theory' is a little different than 'I want to prevent errors, and so etc.' Relational theory is technical. The arguments for it, do not look like this page, they're mathematical. I'm tempted to say something like don't argue for it, just implement it.

Generated keys are a mechanism that prevents errors by allowing the RelationalAlgebra to be used, instead of requiring some SQLesque hybrid relational/bag algebra. "Makes them easier to fix" does not accurately reflect the issue, because it is truly the case that a category of errors that are possible in SQL cannot occur in a faithful implementation of the RelationalModel. I.e., there's nothing to be "easier to fix" if you can't make a mistake in the first place.

To express my disatisfaction, I'll repost this quote from further above: 'Anyway, let me explain how SQL and Rel conceptually differ: In SQL, the input to a query can be a bag, so a certain category of errors can result. In Rel, the input to a query cannot be a bag, so a certain category of errors cannot result. In all other ways, Rel and SQL can accomplish exactly the same things.' I'd prefer to just leave it at 'In Rel, the input to a query cannot be a bag'. Then, we could focus on how messy that statement already is. What is 'the input to a query'? Query languages allow manipulation and retrieval of data. I'd prefer your point to be, that you're interested in a simple, powerful QL, that has a strong formal foundation based on logic. That's what you want to support. And it will be something, that is not intended to be used for complex calculations, is not 'turing complete', it will support easy, efficient access to large data sets. It will allow for much optimization.

I don't think there's any difficulty understanding, informally, what "input to a query" is.

That's not enough? The focus, for me, should be on what forms the basis for SQL. Relational algebra, and relational calculus. Now, if a query is applied to relation instances, then the result of a query is also a relation instance. I'd put somewhere along about here, something about how the schema for the result of a given query is fixed. Determined. By what? By definition of query language constructs. Relational algebra has some basic operations. It also has, I suppose, additional operations. But each operation returns a relation. That means, that the algebra is closed. And that means, that operations can be composed. Which all sounds quite different, doesn't it, to 'In Rel, the input to a query cannot be a bag, so a certain category of errors cannot result.' Not, I think, actually, the point, I can't wring all this happy crap I'm talkin' about, out of that.

Actually, neither RelationalAlgebra nor RelationalCalculus form a basis for SQL, which is precisely the problem. They are arguably both an inspiration for SQL, but it deviates heavily. SQL allows duplicate rows, for example. However, the focus on this page is not on adherence to the RelationalModel per se. You appear to be suggesting that it is a goal in itself, and that it can be defended mathematically. Be that as it may, it doesn't say -- in pragmatic terms -- why the RelationalModel is a GoodThing. On this page, we've been debating it from a practical point of view. In short, not using the RelationalAlgebra means using a bag-based algebra, which permits errors due to duplicate rows. (Of course, that's essentially true by definition!) SQL is notorious for this, and I gave an example on BagSetImpedanceMismatchDiscussion.

The happy crap that I'm talkin' about, is relational algebra. These are just preliminaries. What a query Is Applied To, that's better than 'the input to a query etc.'. What is a query applied to? schemas of input relations--that's the 'input', perhaps..I dunno, maybe I'm just being a bit more patient than you, slowing down. Was this what you were trying to explain? And what about schemas of input relations for a query, something about them being 'fixed'? And the result of such a query is a relation instance, which is to say, the result of such a query is *also* a relation instance? So? So the schema for the result of a given query is also fixed. Oh yeah, well then what is it determined by? By, see above, it is determined by definition of query language constructs. I'm sure I'm wrong to smugly feel, that this probably sounds to everybody else like Martian to you people, jeez, this is the subject you're debating. These are preliminaries. Before we get into Selection, so certainly before we get into union, intersection, set-difference, we'd get into projection, which I haven't seen this word used here? Because, of course, we all know about projection. So of course, incidentally, we know why projection operator has to eliminate duplicates. I think top is crying for help, he wants to understand this arcane priesthood wisdom, and either you don't have the patience, or it's just actually too technical for all of us.

Before we get into projection, we must understand relational algebra. In any case, the relational model has *rigorously defined* query languages. That point, is simply part of a summary of relational algebra. I don't think I've cleared anything up one bit, have I? :-)

I'm sure Top is well aware of the technical aspects of the RelationalAlgebra. Why it is superior to some bag-based algebra is the issue here, and I argue here that it is superior -- among other things discussed elsewhere -- because it categorically precludes errors due to duplicate rows.

{This is a case where backwards compatibility is evil. Connecting to a variety of databases and remaining compatible with them is a goal of mega backwards compatibility and mega flexibility. Surely there is a pattern describing the evils of backwards compatibility on this wiki somewhere? It goes something like this: we need to remain compatible with ms-dos so let's build windows 98 on top of ms-dos instead of building a proper operating system that replaces ms-dos. Uh oh! How about building a database that remains backwards compatible with ExBase, but also runs on Dos... and truly relational is just a PipeDream because backwards compatibility with SQL is king. Or remaining backwards compatible with old BASIC code which used lots of GoTo and line numbers.. At some point you have to abandon backward compatibility in favor of a new more sterile technology. Luckily CSV offers a way to remain still somewhat backwards compatible since CSV is just dumb text.}

This is a matter of weighing trade-offs. I've spent a good amount of time revising stuff to fit new languages/paradigms/tools/interface. It's costly economics-wise and shouldn't readily be dismissed. And in my opinion Microsoft was wrong to discontinue the VB6 language so suddenly. They could have offered both for a while so existing apps would still have some support. In other words, phase it out gradually rather than yank the rug out all at once.

Figure 4729:

Query language can send or receive a bag to/from a DBMS engine that:

 Query.|.DBMS that...|.DBMS that DOESN'T
 Lang..|.allows bags.|.allow bags 
 ------|-------------|------------------
 SQL...|.....YES.....|........NO........
 Rel...|.....YES.....|........NO........

(Dots to prevent TabMunging)

MarchTwelve