Bag Need Scenarios

These are four scenarios where a "bag" may be needed or helpful instead of unique rows. (Related to BagAtational.)

Scenario Summary:

Counter-argument summary:

Assume we are using a true relational DBMS (see the RelProject, for example) which does not permit duplicate rows in tables (RelVars)...

A summarized outline is in the works under BagNeedScenariosReWork.

It's the same stuff again. Do we really need to prolong this, let alone repeat it?

I'm trying to summarize it. In other words, a compact version of the "same stuff".

Like the stuff that's already above, yeah?

I would put it there, but this page is TooBigToEdit. As mentioned on BagNeedScenariosReWork, when the summary settles, a topic re-factoring may take place.


Scenario 1: limited access to large log table

You are given read-only access to an event log table created by a customer/partner without an auto-sequence key or any other (accessible) unique key. The events are just "dumped" into the table with no guarantee of uniqueness. The logging device ID's were supposed to be unique, but due to sloppiness or lack of maintenance (such as the repair-man doing a reset and skipping device ID re-entry, leaving factory default), they are not in practice. And the time-stamps emitted by the logging devices are rounded to a tenth of a second. The table is too large to practically entirely copy to a local work-space. The working assumption here is that you, a data user, don't have the political power and/or rank to fix the loggers "right" or change the table structure.


Scenario 2: Ad-Hoc Column Trimming

You are doing ad-hoc querying to hunt down a suspect data pattern. Chopping out compound keys makes the listing fit on screen and/or paste better into a spreadsheet or eNotepad.


Scenario 3: Contracted Delivery Columns

A marketing service that your company contracts with to study customers is asking for a list of addresses and a purchase count of a given product from your customer database. For security purposes, your legal team has negotiated to omit the customer ID and name from the list, and the columns present and only those present have been clearly defined in the contract. It's thus possible some addresses may be duplicates because different customers may live in the same household; or due to typos, such as missing apartment number. These won't necessarily hurt the analysis because approximate location, "neighborhood", is sufficient for the marketing analysis.

(For those using the non-keyed data on the receiving end, see BagSetImpedanceMismatchDiscussion)


Scenario 4 – Sales Summary

A retail company is sending out a CD to managers with a small reporting application that provides interactive QueryByExample reports of prior year's sales from different perspectives. Urban managers can get it off the intranet, but there are still sufficient numbers of rural stores with spotty or no access. (Another take on this is given at PageAnchor "scenario_four_alpha".)

Normally only a product catalog is sent on the CD, but the techies concluded that the reporting app could also fit on the same CD with some adjustments.

For this project, a summary statistical table is produced with this schema:

 Table: sales_sum
 ----------------
 store_id    // retail store identifier
 month       // date rounded down to 1st of month
 prod_categ  // primary product category
 repeat_cust // repeat customer flag: Y or N
 sales       // dollar amount of sales

The first four columns make up the primary key. However, disk-space was a premium on the CD, so the primary key was eliminated, creating essentially a "bag".

Individual indexes on the first 3 columns remain. Slower repeat-customer-related queries were noted, but deemed acceptable considering the tight space.

An explicit primary key objectively takes up more disk space, almost as much as the table itself. This is a bit smelly. Purity is not free here.

What would you have done if the catalogue was so large it wouldn't fit on the disk at all? What if there was only room for an index on one of the three columns? You've certainly shown a "need" for a bag in a situation that just happens to be delicately balanced between sufficient space for a relational implementation and not enough space for any implementation at all. That makes it the very epitome of contrived. Very, very dubious, Top.

[TopMind, this entire scenario is you mistakingly conflating indexing (and a particular implementation thereof, no less!) with uniqueness constraints. FAIL!]

[As a note, there are techniques to index on all five columns at once with trivial additional space costs. Do a little research on kd-trees and R-trees if the subject interests you.]

A given RDBMS does not necessarily separate the two nor provide space-friendly alternatives. It's not practical to provide key/constraint implementations optimized for every possible domain-specific optimization scenario. Sure, if there's enough time, one can research different DB engines and perhaps find one that does, and convert any vendor-specific SQL over, but that's an idealistic assumption. Plus, using a different DB increases cross-training costs. Staff has to know how to use a given RDBMS engine/format. Managers often want PlugCompatibleInterchangeableEngineers, and a new "weird" database brought into the mix will go against their wishes. -t

Further, why check "constraints" when it's read-only after release? There's not likely going to be any INSERTs done after shipped. Usually the file(s) are replaced in their entirety for such a product. One can run a simple query to detect any duplicates before shipping. -t

[I was about to make the same point. It was very silly of you to call something a 'bag need scenario' when it lacks duplicates, does not need duplicates, and cannot be later modified to include duplicates.]

Bags don't require duplicates to actually be bags. An empty data-structure is still a data structure. However, I will agree this is a gray area because defining a data structure by what it "does" versus what's in it can be sticky. -t

[Duplicates are required to demonstrate that bags (multi-sets) are necessary or helpful relative to sets. That 'gray area' you've wandered into hardly seems relevant.]

In fact, I've seen similar tables in which last minute corrections were added by simply "appending" the delta's to make adjustments to existing records, creating what would be a duplicate key. As long as all report queries are GROUP BY, which this type of summary reporting usually requires, it will work.

[What a silly way to handle deltas.]

What would you recommend instead? This is the least intrusive and easiest to find and undo. If you reprocess the whole shebang, you risk some silly mistake or command typo. In my case, it was only about 10 record such that the size difference would be negligible.

[You also subjected your process to command typos by not scripting it? Good work, TopMind. If the RDBMS supports some non-intrusive approaches, use them, otherwise reprocess; either way, deltas require both inserts and deletes. But we are drifting far afield of the contrived and irrelevant scenario at this point.]

All technical issues aside, my objection to this scenario is that you can always come up with a hypothetical constrained situation to justify any "need". Why a CD? Why not data storage on an RFID tag to be scanned by a low-capacity hand-held device? How about a low-capacity promotional USB key that can be plugged into any USB-capable device including phones and car stereos? How about limited storage on a satellite or space-shuttle experiment? And so on. There are innumerable one-off situations that require a special set of circumstances. Whilst these may warrant bags or any number of other constrained strategies, what they do not do is justify providing bag support in relational systems in general. You still haven't presented any argument to support the notion that a relational system benefits from supporting bags in addition to relations, except for being able to import/link duplicate records at the input, and export duplicate records at the output. Can you show any general justification for representing data in bags within the system boundaries? Can you show any justification to have a 'bag' data type in addition to a 'relation' data type on systems capable of supporting relations?

Note that you wrote at the top of this page, "[t]hese are [...] scenarios where a 'bag' may be needed or helpful instead of unique rows." If we're dealing with a space-constrained situation that forces us to eliminate unique rows as an option, then this isn't a case where a bag may be needed or helpful instead of unique rows. "Instead" implies the option to choose. You've deliberately constructed the scenario to eliminate the option to choose. Thus, this is a case where a bag is the only possible option, by deliberate construction. As such, your scenario is almost identical to this one: "After one of Chris Date's lectures, your boss got into a heated argument with Chris over allowing duplicate rows in tables. As a result, your boss refuses to let you use anything that implements the true relational model and requires that every table allow duplicates. Therefore, bags are needed." Well, no. Bags aren't needed, bags are being forced. The distinction is significant.

[I must disagree. That isn't a significant distinction. Something being forced by real circumstances is the very essence of being needed. Top's error here is attempting to pin the requirement on the peculiarities of a particular RDBMS implementation, hypothetical or not. (If he wanted to be a little more obvious, he could produce another scenario: hypothetical RDBMS X forces duplicates on you always; that is, it literally checks for duplicates and refuses to do work, except for inserts, until at least one of its relations contains a duplicate. Therefore, bags are needed or helpful.) I also disagree with your assumption that space-constrained situations can force us to utilize relations with duplicated rows. Something to keep in mind is that, the more space-constrained the RDBMS is, the less you need indexes (indexes are all about upwards scalable performance). At a small scale, an RDBMS could be efficiently implemented by linear searches on arrays. Additionally, it is a mistake to confuse representation (how the RDBMS organizes the bits) with the semantics (what this information means within queries or upon export). It is perfectly legit, if you wish, to use position in a linear array to store information regarding an autonum column; that would not introduce 'duplicate rows', though it is of questionable suitability if rows can be deleted and the autonums can't be reused.]

What I've tried, and perhaps failed, to make clear by my use of the term "forced", is that if you ignore all the technical issues, this is a purely contrived circumstance wherein Top has attempted -- in a somewhat roundabout way -- to state "you need a bag because I say you need a bag."

The bottom line is that removing artifacts that prevent duplicates saves space. If you don't hire a bouncer for your loft party, you can save money; the trade-off being you risk having riff-raff around. Depending on the circumstances, you may decide it's worth it, such as making the party more interesting at the expense of having to buy new furniture afterward. Maybe you decided your furniture is out of style anyhow. Having a choice can be good and allowing bags offers choices. -t

[Bad analogy. The trade-off is that new riff-raff won't be recognized and removed with the same efficiency as a dedicated bouncer might manage. And if you want choice: there are many, many approaches to handling duplicates. Indexes are not even among the best of them (they trade space for time). Feel free to introduce the choice there, at the point of optimizations tuning. No need to allow bags in order to introduce choices.]

Well, let me temper that a bit: having a choice can sometimes help. This does not mean that gazillion choices is necessarily good. I'm illustrating specific places where it helps. Whether it's worth other downsides in other situations is a larger question than this scenario itself. I'm not addressing every possible variable of the bag issue here. -t

[Just because a choice is moved off into a different operator does not mean the choice is gone. You don't need to mourn the choice. You only need to learn its new address.]

If the "operator" doesn't have to fit on a tight CD, it simplifies life.


Scenario #5 - Fouled-up Key

I once worked on a state government project as a contractor whereby the state had once accidentally issued duplicate license plate numbers. Rather than recall the plates, they decided put out a notice to law enforcement agencies and to live with the overlap. However, as a contractor we had to work with their data and had to live with "bags". The data set was too big and changed too often to make a local copy with surrogate keys, barring a big expenditure and extra staff. The state may have created their own internal unique key, but as a contractor we didn't have access to it.

Note that the issue is from the perspective of the contractor. Perhaps the state should have cleaned up the mess rather than live with junky data, but I'm not assuming king-level control of all systems. In the real world you have to work with stupid people and stupid systems.


These scenarios are based on my actual experiences. Some are simplified to keep the examples explainable without giving a domain lesson. -- top


Link Specifics

Suppose we are querying for the most recent week's worth of logs. What comes back from ODBC? Does it have a unique identifier? If so, is it relative to the original table, or only the current request? Is it usable by TD/Rel? What if a driver only sent CSV to TD/Rel of just the requested columns? Where is the key stored while the client is chomping the data set? Does something have to scan the entire source table to make unique keys? It's generally fairly easy to get a unique working surrogate key for a single given query, just number the row-set starting with "1" and on up. However, that's not usable across different queries, such as if we want to get updates, like any straggler logs that may have arrived late because a station's network was down and the loggers queue. I don't know if one-time-use is sufficient for TD/Rel. I don't have to ask most these questions of existing RDBMS because they can just stuff it into a bag. (Auto-sequence surrogates in the copy can optionally be added.) It does appear that TD/Rel are finicky compared to the current RDBMS. It's like dogs versus cats: dogs will eat just about anything, but cats are finicky, AKA, high-maintenance.

Based on the earlier statement, all of these questions have already been answered: The following was conjecture on your part:


You've still not shown any need for bags. Let's summarise:

You've shown that there is a need to generate files with duplicate records to meet client requirements.

You've shown that there is a need to import or link to files and tables with duplicate records. You've shown that you'd like to be able to display relations without keys shown. So, where are bags needed?

Nothing is "needed" in the absolute sense. Per above, I should have perhaps used the word "helpful" instead of "needed".

[You did use the word 'helpful'. We have yet to see a case where bags are non-trivially helpful (ooh, boy! I saved five characters relative to a particular syntax - and that's really a feature of the syntax and not bags!). We have many instances where bags are harmful in non-trivial ways (optimizations being the only one a narrow-minded bigot is likely to acknowledge, because duplicate-entry errors, resilience under distribution and concurrency, idempotence, and such can be dismissed with wild HandWaving).]

YOU are HandWaving. You have not demonstrated openly-dissect-able practical significant downsides. You just plain have not. Notions in your head don't count.

[Downsides of bags are the same as ever: potential for duplicate-entry errors, certain optimizations are not possible without a lot of extra context, lack of idempotence on concurrent or disrupted inserts and deletes, duplicates multiply across direct or indirect self-joins. Are you claiming these are insignificant? Or are you denying they are downsides? And Note: remember that this topic is about YOUR burden of proof: showing that bags have openly-dissect-able practical significant upsides.]

I gave 3 domain scenarios to dissect, your turn now...

[You say that as though they weren't already dissected and found wanting.]

Where are the 3 domain narratives?'

[I am waiting for a specific, clear, objective denial from you. I already asked the questions above.]

Denial of what? I don't know if the problems you claim with duplicates are "insignificant". That's not my burden. I'm not the Chicken Little of duplicates. The burden of evidence is on Chicken Little.

[My questions are written above. I placed them in bold for you. The problem is that you have, in fact, been dismissing these issues as idealist nonsense. You cannot pretend now that you are uncommitted. I would like you to make your commitment clear, and objective, such that we have something to target with any scenarios we develop.]

I cannot answer them in terms of how often they are a practical issue, or the practical implications such as how often and in what circumstances do they cause significant problems. I have insufficient knowledge to apply occurrence probabilities. I don't question that there are probably worse-case scenarios, but every tool has worse-case scenarios. The devil in computation of "significant" is in probabilities presented by the domain and human psychology in terms of interacting with a given tool. -t

How often do you expect to have to import external data with no identifiable key, either via the ODBC/JDBC driver or in the data itself, such that it would not be possible to represent it without bags or relations with auto-generated keys?

How often do you expect to have to export or otherwise present data with any and all keys removed?

I thought you meant your scenarios for duplicates causing problems, not my scenarios. Let's make sure we are talking about the same questions before we continue. For the record, I would NOT call the problems/scenarios I listed here "significant". "Relatively minor" is probably the phrase I would use. Although it matters more in some shops/domains than others. -t

[TopMind, beyond the clear hypocrisy regarding attempts at 'worst-case scenarios', you and I define 'significant' differently. To me, a 'significant problem' is a systemic one - it cannot be solved in-the-large by local changes. (This doesn't ignore syntax; I consider inability to abstract, resulting in need for BoilerPlateCode, to be a systemic issue.) I am willing to force developers to (occasionally) change their habits to solve significant problems, and I see all forms of standardization to be instances of this. To you, 'significant' seems to mean anything that forces you to change your habits or make local changes - i.e. it is a matter of "psychology" and "familiarity", and when you say 'significant' you seem to mean 'personally significant'. I find all of your scenarios to be insignificant: none demonstrates a systemic problem. Based on your response above, I suspect you would say the same of mine - at least until the implementation fails you, which might not happen if you don't stress the limits of concurrency, availability, scalability, and performance. The problem is, a lot of people aren't you, and databases really are stressed to their limits in practice. I suspect you have rarely ever had to rewrite large projects due to performance, concurrency, or architectural issues, but that is not uncommon in the industry. Scalability and concurrency issues have affected almost every large website, ever, often costing that site millions of dollars in traffic (i.e. for e-bay and Yahoo). Massive multi-player games must often run continuously, 24/7, which means there is never a moment for batch processing. Your self-centered 'psychology-oriented' viewpoint is, I believe, ultimately insignificant and impractical in the real-world. Yet, I know you are prepared to dismiss the scenarios whole as 'relatively' insignificant by use of subjective assertions regarding probability and 'WherePsychologyMatters'. I won't do that dance with you again. I ask you to either clearly take a position that either (a) clearly and objectively rejects the points we have made (so you can't resort to your usual subjective nonsense arguments to counter objective claims), (b) clearly and humbly requests an education on a specific point (so you can't deny being a student on the subject), or (c) admits honestly that, while you might not fully understand the point, you simply won't care until it becomes personal by bashing you over the head (so we can at least drop the subject).]

First of all, if it's about "scalability" and high-end performance, why can't you create a reasonably-easy-to-describe scenario from such a situation that illustrates the power of "no bags"? You talk in general terms, lacking details and specifics. Enough brochure-talk; give 3 sufficiently-detailed scenarios. If you don't feel like it, fine, just say so. Just don't expect readers to take your claims at face value and then insult them for not reading your favorite library of obscure books. I did my part with my 3 scenarios and I expect you to reciprocate.

[Answer your own questions, seriously. You ask why I cannot offer a "reasonably-easy-to-describe scenario" that "illustrates" an issue such as "scalability". It's an issue of ArgumentByLabToy. Unless you understand the math, you won't understand the significance of the example. And, thus, describing the scenario is insufficient; I'd need to describe the math to you and, frankly, that isn't something I can reduce to your level of competence unless you were heavily motivated to understand the significance of the example. Considering that your agenda would be to deny the significance of the example, I don't consider you worth the effort. Besides, your three examples didn't actually demonstrate anything significant, even by your own admission. What is there to reciprocate?]

 // Example: "I-love-options"
 If (all tables of this query are sets) {
   load and run set-based optimizer;
 } else {  // some are bags
   use hybrid processing;
 }

Second, there's a lot more to app development than BigIron apps. There's lots of small- and medium-sized niches out there where machine performance is not the primary cost/resource factor, but rather software creation and maintenance effort. It doesn't make sense for these to adopt practices meant to cater to "the machine" over "the human" if "the human" is the most expensive resource. PickTheRightToolForTheJob.

[Indeed. And, yet, you've offered no reason to believe that the bag-friendly approach is more suitable for these domains. But true RelationalModel promises to prevent duplicate-entry errors that might cause double-purchases and credit-card expenditures and such, and thus reduces maintenance and compensation issues. That promise seems helpful to me, as a human.]

I spend a lot of effort producing and managing "marshalling code", for example. That is the interface and conversion between systems and apps. Tools that simplify that hit one of the bigger time-suckers in my world. -t

[Yeah, I also do a lot of serialization and parsing. I do not believe the RDBMS is the right place to tackle this problem (since it hits a lot more tools than the RDBMS). But, until a better solution is a commodity, it sure would be nice to have an RDBMS with dedicated import/export utilities flexible enough to support a variety of common and ad-hoc formats. Why, it would be a shame to depend upon an inflexible vestigial DBMS features that have potential to cause scalability, performance, and maintenance issues. (And what's with all those unfriendly gotchas? Why can't I abstract my SELECT INTO for use within another SELECT, or use it in a VIEW?)]

Once you learn that certain features are only available at the final/end query, then it will no longer confuse you. If that's the kind of stuff that trips you up, then perhaps you should consider another career, like grumpy high-school math teacher.

Would it make you more comfortable to have a differently-named operation/clause for "output selects"? "outSelect", or "format" or something?

[Why yes! What a fantastic idea, TopMind! If you're digging for names, I'll suggest 'EXPORT'. Maybe something like EXPORT myTable ORDER(ASC a) EXCLUDE k FORMAT CSV TO "myfile.csv". But, once I had an EXPORT, it would seem like SELECT was a bunch of LanguageIdiomClutter - two operators attempting to do the same thing. So, it would probably be best that EXPORT can work with tables produced by embedded SELECT statements. Why, something like this would have kept me from ever getting tripped up, keep me from accidentally using the wrong feature at the wrong place.]

And, I don't expect a query language to offer super-fancy rainbows, flowers, and parsley; but I do expect it to provide the basics of common export and output, including "showing" only the columns I ask it to show.

That's what Rel does. Just don't expect the projection operator to be overloaded such that it both emits duplicates and doesn't emit duplicates. Projection, like every other RelationalAlgebra operator, does one and only one thing. Other statements are used for importing/linking and outputting data.

I vote overloading. Classification unity of activity is more important than unity of "type" (type relational and type bag) in my opinion. I wish to focus on the task/duty at hand, not noun taxonomies of the underlying parts. If you disagree, so be it.

Top advocates polymorphism! Hell experiences cold snap; freezes over!

That's an implementation detail. The user of the query language won't see whether polymorphism or IF's or whatnot is actually used under the hood. Whether there are even distinct "types" may be an implementation detail (HAS-A-primary-key versus IS-A-bag).

[Polymorphism is not an implementation detail. The implementation detail is whether polymorphism it is implemented with if conditionals or a VTable or whatnot.]

Please clarify. There are not necessarily multiple "types" here.

[Implementation details are exactly the set of things I cannot distinguish in terms of FunctionalRequirements. Polymorphism, when it exists, is never an implementation detail. (Potential point of confusion: you seem to be assuming that 'polymorphism' means 'subtype polymorphism'. OperatorOverloading is technically a form of polymorphism.) Now, whether or not polymorphism exists in this case is arguable; I just wanted to comment, more generally, that you misused the phrase "implementation detail" by applying it to a feature that is by its own definition part of the interface.]

It can be both, but let's save that squabble for another day if it's not material to this.

[Interface vs. implementation may be layered, true, and thus are relative to position within the stack or protocol. But you still misused the term.]

I disagree, but put it in the haggle-queue for another day if it's not pivotal to this topic.


Style note: I will sign my messages with "-t" rather than rely exclusively on italics etc. I may also mark the prior "Anonymous donor" to separate it. This topic is getting too long for me to track where I use italics and where I don't. It grew out of sync for some reason.


Top wrote:

 // Example: "I-love-options" (internal to DB engine)
 If (all tables of this query are sets) {
   load and run set-based optimizer;
 } else {  // some are bags
   use hybrid processing;
 }
Yeah, I could provide that in the RelProject. What would a user gain from all that additional development effort on my part, including considerable thought to define hybrid optimisation strategies that are provably less performant than strictly using relations? Nothing. What would a user gain from learning the distinction between "duplicate-emitting" projections and "non duplicate-emitting" projections? Nothing. What would a user gain from having to learn not to make mistakes by choosing these incorrectly? Nothing. What would a user gain from avoiding all that complexity? Much. Users always benefit from not being subject to unecessary complexity.

What does a user lose from having to use relations instead of bags? Nothing, but she might occasionally have to employ an extra keyword when exporting data in order to eliminate keys and retain duplicates.

Top, you have presented no justification for representing data in the database using bags. If you have a good reason to represent data using bags -- as distinct from importing, exporting, or displaying it -- then you may be onto something. Otherwise, you have only defended what appears to be a marginal, infrequent need to import data containing duplicate records, or output (including displaying) data containing duplicate records. In the former case, detecting the need to add an auto-number key can be automatic. User effort: 0. In the latter case, excluding keys from the output (regardless of output mechanism) requires a keyword or two. User effort: negligible.

Therefore: There is no reason to maintain data in the database using bags, even as an option, hence "hybrid processing" is pointless.

I'll return to these questions later. But first, let's finish up addressing your performance claims, the main reason for that pseudo-code sample. -t

See the example near the top of BagAtational, which identifies a significant optimisation transformation available to relations that is not available to bags. There are others.

PageAnchor no_bags_in_this_one has NO BAGS. You are "solving" the wrong problem. -t

Sorry, not following you here. Your pseudo-code is clearly selecting on bags vs sets.

That's the point. Your lovely set-friendly optimization is still available if a given query has no bags in it. Determining such is not part of the tight inner loops of actual query execution, just one-time selection(s) up-front.

So... Let me get this straight: You want me to implement an entire bag-oriented query optimiser or a hybrid bag+relation query optimiser, which will inevitably deliver poorer performance than the pure relational query optimiser, and require that users understand the performance and possible-inadvertent-duplication implications of using bags and learn a set of options to choose bags vs relations, all to avoid having to learn as little as one keyword to eliminate keys and preserve duplicates on those exceptional occasions when a client demands output stripped of unique IDs? That's what you're asking for, because you've certainly not shown any justification to represent data using bags within the database. --AnonymousDonor

I see no reason why there has to be two sets of explicit operators in the query language. The other issues I will address later. -t

There is a fundamental reason for having two sets of explicit operators in the query language. It's because they are completely distinct operations, more different than '+' and '-'. One is relational projection, a well-defined mathematical operator on a relation that returns a relation. The other is an option on an export/output statement that removes of specified columns in the emitted data. It isn't even an operator, really. They are as different as chalk and cheese.

Fancy math under the hood is an implementation detail.

That's like claiming the difference between addition and calculating a square root is an implementation detail. It's a fundamental conceptual distinction, and it's explicit, not "under the hood". -AnonymousDonor

No, it's conceptually very similar, at least when looking at tables through "domain eyes". Even the few times I have made one of those alleged planet-ending "duplication errors", the results were almost indistinguishable from what "proper" key management would have returned. -t

Yet, if you try to apply that "conceptual similarity" in a naive fashion, you quickly discover that a "projection" that returns duplicates cause errors when used with relational operators. In other words, a superficial and almost coincidental commonality, i.e., "pick some columns", disguises a more fundamental conceptual distinction. Claiming that they're "conceptually very similar" is as silly as claiming that '-' and '/' are "conceptually very similar" because both symbols use a single, straight line.

Concepts are in the head. The computer doesn't otherwise care, it just follows rules. And it's not coincidental: there are certain similarities in many if not most data structure operations among different data structures. And "disguising" is sometimes a good thing; it's part of abstraction in fact: disguising complex things as simple things in order to give us a UsefulLie can simplify our work and grokkability. The key is finding the balance/threshold where there is too much abstraction. And your "strait line" analogy is poor. Try again. -t

By the way, wouldn't life have been better if you couldn't have made duplication errors at all? Or can you simply not abide typing an extra keyword for those three or four times in an entire career that you have to emit duplicate records sans keys?

Many if not most of them could be reduced by minor adjustments to SQL, such as UNION defaulting to meaning UNION ALL, with something like UNION UNIQUE for the other alternative. This would be more consistent with SELECT's "bag first" behavior. SQL has some stupid defaults. Whatever replaces SQL hopefully has smarter defaults (if bag-able).

And I will live with occasional "purity errors" if it sufficiently simplifies common code. For me, errors caused by MISreading long-winded code are on the average greater than those caused by "purity" faults. I like purity too, but not at the expense of error-causing lack of expressiveness. It's all about weighing trade-offs.

[SQL is really long-winded. That means, if I'm following TopMind's logic, that bags are bad. I hate it when impurity causes all this verbosity. Really, we should trade-off: more purity, for less verbosity. That's what Rel did.]

No, it pushed conversion/interface issues to other tools and limited its ability to use existing RDBMS engines.

No, it didn't. It was one of your SQL examples that pushed conversion/interface issues onto TOAD, as I recall. Specific examples have shown that Rel incorporates import/link/export capabilities that SQL does not. Rel is unlimited in its ability to link to existing RDBMS engines. If a JDBC driver exists, Rel can connect to it.

By "use" I meant hook into and act as, or even be, a native query language, not merely import or export, or something in-between and goofy like MS-Access's "links". -t

I do not understand. What does it even mean to say one RDBMS engine has limited ability to "use" another in that sense? I would not blame SqLite for being a bad query language for Oracle. Are you confusing Rel with TutorialDee?

--AnonymousDonor

Let me see if I can clarify this. There is the (semi) standard "SQL" and vendor-specific dialects. SQL the semi-standard can more or less be used on all the RDBMS. Any challenger to SQL should strive for something similar: a common base language that could potentially work on all the vendors' RDBMS with hopefully minimal fuss.

People don't want to toss their existing RDBMS and start over if by chance a better query language started displacing SQL. They would want their RDBMS to be able to use the new language.

In fact, as long as the replacement language is not protected by some proprietary restriction, vendors would end up adopting some form of it for their engines anyhow. Thus, if TD/Rel took off, vendors would create a version that works on their existing engines whether you want them to or not. And, this may include adapting it for bags. You cannot stop the baggification if and when TD/Rel becomes successful.

In some ways this can all be compared to RISC machine language: the momentum of existing CISC was just too much such that RISC appears to have lost the race even though it was theoretically cleaner (although ARM is still hanging around nicely in small-device niches). IBM and HP/DEC made a pretty good try at RISC, but have since pulled the plug because they couldn't get sufficient performance boosts to compete with x86 due to the economy of scale of x86.

I'm not talking about the theoretically best technology here; I'm talking about the likely behavior of the market. Thus, if you want to control how TD/Rel works with bags, do it early, or somebody else will do it for you.

--top

[The obvious solution would be simply to import/link tables from SQL data models using the same (to the user) mechanisms it does for remote tables, albeit with a slight tweak for the local URI. If TD/Rel becomes successful, it will be in part because people acknowledge the value of its purity. I agree on general principle that one should plan for success. However, I do not believe this to be a problem scenario.]

"Simply"? Non-native "linking" has not proven efficient and reliable beyond trivial data needs.

 return linkNative(L) if uriIsLocalToRDBMS(L) else linkRemote(L);

[There's your code. It is simple. No need for HandWaving about efficiency or reliability. No need for Oracle or MySQL to invent a new syntax for linking, or introduce extra options to the user, should they ever follow up on that (slim, but gratifying) possibility of implementing RelProject's variation of TutorialDee to ride DV's coattails. Well, they might change it anyway, for VendorLockIn, but they do not need to do so.]

I've been working on this sort of thing for years, including developing conventional multi-database driven clients via ODBC/JDBC, heterogeneous database systems, ETL systems, database IDEs, and now support for linking external DBMSes within Rel. Two things based on that experience:

Doing a join using two linked tables can be a bear because it cannot do the look-ups on the original host without some clever parsing. One runs into a lot of problems similar to RemoteIterationIssue.

Please define "can be a bear", i.e., is it difficult? Is it impossible? Does it not work reliably on all DBMSes? Etc. What do you mean by "cannot do the lookup-ups on the original host without some clever parsing." Define "clever." While you're at it, "clever parsing" of what? What do you mean by "lot of problems?" Etc. The search phrase you're looking for is "distributed heterogeneous databases." Happy reading, and please come back only when you've done some of it.

Are you saying that linked tables have just as much access to indexes and host-side filtering that native ones do? I'm not in a mood right now to produce a scenario demonstrating the inefficiencies and limits of linked tables. Maybe another day. -t

I'm not sure what point you're trying to make in this context. There appears to be some misunderstanding, but let me attempt to clarify some things:

There are standard ways of linking to DBMSes. ODBC and (to a lesser degree) JDBC are the standard APIs supported by most DBMSes, but there are also vendor-specific APIs like Oracle's OCI and Microsoft ADO.NET. All of these work the same way: Statements in a database language -- typically some SQL dialect -- are sent via the API to the DBMS and result sets or error and/or status messages are returned. All distributed heterogeneous database systems access individual DBMSes via these APIs. As such, no heterogeneous database system has any more or less "access to indexes and host-side filtering" than any other. Note that Rel, by providing links to external DBMSes, is a heterogeneous database system (i.e., it can link to a variety of external DBMSes) in addition to providing native tables aka RelVars. Some other DBMSes provide this facility. Oracle, for example, provides Oracle Generic Connectivity and Oracle Transparent Gateway to (relatively) seamlessly connect to external DBMSes in addition to using its native table format.

Now, it is always the case that a local connection to a native database is going to have access to internal structures that a remote connection will not. That much is obvious. However, for the users of the heterogeneous database system this makes no difference. For example, given a Rel system, assume the following two RelVar definitions (simplified for illustration):

 VAR myVarNative REAL RELATION {name CHAR, address CHAR} KEY {name};
 VAR myVarExternal JDBC postgresqldb TABLE customers MAP {name CHAR(thename), address CHAR(addr)} KEY {name};
A user can query either myVarNative or myVarExternal using TutorialDee syntax, for example:
 myVarNative WHERE name = 'Dave'
...or...
 myVarExternal WHERE name = 'Dave'
The former will perform the query using Rel's internal query processor, optimiser, and data storage and retrieval mechanisms. The latter will be translated by Rel into "SELECT DISTINCT * FROM customers WHERE thename = 'Dave'" and executed against the appropriate external DBMS. Therefore, the external DBMS's query processor, optimser, and data storage and retrieval mechanisms will be used. In both cases, the performance of the query will be dependent on the performance of the relevant query processor, optimiser, and data storage and retrieval mechanisms. In neither case does the existence and/or use of bags, relations (or the lack of either anywhere in any system) have any relevance at all, and all heterogeneous DBMS systems work this way. Rel is no different from any other heterogeneous database systems. From a client's point of view, the only difference between Rel and (say) Oracle is that Rel uses Rel's TutorialDee dialect and Oracle uses Oracle's SQL dialect.

The issues generally start to show up during JOIN's, not single-table operations. Again, I'll provide more details some other day. Think about this: where is the primary processing of joins happening: on the client or server, and why.

In the worst case -- e.g., "myrelvarx JOIN myrelvary" where myrelvarx and myrelvary reside in different databases and are of similar cardinality -- the entire tables must be retrieved from their respective databases and JOINed within the heterogeneous system. This is an issue common to all distributed heterogeneous database systems. It is certainly not unique to Rel or true relational systems and (again) has nothing to do with bags vs relations. This is now wandering off the topic of this page.

Even if the two tables are on the same host, it still may have to retreive entire tables if you link instead of do it native. This would put a link-to-translate-bags language at a disadvantage to a native language if one or both is a bag-table. (Damn, you suckered me into this sub-debate.)

If you're JOINing tables on different databases, whether on the same host or not, you may have to retrieve entire tables.

If you're JOINing tables on the same database, you can always employ the native JOIN defined by that database's DBMS.

This has nothing to do with whether a database language supports bags or not. Note that any auto-generated keys used to convert remote bags to local sets will be applied to the ResultSet -- e.g., the result of employing a remote database's native JOIN -- and not the remote database's source tables.

Any "disadvantages" incurred by creating any heterogeneous database system are the same for all heterogeneous database systems, regardless of language or data model. The only reason for providing a link-to-translate-bags mechanism in Rel is to be able to use Rel as a heterogeneous database system. It's the same reason Oracle provides Oracle Generic Connectivity and Oracle Transparent Gateway. Using Rel as a heterogeneous database system and using Oracle's OGC/OTG have exactly the same "disadvantages". Both Rel and Oracle OGC/OTG have the same advantage: a single DBMS and database language can be used to seamlessly access a variety of DBMSes and/or a number of databases.

Once again, this has nothing whatsoever to do with the bags vs relations debate.


PageAnchor: inherently_slow

Let me ask for clarification regarding machine performance. Agree or disagree: Human error/judgment issues aside, a DBMS that optionally allows bags in addition to sets does not inherently perform noticeably worse than a set-only DBMS. Your argument about performance issues has to do with bags inadvertently slipping in and "polluting" the optimizer. It's not the existence of the option/choice itself of including bags in the DB that slows the computer down. Is this correct? -t

Arbitrary expressions consisting of operations on bags will be slower on average than equivalent expressions on relations due to the inability to apply certain optimisation strategies. However, the mere existence of the option/choice itself does not represent a performance hit.

That once again begs the question that still has no answer from you: Why provide bags as an option at all? We've shown that importing/linking to data with duplicates adds no complexity and exporting data with duplicates adds negligible complexity (a keyword). However, providing bags as an option within the database definitely adds complexity. So, why provide the option? We might as well provide a BrainFuck implementation in the database too, for all the benefits that have been shown for providing bags in the database.

[To the extent that an optimizer must depend upon locally available information (i.e. when dealing with abstraction or modularity), the mere option/choice can hinder optimization. Under the assumption that the optimizer has global information about all parts of the query, the option/choice will not necessarily hinder performance, but still greatly increases complexity (complexity to handle the extra options, plus complexity to manage the information to make the right decisions). Historically, complexity hinders development, testing, and validation of optimizers... which may indirectly impact performance.]

See the RISC/CISC analogy above.

[CISC technologies work on the same principle that leads to AttributesInNameSmell. That is, certain combinations of useful features can be optimized. But I do not see a valid analogy here, in this section about performance.]


[1] Often such documents do not explicitly state a column order or lack of. But out of professional courtesy, one usually follows the order in the written specification. Paper has an inherent order whether we want it to or not. In fact, so does any digital file, regardless of encoding technique. Thus, it's difficult to get a true bag. Nature sort of forces it into a list. If you want a bag, then generally it has to be done by suppressing or hiding internal physical ordering from the interface. And even that's difficult because it will generally deliver the info in a time order, a sequence. Maybe if the interface was 100% parallel, then there would be no time-of-delivery order.


Why provide bags as an option in a Relational DBMS? We've shown that importing/linking to data with duplicates adds no complexity and exporting data with duplicates adds negligible complexity (a keyword). However, providing bags as an option within the database definitely adds complexity. So, why provide the option?

I disagree it will be simple, but I am not going to repeat the entire arguments all over again here. We'll wait until TD/Rel comes out and kick the tires then. Done. -t

Ok, let's assume, for the time being, that importing/linking to data with duplicates adds no complexity and exporting data with duplicates adds negligible complexity. Why provide bags as an option in a Relational DBMS?

Are you talking about the query language, or the new RDBMS? A new query language that is not friendly when used with/for *existing* RDBMS will face resistence. I see no real problem with making the DB engine be bag-free, but the query language is another matter. Oracle could add an option switch to their existing engines to dissallow bags also if it becomes a PR issue.

Assume that importing/linking to data with duplicates adds no complexity and exporting data with duplicates adds negligible complexity. Assume the database engine is bag-free. Is there any reason to support bags in the database language?

Well, okay, but you know I'm going to say that's a tall "assume" right now.

"Well, okay" what? Do you mean you agree there is no reason to support bags in the database language?

No. I disagree with the stated assumptions.

That's fine. I know you disagree with what the assumptions state. That's why the word "assume" is used. We can assume whatever we like, even if it is purely hypothetical, nonsensical, unreasonable, or simply unrelated. E.g., we can "assume infinite bandwidth" or "assume there exists a perfectly spherical cow of infinite density" or "assume that importing/linking to data with duplicates adds no complexity and exporting data with duplicates adds negligible complexity." The last one is an example of "simply unrelated". I am intentionally separating what goes on inside the DBMS from what goes on at its import/link and output boundaries.

Anyway, let's set that aside and look at my question another way: Imagine that our requirements specify, for (say) security reasons, a database environment that is self-contained. Input will be entirely via data-entry forms written in C# that connect to the DBMS via ODBC; output from the DBMS will be strictly on-screen or printed reports via some CrystalReports-like tool. I.e., this is a fairly typical desktop-oriented database-driven application environment sans import/export. If we choose a pure relational DBMS to meet our requirements, do you see any reason to provide bags in addition to relations?

In the future the requirements may change to integrate better with other RDBMS, or they decided to dump the non-SQL-based system and switch to SQL. CrystalReports and similar tools use SQL under the hood for most database interaction. If you only have TutorialDee, how will you interact with such tools? I suppose you could make a simple single-SELECT SQL interface that calls more complex TD under the hood, but that's a little on the silly side and may still keep you from using fancier SQL-based UI's. It's kind of silly to dump a standard used by many tools for minor shortcomings. It's almost like making your own web browser that is based on something other than HTTP because HTTP has a little something that bothers you.

Did I mention TutorialDee in this section? I did not. I wrote, "a pure relational DBMS." Such a DBMS could even support a dialect of SQL more adherent to the Standard than that provided by most commercial SQL DBMSes. CrystalReports and other tools could trivially support it. TutorialDee is not the issue here. Please, stop being evasive and answer the question.

My apologies. I didn't notice the non-statement. I just noticed you didn't mention Philadelphia either. However, it does shift the question toward how this "special" dialect of SQL deals with various issues. The different ways to go about it could probably fill up another topic, and make us each call each other another 50 unique names or so.

Please, stop being evasive and answer the question. I have highlighted it, above, in case you've forgotten it.

I don't think I can anticipate all the potential downsides to make a good estimate. I've never lived a life without the option of bags.

I see. Is it possible that some of your objections to bagless systems are based on a fear of the unknown?

No, it's based on my experience with trying to forecast problems down the road with complex systems. I can already foresee some minor annoyances working with outside systems and data having bags. There are probably things I cannot foresee being that it's difficult to mentally compute all possible usages and query kinds. The more complex the system, such as more combinations, the more things are probably skipped in estimations.


PageAnchor: scenario_four_alpha

This is to clarify and illustrate some of the statements in scenario 4 (above).

Boss: The mass-mailing of the thumb drives to potential clients will commence in two weeks. This is an important sales drive. However, our budget is being cut and we'll probably have to purchase the 512k drives instead of the 1gig drives. Thus, we will probably have to dump some of the items we planned to include. Will the Cleveland actual sample data query system you built fit in 512k? I need to know how much of the other materials we need to cut fairly soon to make room for it.

You: Well, I may be able to remove some of the indexes, but it would slow down certain queries.

Boss: We'll look bad if too many queries run too slow, losing potential customers. Which index do you think would have the least impact on speed?

You: Let see, well the primary key index can possibly be tossed. It's pretty big because of the way our stats are laid out. (Sample schema already given.) It takes up almost as much space as the data itself.

Boss: What does it do?

You: It mostly just prevents accidental duplicates on our part. It does help a few kinds of queries on the first element of the key run faster, but I can replace it with a single-column index.

Boss: Is there another way to check for duplicates?

You: We could write a query set that would let us know if there are accidental duplicates. We just have to remember to run it before we send out the final cut.

Co-worker: Maybe we can switch to a different database engine that prevents duplicates without such a big index. ABC 3.0 allegedly has this property.

Boss: We already enough database engines around here to support. When one of you leaves for that big raise, we are left training somebody in all those. I'm hesitant to add another one. If we did go with ABC 3.0, how much time would it take?

You: It would cut it close to our deadline, I estimate. I don't recommend changing the engine this late in the game. We already did most of the testing, which would have to be repeated.

Co-worker: I'm pretty sure I can burn the midnight oil to get ABC 3.0 going before the deadline. I'd like to gain some knowledge on it anyhow so that we have another tool approach in the future.

Boss: No, I don't want to risk new problems this close do the deadline, and add another DB engine to learn and support. Tossing the primary key and remembering to run the duplication checking utility sounds like the simplest fix with the least risk. Maybe next year we can look at alternatives if we have space problems again, but at this time we need to focus on getting the current stick ready.

Good heavens. I haven't seen juvenile "story" illustrations since junior high school.

You've given a completely contrived example, along with some unrelated editorialising. Here's a much simpler conceptual analogue:

Me: I can write our new DBMS so that doesn't require bags.

Boss: Do it and you're fired!

Me: Ah, you're right as always -- our DBMS requires bags.

Anyway, if space is a concern (a) you can use the BTree (or other data structure) itself to store the data, and/or (b) eliminate the duplicates in the source data. Of course, if this is a query system I presume it's read-only, in which case it's probably a good idea to remove duplicates in the source data -- before putting it on the USB stick -- for performance and space reasons. Since it's read-only, we don't need any mechanisms to detect duplicates.

Overhauling the DB engine or (data structure) was addressed already. And remember we want multi-indexing in most query utilities. I'm not sure how much we'd have to reinvent if we did a hand-rolled indexing system. And as I explained elsewhere, I don't know what you mean by "contrived". It's not a technically-precise term.

Re: "Since it's read-only, we don't need any mechanisms to detect duplicates". Yeah, in other words it can be bag! Thanks for agreeing with me.

Huh? How did you get from "we don't need any mechanisms to detect duplicates" to "it can be bag"? We can construct a database that contains no duplicates. Since it's read-only, we can't add records. If we can't add records and there are no duplicates, there are no bags.

As we already discussed somewhere, we may want to add last-minute adjustments that would make it have duplicates. I've added last-minute adjustments to similar "summary" tables many times. Ideally we'd reconstitute it, but sometimes one has deadlines. Further, a bag that happens to contain no duplicates is still a bag. It fits the definition of a bag far more than a set.

I bet some of your last-minute adjustments have introduced erroneous duplicates.

Perhaps. I and/or my boss made an economic decision of risk versus time versus resources etc. The work environment generally is such that AllRoadsLeadToBeeMinus. I usually try to run some quick check-sums to help ensure there are no glaring errors and that if there are errors, most likely they'd be small and "local". -t


Re: "Boss: Do it and you're fired!"

I'd be willing to bet a paycheck that 95% of all bosses don't give a fudge as long as stuff usually works. I've had a few bosses that do what I'd call "obsess" or "fixate" on certain things, but bags wasn't one of them.

I see. I, however, would bet a paycheck that no matter what anyone writes, you'd find some reason to quibble with it even if quibbling adds absolutely nothing to the discussion at hand.

Projection. Your insistent set-only stance is silly and quibbly from my perspective, I would note. It's like you are saying that your quibbles are better than my quibbles.

I defend a set-only stance because I like maximum optimisation and minimum opportunity for erroneous duplicates either in entered data or query results.

As for quibbling, did you really need to question the motives of my fictional "boss" in something that is clearly intended to be nothing more than an illustration of "story" structure?

I generally based it on what I believe is most likely to happen; what a "typical" boss would do in that circumstance. If you disagree with my probability assessment, be my guest. Just don't be rude about it.

I don't agree or disagree with your probability assessment. I simply wonder what it has to do with anything. It's as if I'd mentioned someone had a coffee in the evening as part of some tale about the evolution of the relational model, and you interjected to point out that drinking coffee in the evening will keep you awake, which is bad because you need sleep to be most effective. True, but who cares? It's irrelevant to the discussion at hand.

I'm trying to illustrate the work environment that would put one in a position similar to the scenario. In other words, what would YOU do in that person's shoes, and why. It also illustrates the limits of one's control over the decision choices/paths per role in a typical company. Some of your statements made me question, or at least puzzle over, your familiarity of a typical business work environment as far as what one has control over, what they don't, how one would likely communicate options to managers, and likely responses. I'm trying to put you in a business simulator to see what your responses are likely to trigger as far as manager reaction. We often don't get to solve technical issues alone on an island. It's part of a team and organization whether we want it that way or not. Whatever solution you suggest has to be compatible with the given environment.

Sorry, you've completely lost me here. Your comment appeared to be non sequitur, and it still appears non sequitur. I can't imagine why you'd see the "boss" as anything but the slightly-comical, vaguely-Dilbertian illustration it was intended to be. I'm even less capable of imagining why you'd think your comment should represent some sort of business lesson. If that was the intent, it was even more out of place.

I'm trying my best to communiate to what to me is a strange foriegn mind. I did not intend to fail at such communication. There seems to be some underlying assumptions that we are not communicating, and I hoped that putting it into a step-by-step situation in a "field" setting would help tease out these underlying assumptions.


Top's arguments remind me of a mistake I've seen made several times. One instance in particular caused ongoing headaches for years in a product that I worked on. We were developing a database-based shrink-wrapped Windows desktop product. It was to replace a DOS application that used flat files for data storage.

There were many improvements to the schema of the new product's data, which variously reflected lessons learned from the old product, support for features that didn't exist in the old project, and leveraging of the better technology that we were using. An important part of the application was a converter that imported data from the old application. A big stumbling block was that the old data often couldn't satisfy the constraints imposed in the new system without significant additional work, and in some cases might even have needed additional input from a user. An argument developed within the team over whether to relax constraints during the import process.

The question seemed patently absurd to me. You deal with these issues at the system boundary, so that the rest of the system can take advantage of the assurances provided by these constraints. If you don't, then the constraints have little or no value. Inexplicably, the manager (who actually was technical, if a bit out of practice, and usually made better decisions) sided with the more junior staff working on the import feature. We turned all constraint checking off during import.

We paid for this decision many times over in the following years. Even though new data could not be entered that did not satisfy constraints, the system had to be littered with code to handle the old data as a special case. Numerous bugs arose in the field and had to be "fixed" with such code. New customers weren't plagued by these problems, but old customers were, and the erosion in confidence may have prolonged the support cycle for the old product. I can't quantify the costs with any precision, but they were considerable.

If you consider "no duplicates" as a constraint, then some of Top's scenarios are exactly analogous. (In fact, some of the constraints alluded to were, exactly, uniqueness constraints. And he is arguing for making this mistake on a grand scale. He justifies this by belittling the value of the constraint.

E.g. the term "purity" is used as if it were a disadvantage. On the contrary, reliable adherence to any constraint gives guarantees to a programmer about the state of the data, reducing the number of possible states that have to be catered to by code. This has profound beneficial effects. It's at the heart of type systems, in fact. And this topic is closely related to types. Nobody argues against the ability to have bag-like data in a system; the argument is over what should constitute a relation in a relational system. (The argument is also absurd because there's a whole body of knowledge built around relational systems, and nobody's going to redefine that in order to accommodate one person's insistence on changing a fundamental definition.)

Top also underestimates the cost of duplicates. In my experience, duplicate errors are made fairly frequently in SQL systems, and when summary operations are involved, often cause more than cosmetic problems and are hard to diagnose.

-- DanMuller

This reminds me that QualityIsFree.

Ultimately it's still an economic decision and your boss made a decision based on his/her best estimate of the trade-off in problems. If your organization had gone with the tighter re-import decision, it perhaps could have angered existing customers enough to sink your business. Since we don't have a parallel universe to observe that alternative in action, we can only speculate at this point what the actual costs and drawbacks of that would be.

At least the "loose data" decision allowed your company to survive, or else it wouldn't be around long enough to have developers who complain about the annoyance of the legacy features. It thus passed at least one test: it didn't sink the product.

If one applies FutureDiscounting, the path may be the right decision per current finance practices. Under FutureDiscounting, paying more in the long run is often worse than avoiding near-term losses. (Some argue that FutureDiscounting shouldn't apply to IT, or at least apply less, but that's another debate.)

Windows Vista by many accounts sunk itself into suckage lore largely because security was given higher priority over backward compatibility. Microsoft struggles with backward compatibility versus better security all the time (MicrosoftWindowsBeyondRepair). Consider another MS product: the Xbox was rushed to market with known flaws, but by many accounts this rush did allow Microsoft to get a foothold in the game market at the right time, just before competitors released their next generation of consoles.

There are SafetyGoldPlating horror stories and there are also slipshod slop horror stories. Sometimes WorseIsBetter. There is no up-front right spot on the "slop level" slider. Picking the right level on this slider requires skillful timing and domain experience.

Anyhow, these are tough decisions where it's difficult to model the future precisely and test the impact of different decisions, especially from an economic standpoint. I can respect your viewpoint and I can respect your boss's viewpoint regardless of whether I agree or not. Let's agree to respect each other on this difficult issue.

We can disagree without calling each other "sloppy" or "anal" or "detached from the organic business reality", etc. It's okay to be confident you are right, but don't be rudely insistent unless you have pure math or logic behind you. (If I violated these social rules with your content, I apologize.) However, with economic choices, pure math or logic usually is not available. Economics, like psychology, is still a "grey science".

Often in the absence of good, hard scientific tests (outcome analysis), people fill in the gaps with their personality traits and quirks, and this is possibly why such issues become so heated. Some people are naturally fastidious and others naturally loosey-goosey. Each of these groups may project these personal traits into such estimates.

(I believe I'm in the middle of the road, at least in the shop. In this particular wiki, I'd probably be considered a bit toward the loosey-goosey side. But this may be because of my domain/specialty.)

I wasn't there for your DOS-to-Windows re-write, and so didn't see the details that would allow me to make an informed vote, so I am not picking a side in your scenario. At least it sounds like you guys gave the issue a good hard thought and had probing discussions.

My main point is that the bag in scenario 4 provided a viable business option that wouldn't exist without it. (Switching to a new database engine also carries mistake risks, I would note, because one tends to make more mistakes in with new tools.) If a technology option provides management with more options that are viable, this is generally a good thing. You may not agree with management, but that may be stepping outside of your actual role. In my book, something that supplies more options to owners and decision makers a plus (all else being equal), and I will score it as such in my working IT decision-making models. If you wish to argue it's irrational, be my guest, but please make it a new topic. -t

Re: "Top also underestimates the cost of duplicates. In my experience, duplicate errors are made fairly frequently in SQL systems, and when summary operations are involved, often cause more than cosmetic problems and are hard to diagnose."

Well, I just don't see it that often except as a side-effect of something else. Usually if it happens, it's a result of a bad schema design, insufficient input validation, a poorly-skilled query author who shouldn't be writing queries in general, unavailable info on primary keys, or a sacrifice made for backward compatibility (similar to above). Perhaps you can present the scenarios of actual screw-ups so we can dissect what went wrong and why.

Also, if we could reboot the world and existing RDBMS started out with the no-bag rule, then existing tools and data-sources would be set-friendly and some of the above issues would go away. But bags and related practices have too much of a foothold already and have become a de-facto standard. QwertySyndrome wins in this case.

In short, the organic-versus-planned decision is still a grey art to measure, and anybody who pretends it's an exact science is kidding themselves.

--top

{I've been in the same position as DanMuller, having to port an old DOS-based proprietary-database-driven application to a SQL DBMS (that strictly enforced constraints, including no duplicates anywhere) driven application. We ran into the same problem -- legacy data sometimes violated constraints in the new database. There was some debate over whether to relax constraints or not, but we chose not to relax constraints. What won the argument was the reality that data was a business asset that was increased in value by ensuring its correctness under the new constraints, and that relaxing constraints would result in possible long-term support and reporting difficulties which would cut into our bottom line and our clients' bottom line. We wrote a conversion program that involved the user in the conversion process to resolve ambiguities, eliminate duplicates, fill in missing data, etc. Some users did kvetch about the time the process took and the effort involved, but none genuinely complained because we took care to inform users that the process was necessary to ensure the correctness of their valuable data and help avoid possible expensive problems in the future. I've never regretted the decision, especially as conversion problems would only ever be a one-time issue, but relaxing constraints would have meant possible ongoing problems.}

I'd argue that as technicians, we are often not qualified to make such final decisions. We'd like to think we are, but to be frank, those who eat and breathe the business side are better qualified than those who mostly eat and breathe technical issues. Our job is to present the technical options as accurately as we can. --top

{In that role, I was the CEO of the software company.}

You'd still be if you were any good :-)

{On paper, I still am. However, I essentially retired 10 years ago, moved to a foreign country, bought a massive house on the country edge of a quaint village with my girlfriend -- a former actress/model, but now working as a computer geek because it's what she loves -- and have settled down to a life of research, teaching, and amusement. I attribute almost all of my success to faithfully adhering to the pure relational model and avoiding bags.}

What about using the carrot instead of the stick? Keep the old version going for those who don't want to convert. If they really want the new features and new GUI, then they will eventually decide to byte the bullet and clean up their data. (Offer free cleaning tools/reports.) Eventually most probably will. Some might argue they'd switch vendors entirely, but it's still easier to convert to something that's almost the same instead of starting over, as long as your product is roughly on par with the competition. Offer them a loyalty discount. I've been on the customer side of the coin, and key system upgrades can be very stressful on a company. If you make it easier for them to convert on their terms, they will be greatful. --top

{Completely off-topic. However, our clients were allowed to convert to the new system at their leisure.}

It's nice to have that option.


...Ummm, my scenario for Bags are when there is something meaningful that can happen when keys are identical. That is, is there a useful "conflation" that can be taken advantage of to make a smaller table? This could be as simple as a union-merge, but things really get interesting when a more meaningful conflation function can be used. But no one seems to organize their data in ways that make use of such functionality. The only real use is when the relationships between data are so interconnected that they are better represented as a pure graph with no tables at all. Like a forest of sprouts, over towers of trees. This means it's useless for institutional data, but perfect for a WikiDb? engine. --MarkJanssen

Can you give an example of something meaningful that can happen when keys are identical? Is it something that cannot be achieved by replacing the duplicates with a {key, quantity_of_duplicates} pair? Not sure what you mean about "conflation function" et al.


This page is TooBigToEdit

Indead! It confuzes my spail chekker.


See Also: ComplexityOfOutputtingDuplicateTuplesInTutorialDee, BagAtational


AprilTen NovemberTen

CategoryDatabase


EditText of this page (last edited December 10, 2014) or FindPage with title or text search