Bag Need Scenarios Re Work

This is an attempt to produce a summarized or more compact version of BagNeedScenarios. It is not necessarily intended to replace BagNeedScenarios, but rather provide an easier-to-digest version. If the experiment succeeds, then this topic will be renamed BagNeedScenarios and the original renamed BagNeedScenariosDiscussion?.

Two version of the summary are now given due to irreconcilable differences between the pro-bag and anti-bag camp.

Summary Version A - "Anti-Bag" Camp [under constrution]

1 - Limited Access to Large Log Table which may contain duplicates due to imperfect logging equipment.

Solution: The logging devices should be fixed to provide unique device ID's. (Note: If the RDBMS exposes a unique row id (e.g. Oracle's ROWIDs), then including the row id in the query is an example of this solution).
- Why it might not work: if they are difficult to access, such as weather sensors in remote areas, it may not be economical. See scenario #4 for a related statement on economics and decisions.
Solution: The mechanism that provides linkage to (or import of) external data can generate a unique id.
- Con: This can be an extra step.
- Con: Query user may not have access to create/write new work-tables (with added ID's), or the bandwidth to make a local copy within practical time limits. For example, most queries for a given user may be summary queries. Only the summed results go over the network, not each row.
  - Counter: The Query user doesn't need to create/write new work-tables (with added ID's). The bandwidth is the same either way since in either case, the exact same data in the exact same format is being sent across the network.
    - Please clarify.
- Con: Auto-gen numbers would likely change anyhow on the next copy.
  - Counter: The unique id is for internal processing. There is no need for it to persist beyond the current copy.
- Con: If the table is large, then making a local copy just to add auto-numbers may not be practical. Typical queries may only be interested in a small portion.
  - Counter: If only a small portion is of interest, then only generate ids for that portion. For data sets that are too large to include a local copy of the interesting data, one will need to break the data up into smaller portions anyway. Generate unique ids for those portions.
- Pro: Many RDBMS already provide an internal unique row key that can serve the same purpose as a generated ID for many data-sets[1]. It can be used without changing or copying the "original" table or subset. Not only does this make creating unique id's trivial (if the RDBMS's keys are exposed). It's proof that the technique is viable.
Solution: The mechanism that provides linkage to (or import of) external data can coalesce duplicates and generate a "count" attribute.
- Con: This can be an extra step.
- Con: If the table is large, then making a local copy just to coalesce duplicates may not be practical. Typical queries may only be interested in a small portion.
  - Counter: If only a small portion is of interest, then only coalesce duplicates for that portion. For data sets that are too large to include a local copy of the interesting data, one will need to break the data up into smaller portions anyway. Coalesce duplicates for those portions.

2 - Ad-Hoc Column Trimming for Fit

Solution: Handled by the output presentation mechanisms of the true relational DBMS or its client applications.
- Con: This can be an extra step.
- Con: The existing client software and/or query browser may not have such a mechanism outside of controlling the SELECT column list or it's equivalent (which was generally sufficient with SQL).
  - Counter: A true relational DBMS isn't going to be using SQL (or other query language that allows bags). If the client is using SQL, then the translation layer (that already has to be there) can do the work of converting the set returned by the DBMS into the bag required by SQL. If the client is using the DBMS's native language, it can use EXPORT (or equivalent).
    - {Please clarify. What translation layer?}
    - The one that converts the SQL into something the true relational DBMS understands. (Whether this is another query language or an API is immaterial).
    - So it's going "translate" from bags to sets? That could be a lot of overhead in many situations.
    - What makes you think so?
    - We discussed this already at length somewhere. Adding a new query language or service on the database server is NOT going change the nature of the client software. Granted, if you have a custom client for a given RDBMS or query language, then of course you can add whatever the heck you want to it. My point is that EXISTING tools in the field may not have such mechanism because they didn't need it (with SQL) because they can accept bags as output from the RDBMS.
    - I find it difficult to imagine a client that can accept a bag but can't accept a set. I find it even more difficult to imagine that a business which invests in using a true relational database system like Rel is going to be stymied by some crude, pre-existing client. More likely, having made the decision to invest in true relational technology and its attendant benefits, it will be highly motivated to appropriately deal with duplicates by eliminating them at their source and engineering appropriate output mechanisms in the unlikely (and rare) event that they need to be produced. Anyway, Rel supports emitting both relations (no duplicates) and ARRAYs (a TutorialDee structure that allows duplicates, which is typically used as an analogue to SQL's CURSOR mechanisms), so it is possible to emit bags -- without undue circumlocution -- in the unlikely (and rare) event that they need to be produced.
    - Re: "I find it difficult to imagine a client that can accept a bag but can't accept a set" - the problem is accepting bags, not accepting sets. Anything that can accept a bag can also accept a set with little or no modification, but not the other way around. As far as tossing existing tools to get the allegedly wonderful benefits of no-bag DB's, I believe you exaggerate the practical benefits of such, but I won't repeat that debate yet again here. As far as using ARRAY or what-not, that's fine as long as existing clients can accept the results, a bag. Either way you end up with a bag at the client. A SELECT statement just makes it easier, more familiar to users of internal SELECT's (less training costs/time), and less code. ARRAY/cursors is just a very round-about way to do what SELECT does easy. Don't reinvent the wheel by making it crappier. Explicit looping is anti-query-language, taking us back to the FORTRAN and COBOL days. What's wrong with a non-iterative bag-producer command for output/export reasons? Call it BAGSELECT if you want. If you want to rally against internal bags, then forbid a RDBMS from using SELECT etc. to make internal bags, but not prevent "output" bags. Use your alleged mass brilliance to adjust what already exists instead of redo it in a verbose way. (Most RDBMS don't allow query users to define CURSOR's, by the way.) -t

3 - Contracted Delivery Columns

This is a variant of 2.

4 - Compact Sales Summary

Con: Compaction does not necessarily imply forgoing duplicate protection/elimination
- Counter: while there may be ways to compress it or switch to a different data engine if given enough time, it may not be economical and/or there could be insufficient time to re-work it.
  - Counter-Counter: that argument is a total straw-man. TopMind is projecting his ideas of database implementation on this while clearly ignorant of the subject.
    - {I changed the wording. Is it more acceptable to you now?}
Con: and the scenario borders on "you need a bag because I say you need a bag."
- Counter: Management may decide to increase risk to avoid additional time (rework delay) or labor. That's not a developer/designer's final choice. The owners and their proxies make the final economic trade-off decisions on factors such as risk, time, labor, cost, etc., not technicians. Bagginization gives them an additional choice.
  - Counter-Counter: the relationship between 'baginization' and risk, time, labor cost, et cetera is unknown. There have been no studies on the subject. Therefore, the notion that baginization allows rational trade-offs on these properties is a straight-forward lie promoted by TopMind to defend his bags.
    - If the minimum level of evidence requires an OfficialCertifiedDoubleBlindPeerReviewedPublishedStudy, then your side has nothing either. Works both ways, bub. There is no OfficialCertifiedDoubleBlindPeerReviewedPublishedStudy that proves that optional allowance of bags causes gobs of errors in practice. -t
    - My 'side' has not made any arguments regarding risk, time, labor cost trade-offs of 'baginization'. Anyhow, TopMind has now implied that lack of an OfficialCertifiedDoubleBlindPeerReviewedPublishedStudy is a fine excuse for making shit up. I tag Top's response here another candidate for ObjectiveEvidenceAgainstTopDiscussion.
    - Sigh. How about if I word it: "It gives the decision makers a choice that wouldn't exist otherwise." That's an objective statement. As far as what they do with that choice, that may be up to them.
    - Heated objection moved below under PageAnchor "economic_A".

Please keep responses brief, no more than about 50 words. Link or PageAnchor to relevant longer discussions or descriptions.

Summary Version B - "Pro-Bag" Camp [under constrution]

1 - Limited Access to Large Log Table which may contain duplicates due to imperfect logging equipment.

Solution: The logging devices should be fixed to provide unique device ID's. (Note: If the RDBMS exposes a unique row id (e.g. Oracle's ROWIDs), then including the row id in the query is an example of this solution).
- Why it might not work: if they are difficult to access, such as weather sensors in remote areas, it may not be economical. See scenario #4 for a related statement on economics and decisions.
Solution: The mechanism that provides linkage to (or import of) external data can generate a unique id.
- Con: This can be an extra step.
- Con: Query user may not have access to create/write new work-tables (with added ID's), or the bandwidth to make a local copy within practical time limits. For example, most queries for a given user may be summary queries. Only the summed results go over the network, not each row.
  - Counter: The Query user doesn't need to create/write new work-tables (with added ID's). The bandwidth is the same either way since in either case, the exact same data in the exact same format is being sent across the network.
    - Please clarify
- Con: Auto-gen numbers would likely change anyhow on the next copy.
  - Counter: The unique id is for internal processing. There is no need for it to persist beyond the current copy.
  - Reply: The "current copy" may not need unique ID's either. It's a waste of resources to add them if not used.
- Con: If the table is large, then making a local copy just to add auto-numbers may not be practical. Typical queries may only be interested in a small portion.
  - Counter: If only a small portion is of interest, then only generate ids for that portion. For data sets that are too large to include a local copy of the interesting data, one will need to break the data up into smaller portions anyway. Generate unique ids for those portions.
  - Reply: Similar to above, a given "portion" may not have a need for unique ID's. Don't carry an umbrella to the golf course unless there's a reasonable chance of rain.
- Pro: Many RDBMS already provide an internal unique row key that can serve the same purpose as a generated ID for many data-sets[1]. It can be used without changing or copying the "original" table or subset. Not only does this make creating unique id's trivial (if the RDBMS's keys are exposed). It's proof that the technique is viable.
Solution: The mechanism that provides linkage to (or import of) external data can coalesce duplicates and generate a "count" attribute.
- Con: This can be an extra step.
- Con: If the table is large, then making a local copy just to coalesce duplicates may not be practical. Typical queries may only be interested in a small portion.
  - Counter: If only a small portion is of interest, then only coalesce duplicates for that portion. For data sets that are too large to include a local copy of the interesting data, one will need to break the data up into smaller portions anyway. Coalesce duplicates for those portions.
  - Reply: The comments above about extra processing applies.

2 - Ad-Hoc Column Trimming for Fit

Solution: Handled by the output presentation mechanisms of the true relational DBMS or its client applications.
- Con: This can be an extra step.
- Con: The existing client software and/or query browser may not have such a mechanism outside of controlling the SELECT column list or it's equivalent (which was generally sufficient with SQL).
  - Counter: A true relational DBMS isn't going to be using SQL (or other query language that allows bags). If the client is using SQL, then the translation layer (that already has to be there) can do the work of converting the set returned by the DBMS into the bag required by SQL. If the client is using the DBMS's native language, it can use EXPORT (or equivalent).
    - {Please clarify. What translation layer?}
    - The one that converts the SQL into something the true relational DBMS understands. (Whether this is another query language or an API is immaterial).
    - So it's going "translate" from bags to sets? That could be a lot of overhead in many situations.
    - What makes you think so?
    - We discussed this already at length somewhere. Adding a new query language or service on the database server is NOT going change the nature of the client software. Granted, if you have a custom client for a given RDBMS or query language, then of course you can add whatever the heck you want to it. My point is that EXISTING tools in the field may not have such mechanism because they didn't need it (with SQL) because they can accept bags as output from the RDBMS.
    - I find it difficult to imagine a client that can accept a bag but can't accept a set. I find it even more difficult to imagine that a business which invests in using a true relational database system like Rel is going to be stymied by some crude, pre-existing client. More likely, having made the decision to invest in true relational technology and its attendant benefits, it will be highly motivated to appropriately deal with duplicates by eliminating them at their source and engineering appropriate output mechanisms in the unlikely (and rare) event that they need to be produced. Anyway, Rel supports emitting both relations (no duplicates) and ARRAYs (a TutorialDee structure that allows duplicates, which is typically used as an analogue to SQL's CURSOR mechanisms), so it is possible to emit bags -- without undue circumlocution -- in the unlikely (and rare) event that they need to be produced.
    - Continued at ComplexityOfOutputtingDuplicateTuplesInTutorialDee.

3 - Contracted Delivery Columns

This is a variant of 2.

4 - Compact Sales Summary

Con: Compaction does not necessarily imply forgoing duplicate protection/elimination. A different database engine could be converted to.
- Counter: while there may be ways to compress it or switch to a different data engine if given enough time, it may not be economical and/or there could be insufficient time to re-work it, and/or it would result in more training upon staff turnover to have more database engines supported by the shop.
Con: and the scenario borders on "you need a bag because I say you need a bag."
- Counter: Management may decide to increase risk to avoid additional time (rework delay) or labor. That's not a developer/designer's final choice. The owners and their proxies make the final economic trade-off decisions on factors such as risk, time, labor, cost, etc., not technicians. Bagginization gives them an additional choice (factor trade-off profile).
  - Counter-Counter: the relationship between 'baginization' and risk, time, labor cost, et cetera is unknown. There have been no studies on the subject. Therefore, the notion that baginization allows rational trade-offs on these properties is a straight-forward lie promoted by TopMind to defend his bags.
    - If the minimum level of evidence requires an OfficialCertifiedDoubleBlindPeerReviewedPublishedStudy, then your side has nothing either. Works both ways, bub. There is no OfficialCertifiedDoubleBlindPeerReviewedPublishedStudy that proves that optional allowance of bags causes gobs of errors in practice. Your viewpoint is not the default truth. -t
    - My 'side' has not made any arguments regarding risk, time, labor cost trade-offs of 'baginization'. Anyhow, TopMind has now implied that lack of an OfficialCertifiedDoubleBlindPeerReviewedPublishedStudy is a fine excuse for making shit up.
    - Sigh. Your side has implied they are not important factors. How about if I word it: "It gives the decision makers a choice that wouldn't exist otherwise." That's an objective statement. As far as what they do with that choice, that's up to them.
  - (Heated objection moved below under PageAnchor "economic_A".)

Please keep responses brief, no more than about 50 words. Link or PageAnchor to relevant longer discussions or descriptions.

Removed from above:

Counter: If only a small portion is of interest, then only generate ids for that portion. For data sets that are too large to include a local copy of the interesting data, one will need to break the data up into smaller portions anyway. Generate unique ids for those portions.
- Con: For one-off and ad-hoc queries, this may provide no net benefit. (There's no dispute that heavily-used/referenced data-sets should receive a primary key under normal circumstances.)

You've lost track of the context. You were trying to back up your claim that Bags provided an advantage somewhere. A "con" that they don't hurt sometimes hardly qualifies.

Solution: The mechanism that provides linkage to (or import of) external data can coalesce duplicates and generate a "count" attribute.
- Con: We may not know if they are truly duplicates or coincidences. The technique mentioned may result in untested assumptions being made.
- Con: We may not know if they are truly duplicates or coincidences. The "count" technique mentioned may result in untested assumptions being made.
- Con: As the scenario is stated, there is not enough information to know whether similar records are really duplicates or coincidences. Thus, the "count" solution mentioned will not be viable.
- As the scenario is stated, there is not enough information to know whether similar records are really duplicates or coincidences. Thus, the "count" solution mentioned will not be viable.

This is a property of the data set. It's why the first solution (fix the data source) is preferred. If you can't fix it, you have to deal with that issue regardless of whether or not your data processing engine uses bags or sets.

1 - Limited Access to Large Log Table which may contain duplicates due to imperfect logging equipment.

Solution: Dismiss this as an invalid problem. It is essentially: "We need duplicates because I've assumed they are already part of our technology!" We also need pigeons in our databases to better fit with the pigeon data transport network.

Sometimes you have to deal with data sources that aren't ideal. People do make bad decisions and sometimes you're stuck with the aftermath.

The question on BagNeedScenarios is whether the technology should support people in making these 'bad decisions'. Unless you can show that bags were 'helpful or necessary' in creating Scenario 1 in the first place, it's a fine example of circular reasoning: you're assuming your conclusion in order to prove your conclusion.

The question (as I read it) was whether or not you needed to support bags inside the data management system. The issue raised by scenario 1 is what you do when an external source of data is a bag. To me, that is a real concern that can't be handwaved away by saying "only accept sets as data sources". --AnonymousDonor

Are you assuming there is a clear boundary on 'inside' vs. 'outside' of a 'data management system'? Scenario 1 assumes that bags are part of the data-management system - the database storing the log. I do agree that a data management system should accept many sources of data, but TopMind's argument amounts to: "we need bags because I've assumed we have bags stored in the data management system". The proof is in his further argument: all of his defenses for this point are complaints that one needs 'extra steps' to translate between data management system, ignoring that this is subsumed by the same import step any data management system would have (including any import from relational Rel to baggy SQL). At the very least, every single one of his 'extra step' complaints are based on circular logic. Since he doesn't have any other points, my summary of the 'problem' he presents is correct and the whole scenario should be mentioned in ObjectiveEvidenceAgainstTopDiscussion.
Yes, I'm making that assumption. First, this discussion grew out of Top's claim that tools like Rel should internally include support for bags. Rel (and tools like it) have clearly defined boundaries. Second, the scenario is interesting only if the log is external to the system. If the log is internal, the scenario degenerates into the "you'll use bags because I say so". Having an external data source be a bag is plausible because you can't force others to make good decisions.
Regarding Top. Yep, the best he's done so far is to claim that there might be an extra step each time the data has to cross the boundaries of the system.

I'm not clear on the above "circular reasoning" paragraph either. The author seems to consider the scenario in which we can reboot the world and start the history of RDBMS' and related tools over. Perhaps we can evaluate each scenario under the world-reboot situation and with the as-is situation. I've generally considered the as-is situation. -t

Under TopMind's laughable excuse for logic, we would also conclude that the world needs traffic jams and smog because they are part of the situation as-is. I agree that bags are part of the 'as-is situation', but that was never under contention. The argument is that bags are necessary or useful. If TopMind wants to argue that bags are more necessary or useful than traffic jams and smog, then he can't depend upon the as-is situation; he'll need to justify bags from first principles. By pointing to bags as part of the as-is situation, one can only justify a much weaker claim: that a data management system must be able to import from bag-like data sources. And that has already been demonstrated.

Like I said above, we should evaluate BOTH the scenario where we can restart history and where we cannot restart history and only control a small corner of the world. And yes you have demonstrated that non-bag systems can deal with external bags, but sometimes with extra steps and/or overhead. In other words, there is a conversion tax. Whether this tax is "worth it" or not is the real debate, not its existence. -t

Bag Pro's and Con's Summary

Con: Sets are in theory subject to better optimization

Counter: Bags are only suggested in cases where good domain keys are not easy to obtain/use. In such situations, artificial auto-keys may not help.
Counter: Most RDBMS have an internal unique row identifier that can make for an internal de-facto set.

Con: Allowing bags risks inadvertent duplicates

Counter: bags are only suggested in limited cases, not as a universal practice.
- "[L]imited cases", eh? That would make them "marginal, rare, or even exceptional", wouldn't it?
- I don't perceive those words as meaning the same. For example, people with "limited mobility" are not statistically "rare". They are called limited in relation to a baseline reference of "normal" mobility. -t

Pro: Bags are more compatible with existing database and query tools.

Example: The SELECT clause (or its equivalent) alone can be used to exclude primary key columns as desired. If set output is required, then a secondary mechanism may be needed to provide such exclusion. Existing tools based on the existence of this trait may not have a secondary mechanism because they didn't need it. (The secondary mechanism is often referred to as a "formatter" or similar in discussion.)
- "Formatter" currently appears on 32 pages other than this one. None of them have anything to do with this topic, or any "secondary mechanism". As you've conveniently ignored, relational projection and output formatting are only related in the most superficial (and essentially irrelevant) way. Their conflation is one of the SqlFlaws.
- You missed "or similar". Anyhow, ComplexityOfOutputtingDuplicateTuplesInTutorialDee debates extensively the relationship and/or overlap or lack of between relational operations and "formatting". As a rough summary, there are benefits to let them overlap where they may overlap, especially if bags are permitted. It's a more organic viewpoint of the relationship of tools and techniques; yours being too idealistic in my opinion. Bags simply better match the real world in many cases. -t

Meta Discussion

Your "cons" are highly unconvincing. Each example (other than 3, which explains nothing) either demonstrates some technical misunderstanding, or addresses a marginal, rare, or even exceptional case compared to what DBMSs are mainly used for. Even if such circumstances do occur, the penalty for eliminating bags in the DBMS is negligible at best. However, providing support for bags inside the DBMS deprecates optimisation and invites a category of errors (inadvertent duplicates) that are almost invariably more serious than making some developer to click an option or type a keyword in order to import or export a bag from a true relational DBMS.

You've said that all already and I disagreed then and I disagree now. The "bad things" that non-bags protect one from are also "marginal, rare, or even exceptional" in my experience. Those who do make multiple duplication errors tend to make many other kinds of conceptual errors also. Anyhow, let's focus on summaries here. Maybe we can re-work those arguments into a summary also. And, your optimization arguments are also suspect in practice.

I see. So now not only are you an expert in DBMS design and query optimisation, you know for a fact that inadvertent duplicates are marginal, rare, or even exceptional? What "experience" has led you to hold that view?

It likely wouldn't improve the optimization of any of the given scenarios (for things the scenario user has control over).

PageAnchor: economic_A

(TopMind actually believes this is a rational counter, as opposed to buzzword laden bullshit.)

I object to your rudeness. If you have a usable counter-argument, please present where appropriate.

(I object to your fallacy. Your "counter" wasn't usable in the first-place; it was first class HandWaving. If you have a usable counter-argument, please present it. There is no need to counter your bullshit.)

I honestly don't believe it's a fallacy; and even if it was, that's not an excuse to be rude. Tools and techniques that give users/decision-makers/owners more options are generally a good thing, even if you personally disagree with their final selection. I know you are upset with my content, but you are not helping with communication by expressing your feelings in such a way. I InviteModeration to help resolve this issue.

(Your implied claim - that 'baginization' is offering an objective (and therefore rational) trade-off decision on risk, time, labor, cost, etc. - is utter bullshit. And that earns rudeness towards you, you crank - you should be raked across coals, laughed at, treated as the 'butt' of a joke for presenting that sort of 'counter' without doing extensive research beforehand. If you want civilized discussion, you first need to play by the rules and present a reasonable argument. Instead, you befoul WikiWiki with page upon unending page of your irrational tripe. BagNeedScenarios was enough - you were shot down there, and it doesn't need 'rework'.)

I'm sorry, but I don't understand your complaint. It seems perfectly rational to me. I cannot see the computations in your head that makes it clear where I am "going wrong" from your point of view. I cannot read your mind. You have to articulate the alleged problem via specific and clear text. I only see what looks like an immature, emotional response from you. Does somebody else want to make a try at it? Would an informal "economic simulation" be of any help?

Removing the primary index in that example gives one lots of extra immediate space with only a slight risk and no need to convert to and learn a different database. If that is not a legitimate benefit to at least consider as a company choice, then perhaps I really am as fucked in the head as you claim. It's flat-out common sense to me. It's just every-day obvious. I cannot make it anymore obvious. It's frustrating that you see it as somehow bad or wrong or sinister. Fuckitofia Arrrrgvile. The only rational explanation I have for your reaction is that you have an obsessive and idealistic personality which ranks "purity" above any other factor out of an inborn nature. We just view the world so differently and weigh factors so differently that we will never see eye-to-eye. And, I do believe a jury of randomly-selected practitioners would mostly side with me. -t

Revisit: Are you claiming that it does not provide such options to decision makers, or that providing those options doesn't matter? As far as "extensive research", you haven't done extensive research for your side either. The default is not your position such that me not providing extensive research does not make your position the truth by default. What kind of extensive research do you want anyhow? At least let's find out what we agree on in the economic argument before doing research that doesn't affect either side's viewpoint regardless. Clearly there's an economic and time and training cost in abandoning typical shop DB's such as Access and MySql for something that may be optimized for this particular scenario. Couldn't we try to agree on a rough figure, or is that not the kind of thing that's bothering you to begin with? Communicate. -t

Title Misnomer II

None of the above -- even allowing for a certain amount of imagination and tolerance of the highly-contrived Scenario #4 -- has identified a bag need. "Need" strongly implies that bags are required, and that without bags a particular problem cannot be solved, or that it can only be solved with bags. Are there going to be any bag need scenarios, or there going to be more pointless quibbles over bag want scenarios?

As already discussed, the topic name is a misnomer. They are not needed in an absolute sense, for work-arounds usually exist. However, we are studying the net costs/benefits, not absolute need. Sets are not absolutely required either; they are just a useful tool when applied skillfully.

Be that as it may, you haven't even identified where bags would be desirable. At best, you've described scenarios where bags might mean eliminating a keyword or two at development time, but have entirely failed to address the fact that this increases the possibility of erroneous duplicates with no way to distinguish them from legitimate duplicates.

I disagree. I've made a decent case. LetTheReaderDecide instead of restating your disagreement over and over again. As far as the EXPORT keyword, as pointed out it would typically not be available in current app and query environments as they are set up now, and thus is not a real "fix". See below.

You seem to start with the assumption that sets are the default and only extraordinary evidence would dethrone sets. That's a false assumption. In practice, actual RDBMS vendors made bags the default. -t

Where have I started with that assumption, and what does it have to do with this page?

That's what your writing implies in indirect ways, at least by my interpretation.

Re: "(Note: If the RDBMS exposes a unique row id (e.g. Oracle's ROWIDs), then including the row id in the query is an example of this solution)."

I don't if it's necessary for many queries. It's available for use if desired, but it would be silly to force it's usage.

You aren't making any sense. Please clarify.

Let's back up a bit. What will including the ROWID solve? I can serve as a temporary surrogate primary key, but what else?

It turns the bag into a set. Therefore, there is no need for the data management engine to deal with the data as a bag internally.
Technically (internally) it may be a set, but conceptually it's still a bag. For example, if duplicate records really did get into the logger event table due to a bad logger, they'd still have unique internal row ID's, but still be "domain duplicates". -t

Further, the existence of ROWID makes such result sets more comparable to lists than bags.

{How so?}

Every entry still has a unique "position" just by being in the data structure. However, nothing beyond that is "enforced". You can have A,B,B,C for example. This can be technically converted into a nominal set: {1,A}{2,B}{3,B}{4,C}; however, from the conceptual side of the domain, it still has the properties of a list, not a set. -t

{Sorry, not following you. Not seeing any relevance, either.}

Sigh. I'll try to think of another way to say the same thing.

EXPORT Keyword Availability Problem

Moved to RelExportDiscussion

Why do you keep moving my "count" counter-argument away?

Read the statement following the new location of your "counter argument".

I'm sorry, but I cannot find a response to it, at least not one that's clearly related.

The immediately following statement is, "This is a property of the data set. It's why the first solution (fix the data source) is preferred. If you can't fix it, you have to deal with that issue regardless of whether or not your data processing engine uses bags or sets." Not sure why you couldn't find it.

Agreed, one has to "deal with it", but adding a count is not necessarily the right way to deal with it. It's a suggestion, not a fix. -t

The thing is, adding a count doesn't deal with the issue either. But, in this case, that's a good thing. We don't want the processing engine to decide what's the best way to clean up the data.

Then remove that bullet item or merge with above. Its issues may not be different from a generated key anyhow. Maybe it just needs a little re-wording to make it agreeable to both parties.

Two Summaries

The attempt at a summary is not very successful. Neither party is happy with the results and some EditWars have ensued. Instead, I propose that two summary outlines be created on the same page; one for each "side" of the debate (pro-bag versus anti-bag). This will hopefully avoid edit-wars and let each side present their viewpoint. As a good-will gesture, I'll let the pro-bag outline be first. (Dec. 2010) -top

On "Contrived"

It has been stated, or at least implied, that the examples are "contrived" and therefore "useless". They are based on actual cases with some changes to simplify the examples. But regardless, I'm not sure that being "contrived" makes them useless anyhow. For example, when writing software it may occur to me that if the user presses the Esc key at a certain spot, the program may crash. I thus may catch that situation and handle it in code. Technically, pressing the Esc key is a "contrived" scenario since it never actually happened. But that's not a decent reason to ignore that scenario. What is important is the probability of it happening. If the user of the word "contrived" wants to argue for low probability, that's fine by me. But please be clear on what is meant by "contrived". It's a somewhat vague word. -t

Top, give it a rest. I have far, far better things to do than trot out the same arguments again and again and again. Don't you?

I am convinced, more than ever, of the need for pure relational systems which will support bags at the boundaries (i.e., link to bags but represent them as relations, import bags into relations, and perhaps export bags) but never support bags inside the sytem. I won't waste my time arguing; I will spend my time building them. -- DaveVoorhis

Being "convinced" is the easy part. Humans come by it at a snap. I thought the discussion was an interesting way to explore the "purism versus practicality" issue that keeps popping up around here. Your language/tools could fit in better with existing systems and tools if you weren't so insistent on resisting bags/lists. "The boundary" can be a place where IT people waste a lot of time. Further, forbidding bags in a DB engine and/or query tool could be a configuration switch. Encourage purity, but don't force-feed it. As a user, I want to tell the tool what to do, not the other way around. -t

[1] Internal row numbers should not be used with "live" data, as they may be reused upon deletes. One has to know the nature of data source to rely on them.

NovemberTen