Bag Set Impedance Mismatch Discussion

Long, threaded material moved from BagSetImpedanceMismatch regarding how to treat "bags" received from systems outside of our control: give them an artificial key, or leave them as bags (outside of specialized temporary work sets).

Note that one may not be able to re-associate the same record with the same artificial key on a different day. In other words, if we obtain any updates in the future, any key assigned on "your side" should be considered temporary (unless a fancy value-based guessing system is put into place).

For example if you receive a monthly copy of an Employee table, but the employee number/ID is omitted for some reason (such as legal reasons), and you assign a "working key" of 123 for a given month to a given record, there would be no easy to way to assign 123 again to the same record to next month's copy. Perhaps you could match on name, but names change (marriage) and are not unique. At best one could make a guess-A-tron, but it won't be 100% reliable. And perhaps name is omitted (hidden from you) also if it's not needed for your particular task. This is similar to Scenario #3 in BagNeedScenarios, but we are on the receiving end of the data in this scenario.

[If you don't need to be able to connect the "same" employee accross months, then there is no need to ensure that same working key is assigned each time. If you do need to be able to connect them, then the "guess-A-tron" is required regardless of which way this issue is decided.]

For clarification, the Guess-A-Tron was mentioned only as an option for consideration so that the impact of the scenario on option choices is understood and considered. But, it's not a "given" of the scenario. Nor is the need to re-match keys across cycles. Nor is the need to even assign a temporary key. The only given is that the data comes over without any known unique key and that it's used for some unstated purpose on "your side". If your recommended activity/handling varies depending on how often the data comes and what is done with it, then please state the assumed usage or period. We can consider specific cases as well as general ones.

[First of all, I didn't treat the need to re-match keys as a given (nor the need for a "guess-A-tron"). Note the use of "if" in the above. Secondly, there was a stated purpose for the artificial keys. It was "significantly improved optimisation, conceptual simplicity and guaranteed avoidance of accidental duplicates". You have to go back a couple of pages to find it, but that happens sometimes when we split off pages.]

It was just clarification. And "stated purpose" of a proposal versus the givens of the scenario are two different things. As far as "improved optimisation, conceptual simplicity and guaranteed avoidance of accidental duplicates", I disagree with those as stated, at least as a net benefit. But I already described why elsewhere. They are all situational and involve trade-offs. The purpose of these topics is to describe and weigh the trade-offs. To state them as open-and-shut in a summary area is misleading. -t

"Bag" Not Quite the Right Term

"Bag" is a semi-misnomer in practice because many if not most RDBMS have a unique internal row identification system, and often expose the row-ID to the "query user" via a special column name or function. The real issue is the "permanency" of such. In some cases it may only be unique for a given query result. If we re-query the same data at a later time, we may get a different ID for the same row (if same-ness can be determined if we had a God-view of the system). In others it's unique across a table, but new records may re-use ID's from prior deleted rows. Thus, referencing such ID's as foreign keys could present some surprises. Further, certain forms of cross-server load distribution may render the internal ID unreliable in various ways. Perhaps the term "key-unstable" can be used instead of "bag".

Adding surrogate keys to a "bag" query result set (such as to force rows to have a unique key) may also produce unreliable keys such that the same record may have a different ID upon next query. For example, a given stored procedure or database view may not have a unique key and/or may not supply a unique key for various reasons (security, efficiency, politics, app vendor rules, bad design, etc.). Thus, "bags" may not come from just key-less tables, but also from transformation services that don't "send" a primary or unique key. In medium and large companies, one often will not have the option of having such redesigned to emit a primary or unique key. One is a data user, not data designer for many existing systems. There is no known generic way to "auto-guess" a primary key that is stable over time for such situations.

Since most result sets are returned in a linear fashion, we can assign unique row numbers by simply using the "row count" as the primary key. However, these numbers may not be unique across the table. For example, if record #2 is deleted overnight, then #2 will be a different record the next day. Similarly, if we change the query criteria, the numbers will fall differently. We could study the data and perhaps come up with a fairly reliable guessing mechanism to determine a unique key in many cases, but still may never have a 100% guarantee. -top

In such cases, a generated key does not need to be stable over time. For queries on data without a guarantee of uniqueness, a generated key only needs to be stable over the lifetime of a given query. Systems that need to work with possibly-duplicate or "keyless" data are designed to be aware of this.

How are they "aware" of it? Please clarify.

Relational systems that need to work with possibly-duplicate or "keyless" data do so by generating a key that does not need to be preserved beyond the lifetime of any given query that uses the data. Of course, a trivial objection is that if the keys are transient, you can't -- in the long term -- unambiguously identify any specific row in the source data. That's true, but in the long term you couldn't have identified any specific row anyway, because the data has duplicate rows.

That's why I suggest it may be better to live with "bags" in heterogeneous systems: there's no transient key to be mistaken for a real key, reducing the chance of somebody creating bad links. And it's less columns, meaning less clutter. Keep in mind this is a different issue than a "set-only" top-to-bottom system where primary keys are always enforced. This would be in a mixed environment that use existing RDBMS conventions.

So-called "clutter" and some glancing possibility of "bad links" are negligible issues compared to the potential for incorrect query results and undetected duplication in SQL systems. That's why use of primary keys and including "SELECT DISTINCT" in every SELECT query are widely regarded as "best practice" in the SQL world. Furthermore, there are no advantages to "bags" that are worth mentioning. Their use, in any context, should be actively deprecated. To "live with 'bags'" in some fashion may currently be inevitable, but must be discouraged from becoming perpetual. Use of "bags" should be afforded all the disdain we give careless indentation in structured programs, lack of well-formedness in XML, bad spelling in general, or an ill-timed fart.

Again, I'm not addressing a new system here in which "uniqueness" can be controlled from the beginning. I'm mostly referring to working with existing systems which may not contain or provide unique keys in some cases for multiple reasons, often beyond ones control. And I do believe a "fake" key is potentially worse than no key. We'll probably just have to AgreeToDisagree and LetTheReaderDecide. My assessment of query user error probability tradeoff profiles is simply different than yours. The user mistakes caused by mistaking a temporary or query-local key for a real key or a reliable key are at least as likely as those caused by working with non-keyed result sets, in my professional assessment. If you know you are in a fog, you tend to be more cautious than if you don't know you are in a fog, yet obstacles are still hidden. You claim to be removing "undetectable duplication", but you are not. Further SELECT DISTINCT can be a performance buster, and may still return wrong results. To use our common event log scenario, two identical events may happen at the same time, within the clock's or field's rounding error range. SELECT DISTINCT would incorrectly toss one. Formalizing a guess does not make it a non-guess.

Mistaking a "fake" key for a real key is relatively harmless. It clearly doesn't appear in the source data, and at worst JOINs with nothing. My reference to "undetectable duplication" refers to the original data sets. In the absence of any key, how do you know whether two identical rows are two genuine entries, or one entry duplicated? I'm afraid I don't follow your references to fogs and guesses. Sorry.

What do you mean "doesn't appear in the data source"? Transmission error checking? See next paragraph. And it can join with stuff if one starts inadvertently referencing the pseudo-key in other tables.

What I mean by 'doesn't appear in the data source' is that the generated key doesn't appear in the data source. If the same generated key appears in multiple places, then there's nothing wrong with referencing it in other tables.

That mistake is less likely to happen if there is no key to reference (although it may depend on specific requirements, per below). Sure, "Just remember and document" can be used to avoid problems, but similar suggestions can also apply to the counter down-sides. Thus, it's not relatively harmless, as you claim.

Re: "In the absence of any key, how do you know whether two identical rows are two genuine entries, or one entry duplicated?" In bag-friendly systems/tools that preserve non-keyed records, duplicates will normally be preserved. If there is a byte burp during transmission, one wouldn't know either from pseudo-keys because they are often added at the receiving side. (Pseudo-keys shouldn't be a substitute for transmission error detection anyhow.) If one starts writing to the data set (which is a full or partial copy of the master "bag" source), sure it may create problems, but so can pseudo-keys. One would have to study the specific domain requirements to assess which is the most error-prone. If it serves a specific and known domain purpose, I have no problem with the assignment of pseudo-keys, but I won't add them out of some purity itch alone because they may be mistaken for real keys.

Let's take a little scenario to dissect further. Suppose I grab a portion of an event log from another system managed by another department as part of a monthly routine. The view I'm assigned to use emits no primary key. Now suppose I add a pseudo-key to my sub-copy because I read somewhere that it was a best practice. A colleague working on a different project is asked to research something using the log copy I made. They send off some emails that use the pseudo-key to identify the records that may be of importance to a customer/user that had a question/issue.

On the drive in from work the next morning I realize that I did the query wrong, forgetting the XYZ criteria to filter out a product type that was recently transferred to a diff jurisdiction. I rerun the copy query, re-add the pseudo-key, and replace the "bad" file or table with the fixed set. I may not know that my colleague had used the pseudo-key in his email message. If there was no pseudo-key, the colleague may have used a different means to identify the record(s), such as a full row listing. I mention to him that I was redid the batch, but it may not "click in" to him that such an action may change the pseudo-key. Because of this, my colleague used and sent out a bad identifier. -t

If your colleague can disambiguate records with a full row listing, then the full row is the key and no generated key is needed. If there are duplicate rows, presumably your colleague will use (in your words) "a different means to identify the record(s)". In other words, your colleague will use a generated key. If you have duplicate records and need to identify specific records, there is no escaping it: You need a generated key.

Nope: there can be two identical records. Event logs just record events, not caring if a repeat event(s) falls into the same time-slot. (Or perhaps query users are not given more detailed info that would distinguish them.) The row listing is for humans to find them again (using queries or QueryByExample-like tools), not necessarily automated processes.

Even if you use relative references on a row listing, like "the third 28 degree temperature reading from warehouse B at 10:02:24" or "the first record where Bill logs into the lab computer" or "the seventeenth row from the top of the print-out on page eight", those are all generated keys. They uniquely identify a record, row, or tuple for the lifetime of a query. In this case, the "query" might be executed entirely by human beings, but it's still a query that uses a (human-) generated key.

Who said they need to give ordinal descriptions/positions? Note that they could use a GROUP BY and a "repeat_count" column so that the records are unique by all columns, but that's extra processing, and/or an index on the concatenation of every column takes up as much space as the data itself, if not more (depending on the engine architecture). And this is assuming they have the space, permission, and resources to make and re-work their local copy(s), and that they need to dig in the data often enough to justify the extra space/steps. (The resource issues are somewhat similar to scenario #4 in BagNeedScenarios.) It's extra time and space just to satisfy the Purity Gods. I'm sorry, but it looks like anal-retentativeness to me. Keep in mind if designing the system from the beginning, I'd request an auto-number or something be put on events. However, we are talking about working with existing data and systems here for which we may have limited control over.

Huh? I'm simply pointing out that as humans we already generate keys to identify duplicate rows in a "row listing". You wrote, "The row listing is for humans to find [records] again ..." and I responded by showing you how humans "find [records] again". Having relational systems do essentially the same thing -- generate keys to uniquely identify otherwise-duplicate rows -- is not to "satisfy the Purity Gods", but to gain the well-established benefits of relational purity: Improved automated optimisation and reduction of inadvertently conflating intentional duplication with accidental duplication.

Humans can deal with ambiguity in a domain-acceptable fashion, such as footnoting caveats on reports and emails. Computers cannot. And I don't know why you keep bringing up optimization; it's a non-issue here for reasons already given. If anything, some of your work-arounds are extra steps that are anti-optimization.

I don't know what "footnoting caveats on reports and emails" has to do with this, and I don't know why you keep dismissing optimization and the possibility of inadvertently conflating intentional duplication with accidental duplication.

In a footnote, one cay say, "There were multiple records with identical values. However, we counted such as a single record for this particular report because we cannot verify whether it's actually multiple events or a communications glitch." There may be different reports that count them all. For example, if you are trying to gauge system and staff work-loads, you may want to count the duplicate events; but if you are trying to gauge actual domain events, you want to roll up duplicates and count them as a single event because past studies showed that say 60% of duplicates were due to communication glitches rather than actual events. Counting things different ways for different purposes is common, not just for bags, I would note. And ultimately it's the managers' or owners' decision, not technical staff anyhow.

There is nothing in a pure relational system that precludes aggregation and counting. What is expressly forbidden in a pure relational system is retaining duplicate rows as indistinguishable duplicates, or generating indistinguishable duplicate rows. This has no impact whatsoever on managers' or owners' decisions.

And optimization is a non-issue because internally most RDBMS have a unique row identifier that can be used for optimization. Sometimes it can be used and sometimes not, depending on join patterns etc. But also re-working data to supply a surrogate key also costs resources such that the net benefits need to be weighed on a case-by-case basis.

Supplying a generated key (it isn't surrogate, because if there is no key, it can't be a surrogate for something that isn't there) costs negligible resources. Maintaining duplicates anywhere within a relational system -- which makes it a non-relational system by definition -- opens up the potential both to incorporate errors and potentially generate them. That is unacceptable.

Untrue. But to explain why would repeat existing material.

I'll agree that IF we can control the entire chain of data from cradle to grave such that everything has nice keys (with unicorns, daisies, and balloons), then yes sets are more efficient than bags. However, in this particular case we are considering living with others' "dirty" data. -t

In living with "dirty" data, it is unacceptable to let the "dirty" part -- duplicates -- into the system. Otherwise, the problems are potentially propagated.

Creating fake keys has its own downsides. If you need temporary keys for a specific process, that's fine. It's just often too risky to "keep" them longer than the duration or scope of the specific process.

The downsides of retaining generating keys are straightforward management issues, and, of course, there is no impetus to retain generated keys any longer than is necessary. However, the downsides of maintaining duplicate rows inside a "bag"-based, relational-inspired system are often subtle, sometimes complex, and frequently difficult to detect. Can you guarantee that you won't inadvertently introduce erroneous duplicates in JOINs or summarisation operations when you can't control whether source data has duplicate rows or not? In a pure relational system, you can implicitly offer such guarantees of correctness. In a non-relational system -- e.g., SQL that allows duplicate rows -- you can only approach such levels of guarantee, and never quite meet them, by carefully auditing every relevant query and trying -- to the best of your ability -- to make sure they will generate correct results whether the source data has duplicate rows or not.

I've been around and in databases for more than two decades, and I don't see the level of problems you claim, at least not enough to counter the downsides of fake (unstable) keys. You disagree, with counter experience. So be it. We all have different experiences about how people will react to technologies and related ambiguities. I suspect you are more influenced by the conceptual purity rather than the WetWare and behavioral patterns of those human technicians using data and/or tools with such issues. As a compromise, I suggest query tools/languages that discourage the use of bags, such as requiring explicit keywords etc. to produce a bag, but I don't want tools/languages that outright forbid them, at least not when interacting with existing data and products that don't provide "clean" keys or meta-data on the keys.

Actually, it is precisely my awareness of "behavioral patterns of those human technicians using data and/or tools with such issues" that motivates me to promote systems that inherently avoid the sort of errors that humans are prone to make. I favour purity not for its own sake, but because purity -- in this case, at least -- makes it less likely for "those human technicians" to make difficult-to-detect but potentially dire mistakes.

An unstable key can likewise be "difficult to detect". You generally can't tell by looking at it. I realize there are trade-offs being made here about which risk to accept over another, but adding unstable keys is NOT the clear winner. Query developers with reasonable experience generally know to be cautious when given bags to work with, but the same is not true of unstable keys. Not having info about the primary key or knowing there is no primary key is information in itself. If I don't know the key, then I know that I don't know the key. (Did I say that right? Please excuse my Rumsfieldness.) However, I cannot tell that "stability" is missing from a key just by looking at the schema and the data (unless I happen catch it in the act of changing). In the bag case I know what I don't know. But in the unstable key case, I can't readily tell whether it's stable, and it's not a common enough practice to make asking that question a standard behavior. And even if I did ask, there may not be an answer available. In the bag case, I have the piece of information that I don't have knowledge of the key (if it even exists), but in the unstable key case I don't know that I'm missing information.

A generated key is no more "difficult to detect" or "missing information" than page and line numbers on a row listing. That's because page and line numbers are a generated key. Identifying generated keys, or dealing with inadvertent use of some generated key as if it was permanent, is trivially and obviously an order of magnitude easier to identify and less harmful than inadvertently creating erroneous duplicate rows by JOINing tables that have unexpected duplicate rows in one or more of the source tables.

Sorry, but I don't see how it is "trivially and obviously an order of magnitude easier". Please explain. Incorrect links are at least as evil as an incorrect count. And page and line numbers have been problematic in my experience. For example, items get deleted from spreadsheet tasks lists and some manages don't like numerical gaps. Thus, it's renumbered, and somebody often has an outdated copy at a meeting and they get confused until people figure out what happened and then everybody has a good chuckle. Similar with page numbers. Experienced managers (burned in the past) will say something like, "Make sure you have the latest copy dated such and such." Fortunately that happens on a small scale such that it's fairly easy to get back on track after a stumble. But with thousands+ of records it may not be.

There's an old saying:

  "To err is human. To really foul things up you need a computer."

You answered the issue yourself: "[T]hat happens on a small scale such that it's fairly easy to get back on track after a stumble." Misinterpreting a generated key might occasionally result in some light meeting-room embarrassment, but it doesn't result in overtly incorrect reporting. Furthermore, it's a human problem, not a technical one. (Perhaps a better choice of aphorism would have been, "To err is computational, but to really foul things up you need a human.") Next time, configure the report to exclude the generated key; the totals will be correct whether the generated key is printed on it or not.

You misconstrued my words. The "little spreadsheet" is small scale. I did not say a typical case is small scale; you ignored the caveat of the examples. In many apps the quantity is orders of magnitudes larger. Is my writing not clear, or did you have a bout of hastiness? And, they originated as or based on your examples, not mine.
Re: "Next time, configure the report to exclude the generated key". I thought I made it fairly clear that the manager (customer) wanted keys. It was thus a domain requirement, not a technical decision. And, are you suggesting don't include page numbers? (Page numbers also originated as your example.) The customer will sometimes want temporal keys and sometimes omitted. Other than warn of the potential problems, the final decision is theirs.
These quibbles are a red herring. They are trivial human issues, having to do with what is printed on paper rather than the presence of duplicate rows or generated keys within the system.
No, because it centers around your basic premise that they are trivial, where you stated "you answered the issue yourself". Fixing 20 bad links is a far easier problem than fixing 20,000. My answer was in the vicinity of the 20. Context and memory can be used to make a fairly reliable fix for 20, but that repair technique doesn't scale linearly, and would still be a bear if it did. -t
The so-called "bad links" do not impact on the data content of the report itself, and in the absence of a generated key you wouldn't have any links at all. Therefore, you cam simply alter the printed report to not include them.
If temporal keys are supplied, people may start relying on them. The "links" may be in their head or on paper notepads for the small case, but the principle exists on the larger scale and becomes a bigger problem on the larger scale. I suggest you re-read this page again. Let me make sure the premise is clear:

PRINCIPLE RK: If you have something that looks like a key, there is a risk others will mistake it for a real (stable) key and inadvertently start using it as a foreign key in other tables or data sets.

Much more subtle and insidious are queries like:

 SELECT customerID, SUM(amount) AS balance FROM CustomerTransactions GROUP BY customerID

Seems innocuous, doesn't it? But what if CustomerTransactions is:

 CREATE VIEW CustomerTransactions AS 
  SELECT Customers.customerID, amount FROM Transactions, Customers 
  WHERE Transactions.customerID = Customers.customerID

It too seems innocuous. However, if duplicate rows are permitted and Customers inadvertently contains a duplicate row, 'balance' will be incorrect in the first query's result. If you don't immediately see why, then you see my point. (Try it to see why it's incorrect.) Of course, this is a trivial example. Actual occurrences in the field tend to involve complex chains of VIEWs, many JOINs, a few UNION ALLs and so on. It is precisely these kinds of errors -- which I argue are often overlooked because they're subtle -- that relational purity will help eliminate.

You are switching to new examples. If I have full control of the system of your example, I would indeed avoid a view that emitted duplicate keys (although there may be exceptions to the rule). Bypassed is the issue when one doesn't have control over the entire system from cradle to grave. Let's finish our existing scenarios before introducing more.

Actually, these queries represent precisely the scenario we've been discussing, and are a clear example of what I've been arguing all along: that allowing duplicate rows permits subtle and difficult-to-find errors. Imagine the following conditions are true:

The two queries shown above are inside the DBMS.
We have control over the DBMS.
Customers is linked to an external data source over which we have no control.
The DBMS permits duplicate rows.
Customers may contain duplicate rows.

In the above scenario, the first query will generate incorrect results. In a pure relational DBMS, the following conditions are true:

The two queries shown above are inside the DBMS.
We have control over the DBMS.
Customers is linked to an external data source over which we have no control.
The DBMS does not permit duplicate rows.
Customers cannot contain duplicate rows.

In this scenario, the linkage to the external data source over which we have no control forces the database developer to consider the possibility of duplicate rows and deal with them appropriately. The system that permits duplicate rows does not, and thus encourages incorrect results in the first query when duplicate rows appear in Customers. Furthermore, the second query in a pure relational system (i.e., SELECT DISTINCT by default) will eliminate duplicates, which means all transactions for the same customer and the same amount will be amalgamated. Therefore, the second query would have been properly designed in the first place under a pure relational system, and Customers would be linked to its external data source in a manner that prevents duplicates. Therefore, under a pure relational system, results would be consistently correct whether or not there are duplicates in the source data for Customers.

It still appears you are introducing a new scenario with different issues and contexts. I will not entertain it until the original scenario is fully addressed.

Are we not discussing "working with existing systems which may not contain or provide unique keys in some cases for multiple reasons, often beyond ones control"? Those are your words from above, and are precisely the scenario I've illustrated with my two queries, above.

I'm still not sure what your point is. I have insufficient info about the domain environment to plot and present a course of action. In the case of accounting, if we have reason to believe that the ID's are unstable (customer ID may not be the same domain customer upon a future query), we could refuse to do business with them because the risk of misplaced money may be higher than the owners want to deal with. Or we could build all kinds of versioning or/and tracking layers for CYA purposes to produce a reliable "artificial key"[1]. However, this ramps up the complexity of the system and costs.

The risks and costs that our managers/customers may be willing to deal with for accounting versus an event log (original scenario) may be very different. This is a similar issue to PageAnchor: scenario_four_alpha in BagNeedScenarios. The decisions about the efforts and costs of preventative features versus the risk is NOT up to the technician in non-trivial matters. Using a DB engine that "must have" a key (stable or temporal) doesn't change this equation. Artificial keys are indeed possible to make. However, that doesn't necessarily solve the issues surrounding them. Making a database's rules happy doesn't necessarily make the customer happy nor prevent all human error. You can't paper over inherent domain ambiguity simply by throwing GateKeeper rules at the DB. That's a naive stance.

Introducing artificial keys creates risks that don't exist without the artificial key, and they are often not "trivial", as you claim (Principle RK). I'm not saying that lack of a key has zero risks, only that it is a trade-off. -t

My point, as I stated above, is that (I quote myself)"[i]dentifying generated keys, or dealing with inadvertent use of some generated key as if it was permanent, is trivially and obviously an order of magnitude easier to identify and less harmful than inadvertently creating erroneous duplicate rows by JOINing tables that have unexpected duplicate rows in one or more of the source tables."

I have shown that allowing duplicates into the DBMS encourages the possibility of generating subtle errors in query results, and I have shown (indirectly) that even naively attempting to put DISTINCT in all SELECT queries does not necessarily produce correct results. I have also shown that the queries involved in promulgating such errors appear, superficially, to be correct. I maintain, therefore, that allowing duplicate rows invites errors that are subtle and difficult to detect and correct. However, mis-using a generated key is invariably straightforward to detect and correct. Furthermore, it is no different to misusing any column as a key when it shouldn't be used as a key, which makes it orthogonal to the issue of allowing duplicate rows or not.

I don't understand how you are ascertaining "trivial" and "straightforward to detect and correct". Further, it may be the case that subtle errors also produce subtle problems while blatant errors cause huge problems. A nuke going off in your city is certainly an "easy-to-detect" problem, but the cleanup can be a bear (or turn you into a bear). You seem to be arguing that a nuke is less problematic than mercury leaking into the water supply because the nuke is quicker to be noticed. And, the magnitude of link problems may vary widely per circumstance. There may not even be a way to fix bad links if not enough info was kept to reconstruct them.

Re: "it is no different to misusing any [other] column as a key" - It's the probability of making that mistake that's the sets it apart. For example, if you run a GROUP BY with an occurrence count, it's unlikely that any other column will closely resemble a key, and even if it did, most of the time the column name will make it clear it's a domain field. Thus, the chance of a given column numerically resembling a primary key (pattern) AND having a key-like column name at the same time is relatively rare compared to an artificial key. Examining the artificial key by name and data likely will offer little or no clues that it's unstable.

   /* see if candidate column resembles a key */
   SELECT maybeKey, COUNT(*) AS occurs
   FROM targetTable
   GROUP BY maybeKey
   HAVING COUNT(*) <> 1
   ORDER BY maybeKey
   /* may need tweeks for null detection */

What do you think a generated key is going to be named? PERMANENT_PRIMARY_KEY_USE_ME_USE_ME_USE_ME? I seriously doubt even the most absent-minded, incautious, neophyte database developer will misuse a generated key more than once, but inadvertent duplicate rows from a scenario like that I've shown above -- which is purely the result of allowing duplicate rows into the DBMS -- can and does catch out seasoned professionals. Because some naive beginner might expose a generated key on a report, and a manager might refer to it, is no reason to continue allowing -- and I shall use your metaphor here -- mercury into the database drinking water in the form of duplicate rows. I see no "nuke going off" here either way, but given a choice between conditions that permit seasoned professionals to make mistakes vs conditions that permit beginners to make mistakes, I'll accept the latter every time.

You seem to believe a developer brought in to use database data is going to know it's an artificial/temporary key. I thought my prior statements made it pretty darn clear that often one doesn't know (and often can't know in their allocated time). Being "sloppy with keys" is not the primary scenario I'm describing (though it certainly does happen). As a contractor at more than a dozen organizations, I'd say roughly only a quarter give the developer (me) a reasonably-documented schema description (DataDictionary) for any given table. And it's about half for "permanent" places I've worked. Many tables have been around for 3 or more generations of DBA's etc. and documents get lost, or the DBA/architect is "too busy" to lend knowledge. You haven't described how they would know it's not a stable key.

As far as the column naming, quite often it's terse, such as "ID", "Record_Num", "Row", "Line_Num", "Tracking_ID", "Reference_Num", etc.

We've been going around and around on this with too little new progress on each round. Let's just say our experience differs as far as how likely certain mistakes are to occur and LetTheReaderDecide. I'm just calling it as I observe and remember it. For whatever reason, I saw/remember different behavioral and/or political patterns than you have. Perhaps you have better people skills and know how to coax documentation from DBA's etc. Still, that's a situational WetWare issue. Let's call it a year. --top

I've struggled a bit, to understand the nature of this long, threaded debate. Supposedly, it concerns how to treat "bags" received from systems outside of our control: give them an artificial key, or leave them as bags (outside of specialized temporary work sets). Apparently, the idea is that we're working w/"bag" query result sets, defined as result sets where rows do not have a unique key.

{Or the primary key is unknown to the data user. -t}

And, there's some kind of disagreement going on here, there are two sides to this debate. It is, perhaps, agreed that: '..there are no advantages to "bags" that are worth mentioning. Their use, in any context, should be actively deprecated. To "live with 'bags'" in some fashion may currently be inevitable, but must be discouraged from becoming perpetual. Use of "bags" should be afforded all the disdain we give..'

However, what about this point?: 'For example, a given stored procedure or database view may not have a unique key and/or may not supply a unique key for various reasons (security, efficiency, politics, app vendor rules, bad design, etc.). Thus, "bags" may not come from just key-less tables, but also from transformation services that don't "send" a primary or unique key. In medium and large companies, one often will not have the option of having such redesigned to emit a primary or unique key."

In which case, the important point is asserted to be this: 'And I do believe a "fake" key is potentially worse than no key.' And, ok, maybe so, but to the degree that I understand this hypothetical scenario, I see the relevance of a larger point, that most developers aren’t SQL experts, and that most of the SQL that gets used is inefficient, hard to maintain, and sometimes just plain wrong. In my experience, chances are good that an application's database layer contains problems, and I can even say something about what kind. And so, although I've tried to parse through this page, I don't think, in the end, that it 'gets interesting'. Most interesting to me, in the end, is the multiplication of so many terms here, 'unreliable key', 'surrogate key', 'artificial key', 'transient key', 'temporary or query-local key', 'generated key', 'real key' ('or a reliable key'), '"fake" key', 'unstable key', 'pseudo-key'. That's not offered as a necessarily complete list (I'm inclined to count 'artificial/temporary key' separately?). And then, we get statements like this: 'Supplying a generated key (it isn't surrogate, because if there is no key..)' Good luck, man.

There are probably only two people in the world who care enough about this issue to debate it in such depth, and they did, all over this page. Or, I should say, we did. As one of the two original participants, I have to say you've given a pretty fair summary. It's unfortunate my opponent used so many different modifiers before the word "key" (I kept it mainly confined to "generated key" or "key") but I know what he meant in every case. I wouldn't expect anyone else to know, however, unless they'd spent as much time as we have -- now measurable in years -- debating the minutiae of designing and implementing alternatives to SQL.

The debate can be summed up as follows:

Position #1: There are cases where duplicate rows should be allowed inside a "relational" DBMS.
Position #2: Duplicate rows should never be allowed inside a relational DBMS.

{I should have established a common working-set of terms for such at the beginning. My apologies. I would note that it's hard to tell which topics will keep expanding and which stop growing early. My time-machine has a Microsoft glitch. Generally most refer to a single kind of key: an artificial key that is "unstable" in that the same record may have a different key on a different day. "Artificial key" is not sufficient because auto-numbers are also artificial, but are usually "stable". --top}

{I disagree with those two position summaries. This particular page is mostly about dealing with data given to us from an "outside" source that lacks a known key. And even "within", there are compromises that can be made. Further, a distinction needs to be made between existing RDBMS engines and/or app system implementations that currently use bags in the established code base, and "new" engines or RDBMS. For example, I would expect a new query language designed to replace or supplement SQL for established systems to be able to communicate with bag-centric nodes and data sources and/or API's that may supply a "stable" key. I would consider making a classification system, but scenarios are better due to all the possible combinations in my opinion. --top}

Curious. I've been arguing all along that duplicate rows should never be allowed inside a relational DBMS, and have been debating with you under the (mis?)perception that you've been arguing that there are cases where duplicate rows should be allowed inside a "relational" DBMS. Whether data sources are inside or outside, or DBMSs are new or old, etc., is -- IMHO -- rather immaterial in light of such fundamental constraints or lack thereof. The initial context was certainly about dealing with external data sources, but it only highlighted the issue re duplicates being allowed or not. If we allow external data sources to be introduced unconditionally, then we must allow duplicate rows. If we must disallow duplicate rows, then a key may have to be generated by the interface between our DBMS and the external source. The fundamental issue, therefore, is about allowing or disallowing duplicate rows.

I guess I misunderstood what you meant by "inside". I'm not even sure that "inside" and RDBMS even needs to be stated in your two scenarios. Would your recommendation for non-RDBMS usage change, and why?

The issue in this topic is how to handle data that comes to us with no known stable key, with the assumption we cannot change that fact due to organizational or other issues. That supplied data lacks a known stable key is a "given" for the sake of this topic. I added more details to the intro at the top to clarify. --top

.....

Well, then, what I make of this is that it might clear things up a bit to hammer on the distinction between a database, and not-a-database. There are perhaps borderline cases, but we're more interested in large-volume, high-transaction databases anyways, aren't we? I feel underemployed, worrying about what happens when people haven't embraced the advantages of using a real database--who are these people? Let them learn their lesson. There didn't used to be any computers, hey use a typewriter I care. So. I have truly missed the point, eh? I see that the page ends with this: 'The issue in this topic is how to handle data that comes to us with no known stable key, with the assumption we cannot change that fact due to organizational or other issues. That supplied data lacks a known stable key is a "given" for the sake of this topic. I added more details to the intro at the top to clarify.'

Answer, how to handle this data, is put it in a database. Example, you sign up at a health club, they put your info in the database. They don't wonder what the primary key is in the DMV's database, although your info may also be there. If I understand top's position correctly, and he's saying put data in a database and give it a key, then I agree. This is kind of like rebasing--the data may be in some other database, but we treat that as pure irrelevant speculation, it doesn't matter where the data came from. Envision yourself as a DBA for a company. Your company has other companies as customers. Each customer has several contacts. Each contact has several addresses and phone numbers. Each phone number has a call log. Each customer places many orders. Each order includes products. Some products have a service log. Whatever whatever..

Whatever the proposed architecture for this, rule 1 is every table has a primary key.

I note how much we're probably all agreed on (but a few reminders might clarify things a bit). Under the rubrik of integrity rules, specifically general integrity rules (so to speak), the entity integrity rule is very simple. It says that primary keys cannot contain null (missing) data. I take it that the reason for this rule should be obvious. But, to review, every table must have a primary key, which uniquely identifies rows in the table. And, again, the entity integrity rule forbids nulls in primary key columns. And, a database designed according to the relational model will be efficient, predictable, well performing, self-documenting and easy to modify. Where does this leave us? I guess that I can I construe this debate here, as having to do with what guidelines is it important to follow, if you do break the rules. In that case, it is taken to be irrelevant, that if you take the time to design your databases properly, you'll be rewarded with a solid application foundation on which you can build the rest of your application.

I'm focusing narrowly on the question, how it is, that if, by definition, a relational database must contain normalized tables, and to be properly normalized, a table must contain a primary key, and on all this everybody agrees, then what sort of great primary-key debate have you gotten into--what is left to discuss? What exactly are these very strong opinions about?

I had some fun about the proliferation of 'xxx-key' terms like 'surrogate keys'. That one specifically, though, by the way, does have an accepted technical meaning (this is just for review, I assume agreement here). A surrogate key is a primary key created from a meaningless value. It's specifically not created from naturally occurring data, what, again, is used as a table's primary key, is a meaningless value--that is referred to as a surrogate key.

The real point that I want to make about all these different kinds of keys that are mostly the same kind, or whatever, is let's go back to the definition of a primary key. Of course, we all agree, that it uniquely identifies each record within a table. However, I find it relevant here to point out, that this is only half the story. What is the main purpose of a primary key? Perhaps there is also no controversy on this point (although I think a reminder may turn out to be useful). The main purpose of a primary key is to relate records to additional data stored in other tables. Several other points are entailed. This is the explanation, for example, for why the primary key must remain stable--that is, you can't change the primary-key field(s). The primary key value, I'm belaboring the point, can't be changed. And, must exist when the record is created. And, must uniquely identify each record. And can't be null. This is also why it must be compact and contain the fewest possible attributes, for that matter.

Now, I've used the word 'must', there. Must remain stable, perhaps? mostofthetime? No, must. Okay, I see, then. MUST, is that supposed to mean that I can't break the rule (watch me)? No, must doesn't mean that you can't break the rule. It just means that your application won't adhere to the database model if you choose to break these established rules. Okay, then, if I am saying that relational database theory requires a particular condition, then should I expect the relational database system to enforce it as I'm developing my application? No. I expect to enforce it myself.

Now, some of the debate above, concerns what if an uneducated user updates a primary-key value? Well, you can, but should not, because a primary-key value shouldn't be subject to data entry errors. Because, changing the value violates a rule. But, this doesn't seem like such a big problem? But remember, you're not supposed to change the primary-key value.

I can anticipate, that somebody will point out that I'm talking about primary keys, not 'meta-alt-ctrl-shift-escape-keys'. We all agree about primary keys, that's trivial--what about the real word vissisitudes of life? My reply to this, is that when things slip out of control, of course they fall apart. Beyond recommending that you select a value that you can control and maintain, I can only say, what you cannot control and maintain, will fall apart. There is no suspense about the outcome, really?

A key, a key, as in, if you're going to use the word key, then without even adding any modifiers here, what are we referring to? A key is immune to changes in business. A key depends on only one field. Unique (will always be unique). Stable. Compact. What is an 'unstable key'? A cry for help. A grave misunderstanding. A joke, perhaps? ru kidding me?

Upon review, I think I'm taking the single-most important issue to be good design. I do indeed suspect, that I'll be accused of missing the point of this debate. However, if the foundation is weak, so is the building. I'm more interested in how to avoid future problems and subsequent (and perhaps convoluted, like this debate) repairs. Your whole debate is about how to salvage the situation, when a simple design choice, one of the easiest ways to provide a strong, stable, yet flexible foundation, has been rejected.

I'm tempted to get into an even more fundamental question, what is a database? If we're going to build one that can be used effectively, then what is a simple definition? Do we all know/agree?

I'll try: a database is a collection of information that is organized so that it can easily be accessed, managed, and updated.

And, while we're reflecting on 'what is a database', the relational data model, is one kind--one that has a whole series of rules govering keys. Above, I defined a 'surrogate key'. Now, I would like to add, and here again I'm expecting agreement, that a 'technical key' and an 'artificial key' are two other terms for the same thing as a 'surrogate key'. This, once again, is a key for which the possible values have no obvious meaning (to the user or the data).

I think I will add, that just as a table has a name--you know, like 'part', 'customer', 'invoice', also, a database has a name. And, if it is I hope clear where I am going with this, a table is in a database. and, a key is in a table. I just don't know what is the point, if choosing a primary key is one of the most important steps in good database design, if in fact, good database design starts with the right primary key, then who cares about how to handle situations where you have not chosen an appropriate primary key? Be a loser, and loser that you are, you will be surprised, when you lose. It's not that simple? A key is a table column. That's table as in 'database table'. The key is what ensures row-level accessibility. You're trying to get by w/out one? That's what they did in Ancient Rome, also they used XVIII to write 18. You're welcome to do things the way penguins do them when they gather about the south pole, and sing songs while they work. Seen Zoolander? The files are in the computer. The FILES are IN the COMPUTER. Oh...the files are IIN the computer? Duh. Lesson two, never null. No primary key value can be null, nor can you do anything to render the primary key null. This is an inviolate rule of the relational model as supported by ANSI, of relational database management system (RDBMS) design, and of what database engine are you using? It flags as NOT NULL all columns that make up the pkey, when you designate a primary key, is what I'm guesssing--what database engine are you using?

Let me put it this way, what does 'null' mean? Null is an unknown condition. And, what is unknown, if the primary key is null? How to do lookups and comparisons is what is unkown.

If you somehow have managed to fail to ensure that the primary key is never null, then I guess there is no point in adding that it should be brief, should be a simple data type, should be a nonidentifying value, because you lose. Look at the scorecard--what happened? You're still stuck on lesson 2. NEVER NULL!

BTW, I see that I've drifted into brusquely confronting some interlocuter--I really am not addressing anyone in particular, just trying to get the issue in view, just thinking out loud..peace.

{You seem to focus on the purpose of the key above where-as I'm more focused on what's likely to happen under each scenario and what is the cost/benefits of each as developers and users come, use a table, and perhaps then leave. My mental model is similar to gaming theory trees in that I try to find the branch with the least average cost: a "probability tree". This not only includes the probability of the branch, but also the weight of the potential "damage done" (score or anti-score in gaming theory). Think of it as running a bunch of simulations of the reaction(s) to the table in various office instances. Adding a column that may be mistaken for a stable key is just too big of a risk unless we know a working temporary key is going to be a common need and individuals cannot realistically make and fiddle with their own copy and cannot use the RDBMS's internal row key. In other words, it's situational. For example, suppose the intended use of the data is for aggregate statistics. We may join on aggregated categories, such as zip-code, product code, etc, and never need to join on (cross-reference) an individual row. We could do our job just fine without an artificial key. "Tables should have X because they just should out of the laws of table-ness" is not good enough in this case because adding it creates extra and unnecessary risk.} --top

Now, I had typed a lot, I'd be very impressed if you had actually gone through it closely, it's a lot to ask. But if I get through with only one point--I'm thinking that the 'xxx-key' terms reproducing, like a gaming theory tree, if you will, has got to stop, for reasons which satisfy me. What I mean is, note, that I think you're using the term 'artificial key', in the Pickwickian sense--I'll quote myself: 'Now, I would like to add, and here again I'm expecting agreement, that a 'technical key' and an 'artificial key' are two other terms for the same thing as a 'surrogate key'. This, once again, is a key for which the possible values have no obvious meaning (to the user or the data).'

Can we use the term 'artifical key' correctly? Always, the 'artifical key' is the primary key. The surrogate key. The, if you like variety, the technical key. I'm being pedantic, I'm correcting you--don't get me wrong, I don't think I'm so smart, you'll have your turn..

{Auto-keys are also "artificial", yet they are stable. A given record will not hop/change keys over time (barring exceptional circumstances). Same with surrogates. Thus, neither "artificial" nor "surrogate" are sufficient to describe key (no pun int.) characteristics of the "kind" of key I am talking about. Maybe there's a better term yet, but it hasn't arrived here yet. It may be artificial, and it may be surrogate, but those alone don't describe it sufficiently. Those are not the important aspects for the topic. Whatever term/phrase used must reflect the instability aspect, or it's not acceptable. -t}

Also, I actually had a point about what's likely to happen under each scenario: 'We all agree about primary keys, that's trivial--what about the real word vissisitudes of life? My reply to this, is that when things slip out of control, of course they fall apart. Beyond recommending that you select a value that you can control and maintain, I can only say, what you cannot control and maintain, will fall apart. There is no suspense about the outcome, really?' And you have in your reply something about '(A)dding a column that may be mistaken for a stable key..' You describe this as a risk, but I see no risk here, I guarantee what will happen if you break it--you bought it. Alternatively, add a column that ends in '_ID', perhaps. And, you know, make it the first column. And, make it the primary key.

And, uh, also, this is 'key':

     -->make it the primary key<--

Now, I take it that there is no 'mistaking' this for a stable key--it IS a stable key, it's the primary key.

{I suspect you badly misunderstood the scenario. No, it is NOT stable. We are not given enough information to make a stable key. The information to create a stable key just plain is not there for us to use and we must live without it. That's a given to the scenario. And I've done this before because it was the written requirements, which were probably designed for security and privacy reasons. The analyzers, an outside agency, didn't need customer number nor customer name and it would risk privacy breaches if they had it. They did not need it for their role and so were given what they needed for their role and only what they needed for their role. Thus, they got bag, as specified in the requirements document. -t}

And, you offer a scenario as an example: 'we know a working temporary key is going to be a common need and individuals cannot realistically make and fiddle with their own copy and cannot use the RDBMS's internal row key. In other words, it's situational. For example, suppose the intended use of the data is for aggregate statistics. We may join on aggregated categories, such as zip-code, product code, etc, and never need to join on (cross-reference) an individual row. We could do our job just fine without an artificial key.'

Okay, see how I fail to comprehend the difference between a key, and a key. An artifical key is a primary key. What is a working temporary key? Is it a key, the primary key? Maybe let's review this 'common need', for what, exactly? I think, that if we're talking about users, who don't need to know the primary key--then I have no problem with this concept. I'll see you and raise you, however. I define these users who don't need to know the primary key, as users who don't need to know the primary key exists. They don't see, they don't use, they don't know about, the primary key. They use an application, and the application, ah, getme? the Application maintains the relationships behind the scenes.

{Note that they users may be query writers, not just front-end users.}

What, I think, is the difference here, is I don't want to allow as thinkable, logically, any concept of keys that are not stable. Why wouldn't they be stable, because of users? Users don't know that keys exist, they don't see them, don't use them. Keys are not subject to input errors, are never null, they provide the pointer to the data. If there is no pointer to the data, this fact will mean, that there is no data being pointed to--or as I put it before,'Null is an unknown condition. And, what is unknown, if the primary key is null? How to do lookups and comparisons is what is unkown.'

Now, specifically to your example, you're wondering about what if the job we are doing is we are joining on aggregated categories, and using the data for aggregate statistics. Okay, now, my concept of data is relatively simple compared to the heathens who still worship at the altar of the flat file. My schemas are BCNF. Or better, pal, or better. I've accepted Codd’s rules and I lead a purposeful life, so how is it, supposedly, that after just joins, and a view or two, my sql is supposedly corrupted? All logic begins with assumptions, and I'll make some here. What, are you an Econometrician? A Psychometrician? I assume that you have no such requirements.

If you do, well, research in the math stat world continues to find new algorithms and implementations, but these disciplines are designed around the cursed flat file. I guess that you are dumping some rows from a database, importing them into a session with some statistical package, generating some output, massaging this for use by a client program, puting the output in some place, rinsing, repeating...Stat algorithms tend strongly to find their expression in linear algebra, which is to say matrices. Knock yourself out. Explore the world of inferential statistics. There is a limit to my competence, and there's more to data than the RM (Relational Model).

{Re: "but these disciplines are designed around the cursed flat file." - Which means the software and such users are likely well-familiar with "bags" and know how to work with them. The same cannot be said of unstable keys. Now you could say, "Well, it's a shame that they are not using 'pure relational' data". That may be the case, but it's not something we can change overnight. Our job is simply to get data from an "external" source and make it available to these statistics people in the smoothest, most drama-free way possible. Our job is not to spank the world and force relational religion on them. We are not being paid to evangelize idealism. -t}

I sympathise with your attempt to engage in what is a mere piece of a wider, ongoing discussion. From what you've written, I think you may be missing some of context. I'll try to give it to you, in a nutshell:

My correspondent and I (that being me, DaveVoorhis, and the individual known on this wiki as "Top") are both involved in the developing alternatives to SQL. Mine is the RelProject. Top's is SmeQl.

My RelProject is a faithful implementation of the database language called TutorialDee, which adheres strictly to the RelationalModel. No relation can ever have duplicate rows. That is required by the very definition of a "relation".

Top's SmeQl is inspired by the relational model, but like SQL it allows duplicate rows in tables or "bags".

Like many DBMSs, the RelProject maintains its own data storage but it can also link to time-varying external data sources. Sometimes, those external data sources -- for example, logs produced by hardware devices -- will have duplicate rows. Currently, the RelProject requires that the database developer either identify a key in the source data (if there is one), or pick an option that causes a key to be generated (if there are or might be duplicate rows.) However, a generated key will only last the lifetime of a given RelProject query. It is not a permanent key, nor could it be, because if there's no way to uniquely identify rows in the source data there's certainly no way to permanently associate particular rows with a key. I called such a key a "generated key" because it's generated for the data rather than being found in the data. Top has called it an "artificial key" or a "fake key", which is fine -- I know what he means.

Top argues that such a "generated key" is more harmful than allowing duplicate rows in the DBMS.

I argue that allowing duplicate rows in the DBMS is more harmful than a "generated key".

{I'm not sure what you mean by "allowing". Please clarify. If we stick a made-up number on them, they are technically still potentially "duplicate" in terms of the data values users may care about. -t}
[GarbageInGarbageOut still holds, of course. But requiring the `generated key` prevents us from accidentally multiplying garbage when performing joins or transitive closures. It is convenient to know that the garbage output is directly proportional to the garbage input, no matter how many relational transforms you apply in the middle. It simplifies debugging, for example, and reasoning about correctness.]
{With the kind of data and task scope of this scenario (mostly aggregation & statistical), joining on individual records (such as a particular employee or customer record) is not going to be very common nor very useful, with or without a key. And if it does become common, THEN we can add a temporary key. YagNi. The risk and problem spread is still smaller than the risk of mistaking a generated key for a stable key in my assessment.}
[The keys are equally valuable if you perform joins as part of computing an aggregate - which, in my experience, is a relatively common tactic. It is unclear to me why you assume `individual records`. Regarding confusion about stability, you could mitigate the problem by adding `unstable_` to the column names for volatile keys. (Or you could adjust the database to make the keys more permanent.)]
{Example SQL?}
[Regarding what? Use of `unstable_` in column names? Anyhow, no, I'd only go to that effort for a respectful student or a respected peer.]
An example of it making a difference for aggregation. See BagSetImpedanceMismatchDiscussionTwo for a sample schema to work with. As far as your insulting comments, I don't appreciate it. Would you like it if I called you a "impractical purist zealot" again?
[I freely offer respect to each WikiZen, but you systematically squandered and destroyed the respect I offered you six years ago. If you want more, earn it.]
I could say the same about you and your stubborn refusal to present clear or formal logic. Let's be grown-ups and agree to not insult each other. Agreed? We both think the other is a defective idiot, but let's not keep stating this assessment over and over and over. Besides, RudenessFails to "fix" people even if they are "broken". -t

That, ultimately, is what this page is about.

As for your question, "I've accepted Codd’s rules and I lead a purposeful life, so how is it, supposedly, that after just joins, and a view or two, my sql is supposedly corrupted?" Simple answer: If you allow duplicate rows in your source data, then it isn't your SQL that will be corrupted, but your results. You will get (possibly unexpected) incorrect results. That is a substantial point in my argument that duplicate rows should never be allowed in a DBMS. I regard that to be a critical issue, a veritable showstopper that trumps any and all arguments to allow duplicate rows. Top disagrees.

Office Politics

There is also a "political component" to this kind of scenario. If we add a "working key" to the dataset, then whatever negative event happens with that key will fairly likely reflect back on us, the key creator. We'd co-own the problem. Whether that's "fair" may be irrelevant. Those judging us likely won't give us a full hearing and/or understand the intricacies of the issue. Typical boss: "It's not your job to augment the data; just pass it on as is." I've worked for a lot of clueless and "unavailable" managers over the years. (I know what some of you are thinking: "That's the only kind of manager who would hire an asshole like you, Top.")

If, however, a data user fouls up their own project by adding their own key or mishandles what you call "duplicates" due to lack of a key and lack of care or skill, then most of the blame will fall on them. It's their mistake, not yours. (An exception may be if they ask for a key and you reject them, but that's not the scenario, as stated.)

Thus, for the sake of argument, even if the net problem risk magnitude caused by having the added key versus not having it were the same (which I don't agree to), the "blame flame" will be less focused on us, the data provider if we don't add a key. If I add extra info, or what can be perceived as extra info, then I will partly inherit any fall-out from it in a political sense. This may be a somewhat selfish way to view it, but it's human nature to CYA. You may argue it's more moral to be altruistic and accept the risk and blame of such, but that won't scale to every individual. Most people go for CYA and will reduce their risk at the expense of others in such situations. It's not being dishonest; it's just not volunteering to go the extra mile to accept the risk of more political blame. -t

Having spent a goodly portion of my professional career being responsible for enterprise data and reporting, I have to say I'd far rather accept the blame and repercussions of introducing a key, than accept the blame and repercussions -- even indirectly -- for inadvertently producing an erroneous report due to unrecognised duplicate rows.

What I categorically and absolutely will not do is allow duplicate rows/tuples in relations. The RelProject is not, and never will be, BagAtational. However, should it be necessary to import, link to, or generate external bags, there are mechanisms to do so as I've mentioned elsewhere. Thus, correctness is the default position and is explicitly promoted. However, when absolutely necessary, there are ways to accommodate duplicate rows that does not require some degraded corruption of the RelationalModel. If you feel they are not sufficiently convenient for you -- and that, ultimately, is what this seems to be about: convenience -- you are free to not use the RelProject and can choose (or develop) something else.

That sounds like a PersonalChoiceElevatedToMoralImperative. Again, I'm not condoning CYA-centric behavior, only saying that in practice most will probably follow it in my experience.

Rel is a true relational database management system, which means it adheres to the RelationalModel. There is no more PersonalChoiceElevatedToMoralImperative here than if I said I'd made a floating conveyance and you complained it was PersonalChoiceElevatedToMoralImperative because I didn't include wheels. It's a boat, top, a boat! There are no wheels. Likewise, it's a true relational database management system. There are no bags!

Using Rel is a choice. (And Rel can be changed.)

Huh?

Thanks for the efforts, to clarify the nature of the debate/put things in larger context. So, you're both involved in developing alternatives to SQL. I have some homework to do at this point--I'm considering the BagNeedScenarios, for starters. However, I'll barrel ahead, as I think I've taken sides already. I continue to cling to the principle, that relational databases by design are supposed to eliminate redundancy (if it's okay to use 'redundancy' as a jargon term there--the whole idea of redundancy concerning what you should do in order to maintain data integrity--what should you do? Store information in one place, right?). I want to label this principle as not in dispute? I take it that DaveVoorhis agrees. We all agree that there are many production databases out there with redundant data, that duplicate records are alive and well, and it seems like we (including top) sort of agree that this is unfortunate? An important point.

[TopMind has in the past advocated (argued in favor of examples) using duplicates for such examples as: line items on a receipt, cans of cat food in a store, and logging. He invented the word `BagAtational` as a portmanteau of `Bags` and `Sensational`. If he thinks duplicates are unfortunate, that's news to me.]
That is probably incorrect. Where did you see that? I would NOT generally advocate non-sequenced receipt items, unless perhaps there was some kind of hardware constraint. -t
[So you also advocate bags under hardware constraints?]
{I'd consider it, depending on circumstances. See scenario #4 in BagNeedScenarios. -t}
- Introducing hardware constraints into the debate represents a slippery slope. It's not difficult to envision a series of increasing hardware constraints that start by chopping out each letter of ACID one by one, then eliminate the DBMS entirely, then eliminate files, and finally reduce the scenario to 256 bytes of RAM and 2K of ROM. As such, you can't realistically consider some imaginary hardware constraint that somehow forces you to use bags but doesn't impose any other limitations.
- SlipperySlope is generally considered a fallacious position.
- Precisely, which is why you shouldn't use it! It's you who implied a SlipperySlope by claiming that hardware constraints impose bags.
- {Please clarify. I do not recall using SlipperySlope to justify such.}
[I consider your scenarios to be advocacy for bags. Are you suggesting they are not?]
{Your wording seemed to make it sound absolute, which is misleading. If you had stated "top advocates X under certain conditions", I'd be satisfied. -t}
[Fair enough. But you do advocate their use in many circumstances that other experts would find dubious.]
"Other experts"? Your buddies? Roughly 98% of all RDBMS implementers chose to allow bags, at least as an option.
Of course. That's because 98% (to use your made-up statistic) of all DBMS implementers choose to implement SQL, which defines "bags". You'll probably find a similar percentage of database professionals advocate always defining a primary key for every table, always using DISTINCT on SELECT queries, and so on. These are widely-recognised best practices that exist solely to prevent inadvertent "bag-like" behaviour and the appearance of duplicate rows.
They could have created a non-bag version of SQL. Further, DISTINCT can make queries run slower. Users often want snappy responses, and often will accept it in trade for more bug risks.
The political forces surrounding the definition SQL standards have very little to do with good engineering or good IT. On modern hardware (i.e., made in the last 20 years), the performance overhead of DISTINCT is insignificant. Furthermore, professionals still advocate always defining a primary key for every table, always using DISTINCT on SELECT queries, and so on. These are widely-recognised best practices that exist solely to prevent inadvertent "bag-like" behaviour and the appearance of duplicate rows.
Please clarify the standards statement. It may be more practical to rework SQL than toss it all out for say Rel if bags are the driving issue. And "modern hardware" is mostly irrelevant because demands/expectations increase as hardware gets more powerful. Users always want deeper and faster queries and more reports with more options. The cost is generally proportional to the data size. (Although I admit I have not done extensive testing on different machine and vendor combos.) And I'm skeptical of your reference set of "experts". They seem more enamored with conceptual purity than the practical economics of delivering products customers and users actually want. It's almost like having a hot-rod expert build a car for granny to go get groceries. They will over-engineer it with all kinds of fancy gizmos and designs that are mostly irrelevant to granny's original goal. It's a common side-effect of AspergersSyndrome. I've also caught myself GoldPlating shit with concepts I found intriguing or seemingly important also (at the time). The duplicate thing just doesn't happen to be one of my trip ups. -t
Further, many OS and similar logging systems don't bother to make them unique. Are they all "dumb" too? -t

This judgement affects what we do, when we find duplicate data. We remove and delete that duplicate data. Unless, I suppose, we can't come up with a method for accomplishing that.

I figure, that if there's a DBA in the midst, then the DBA will be hard at work, constantly, finding duplicate records in tables, deleting/eliminating duplicate rows, in fact employing many options and ways of doing that, and let me slow down to consider this activity which I take it, we must assume is always going on in a maintained database. The DBA comes along, looks at, say, a 'customers_bak' table, and cleans duplicates. Or, simple example, there is a table 'duplicaterecordtable' and it contains this:

duplicaterecordtable Col1 Col2 1 1 1 1 1 1

And the DBA changes this to this:

duplicaterecordtable Col1 Col2 1 1

And then, the DBA moves on to do the same thing to other tables--it's what he does, it's all he does. Another example, a users table, containing this:

users FirstName LastName Elvis Presley Elvis Presley Elvis Presley Elvis Presley

And I leave as an exercise, how that table looks when the DBA is finished w/fixing it (I take this up, show one way to fix it below).

Now, DaveVoorhis explains that in his RelProject, well, here is a quote:

: Sometimes, those external data sources -- for example, logs produced by hardware devices -- will have duplicate rows. Currently, the RelProject requires that the database developer either identify a key in the source data (if there is one), or pick an option that causes a key to be generated (if there are or might be duplicate rows.) However, a generated key will only last the lifetime of a given RelProject query. It is not a permanent key, nor could it be, because if there's no way to uniquely identify rows in the source data there's certainly no way to permanently associate particular rows with a key. I called such a key a "generated key" because it's generated for the data rather than being found in the data. Top has called it an "artificial key" or a "fake key", which is fine -- I know what he means.
: Top argues that such a "generated key" is more harmful than allowing duplicate rows in the DBMS.
: I argue that allowing duplicate rows in the DBMS is more harmful than a "generated key".

Now, to this I say, that adding an identity column (RelProject requires this), would just be a step, part of a method for deleting duplicates. We're adding an identity column to the table in order to distinguish all rows from each other, and then it is easy to delete duplicates by using the Id column.

Take my users table:

users FirstName LastName Elvis Presley Elvis Presley Elvis Presley Elvis Presley

And, run this query: ALTER TABLE users ADD Id int IDENTITY(1,1)

And then you have this:

users FirstName LastName Id Elvis Presley 1 Elvis Presley 2 Elvis Presley 3 Elvis Presley 4

And then run this query: DELETE FROM users WHERE Id IN (2,3,4)

I wouldn't claim that query to be ideal, there must be a better way to do it, automatically, but that might depend on which actual implementation we're using and this is just hypothetical. So anyways, then run this query:

ALTER TABLE users DROP COLUMN Id

And I'm still inclined to leave it as an exercise, how that table looks when the DBA is finished w/fixing it.

but, in this great debate, of which here is the executive summary:

: Top argues that such a "generated key" is more harmful than allowing duplicate rows in the DBMS.
: I argue that allowing duplicate rows in the DBMS is more harmful than a "generated key".

I'm afraid I can't accept these as opposite positions. You ('I argue') are actually allowing duplicate rows in the DBMS--and if you're going to do that, then I'm inclined to the view that it's more harmful to delude yourself about the fact--just say that you are allowing duplicate rows. Indexing duplicate rows does not make them not duplicates, does not, as you (seem to?) claim, adhere strictly to the RelationalModel.

[To the contrary, indexing duplicate rows does make them `not duplicates`. And RelationalModel says nothing about human-meaningful semantic content, only about structure, algebra, calculus.]

I'm quoting you here: 'No relation can ever have duplicate rows.'

Is there some ambiguity about what this means, what counts as a duplicate row? Not, apparently, when you say this: '..a generated key..is not a permanent key, nor could it be, because if there's no way to uniquely identify rows in the source data there's certainly no way to permanently associate particular rows with a key.

[There is no ambiguity. But the source data isn't a relation. It could be a log file, or a table from a non-relational database. The `generated key` that DaveVoorhis speaks of would be created as part of an IMPORT statement or similar, which allows Rel to process foreign data resources. The key is not permanent because it might be generated differently each time IMPORT is performed.]
{I'd suggest making a distinction between "domain duplication" versus "computational-space duplication". We can almost always remove the second kind, but not the first. Also note that many RDBMS provide a function or dummy column that has a unique row-ID (but it's not "stable"). It's something to consider before reinventing the wheel via an added interim ID column. -t}
[I've considered it, and I consider using row-ID to be a mistake. We need something stable at least for the lifespan of a query, across sub-queries and joins. row-ID is not stable across joins, sub-queries, etc. It seems to be a PrematureOptimization.]

But then, what do I make of this?: '..requiring the `generated key` prevents us from accidentally multiplying garbage when performing joins or transitive closures.'

Okay, how exactly does requiring the 'generated key' prevent us from accidentally multiplying garbage etc? How is a generated key, to use your jargon, a panacea? Am I, for example, mistaken about what will happen if you join such a table against itself? I'll detect duplicate rows.

[No, you won't. Literally, the rows aren't duplicates if they have different keys. Consider the source data:]

users First Last Elvis Presley Elvis Presley

[Allowing duplicates, if you join users with users on Last, you'll get Elvis squared:]

users l.First Last r.First Elvis Presley Elvis Elvis Presley Elvis Elvis Presley Elvis Elvis Presley Elvis

[But if you introduce a volatile ID column:]

First Last VID Elvis Presley 1 Elvis Presley 2

[And do the same:]

l.First l.VID Last r.VID r.First Elvis 1 Presley 1 Elvis Elvis 1 Presley 2 Elvis Elvis 2 Presley 1 Elvis Elvis 2 Presley 2 Elvis

[Look! No duplicates. Further, you now have sufficient information you need to make useful judgements about how this relates to the original data source - e.g. in this case we might filter out all cases where the left VID is greater than or equal to the right VID, which gives us a nice collision graph on last names:]

  SELECT * from (the last table) where l.VID < r.VID
    l.First l.VID Last     r.VID r.First
    Elvis   1     Presley  2     Elvis

[This is quite useful for aggregation tasks and various advanced computations. Of course, if we judged that the duplicates aren't meaningful, we could simply eliminate them by making the whole row a key (which is always true in RelationalModel). The reason for adding a volatile ID is that we do have some reason to believe the duplicates are meaningful at their source. E.g. TopMind has advocated using rows for logging or items on a receipt.] {I don't think I have advocated this. See above. -t}

Perhaps, the judgement is made, that no need has arisen to eliminate these duplicates, but that puts you in the position of saying something like this:

I argue that allowing duplicate rows in the DBMS is more harmful than allowing duplicate rows in the DBMS.

Or perhaps:

I argue that allowing duplicate rows in the DBMS is perfectly okay, as long as you don't realize that they're still duplicates, and call them something else.

I've taken a look at http://dbappbuilder.sourceforge.net/Rel.php. What I'm looking for, is my definition of a key as the element that forges a relationship between two tables. your 'generated key' does not do this, a point which you actually allow. Must we call this a key, therefore? I categorize this as a mistake--I'm being pedantic?

Is this a controversial point, that just because your 'generated key' has a unique value for each record, that does not mean that there are no duplicate records..? Right? Just because my table has a field that you figure acts as a unique key, of sorts, it's a particular kind of key, not the usual kind, but it is unique for each record, does that mean that my SQL isn't corrupt. What happens when you go to create relationships? What if you have a table listing customers and a table listing orders? Of course, the orders table does not have any reusable customer information in it. And then, you go and create a report that tracks which order goes with which customer. What happens, in this case? Repeating the word 'key' 'key' 'key' over and over again like a magic incantation, 'generated key' being a rather Orwellian concept, like true freedom, 'generated key is unique'! It's essential that the key be unique! The GenErAtEd? key is unique! I'm good! Satisfied the gods of etc., as top would say. I don't think that's gonna make it rain. When your generated keys are unique, are you also using foreign keys? They'll be unique too. So that's good. Everything is unique. Except for the records. Oops. The keys are meaningless numbers, right? It's the records that are supposed to be unique.

In conclusion, to end where I began, I continue to cling to the principle, that relational databases by design are supposed to eliminate redundancy (if it's okay to use 'redundancy' as a jargon term there--the whole idea of redundancy concerning what you should do in order to maintain data integrity--what should you do? Store information in one place, right?).

Have I got anything, here?

A little, perhaps.

What is specifically excluded by the RelationalModel is the following:

 FirstName LastName
    Elvis  Presley
    Elvis  Presley
    Elvis  Presley
    Elvis  Presley

Inside a relational database, such as the RelProject, it simply can't exist. It must, by definition, either be:

 FirstName LastName
    Elvis  Presley

Or, it can be:

 FirstName LastName Id
    Elvis  Presley  1
    Elvis  Presley  2
    Elvis  Presley  3
    Elvis  Presley  4

If we project away the Id, i.e., we remove it, the RelationalModel specifies that the result must be:

 FirstName LastName
    Elvis  Presley

There is no ambiguity about duplicate rows. A duplicate row is precisely that. In an implementation of the RelationalModel, it does not exist. Ever.

Enforcing the rule that this...

 FirstName LastName
    Elvis  Presley
    Elvis  Presley
    Elvis  Presley
    Elvis  Presley

...cannot exist results in certain guarantees about the correctness of query results. In database systems that permit duplicate rows, those guarantees cannot be made. In particular, I've given an example of SQL queries, above, where duplicate rows may inadvertently result in erroneous query results. That cannot occur (at least, not without deliberately engineering it to be the case) in a relational database.

Note that a generated key, such as in the following:

 FirstName LastName GeneratedID
    Elvis  Presley  1
    Elvis  Presley  2
    Elvis  Presley  3
    Elvis  Presley  4

Returns stable GeneratedID values for the lifetime of a given query in the RelProject, such that if the above is retrieved as a RelVar, e.g.:

 VAR Customers EXTERNAL RELATION /* external retrieval spec goes here */ GENERATED KEY {GeneratedID};

Then the following will be true:

 (Customers JOIN Customers) = Customers

However, if we do the following:

 VAR CustomersTemp REAL RELATION {FirstName CHAR, LastName CHAR, GeneratedID INTEGER} KEY {GeneratedID};
 INSERT CustomersTemp Customers;

The result of the following will be true:

 IS_EMPTY(Customers JOIN CustomersTemp)

Okay, then what if I add something like--don't be distracted by the familiar & various mechanisms for enforcing entity integrity, indexes, UNIQUE constraints, PRIMARY KEY constraints, triggers. Suppose that the PK values are duplicate, yet the row as a whole is not. This is obviously bad. Example, a table with Social Security Number as the primary key, and having two different people (or rows) with the same number, each having unique attributes. Obviously bad. Whatever caused this, it was a malfunction. But, at least, in this case, valid unique data was placed in the row--and that's good.

It is valid data. That it does not reflect the real world, or maybe that it does, is conceptually immaterial. It's a problem for a DBA to sort out.

Next, consider what you would do, if you find a few sets of duplicate PK values, and you verify that the entire row is duplicate. In this case, what does RelProject do? Add a column? To give the row uniqueness? That doesn't give the row uniqueness, it just masks the problem..?

If the database designer has not specified a generated key, it throws an exception upon encountering an entire duplicate row.

I'm saying that there are four possibilities: 1. duplicate keys, duplicate rows 2. duplicate keys, unique rows 3. unique keys, duplicate rows 4. unique keys, unique rows

1., 2., and 3., are malfunctions..? But you want to allow 3., in preference to allowing 1. Must you allow either? Or, if you do, is this still tutorialD?

1 cannot exist. 3 should perhaps be re-stated as "only the key is unique," but that's perfectly acceptable, reasonable, and often occurs.

I mean, relations never contain duplicate tuples.

I take it as a consequence of definition?

I've introduced a bit of jargon--what are tuples? I'm trying to remember my TutorialD, where, you have tuple types, which are like row types. Loose analogy.

I haven't given precise definitions for the fundamental concepts tuple and relation. There is lots that I haven't discussed. However, if we agree that the relational model prohibits duplicates..if it's agreed by all that duplicate PKs are a violation of entity integrity, and should be disallowed in a relational system, there is another point, which I own as controversial? Namely, that as tables should never contain non-unique primary keys, so also, and this isn't redundant to say, tables should never contain duplicate rows.

Yes?

I've not followed anything after "I mean, relations never contain duplicate tuples." The words make sense, but the meaning eludes me. Perhaps this will help: For the sake of the discussions here, "tuple" = "row". However, table != RelVar. A table, as in SQL, can contain duplicate rows. A relation, which is the value associated with a RelVar, can never contain duplicate rows aka tuples.

Well, you are finding duplicates, and you are deciding what to do when they are found. I'd suggest deleting them. You've, it seems to me, fixated on the notion that you have an alternative way to fix the issue--you don't have to remove the duplicates. Instead, you can insert more currupt data, because two wrongs make a right. You may not recognize your own reasoning in this..? ;-)

You've stated baldly that '"only the key is unique," but that's perfectly acceptable, reasonable, and often occurs.' Okay, I think we understand each other perfectly well, and disagree. I'm optimistic about convincing you on this point. We're getting somewhere. I don't take 'only the key is unique' to be perfectly acceptable, reasonable, and often occurs. Probably, I need to clarify that I'm reasoning about surrogate keys, only the surrogate key is unique? That's not acceptable, that's a duplicate row. Agreed?

Ah, you didn't state you were referring only to surrogate keys. However, it may happen (though more rarely) that a set of surrogate keys are unique but all other columns are the same. Unconditionally removing what appear to be redundant entries potentially throws away information. It might be that the duplicates actually have meaning. For example, imagine the following access log entries from Rel's link to a (hypothetical but realistic) external system, such as a router or other hardware gadget:

 Logins:
   UserID   IP           GeneratedID
   Dave     192.168.2.1            1
   Dave     192.168.2.1            2
   Dave     192.168.2.1            3

It might mean I really did log in three times, and I might want to know that I logged in not once, but three times. On the other hand, if I know the entries are redundant and I want them removed, I can simply project away the GeneratedID via the following:

 Logins {UserID, IP}

Which results in:

 UserID   IP
 Dave     192.168.2.1

Either way, I can choose how I want to deal with the data, and I won't get -- as I can with SQL -- an unpleasant surprise in the form of unexpected duplicate data causing incorrect query results.

No, you are trading one problem for another.

What problem am I trading for another?

Size and processing. A generated key takes space and time. It may not matter in a smaller app or system, but could in a bigger one. Having the design option of skipping auto-keys to improve machine resource needs is a good thing. Let's say such a table is consolidated each week to another table with week, userID, IP, and count. But otherwise we don't want machine resources spent on something that does not improve the actual info in the database. An auto-key gives us nothing to serve our needs (granularity by week).

That may have been an issue in the early days of data processing -- late 1950s to early 1960s -- but incrementing an integer is negligible processing now. The space requirements are also negligible -- query results don't normally store every row; typically one row (the current row) is in memory at a time unless the RecordSet can be navigated in reverse. Even then, whilst maintaining an integer certainly takes space, any RecordSet implementation that runs risk of running out of memory is going to support using disk to obtain extra space, so the space consumed by a generated key is no more significant than the space required by any other columns. Such negligible overhead is worth it, in order to categorically avoid -- as I noted above -- an unpleasant surprise in the form of unexpected duplicate data causing incorrect query results.

It can still be big cost for an app approaching Facebook-sized.

Apps approaching "Facebook-sized" are a vanishingly small percentage of database-driven applications. FaceBook and its ilk are already well outside the bounds of conventional RelationalModel-inspired (typically SQL-based) database technology.

That may be because RDBMS typically select fastidiousness over speed. Skipping unique record guarantees may be one way to compete with NoSql.
What do you mean by fastidiousness? Typical SQL RDBMSs aren't inherently fastidious in terms of avoiding duplicate rows, but voluntarily using SELECT DISTINCT and/or UNIQUE and PRIMARY KEY constraints doesn't represent a significant speed penalty over not using them. NoSql is generally warranted when the requirements demand massive distributed storage, and the overhead of maintaining ACID compliance (if that's what you mean by "fastidiousness") is too costly.
It's not unrealistic to ask for a system where one could switch off various constraints/compliance/verifications to gain performance or scaling, and that includes duplication assurance. A shop shouldn't have to throw out an existing RDBMS and SQL code to scale certain apps or tasks. SQL is an interface, not a DB architecture specification. (The flip-side is it would also be nice if a given RDBMS could be optionally configured to only allow "bags" if an explicit ALLOW DUPLICATES keyword or the like is given for a query.)
The need for the ability to "switch off" constraints and compliance mechanisms in order to gain performance is invariably in applications where the need for utmost performance is so crucial that every other concern is secondary. In the vast majority of RDBMS applications, there is no need to disable any constraints. In the minority of applications where performance is that crucial, an RDBMS is almost certainly not appropriate; there are much better tools for ultra high performance, ultra high volume storage than a weakened RDBMS.
- I believe your trade-off weighing is off. And you seem to violate your own "never" rule below, near PageAnchor all-or-nothing.
- How is my "trade-off weighing" off? What "never" rule am I violating? I'm not making a rule. I am, however, adhering to characteristics expected of a true RDBMS, not compromised SQL DBMSs. A developer chooses to use a true RDBMS -- or uses SQL DBMSs following certain rules, like using SELECT DISTINCT, always defining a primary key, etc. -- in order to obtain certain guarantees. Providing switches to turn off those guarantees gains nothing. If you have to relinquish those guarantees in order to meet performance requirements, then there are tools that look nothing like an RDBMS, SQL or otherwise, that are better suited to ultra high performance, ultra high performance storage. It's a matter of choosing TheRightToolForTheJob.
- This again appears to be an all-or-nothing claim. I'd like further evidence it MUST be all one way or all another way.
- Because a true RDBMS that doesn't consistently deliver the guarantees of a true RDBMS is a poor RDBMS. If you need to do that, you're better off choosing a good something else.
- This is starting to smell like the NoTrueScotsman game. I don't care about some classification slot, I care about tools and features that deliver services.
Fast distributed storage typically does away with uniqueness, I would note, otherwise every addition and change would have to pass a centralized uniqueness inspection before it's visible. (Auto-keys can prepend/use a "location ID", but domain keys may not have this option.)
Fast distributed storage does away with constraints in general, because utmost performance is a necessary requirement.
PageAnchor all-or-nothing
Why does it have to be all or nothing? Why is it between an RDBMS OR "fast distributed storage"? That seems an arbitrary DiscontinuitySpike. You seem to be agreeing that at least in some cases doing away with uniqueness to gain performance/scaling is in order. Thus, you are not ENTIRELY against the idea of trading away uniqueness for performance, correct? I agree that a single do-everything DB is probably unrealistic implementation-wise, but at least we can have a closely-related family of DB engines such that we can scale from small-end to big-end without throwing out every query, every stored procedure, etc. and starting completely over. Why not have "fast distributed storage" that looks and acts like existing true-and-tried RDBMS in every way EXCEPT in areas that need to changed or remove for performance? In other words, only sacrifice the features that have to be sacrificed to achieve required performance. This may include giving up guaranteed uniqueness in some cases.
It's all or nothing because a true RDBMS implies certain guarantees of atomicity, consistency, isolation and durability. If you're using an RDBMS-flavoured storage engine with reduced constraints -- or reduced ACID compliance -- in order to meet performance requirements, you're treating yourself to the worst of both worlds: an insufficient RDBMS and a barely adequate storage engine.
Please elaborate on "worst of both worlds". RDBMS have other nice features/benefits beyond "protection".
If your performance demands are high enough to require disabling ACID guarantees, then what are you gaining? SQL? That's hardly any gain over more expressive, flexible languages, including any of the popular general-purpose languages. If your performance demands are high enough to require disabling ACID guarantees, you're probably gaining only enough of a performance edge to soon run into another performance limitation caused by the fact that you're running on a storage engine designed to provide ACID guarantees whether you're using them or not. Storage engines designed without the need to support ACID at all are capable of significantly greater performance.
Whether there are sufficiently better existing query languages to justify abandoning SQL is another topic. It's currently the de-facto database query standard and shops want to stick with standards either because they are standards and/or because the shop has an existing code base. (I do not believe there is a competitor to SQL above 5% market share per new query code written. That means switching away from SQL is a big risk.) And switching engines may be fine to get scaling. That's not necessarily the same as abandoning your existing query code base (per "DB family" scenario above). Sure, there will probably be some query code changes, but that's a far cry from starting from scratch.
Again, if your performance demands are high enough to require disabling ACID guarantees, you're already on the ragged edge of not being able to use any SQL DBMS, because none of them will be fast enough. Switching away from SQL may be a "big risk" in terms of leveraging existing skills, but it's a necessity when the performance requirements eliminate using any SQL DBMS.
Per PageAnchor all-or-nothing, you have not identified specific features that preclude SQL in the absolute sense (ignoring available implementations).
There's nothing that precludes SQL in any absolute sense. It's merely a language syntax. However, when your high performance specialist DBMS has only two columns (key and value) per table so there's no need for a column specification in a SELECT, and there's no JOIN in the usual sense, and the key and value can be any native language value, there's no ORDER BY or GROUP BY, and there's no UPDATE, and INSERT can only specify a key and a value, it doesn't make much sense to have a SQL parser when the only operations are 'value get(key)', 'void put(key, value)', 'delete(key)', map and reduce.
I don't have any definitive surveys, but I believe in practice the keys and/or values grow to have text-embedded "sub-keys" and/or cross-references which can be interpreted as multiple columns and/or tables. With a little parse-based projecting, RDBMS-like views can probably be laid on top of them. It doubt most "flat" key/value databases of significant size truly remains "flat" in semantics. More complex structures are simply shoe-horned into them in the name of performance-above-human-friendly-querying. And many NoSql DB's are more complicated than mere key/value. Key/value is only one type of NoSql database (per WikiPedia entry) among roughly 5. Related observations and surveys about the actual nature of such systems welcome. -t
"Human friendly querying" is generally irrelevant to the purposes for which NoSql DBMSs are appropriately used, whether you're using a key/value store as per my example above, or something else.
That doesn't make sense. Nobody wants something approaching BrainFsck as a query language UNLESS it's the only way to get efficiency. If that's truly a necessary sacrifice, please provide evidence it is a necessary sacrifice. In other words, prove one must live with ugly (or particular) query syntax to achieve efficiency, or withdraw your odd claim.
Why do you think a query language (what does BrainFsck have to do with it?) is needed to interact with NoSql? Some NoSql systems are intended only to perform 'store(key, value)' and 'value retrieve(key)' as quickly as possible for petabytes of key/value pairs. Why use SQL on that? A colleague of mine was recently shown a specialised query engine for extracting text using RegularExpressions that processed (as I recall) 60gb of text data per second. Do you think SQL would be appropriate for that?
If stuff was properly "slotted", then maybe he/she wouldn't need RegEx.
[How do you propose the data be "properly slotted"? It's sixty gigabytes of text data per second. How could you possibly partition that into meaningful relations? -DavidMcLean?]
Partition? I'm not sure what your point is. I need more background/context about what kind of data and queries are being used.
[Text isn't in relational form. You need to process sixty gigabytes of it per second. Put simply, there really isn't time to convert it to relations. -DavidMcLean?]
PageAnchor text-scenario-01
If you have distributed nodes, then you could have distributed processing such that one can DivideAndConquer the text parsing effort to many nodes. And besides, like I've said before, just because some queries may not be appropriate for SQL does not mean ALL are. I cannot tell if that particular case is representative of most cases based on the spot of info given. I've worked with an app that processed lots of text records with semi-structured patterns to the text that could have be regularized into columnar structures and/or many-to-many category table(s) via parsing to simplify many types of querying. I wasn't given the resources to go that far, but it could have been done. (It wasn't comparable in volume to your mentioned case, but that's besides the point. I've kicked around the idea of open-sourcing a kit to assist with just such text chomping, at least for "regular" sized collection sizes.)
Assuming you can even distribute 60 gigabytes of text per second to distributed nodes, is it reasonable to do so just so you can use SQL to do badly what regular expressions do well?
Badly? If there is enough regularity to the text, then it's more logical to use something like SQL. For example, in the app I mentioned, different contributors (or groups of contributors) used different styles or text templates to supply similar info:

         Contributor group 1:
         "Problem type: package leakage"
         Contributor group 2: 
         "Package condition: leaking"
         Contributor group 3: (free-form)
         "The box was found to be leaking."

How leakage is specified in the text is a low-level issue. If the info is pre-digested then a power-user or query writer can enter something like "WHERE leakage=True" or "WHERE 'Leakage' IN message.categories". It separates semantics from syntax (verbiage). If a fourth contributor group comes along and specifies leaking in a new way, the queries and/or knowledge about how to use them don't have to change. Abstraction 101. (Somebody still needs to add the new way to the parsing rules, but it doesn't affect "business logic" queries. The business logic programmers/query-writers don't have to be parsing experts.) SQL would be providing a higher abstraction. I agree that sometimes machine efficiency concerns trump adding an abstraction layer, but technology often catches up such that what requires assembly or the like today may be sufficient with C today, and Python tomorrow. You are defending the equivalent of assembler language. It's justified in some cases, but one shouldn't marry their organization and/or ALL data access to such low-level techniques. -t
Yes, badly. Tools like that described are used to, say, retrieve every line of text that contains a UK telephone number. How do you plan to use SQL to do that on a 60gb per second stream, which obviously doesn't leave much time for anything to be "pre-digested"?
I'm not following. I didn't intend SQL to do the parsing itself. And I already described how it could be done: distributed processing. If the info is stored on say 100 servers, then each server can do it's own sifting of its own slice of the data. I agree it may not be the best approach for ALL circumstances, but why saddle yourself with a low-level tool for everything to satisfy SOME queries that require the lower-level approach? Maybe the RegEx expert only works during the day and if the org wants to run a query at night, there are more people likely to know SQL such that they can use the slower but known SQL approach during the night shift. (SQL also makes it easier to slap a semi-generic QueryByExample interface on it.) Slow is better than nothing.
It isn't some queries that require regular expressions and some don't; all queries will be regular expressions because it's a stream of text. The only features to be distinguished are arbitrary sequences of characters. I don't see where SQL enters into it, or why it would. How would you express finding a UK telephone number in a stream of text except via a regular expression or some equivalent? (By the way, if your business is dealing with 60gb/second streams of text, you can probably afford to hire someone who can write regular expressions, especially if you don't need to hire a SQL DBA.)
I'm still not following. A low-level parsing interface would "translate" info to an SQL-digestible form. You may have a column called "phone_number_country" or the like that indicates the format of the phone number. And if you have hundreds of domain analysts, it may not make sense for each to be a master of RegEx. For example, the NSA probably sifts bunches of stuff and has hundreds of content analysts. It's not logical for each to reinvent low-level phone number parsing regex strings because they were probably hired for their content analysis skills, such as country-of-origin language and/or local customs knowledge, and NOT for regex programming skill. I imagine there may be extreme cases where the staffing and cost situation favors your description, but I doubt it is common. You may be arguing via an extreme case. It would be extremely rare to be processing huge bunches of info AND having a skeleton crew analyzing such info. Usually the count of domain data analysts is roughly proportional to the volume of data being analyzed. It's relatively rare that one is very high and the other very low.
How is a "low-level parsing interface" going to translate anything to a "SQL-digestable form"? Are you assuming every possible RegularExpression query will have been pre-defined to appear as a column in a pseudo-table interface just so users can employ SQL instead of regexes? That will need to be a lot of columns, with a lot of overhead for something that needs to process 60gb/second, for no earthly reason. I know developers who work for three-letter agencies, and they have no problem whatsoever with using regexes, SQL, or whatever is most appropriate for the job. In this case, SQL is inappropriate for the job. (Skeleton crew? Huh? Why do you assume that?)
Right, "whatever is most appropriate for the job". That's what I'm saying. For some tasks/users, regex's are better and SQL for others.
Indeed. For this scenario, I see no benefit to using SQL. What kind of organisation that needs to process a 60gb/second text stream is going to expend the effort and expense to provide a peculiar SQL front-end to suit regex-shy developers when they could simply employ someone who knows regexes (i.e., any competent developer, including plenty of people who use SQL all day long) and not have to create a SQL parser, distributed nodes, or any other needless complexity or overhead?
- It doesn't the fit data-usage profile of any org I know of. How do I know your friend is not making it all up or bending the truth for job security? And NoSql is generally already distributed, or at least split across servers.
- Why would my colleague be "making it all up"? Wouldn't it be a rather strange world where all organisations do exactly the same things as the ones you know, and all claims to the contrary are lies? My colleague went to a conference on high-performance computing, and he saw a high-performance regex processor. He thought it was interesting. He has no vested interest in lying about it.
- It just smells highly fishy to me, like some key pieces of info are missing. It's difficult to make a sound judgement with say 1% of the total information typically collected for analysis. Maybe it's a SystemsSoftware application, such as an anti-virus scanner, that is outside of my usual domain scope and experience. But I don't really care about SystemsSoftware that much; I'm a biz domain guy and focus on biz domain tools. If it works well for SystemsSoftware, great!
- Is there a clear dividing line between SystemsSoftware and business software? I suppose a bookkeeping system that records debits and credits in a journal against accounts and produces profit/loss statements, a trial balance, and a balance sheet is clearly business software, and a device driver is clearly SystemsSoftware. Between those extremes, it all blends.
- There probably are grey areas, but it's hard to know without looking at the details. I have no details here to study.
Re: "Are you assuming every possible RegularExpression query will have been pre-defined..." -- Not sure what you mean. Typically text or semi-structured content for a given domain will have reoccurring patterns (intentionally redundant for emphasis) that can be used to create higher-level codes, slots, flags, categories, weightings, etc. so that one doesn't have to diddle directly with text and parsing for many tasks or sub-tasks. (I'm not ruling out mixing regex's and SQL, you'll note.) Also a "parse rule builder" can be created so that power-users don't have roll their own regex's for many parsing needs. Via a GUI one could add rules along the lines of "words that start with 'foo' within 7 words distance of the word 'bar' or any word containing 'nif'". It could be making regex under the hood such that the search engine is still using regex's such that you are comparing the speed of hand-built regex to rule-builder-generated regex. The SQL engine may likewise generate direct regex. Probably not as efficient as hand-built, but maybe good enough. Some fancier email filtering software has rule builders somewhat like the described (in terms of the UI technique, for I don't know if they use regex's under the hood).
(Personally, I wouldn't want a job where 90% or more of my regex's queries COULD be replaced by SQL and/or parse-rule-builders-for-power-users if only the hardware were up to snuff. The hardware is fairly likely to catch up soon. I don't want to be a human compensator for slow machines.)
Re: "That will need to be a lot of columns" -- Doesn't have to be just columns. One can often use categories (typically via many-to-many tables) and/or an AttributeTable. The design details depend on the design devil. That should go without saying. Without surveying actual representative (random) selections of NoSql orgs or apps, we probably cannot fully settle this.
Sorry, I don't know what you mean. Doing regex searches on a 60gb/second text stream doesn't require categories, many-to-many tables, an AttributeTable, or any tables at all.
So this DB has planet-loads of data, but only one person, a programmer who uses regex, using and analyzing all this data? Smells like a fishy claim; see above.
Who said anything about one person? It can be used by any people who know regexes, because the device is a high performance regex processor. They're a thing. See http://www.intel.co.uk/content/dam/www/public/us/en/documents/white-papers/hyperscan-matching-engine-paper.pdf or http://www.titanicsystems.com/products/item/1/regular-expression-processor-rxp/ Obviously, someone could create a non-technical front-end to make it easy to phrase regexes, but there's no reason why that would look like SQL.
The devil is in the details of the requirements and usage patterns. I cannot investigate any of those myself. Like I say above, it smells fishy, but I'll leave it be for now.
What "smells fishy" about it? Are you sure the fishy smell isn't caused by your perspective changing?
Please clarify.
Maybe you're beginning to appreciate that the data management / database world is too broad to fit into SQL, and you think that "smells"?
It may be, but you haven't presented a decent case for such. Your evidence is non-inspect-able anecdotes. I have nothing against anecdotes, but they ARE a weak form of evidence such that I shouldn't be expected to accept such evidence as anything close to complete or thorough.
See the URLs above, which list some applications.
- Nearly all of them are SystemsSoftware.
- Some of them aren't.
- They are not supplied with nearly enough details and specifics to understand the tradeoffs involved.
- The fundamental detail is the one originally described: how would you search for (for example) and extract every telephone number in a 60gb/second text stream? It's not that SQL is unsuited to this, it's that SQL is completely orthogonal to it. At best, you'll wind up writing a SQL query that embeds a regex to specify the structure of a telephone number. Since the only thing the SQL query is going to do outside of that is essentially 'SELECT * FROM text' -- keeping in mind it's a stream, so it's got to run continuously -- there's no point in specifying a SQL query; you might as well just specify the regex.
- For a one-time request, I agree with you. But we are not talking about one-time-requests. Please continue at SqlAndDataMiningDiscussion. And if it's a stream, it's not even a database. Why are we talking about databases even when the reason/need for a "database" has not even been established. Why are we arguing so much with so few domain details? It seems like a waste of time, like speculating about life on Mars in 1820: we don't know enough to justify bickering yet.
Anyhow, there's nothing wrong with LetTheReaderDecide. If a reader needs to analyze huge gobs of data but there will only be a handful of analysts of the data who are expected to know regex's, then go that route. Done.
Who said there will only be a handful of analysts? There might be many.
I truly doubt regex is the best tool on a larger scale of human analysis and sifting outside of IT domains (such as malware scanning). Regex is too low-level. More likely it would be more economical to pre-digest content into some kind of SQL-Rdbms-like database rather than the entire line of workers deal with the info on just a syntax level. -t
There are a variety of tools used for "BigData" and analytics purposes. For some, regex is the ideal tool, especially if you need to search for arbitrary character sequences. Pre-digesting content into some kind of SQL database isn't going to work if you don't know what you're going to need to search for in the first place.
Any larger-scale endeavor is typically going recognize/form domain patterns over time such that for efficiency's sake, the organization packages low-level info into medium and high-level abstractions so that domain analysts are working with medium and high-level abstractions instead of low-level abstractions. I see no reason why this usual pattern wouldn't exist except in rare or unusual cases. (Sure, there will be some edge cases in most domains that require custom low-level fiddling, but not the majority of info.) Maybe in a new field the domain analysts are dealing with low-level info and wearing multiple hats (tech AND domain hats). But I don't expect that would last, for it's usually not economical. Patterns WILL emerge, and companies that don't mine those patterns for abstraction-based efficiency will probably dwindle or parish. Typical experienced domain analysts are not equipped to be byte diddlers. There may be rare exceptions, but I've yet to actually see one. Bring me an actual movie of this claimed unicorn. -t
Typical data scientists and BigData analysts are more than capable of using regexes. They're hardly "byte diddlers", but people who know how to extract information from more than just SQL. SQL is fine if you're building applications to store and retrieve business transactional data, but that's what SQL is limited to doing. Increasingly, nuts-n-bolts transaction processing is considered finished but for maintenance. It's no longer an area that facilitates business growth. Instead, it's an area for straightforward business maintenance. The growth is in areas like BigData. The future of custom business applications isn't in record keeping and shuttling reports around, but in creating applications that (for example) predict future sales by correlating buying activity with sentiment in customers' social media postings and current weather forecasts in order to optimise just-in-time purchasing of raw materials.
Continued at SqlAndDataMiningDiscussion.

And what's an example of an "unexpected duplicate"?

See above starting with "Much more subtle and insidious are queries like ..."

But how would such get into a simple IP tracker? Walk us through a scenario.

The simple IP tracker is merely an illustrative example. The same principle applies to more complex scenarios. It is not scenario-specific.

If you "don't use it wrong" you won't get the wrong answer. It's a trade-off between higher machine efficiency and human error risk. While I might personally optimize it against human error, other managers or colleagues may choose different. Maybe it's considered low-importance or "bonus" info such that management doesn't want to sacrifice speed in order to make it "ideal". If they are paying for it, it's their trade-off choice.

Precisely the point of my "much more subtle and insidious" example is that "don't use it wrong" isn't obvious. Clearly, there are times when achieving the utmost of machine efficiency is of paramount importance, but for the majority of tasks for which SQL is used, the overhead of maintaining duplicate-free data is negligible. The overhead of dealing with "subtle and insidious" problems is not.

But again, that tradeoff decision is up to the owner of the machines. Lecture them all you want, but it's their stuff and they can do whatever the hell they want to with it. If your system removes choices for them, they may toss your recommendation for a system that gives them choices THEY WANT. That's capitalism for good or bad. Time-Machine yourself to the Soviets if you want to be the Minister of Uniqueness Compliance.

The "owner of the machines" might insist on all manner of things -- some reasonable and some foolish -- so such requests are irrelevant here. The "owner of the machines" might also insist on using 1960s era FORTRAN instead of modern SQL because that's what he learned in university, or that all data must only be stored in text files so that it's always human-readable. (These are not made-up -- I've had clients request them.) Why give some boss's demand to store duplicate rows more weight than some boss's demand to use 1960s era FORTRAN or text files?

That seems to be an argument-by-extreme-case. I don't see our situation to be nearly as extreme. And an owner CAN use FORTRAN. Nothing is stopping him/her (although he's likely to go out of business eventually). I've heard of anti-RDBMS C shops that did everything in a file-centric. Such didn't outright kill the companies because they probably did OTHER things right to make up for the stupid parts, which is typical: you only have to do FEWER stupid things than your competition, not be perfect in all categories. A basketball team my suck at defense but have killer offense for example such that they survive in the league, perhaps even better than average, like the Steve Nash era Suns. -t

Yes, exactly -- your argument is an argument-by-extreme-case. Why would someone argue in favour of storing duplicate rows in a database? That's every bit as extreme as demanding FORTRAN instead of SQL, or text files instead of an RDBMS.

Big free web service companies like FaceBook and Google have shown they accept the fact they may lose or corrupt the data of say one out of every 5,000 users to keep their service fast and cheep. It's not economical for them to make that 1/5000 happy to make their service slow or expensive or ad-heavy for the other 4,999. That trade-off is a business decision. Our job as techies is to only describe the trade-offs accurately, not make the final choice. If we were good at that, we wouldn't be middle-class cubicle techies.

So you're saying it's acceptable for a typical company to accept the fact that its DBMS may inadvertently allow the developer to generate incorrect results, that might otherwise have been trivially preventable, just to maintain the possibility of storing duplicate rows for the one out of 5,000 users that think they need them?

I'm not convinced it's always "trivial". And the owner gets to declare what's "acceptable". It's their machine and their data. I'm not God nor do I pretend to be. How about YOU go spank them for me.

Of course it's trivial. Problems caused by duplicate rows are trivially eliminated in a system that doesn't allow duplicate rows.

Again, I'm not sure what owners demands have to do with anything, here. This is a technical subject, not a political one. If you wish to discuss the politics of ProjectManagement, this is not the page for it.

It's both. Machines and logic have no goals; they are just bumps on logs. Humans supply goals. Technology often imposes tradeoffs that force the human goal maker to balance conflicting goals (such as speed, money, and short prep time).

Ok, I take it, at this point, that the agenda is to understand duplicate data. The first thing to understand, is that you might not have the necessary permissions to edit or delete records in the database, it might be 'read-only', it might be that a large number of users connect to it. At the very least, if you're going to delete records, you're going to want to be able to back up, first--so as to be able to recover deleted records. So, on that basis, I'm inclined to allow for the idea of 'generated keys'. And, generated keys do not prevent the process of finding records that contain whole or partial matching values. We've gone back and forth a bit, over my broader definition of what counts as duplicate records, that one wants to delete, that need removing..You seem resigned, that one cannot follow a set of specific rules or procedures to find and delete duplicate records accurately, but, nevertheless, you tout that using your system ('generated keys'), after you confirm that a table contains duplicate records, you can delete the unwanted data.

[Eliminating duplicates is trivial. Use `SELECT DISTINCT * from table`. We don't need to "confirm that a table contains duplicate records". We can just blindly apply this transformation, and all duplicates will be eliminated. If that was the only goal, we'd be done. But it isn't the only goal. We also want to preserve relevant information, and duplicates may be relevant information. (Given SQL's many follies, so could be position in the table!) A simple generated key can ensure we don't lose this sort of potentially relevant information accidentally. Of course, it would be better if the data source was a relation to start with, in which case we'd have sets - no ordering, no duplicates.]

Indeed. And the records in question might not be in anything recognisable as a "database", either. They might be log entries in a hardware device, for example, and therefore not capable of being deleted or recovered. There might be duplicates that have meaning -- like my three logins example above -- or the duplicates might be genuinely redundant, but this is something the machine can't know. Therefore, one cannot follow rules or procedures to delete them. It's up to a human to decide whether apparently duplicate data is meaningful and should be preserved, or meaningless and should be eliminated.