Bag Set Impedance Mismatch Discussion

Long, threaded material moved from BagSetImpedanceMismatch regarding how to treat "bags" received from systems outside of our control: give them an artificial key, or leave them as bags (outside of specialized temporary work sets).


Note that one may not be able to re-associate the same record with the same artificial key on a different day. In other words, if we obtain any updates in the future, any key assigned on "your side" should be considered temporary (unless a fancy value-based guessing system is put into place).

For example if you receive a monthly copy of an Employee table, but the employee number/ID is omitted for some reason (such as legal reasons), and you assign a "working key" of 123 for a given month to a given record, there would be no easy to way to assign 123 again to the same record to next month's copy. Perhaps you could match on name, but names change (marriage) and are not unique. At best one could make a guess-A-tron, but it won't be 100% reliable. And perhaps name is omitted (hidden from you) also if it's not needed for your particular task. This is similar to Scenario #3 in BagNeedScenarios, but we are on the receiving end of the data in this scenario.

[If you don't need to be able to connect the "same" employee accross months, then there is no need to ensure that same working key is assigned each time. If you do need to be able to connect them, then the "guess-A-tron" is required regardless of which way this issue is decided.]

For clarification, the Guess-A-Tron was mentioned only as an option for consideration so that the impact of the scenario on option choices is understood and considered. But, it's not a "given" of the scenario. Nor is the need to re-match keys across cycles. Nor is the need to even assign a temporary key. The only given is that the data comes over without any known unique key and that it's used for some unstated purpose on "your side". If your recommended activity/handling varies depending on how often the data comes and what is done with it, then please state the assumed usage or period. We can consider specific cases as well as general ones.

[First of all, I didn't treat the need to re-match keys as a given (nor the need for a "guess-A-tron"). Note the use of "if" in the above. Secondly, there was a stated purpose for the artificial keys. It was "significantly improved optimisation, conceptual simplicity and guaranteed avoidance of accidental duplicates". You have to go back a couple of pages to find it, but that happens sometimes when we split off pages.]

It was just clarification. And "stated purpose" of a proposal versus the givens of the scenario are two different things. As far as "improved optimisation, conceptual simplicity and guaranteed avoidance of accidental duplicates", I disagree with those as stated, at least as a net benefit. But I already described why elsewhere. They are all situational and involve trade-offs. The purpose of these topics is to describe and weigh the trade-offs. To state them as open-and-shut in a summary area is misleading. -t


"Bag" Not Quite the Right Term

"Bag" is a semi-misnomer in practice because many if not most RDBMS have a unique internal row identification system, and often expose the row-ID to the "query user" via a special column name or function. The real issue is the "permanency" of such. In some cases it may only be unique for a given query result. If we re-query the same data at a later time, we may get a different ID for the same row (if same-ness can be determined if we had a God-view of the system). In others it's unique across a table, but new records may re-use ID's from prior deleted rows. Thus, referencing such ID's as foreign keys could present some surprises. Further, certain forms of cross-server load distribution may render the internal ID unreliable in various ways. Perhaps the term "key-unstable" can be used instead of "bag".

Adding surrogate keys to a "bag" query result set (such as to force rows to have a unique key) may also produce unreliable keys such that the same record may have a different ID upon next query. For example, a given stored procedure or database view may not have a unique key and/or may not supply a unique key for various reasons (security, efficiency, politics, app vendor rules, bad design, etc.). Thus, "bags" may not come from just key-less tables, but also from transformation services that don't "send" a primary or unique key. In medium and large companies, one often will not have the option of having such redesigned to emit a primary or unique key. One is a data user, not data designer for many existing systems. There is no known generic way to "auto-guess" a primary key that is stable over time for such situations.

Since most result sets are returned in a linear fashion, we can assign unique row numbers by simply using the "row count" as the primary key. However, these numbers may not be unique across the table. For example, if record #2 is deleted overnight, then #2 will be a different record the next day. Similarly, if we change the query criteria, the numbers will fall differently. We could study the data and perhaps come up with a fairly reliable guessing mechanism to determine a unique key in many cases, but still may never have a 100% guarantee. -top

In such cases, a generated key does not need to be stable over time. For queries on data without a guarantee of uniqueness, a generated key only needs to be stable over the lifetime of a given query. Systems that need to work with possibly-duplicate or "keyless" data are designed to be aware of this.

How are they "aware" of it? Please clarify.

Relational systems that need to work with possibly-duplicate or "keyless" data do so by generating a key that does not need to be preserved beyond the lifetime of any given query that uses the data. Of course, a trivial objection is that if the keys are transient, you can't -- in the long term -- unambiguously identify any specific row in the source data. That's true, but in the long term you couldn't have identified any specific row anyway, because the data has duplicate rows.

That's why I suggest it may be better to live with "bags" in heterogeneous systems: there's no transient key to be mistaken for a real key, reducing the chance of somebody creating bad links. And it's less columns, meaning less clutter. Keep in mind this is a different issue than a "set-only" top-to-bottom system where primary keys are always enforced. This would be in a mixed environment that use existing RDBMS conventions.

So-called "clutter" and some glancing possibility of "bad links" are negligible issues compared to the potential for incorrect query results and undetected duplication in SQL systems. That's why use of primary keys and including "SELECT DISTINCT" in every SELECT query are widely regarded as "best practice" in the SQL world. Furthermore, there are no advantages to "bags" that are worth mentioning. Their use, in any context, should be actively deprecated. To "live with 'bags'" in some fashion may currently be inevitable, but must be discouraged from becoming perpetual. Use of "bags" should be afforded all the disdain we give careless indentation in structured programs, lack of well-formedness in XML, bad spelling in general, or an ill-timed fart.

Again, I'm not addressing a new system here in which "uniqueness" can be controlled from the beginning. I'm mostly referring to working with existing systems which may not contain or provide unique keys in some cases for multiple reasons, often beyond ones control. And I do believe a "fake" key is potentially worse than no key. We'll probably just have to AgreeToDisagree and LetTheReaderDecide. My assessment of query user error probability tradeoff profiles is simply different than yours. The user mistakes caused by mistaking a temporary or query-local key for a real key or a reliable key are at least as likely as those caused by working with non-keyed result sets, in my professional assessment. If you know you are in a fog, you tend to be more cautious than if you don't know you are in a fog, yet obstacles are still hidden. You claim to be removing "undetectable duplication", but you are not. Further SELECT DISTINCT can be a performance buster, and may still return wrong results. To use our common event log scenario, two identical events may happen at the same time, within the clock's or field's rounding error range. SELECT DISTINCT would incorrectly toss one. Formalizing a guess does not make it a non-guess.

Mistaking a "fake" key for a real key is relatively harmless. It clearly doesn't appear in the source data, and at worst JOINs with nothing. My reference to "undetectable duplication" refers to the original data sets. In the absence of any key, how do you know whether two identical rows are two genuine entries, or one entry duplicated? I'm afraid I don't follow your references to fogs and guesses. Sorry.

What do you mean "doesn't appear in the data source"? Transmission error checking? See next paragraph. And it can join with stuff if one starts inadvertently referencing the pseudo-key in other tables.

What I mean by 'doesn't appear in the data source' is that the generated key doesn't appear in the data source. If the same generated key appears in multiple places, then there's nothing wrong with referencing it in other tables.

That mistake is less likely to happen if there is no key to reference (although it may depend on specific requirements, per below). Sure, "Just remember and document" can be used to avoid problems, but similar suggestions can also apply to the counter down-sides. Thus, it's not relatively harmless, as you claim.

Re: "In the absence of any key, how do you know whether two identical rows are two genuine entries, or one entry duplicated?" In bag-friendly systems/tools that preserve non-keyed records, duplicates will normally be preserved. If there is a byte burp during transmission, one wouldn't know either from pseudo-keys because they are often added at the receiving side. (Pseudo-keys shouldn't be a substitute for transmission error detection anyhow.) If one starts writing to the data set (which is a full or partial copy of the master "bag" source), sure it may create problems, but so can pseudo-keys. One would have to study the specific domain requirements to assess which is the most error-prone. If it serves a specific and known domain purpose, I have no problem with the assignment of pseudo-keys, but I won't add them out of some purity itch alone because they may be mistaken for real keys.

Let's take a little scenario to dissect further. Suppose I grab a portion of an event log from another system managed by another department as part of a monthly routine. The view I'm assigned to use emits no primary key. Now suppose I add a pseudo-key to my sub-copy because I read somewhere that it was a best practice. A colleague working on a different project is asked to research something using the log copy I made. They send off some emails that use the pseudo-key to identify the records that may be of importance to a customer/user that had a question/issue.

On the drive in from work the next morning I realize that I did the query wrong, forgetting the XYZ criteria to filter out a product type that was recently transferred to a diff jurisdiction. I rerun the copy query, re-add the pseudo-key, and replace the "bad" file or table with the fixed set. I may not know that my colleague had used the pseudo-key in his email message. If there was no pseudo-key, the colleague may have used a different means to identify the record(s), such as a full row listing. I mention to him that I was redid the batch, but it may not "click in" to him that such an action may change the pseudo-key. Because of this, my colleague used and sent out a bad identifier. -t

If your colleague can disambiguate records with a full row listing, then the full row is the key and no generated key is needed. If there are duplicate rows, presumably your colleague will use (in your words) "a different means to identify the record(s)". In other words, your colleague will use a generated key. If you have duplicate records and need to identify specific records, there is no escaping it: You need a generated key.

Nope: there can be two identical records. Event logs just record events, not caring if a repeat event(s) falls into the same time-slot. (Or perhaps query users are not given more detailed info that would distinguish them.) The row listing is for humans to find them again (using queries or QueryByExample-like tools), not necessarily automated processes.

Even if you use relative references on a row listing, like "the third 28 degree temperature reading from warehouse B at 10:02:24" or "the first record where Bill logs into the lab computer" or "the seventeenth row from the top of the print-out on page eight", those are all generated keys. They uniquely identify a record, row, or tuple for the lifetime of a query. In this case, the "query" might be executed entirely by human beings, but it's still a query that uses a (human-) generated key.

Who said they need to give ordinal descriptions/positions? Note that they could use a GROUP BY and a "repeat_count" column so that the records are unique by all columns, but that's extra processing, and/or an index on the concatenation of every column takes up as much space as the data itself, if not more (depending on the engine architecture). And this is assuming they have the space, permission, and resources to make and re-work their local copy(s), and that they need to dig in the data often enough to justify the extra space/steps. (The resource issues are somewhat similar to scenario #4 in BagNeedScenarios.) It's extra time and space just to satisfy the Purity Gods. I'm sorry, but it looks like anal-retentativeness to me. Keep in mind if designing the system from the beginning, I'd request an auto-number or something be put on events. However, we are talking about working with existing data and systems here for which we may have limited control over.

Huh? I'm simply pointing out that as humans we already generate keys to identify duplicate rows in a "row listing". You wrote, "The row listing is for humans to find [records] again ..." and I responded by showing you how humans "find [records] again". Having relational systems do essentially the same thing -- generate keys to uniquely identify otherwise-duplicate rows -- is not to "satisfy the Purity Gods", but to gain the well-established benefits of relational purity: Improved automated optimisation and reduction of inadvertently conflating intentional duplication with accidental duplication.

Humans can deal with ambiguity in a domain-acceptable fashion, such as footnoting caveats on reports and emails. Computers cannot. And I don't know why you keep bringing up optimization; it's a non-issue here for reasons already given. If anything, some of your work-arounds are extra steps that are anti-optimization.

I don't know what "footnoting caveats on reports and emails" has to do with this, and I don't know why you keep dismissing optimization and the possibility of inadvertently conflating intentional duplication with accidental duplication.

In a footnote, one cay say, "There were multiple records with identical values. However, we counted such as a single record for this particular report because we cannot verify whether it's actually multiple events or a communications glitch." There may be different reports that count them all. For example, if you are trying to gauge system and staff work-loads, you may want to count the duplicate events; but if you are trying to gauge actual domain events, you want to roll up duplicates and count them as a single event because past studies showed that say 60% of duplicates were due to communication glitches rather than actual events. Counting things different ways for different purposes is common, not just for bags, I would note. And ultimately it's the managers' or owners' decision, not technical staff anyhow.

There is nothing in a pure relational system that precludes aggregation and counting. What is expressly forbidden in a pure relational system is retaining duplicate rows as indistinguishable duplicates, or generating indistinguishable duplicate rows. This has no impact whatsoever on managers' or owners' decisions.

And optimization is a non-issue because internally most RDBMS have a unique row identifier that can be used for optimization. Sometimes it can be used and sometimes not, depending on join patterns etc. But also re-working data to supply a surrogate key also costs resources such that the net benefits need to be weighed on a case-by-case basis.

Supplying a generated key (it isn't surrogate, because if there is no key, it can't be a surrogate for something that isn't there) costs negligible resources. Maintaining duplicates anywhere within a relational system -- which makes it a non-relational system by definition -- opens up the potential both to incorporate errors and potentially generate them. That is unacceptable.

Untrue. But to explain why would repeat existing material.

I'll agree that IF we can control the entire chain of data from cradle to grave such that everything has nice keys (with unicorns, daisies, and balloons), then yes sets are more efficient than bags. However, in this particular case we are considering living with others' "dirty" data. -t

In living with "dirty" data, it is unacceptable to let the "dirty" part -- duplicates -- into the system. Otherwise, the problems are potentially propagated.

Creating fake keys has its own downsides. If you need temporary keys for a specific process, that's fine. It's just often too risky to "keep" them longer than the duration or scope of the specific process.

The downsides of retaining generating keys are straightforward management issues, and, of course, there is no impetus to retain generated keys any longer than is necessary. However, the downsides of maintaining duplicate rows inside a "bag"-based, relational-inspired system are often subtle, sometimes complex, and frequently difficult to detect. Can you guarantee that you won't inadvertently introduce erroneous duplicates in JOINs or summarisation operations when you can't control whether source data has duplicate rows or not? In a pure relational system, you can implicitly offer such guarantees of correctness. In a non-relational system -- e.g., SQL that allows duplicate rows -- you can only approach such levels of guarantee, and never quite meet them, by carefully auditing every relevant query and trying -- to the best of your ability -- to make sure they will generate correct results whether the source data has duplicate rows or not.

I've been around and in databases for more than two decades, and I don't see the level of problems you claim, at least not enough to counter the downsides of fake (unstable) keys. You disagree, with counter experience. So be it. We all have different experiences about how people will react to technologies and related ambiguities. I suspect you are more influenced by the conceptual purity rather than the WetWare and behavioral patterns of those human technicians using data and/or tools with such issues. As a compromise, I suggest query tools/languages that discourage the use of bags, such as requiring explicit keywords etc. to produce a bag, but I don't want tools/languages that outright forbid them, at least not when interacting with existing data and products that don't provide "clean" keys or meta-data on the keys.

Actually, it is precisely my awareness of "behavioral patterns of those human technicians using data and/or tools with such issues" that motivates me to promote systems that inherently avoid the sort of errors that humans are prone to make. I favour purity not for its own sake, but because purity -- in this case, at least -- makes it less likely for "those human technicians" to make difficult-to-detect but potentially dire mistakes.

An unstable key can likewise be "difficult to detect". You generally can't tell by looking at it. I realize there are trade-offs being made here about which risk to accept over another, but adding unstable keys is NOT the clear winner. Query developers with reasonable experience generally know to be cautious when given bags to work with, but the same is not true of unstable keys. Not having info about the primary key or knowing there is no primary key is information in itself. If I don't know the key, then I know that I don't know the key. (Did I say that right? Please excuse my Rumsfieldness.) However, I cannot tell that "stability" is missing from a key just by looking at the schema and the data (unless I happen catch it in the act of changing). In the bag case I know what I don't know. But in the unstable key case, I can't readily tell whether it's stable, and it's not a common enough practice to make asking that question a standard behavior. And even if I did ask, there may not be an answer available. In the bag case, I have the piece of information that I don't have knowledge of the key (if it even exists), but in the unstable key case I don't know that I'm missing information.

A generated key is no more "difficult to detect" or "missing information" than page and line numbers on a row listing. That's because page and line numbers are a generated key. Identifying generated keys, or dealing with inadvertent use of some generated key as if it was permanent, is trivially and obviously an order of magnitude easier to identify and less harmful than inadvertently creating erroneous duplicate rows by JOINing tables that have unexpected duplicate rows in one or more of the source tables.

Sorry, but I don't see how it is "trivially and obviously an order of magnitude easier". Please explain. Incorrect links are at least as evil as an incorrect count. And page and line numbers have been problematic in my experience. For example, items get deleted from spreadsheet tasks lists and some manages don't like numerical gaps. Thus, it's renumbered, and somebody often has an outdated copy at a meeting and they get confused until people figure out what happened and then everybody has a good chuckle. Similar with page numbers. Experienced managers (burned in the past) will say something like, "Make sure you have the latest copy dated such and such." Fortunately that happens on a small scale such that it's fairly easy to get back on track after a stumble. But with thousands+ of records it may not be.

There's an old saying:

  "To err is human. To really foul things up you need a computer."
You answered the issue yourself: "[T]hat happens on a small scale such that it's fairly easy to get back on track after a stumble." Misinterpreting a generated key might occasionally result in some light meeting-room embarrassment, but it doesn't result in overtly incorrect reporting. Furthermore, it's a human problem, not a technical one. (Perhaps a better choice of aphorism would have been, "To err is computational, but to really foul things up you need a human.") Next time, configure the report to exclude the generated key; the totals will be correct whether the generated key is printed on it or not.

      PRINCIPLE RK:
      If you have something that looks like a key, 
      there is a risk others will mistake it for a 
      real (stable) key and inadvertently start using 
      it as a foreign key in other tables or data sets.

Much more subtle and insidious are queries like:
 SELECT customerID, SUM(amount) AS balance FROM CustomerTransactions GROUP BY customerID
Seems innocuous, doesn't it? But what if CustomerTransactions is:
 CREATE VIEW CustomerTransactions AS 
  SELECT Customers.customerID, amount FROM Transactions, Customers 
  WHERE Transactions.customerID = Customers.customerID
It too seems innocuous. However, if duplicate rows are permitted and Customers inadvertently contains a duplicate row, 'balance' will be incorrect in the first query's result. If you don't immediately see why, then you see my point. (Try it to see why it's incorrect.) Of course, this is a trivial example. Actual occurrences in the field tend to involve complex chains of VIEWs, many JOINs, a few UNION ALLs and so on. It is precisely these kinds of errors -- which I argue are often overlooked because they're subtle -- that relational purity will help eliminate.

You are switching to new examples. If I have full control of the system of your example, I would indeed avoid a view that emitted duplicate keys (although there may be exceptions to the rule). Bypassed is the issue when one doesn't have control over the entire system from cradle to grave. Let's finish our existing scenarios before introducing more.

Actually, these queries represent precisely the scenario we've been discussing, and are a clear example of what I've been arguing all along: that allowing duplicate rows permits subtle and difficult-to-find errors. Imagine the following conditions are true:

In the above scenario, the first query will generate incorrect results. In a pure relational DBMS, the following conditions are true: In this scenario, the linkage to the external data source over which we have no control forces the database developer to consider the possibility of duplicate rows and deal with them appropriately. The system that permits duplicate rows does not, and thus encourages incorrect results in the first query when duplicate rows appear in Customers. Furthermore, the second query in a pure relational system (i.e., SELECT DISTINCT by default) will eliminate duplicates, which means all transactions for the same customer and the same amount will be amalgamated. Therefore, the second query would have been properly designed in the first place under a pure relational system, and Customers would be linked to its external data source in a manner that prevents duplicates. Therefore, under a pure relational system, results would be consistently correct whether or not there are duplicates in the source data for Customers.

It still appears you are introducing a new scenario with different issues and contexts. I will not entertain it until the original scenario is fully addressed.

Are we not discussing "working with existing systems which may not contain or provide unique keys in some cases for multiple reasons, often beyond ones control"? Those are your words from above, and are precisely the scenario I've illustrated with my two queries, above.

I'm still not sure what your point is. I have insufficient info about the domain environment to plot and present a course of action. In the case of accounting, if we have reason to believe that the ID's are unstable (customer ID may not be the same domain customer upon a future query), we could refuse to do business with them because the risk of misplaced money may be higher than the owners want to deal with. Or we could build all kinds of versioning or/and tracking layers for CYA purposes to produce a reliable "artificial key"[1]. However, this ramps up the complexity of the system and costs.

The risks and costs that our managers/customers may be willing to deal with for accounting versus an event log (original scenario) may be very different. This is a similar issue to PageAnchor: scenario_four_alpha in BagNeedScenarios. The decisions about the efforts and costs of preventative features versus the risk is NOT up to the technician in non-trivial matters. Using a DB engine that "must have" a key (stable or temporal) doesn't change this equation. Artificial keys are indeed possible to make. However, that doesn't necessarily solve the issues surrounding them. Making a database's rules happy doesn't necessarily make the customer happy nor prevent all human error. You can't paper over inherent domain ambiguity simply by throwing GateKeeper rules at the DB. That's a naive stance.

Introducing artificial keys creates risks that don't exist without the artificial key, and they are often not "trivial", as you claim (Principle RK). I'm not saying that lack of a key has zero risks, only that it is a trade-off. -t

My point, as I stated above, is that (I quote myself)"[i]dentifying generated keys, or dealing with inadvertent use of some generated key as if it was permanent, is trivially and obviously an order of magnitude easier to identify and less harmful than inadvertently creating erroneous duplicate rows by JOINing tables that have unexpected duplicate rows in one or more of the source tables."

I have shown that allowing duplicates into the DBMS encourages the possibility of generating subtle errors in query results, and I have shown (indirectly) that even naively attempting to put DISTINCT in all SELECT queries does not necessarily produce correct results. I have also shown that the queries involved in promulgating such errors appear, superficially, to be correct. I maintain, therefore, that allowing duplicate rows invites errors that are subtle and difficult to detect and correct. However, mis-using a generated key is invariably straightforward to detect and correct. Furthermore, it is no different to misusing any column as a key when it shouldn't be used as a key, which makes it orthogonal to the issue of allowing duplicate rows or not.

I don't understand how you are ascertaining "trivial" and "straightforward to detect and correct". Further, it may be the case that subtle errors also produce subtle problems while blatant errors cause huge problems. A nuke going off in your city is certainly an "easy-to-detect" problem, but the cleanup can be a bear (or turn you into a bear). You seem to be arguing that a nuke is less problematic than mercury leaking into the water supply because the nuke is quicker to be noticed. And, the magnitude of link problems may vary widely per circumstance. There may not even be a way to fix bad links if not enough info was kept to reconstruct them.

Re: "it is no different to misusing any [other] column as a key" - It's the probability of making that mistake that's the sets it apart. For example, if you run a GROUP BY with an occurrence count, it's unlikely that any other column will closely resemble a key, and even if it did, most of the time the column name will make it clear it's a domain field. Thus, the chance of a given column numerically resembling a primary key (pattern) AND having a key-like column name at the same time is relatively rare compared to an artificial key. Examining the artificial key by name and data likely will offer little or no clues that it's unstable.

   /* see if candidate column resembles a key */
   SELECT maybeKey, COUNT(*) AS occurs
   FROM targetTable
   GROUP BY maybeKey
   HAVING COUNT(*) <> 1
   ORDER BY maybeKey
   /* may need tweeks for null detection */
What do you think a generated key is going to be named? PERMANENT_PRIMARY_KEY_USE_ME_USE_ME_USE_ME? I seriously doubt even the most absent-minded, incautious, neophyte database developer will misuse a generated key more than once, but inadvertent duplicate rows from a scenario like that I've shown above -- which is purely the result of allowing duplicate rows into the DBMS -- can and does catch out seasoned professionals. Because some naive beginner might expose a generated key on a report, and a manager might refer to it, is no reason to continue allowing -- and I shall use your metaphor here -- mercury into the database drinking water in the form of duplicate rows. I see no "nuke going off" here either way, but given a choice between conditions that permit seasoned professionals to make mistakes vs conditions that permit beginners to make mistakes, I'll accept the latter every time.

You seem to believe a developer brought in to use database data is going to know it's an artificial/temporary key. I thought my prior statements made it pretty darn clear that often one doesn't know (and often can't know in their allocated time). Being "sloppy with keys" is not the primary scenario I'm describing (though it certainly does happen). As a contractor at more than a dozen organizations, I'd say roughly only a quarter give the developer (me) a reasonably-documented schema description (DataDictionary) for any given table. And it's about half for "permanent" places I've worked. Many tables have been around for 3 or more generations of DBA's etc. and documents get lost, or the DBA/architect is "too busy" to lend knowledge. You haven't described how they would know it's not a stable key.

As far as the column naming, quite often it's terse, such as "ID", "Record_Num", "Row", "Line_Num", "Tracking_ID", "Reference_Num", etc.

We've been going around and around on this with too little new progress on each round. Let's just say our experience differs as far as how likely certain mistakes are to occur and LetTheReaderDecide. I'm just calling it as I observe and remember it. For whatever reason, I saw/remember different behavioral and/or political patterns than you have. Perhaps you have better people skills and know how to coax documentation from DBA's etc. Still, that's a situational WetWare issue. Let's call it a year. --top


I've struggled a bit, to understand the nature of this long, threaded debate. Supposedly, it concerns how to treat "bags" received from systems outside of our control: give them an artificial key, or leave them as bags (outside of specialized temporary work sets). Apparently, the idea is that we're working w/"bag" query result sets, defined as result sets where rows do not have a unique key.

And, there's some kind of disagreement going on here, there are two sides to this debate. It is, perhaps, agreed that: '..there are no advantages to "bags" that are worth mentioning. Their use, in any context, should be actively deprecated. To "live with 'bags'" in some fashion may currently be inevitable, but must be discouraged from becoming perpetual. Use of "bags" should be afforded all the disdain we give..'

However, what about this point?: 'For example, a given stored procedure or database view may not have a unique key and/or may not supply a unique key for various reasons (security, efficiency, politics, app vendor rules, bad design, etc.). Thus, "bags" may not come from just key-less tables, but also from transformation services that don't "send" a primary or unique key. In medium and large companies, one often will not have the option of having such redesigned to emit a primary or unique key."

In which case, the important point is asserted to be this: 'And I do believe a "fake" key is potentially worse than no key.' And, ok, maybe so, but to the degree that I understand this hypothetical scenario, I see the relevance of a larger point, that most developers aren’t SQL experts, and that most of the SQL that gets used is inefficient, hard to maintain, and sometimes just plain wrong. In my experience, chances are good that an application's database layer contains problems, and I can even say something about what kind. And so, although I've tried to parse through this page, I don't think, in the end, that it 'gets interesting'. Most interesting to me, in the end, is the multiplication of so many terms here, 'unreliable key', 'surrogate key', 'artificial key', 'transient key', 'temporary or query-local key', 'generated key', 'real key' ('or a reliable key'), '"fake" key', 'unstable key', 'pseudo-key'. That's not offered as a necessarily complete list (I'm inclined to count 'artificial/temporary key' separately?). And then, we get statements like this: 'Supplying a generated key (it isn't surrogate, because if there is no key..)' Good luck, man.

There are probably only two people in the world who care enough about this issue to debate it in such depth, and they did, all over this page. Or, I should say, we did. As one of the two original participants, I have to say you've given a pretty fair summary. It's unfortunate my opponent used so many different modifiers before the word "key" (I kept it mainly confined to "generated key" or "key") but I know what he meant in every case. I wouldn't expect anyone else to know, however, unless they'd spent as much time as we have -- now measurable in years -- debating the minutiae of designing and implementing alternatives to SQL.

The debate can be summed up as follows:

{I should have established a common working-set of terms for such at the beginning. My apologies. I would note that it's hard to tell which topics will keep expanding and which stop growing early. My time-machine has a Microsoft glitch. Generally most refer to a single kind of key: an artificial key that is "unstable" in that the same record may have a different key on a different day. "Artificial key" is not sufficient because auto-numbers are also artificial, but are usually "stable". --top}

{I disagree with those two position summaries. This particular page is mostly about dealing with data given to us from an "outside" source that lacks a known key. And even "within", there are compromises that can be made. Further, a distinction needs to be made between existing RDBMS engines and/or app system implementations that currently use bags in the established code base, and "new" engines or RDBMS. For example, I would expect a new query language designed to replace or supplement SQL for established systems to be able to communicate with bag-centric nodes and data sources and/or API's that may supply a "stable" key. I would consider making a classification system, but scenarios are better due to all the possible combinations in my opinion. --top}

Curious. I've been arguing all along that duplicate rows should never be allowed inside a relational DBMS, and have been debating with you under the (mis?)perception that you've been arguing that there are cases where duplicate rows should be allowed inside a "relational" DBMS. Whether data sources are inside or outside, or DBMSs are new or old, etc., is -- IMHO -- rather immaterial in light of such fundamental constraints or lack thereof. The initial context was certainly about dealing with external data sources, but it only highlighted the issue re duplicates being allowed or not. If we allow external data sources to be introduced unconditionally, then we must allow duplicate rows. If we must disallow duplicate rows, then a key may have to be generated by the interface between our DBMS and the external source. The fundamental issue, therefore, is about allowing or disallowing duplicate rows.

I guess I misunderstood what you meant by "inside". I'm not even sure that "inside" and RDBMS even needs to be stated in your two scenarios. Would your recommendation for non-RDBMS usage change, and why?

The issue in this topic is how to handle data that comes to us with no known stable key, with the assumption we cannot change that fact due to organizational or other issues. That supplied data lacks a known stable key is a "given" for the sake of this topic. I added more details to the intro at the top to clarify. --top

.....

Well, then, what I make of this is that it might clear things up a bit to hammer on the distinction between a database, and not-a-database. There are perhaps borderline cases, but we're more interested in large-volume, high-transaction databases anyways, aren't we? I feel underemployed, worrying about what happens when people haven't embraced the advantages of using a real database--who are these people? Let them learn their lesson. There didn't used to be any computers, hey use a typewriter I care. So. I have truly missed the point, eh? I see that the page ends with this: 'The issue in this topic is how to handle data that comes to us with no known stable key, with the assumption we cannot change that fact due to organizational or other issues. That supplied data lacks a known stable key is a "given" for the sake of this topic. I added more details to the intro at the top to clarify.'

Answer, how to handle this data, is put it in a database. Example, you sign up at a health club, they put your info in the database. They don't wonder what the primary key is in the DMV's database, although your info may also be there. If I understand top's position correctly, and he's saying put data in a database and give it a key, then I agree. This is kind of like rebasing--the data may be in some other database, but we treat that as pure irrelevant speculation, it doesn't matter where the data came from. Envision yourself as a DBA for a company. Your company has other companies as customers. Each customer has several contacts. Each contact has several addresses and phone numbers. Each phone number has a call log. Each customer places many orders. Each order includes products. Some products have a service log. Whatever whatever..

Whatever the proposed architecture for this, rule 1 is every table has a primary key.

I note how much we're probably all agreed on (but a few reminders might clarify things a bit). Under the rubrik of integrity rules, specifically general integrity rules (so to speak), the entity integrity rule is very simple. It says that primary keys cannot contain null (missing) data. I take it that the reason for this rule should be obvious. But, to review, every table must have a primary key, which uniquely identifies rows in the table. And, again, the entity integrity rule forbids nulls in primary key columns. And, a database designed according to the relational model will be efficient, predictable, well performing, self-documenting and easy to modify. Where does this leave us? I guess that I can I construe this debate here, as having to do with what guidelines is it important to follow, if you do break the rules. In that case, it is taken to be irrelevant, that if you take the time to design your databases properly, you'll be rewarded with a solid application foundation on which you can build the rest of your application.

I'm focusing narrowly on the question, how it is, that if, by definition, a relational database must contain normalized tables, and to be properly normalized, a table must contain a primary key, and on all this everybody agrees, then what sort of great primary-key debate have you gotten into--what is left to discuss? What exactly are these very strong opinions about?

I had some fun about the proliferation of 'xxx-key' terms like 'surrogate keys'. That one specifically, though, by the way, does have an accepted technical meaning (this is just for review, I assume agreement here). A surrogate key is a primary key created from a meaningless value. It's specifically not created from naturally occurring data, what, again, is used as a table's primary key, is a meaningless value--that is referred to as a surrogate key.

The real point that I want to make about all these different kinds of keys that are mostly the same kind, or whatever, is let's go back to the definition of a primary key. Of course, we all agree, that it uniquely identifies each record within a table. However, I find it relevant here to point out, that this is only half the story. What is the main purpose of a primary key? Perhaps there is also no controversy on this point (although I think a reminder may turn out to be useful). The main purpose of a primary key is to relate records to additional data stored in other tables. Several other points are entailed. This is the explanation, for example, for why the primary key must remain stable--that is, you can't change the primary-key field(s). The primary key value, I'm belaboring the point, can't be changed. And, must exist when the record is created. And, must uniquely identify each record. And can't be null. This is also why it must be compact and contain the fewest possible attributes, for that matter.

Now, I've used the word 'must', there. Must remain stable, perhaps? mostofthetime? No, must. Okay, I see, then. MUST, is that supposed to mean that I can't break the rule (watch me)? No, must doesn't mean that you can't break the rule. It just means that your application won't adhere to the database model if you choose to break these established rules. Okay, then, if I am saying that relational database theory requires a particular condition, then should I expect the relational database system to enforce it as I'm developing my application? No. I expect to enforce it myself.

Now, some of the debate above, concerns what if an uneducated user updates a primary-key value? Well, you can, but should not, because a primary-key value shouldn't be subject to data entry errors. Because, changing the value violates a rule. But, this doesn't seem like such a big problem? But remember, you're not supposed to change the primary-key value.

I can anticipate, that somebody will point out that I'm talking about primary keys, not 'meta-alt-ctrl-shift-escape-keys'. We all agree about primary keys, that's trivial--what about the real word vissisitudes of life? My reply to this, is that when things slip out of control, of course they fall apart. Beyond recommending that you select a value that you can control and maintain, I can only say, what you cannot control and maintain, will fall apart. There is no suspense about the outcome, really?

A key, a key, as in, if you're going to use the word key, then without even adding any modifiers here, what are we referring to? A key is immune to changes in business. A key depends on only one field. Unique (will always be unique). Stable. Compact. What is an 'unstable key'? A cry for help. A grave misunderstanding. A joke, perhaps? ru kidding me?

Upon review, I think I'm taking the single-most important issue to be good design. I do indeed suspect, that I'll be accused of missing the point of this debate. However, if the foundation is weak, so is the building. I'm more interested in how to avoid future problems and subsequent (and perhaps convoluted, like this debate) repairs. Your whole debate is about how to salvage the situation, when a simple design choice, one of the easiest ways to provide a strong, stable, yet flexible foundation, has been rejected.

I'm tempted to get into an even more fundamental question, what is a database? If we're going to build one that can be used effectively, then what is a simple definition? Do we all know/agree?

I'll try: a database is a collection of information that is organized so that it can easily be accessed, managed, and updated.

And, while we're reflecting on 'what is a database', the relational data model, is one kind--one that has a whole series of rules govering keys. Above, I defined a 'surrogate key'. Now, I would like to add, and here again I'm expecting agreement, that a 'technical key' and an 'artificial key' are two other terms for the same thing as a 'surrogate key'. This, once again, is a key for which the possible values have no obvious meaning (to the user or the data).

I think I will add, that just as a table has a name--you know, like 'part', 'customer', 'invoice', also, a database has a name. And, if it is I hope clear where I am going with this, a table is in a database. and, a key is in a table. I just don't know what is the point, if choosing a primary key is one of the most important steps in good database design, if in fact, good database design starts with the right primary key, then who cares about how to handle situations where you have not chosen an appropriate primary key? Be a loser, and loser that you are, you will be surprised, when you lose. It's not that simple? A key is a table column. That's table as in 'database table'. The key is what ensures row-level accessibility. You're trying to get by w/out one? That's what they did in Ancient Rome, also they used XVIII to write 18. You're welcome to do things the way penguins do them when they gather about the south pole, and sing songs while they work. Seen Zoolander? The files are in the computer. The FILES are IN the COMPUTER. Oh...the files are IIN the computer? Duh. Lesson two, never null. No primary key value can be null, nor can you do anything to render the primary key null. This is an inviolate rule of the relational model as supported by ANSI, of relational database management system (RDBMS) design, and of what database engine are you using? It flags as NOT NULL all columns that make up the pkey, when you designate a primary key, is what I'm guesssing--what database engine are you using?

Let me put it this way, what does 'null' mean? Null is an unknown condition. And, what is unknown, if the primary key is null? How to do lookups and comparisons is what is unkown.

If you somehow have managed to fail to ensure that the primary key is never null, then I guess there is no point in adding that it should be brief, should be a simple data type, should be a nonidentifying value, because you lose. Look at the scorecard--what happened? You're still stuck on lesson 2. NEVER NULL!

BTW, I see that I've drifted into brusquely confronting some interlocuter--I really am not addressing anyone in particular, just trying to get the issue in view, just thinking out loud..peace.

{You seem to focus on the purpose of the key above where-as I'm more focused on what's likely to happen under each scenario and what is the cost/benefits of each as developers and users come, use a table, and perhaps then leave. My mental model is similar to gaming theory trees in that I try to find the branch with the least average cost: a "probability tree". This not only includes the probability of the branch, but also the weight of the potential "damage done" (score or anti-score in gaming theory). Think of it as running a bunch of simulations of the reaction(s) to the table in various office instances. Adding a column that may be mistaken for a stable key is just too big of a risk unless we know a working temporary key is going to be a common need and individuals cannot realistically make and fiddle with their own copy and cannot use the RDBMS's internal row key. In other words, it's situational. For example, suppose the intended use of the data is for aggregate statistics. We may join on aggregated categories, such as zip-code, product code, etc, and never need to join on (cross-reference) an individual row. We could do our job just fine without an artificial key. "Tables should have X because they just should out of the laws of table-ness" is not good enough in this case because adding it creates extra and unnecessary risk.} --top

Now, I had typed a lot, I'd be very impressed if you had actually gone through it closely, it's a lot to ask. But if I get through with only one point--I'm thinking that the 'xxx-key' terms reproducing, like a gaming theory tree, if you will, has got to stop, for reasons which satisfy me. What I mean is, note, that I think you're using the term 'artificial key', in the Pickwickian sense--I'll quote myself: 'Now, I would like to add, and here again I'm expecting agreement, that a 'technical key' and an 'artificial key' are two other terms for the same thing as a 'surrogate key'. This, once again, is a key for which the possible values have no obvious meaning (to the user or the data).'

Can we use the term 'artifical key' correctly? Always, the 'artifical key' is the primary key. The surrogate key. The, if you like variety, the technical key. I'm being pedantic, I'm correcting you--don't get me wrong, I don't think I'm so smart, you'll have your turn..

Also, I actually had a point about what's likely to happen under each scenario: 'We all agree about primary keys, that's trivial--what about the real word vissisitudes of life? My reply to this, is that when things slip out of control, of course they fall apart. Beyond recommending that you select a value that you can control and maintain, I can only say, what you cannot control and maintain, will fall apart. There is no suspense about the outcome, really?' And you have in your reply something about '(A)dding a column that may be mistaken for a stable key..' You describe this as a risk, but I see no risk here, I guarantee what will happen if you break it--you bought it. Alternatively, add a column that ends in '_ID', perhaps. And, you know, make it the first column. And, make it the primary key.

And, uh, also, this is 'key':

     -->make it the primary key<-- 

Now, I take it that there is no 'mistaking' this for a stable key--it IS a stable key, it's the primary key.

And, you offer a scenario as an example: 'we know a working temporary key is going to be a common need and individuals cannot realistically make and fiddle with their own copy and cannot use the RDBMS's internal row key. In other words, it's situational. For example, suppose the intended use of the data is for aggregate statistics. We may join on aggregated categories, such as zip-code, product code, etc, and never need to join on (cross-reference) an individual row. We could do our job just fine without an artificial key.'

Okay, see how I fail to comprehend the difference between a key, and a key. An artifical key is a primary key. What is a working temporary key? Is it a key, the primary key? Maybe let's review this 'common need', for what, exactly? I think, that if we're talking about users, who don't need to know the primary key--then I have no problem with this concept. I'll see you and raise you, however. I define these users who don't need to know the primary key, as users who don't need to know the primary key exists. They don't see, they don't use, they don't know about, the primary key. They use an application, and the application, ah, getme? the Application maintains the relationships behind the scenes.

What, I think, is the difference here, is I don't want to allow as thinkable, logically, any concept of keys that are not stable. Why wouldn't they be stable, because of users? Users don't know that keys exist, they don't see them, don't use them. Keys are not subject to input errors, are never null, they provide the pointer to the data. If there is no pointer to the data, this fact will mean, that there is no data being pointed to--or as I put it before,'Null is an unknown condition. And, what is unknown, if the primary key is null? How to do lookups and comparisons is what is unkown.'

Now, specifically to your example, you're wondering about what if the job we are doing is we are joining on aggregated categories, and using the data for aggregate statistics. Okay, now, my concept of data is relatively simple compared to the heathens who still worship at the altar of the flat file. My schemas are BCNF. Or better, pal, or better. I've accepted Codd’s rules and I lead a purposeful life, so how is it, supposedly, that after just joins, and a view or two, my sql is supposedly corrupted? All logic begins with assumptions, and I'll make some here. What, are you an Econometrician? A Psychometrician? I assume that you have no such requirements.

If you do, well, research in the math stat world continues to find new algorithms and implementations, but these disciplines are designed around the cursed flat file. I guess that you are dumping some rows from a database, importing them into a session with some statistical package, generating some output, massaging this for use by a client program, puting the output in some place, rinsing, repeating...Stat algorithms tend strongly to find their expression in linear algebra, which is to say matrices. Knock yourself out. Explore the world of inferential statistics. There is a limit to my competence, and there's more to data than the RM (Relational Model).

I sympathise with your attempt to engage in what is a mere piece of a wider, ongoing discussion. From what you've written, I think you may be missing some of context. I'll try to give it to you, in a nutshell:

My correspondent and I (that being me, DaveVoorhis, and the individual known on this wiki as "Top") are both involved in the developing alternatives to SQL. Mine is the RelProject. Top's is SmeQl.

My RelProject is a faithful implementation of the database language called TutorialDee, which adheres strictly to the RelationalModel. No relation can ever have duplicate rows. That is required by the very definition of a "relation".

Top's SmeQl is inspired by the relational model, but like SQL it allows duplicate rows in tables or "bags".

Like many DBMSs, the RelProject maintains its own data storage but it can also link to time-varying external data sources. Sometimes, those external data sources -- for example, logs produced by hardware devices -- will have duplicate rows. Currently, the RelProject requires that the database developer either identify a key in the source data (if there is one), or pick an option that causes a key to be generated (if there are or might be duplicate rows.) However, a generated key will only last the lifetime of a given RelProject query. It is not a permanent key, nor could it be, because if there's no way to uniquely identify rows in the source data there's certainly no way to permanently associate particular rows with a key. I called such a key a "generated key" because it's generated for the data rather than being found in the data. Top has called it an "artificial key" or a "fake key", which is fine -- I know what he means.

Top argues that such a "generated key" is more harmful than allowing duplicate rows in the DBMS.

I argue that allowing duplicate rows in the DBMS is more harmful than a "generated key".

That, ultimately, is what this page is about.

As for your question, "I've accepted Codd’s rules and I lead a purposeful life, so how is it, supposedly, that after just joins, and a view or two, my sql is supposedly corrupted?" Simple answer: If you allow duplicate rows in your source data, then it isn't your SQL that will be corrupted, but your results. You will get (possibly unexpected) incorrect results. That is a substantial point in my argument that duplicate rows should never be allowed in a DBMS. I regard that to be a critical issue, a veritable showstopper that trumps any and all arguments to allow duplicate rows. Top disagrees.


Office Politics

There is also a "political component" to this kind of scenario. If we add a "working key" to the dataset, then whatever negative event happens with that key will fairly likely reflect back on us, the key creator. We'd co-own the problem. Whether that's "fair" may be irrelevant. Those judging us likely won't give us a full hearing and/or understand the intricacies of the issue. Typical boss: "It's not your job to augment the data; just pass it on as is." I've worked for a lot of clueless and "unavailable" managers over the years. (I know what some of you are thinking: "That's the only kind of manager who would hire an asshole like you, Top.")

If, however, a data user fouls up their own project by adding their own key or mishandles what you call "duplicates" due to lack of a key and lack of care or skill, then most of the blame will fall on them. It's their mistake, not yours. (An exception may be if they ask for a key and you reject them, but that's not the scenario, as stated.)

Thus, for the sake of argument, even if the net problem risk magnitude caused by having the added key versus not having it were the same (which I don't agree to), the "blame flame" will be less focused on us, the data provider if we don't add a key. If I add extra info, or what can be perceived as extra info, then I will partly inherit any fall-out from it in a political sense. This may be a somewhat selfish way to view it, but it's human nature to CYA. You may argue it's more moral to be altruistic and accept the risk and blame of such, but that won't scale to every individual. Most people go for CYA and will reduce their risk at the expense of others in such situations. It's not being dishonest; it's just not volunteering to go the extra mile to accept the risk of more political blame. -t

Having spent a goodly portion of my professional career being responsible for enterprise data and reporting, I have to say I'd far rather accept the blame and repercussions of introducing a key, than accept the blame and repercussions -- even indirectly -- for inadvertently producing an erroneous report due to unrecognised duplicate rows.

What I categorically and absolutely will not do is allow duplicate rows/tuples in relations. The RelProject is not, and never will be, BagAtational. However, should it be necessary to import, link to, or generate external bags, there are mechanisms to do so as I've mentioned elsewhere. Thus, correctness is the default position and is explicitly promoted. However, when absolutely necessary, there are ways to accommodate duplicate rows that does not require some degraded corruption of the RelationalModel. If you feel they are not sufficiently convenient for you -- and that, ultimately, is what this seems to be about: convenience -- you are free to not use the RelProject and can choose (or develop) something else.

That sounds like a PersonalChoiceElevatedToMoralImperative. Again, I'm not condoning CYA-centric behavior, only saying that in practice most will probably follow it in my experience.

Rel is a true relational database management system, which means it adheres to the RelationalModel. There is no more PersonalChoiceElevatedToMoralImperative here than if I said I'd made a floating conveyance and you complained it was PersonalChoiceElevatedToMoralImperative because I didn't include wheels. It's a boat, top, a boat! There are no wheels. Likewise, it's a true relational database management system. There are no bags!

Using Rel is a choice. (And Rel can be changed.)

Huh?


Thanks for the efforts, to clarify the nature of the debate/put things in larger context. So, you're both involved in developing alternatives to SQL. I have some homework to do at this point--I'm considering the BagNeedScenarios, for starters. However, I'll barrel ahead, as I think I've taken sides already. I continue to cling to the principle, that relational databases by design are supposed to eliminate redundancy (if it's okay to use 'redundancy' as a jargon term there--the whole idea of redundancy concerning what you should do in order to maintain data integrity--what should you do? Store information in one place, right?). I want to label this principle as not in dispute? I take it that DaveVoorhis agrees. We all agree that there are many production databases out there with redundant data, that duplicate records are alive and well, and it seems like we (including top) sort of agree that this is unfortunate? An important point.

This judgement affects what we do, when we find duplicate data. We remove and delete that duplicate data. Unless, I suppose, we can't come up with a method for accomplishing that.

I figure, that if there's a DBA in the midst, then the DBA will be hard at work, constantly, finding duplicate records in tables, deleting/eliminating duplicate rows, in fact employing many options and ways of doing that, and let me slow down to consider this activity which I take it, we must assume is always going on in a maintained database. The DBA comes along, looks at, say, a 'customers_bak' table, and cleans duplicates. Or, simple example, there is a table 'duplicaterecordtable' and it contains this:

 duplicaterecordtable
 Col1 Col2
   1    1
   1    1
   1    1

And the DBA changes this to this:

 duplicaterecordtable
 Col1 Col2
   1    1

And then, the DBA moves on to do the same thing to other tables--it's what he does, it's all he does. Another example, a users table, containing this:

 users
 FirstName LastName
    Elvis  Presley
    Elvis  Presley
    Elvis  Presley
    Elvis  Presley

And I leave as an exercise, how that table looks when the DBA is finished w/fixing it (I take this up, show one way to fix it below).

Now, DaveVoorhis explains that in his RelProject, well, here is a quote:

Sometimes, those external data sources -- for example, logs produced by hardware devices -- will have duplicate rows. Currently, the RelProject requires that the database developer either identify a key in the source data (if there is one), or pick an option that causes a key to be generated (if there are or might be duplicate rows.) However, a generated key will only last the lifetime of a given RelProject query. It is not a permanent key, nor could it be, because if there's no way to uniquely identify rows in the source data there's certainly no way to permanently associate particular rows with a key. I called such a key a "generated key" because it's generated for the data rather than being found in the data. Top has called it an "artificial key" or a "fake key", which is fine -- I know what he means.

Top argues that such a "generated key" is more harmful than allowing duplicate rows in the DBMS.

I argue that allowing duplicate rows in the DBMS is more harmful than a "generated key".

Now, to this I say, that adding an identity column (RelProject requires this), would just be a step, part of a method for deleting duplicates. We're adding an identity column to the table in order to distinguish all rows from each other, and then it is easy to delete duplicates by using the Id column.

Take my users table:

 users
 FirstName LastName
    Elvis  Presley
    Elvis  Presley
    Elvis  Presley
    Elvis  Presley

And, run this query: ALTER TABLE users ADD Id int IDENTITY(1,1)

And then you have this:

 users
 FirstName LastName Id
    Elvis  Presley  1
    Elvis  Presley  2
    Elvis  Presley  3
    Elvis  Presley  4

And then run this query: DELETE FROM users WHERE Id IN (2,3,4)

I wouldn't claim that query to be ideal, there must be a better way to do it, automatically, but that might depend on which actual implementation we're using and this is just hypothetical. So anyways, then run this query:

ALTER TABLE users DROP COLUMN Id

And I'm still inclined to leave it as an exercise, how that table looks when the DBA is finished w/fixing it.

but, in this great debate, of which here is the executive summary:

Top argues that such a "generated key" is more harmful than allowing duplicate rows in the DBMS.

I argue that allowing duplicate rows in the DBMS is more harmful than a "generated key".

I'm afraid I can't accept these as opposite positions. You ('I argue') are actually allowing duplicate rows in the DBMS--and if you're going to do that, then I'm inclined to the view that it's more harmful to delude yourself about the fact--just say that you are allowing duplicate rows. Indexing duplicate rows does not make them not duplicates, does not, as you (seem to?) claim, adhere strictly to the RelationalModel. I'm quoting you here: 'No relation can ever have duplicate rows.'

Is there some ambiguity about what this means, what counts as a duplicate row? Not, apparently, when you say this: '..a generated key..is not a permanent key, nor could it be, because if there's no way to uniquely identify rows in the source data there's certainly no way to permanently associate particular rows with a key.

But then, what do I make of this?: '..requiring the `generated key` prevents us from accidentally multiplying garbage when performing joins or transitive closures.'

Okay, how exactly does requiring the 'generated key' prevent us from accidentally multiplying garbage etc? How is a generated key, to use your jargon, a panacea? Am I, for example, mistaken about what will happen if you join such a table against itself? I'll detect duplicate rows.

[No, you won't. Literally, the rows aren't duplicates if they have different keys. Consider the source data:]

 users
    First  Last
    Elvis  Presley
    Elvis  Presley

[Allowing duplicates, if you join users with users on Last, you'll get Elvis squared:]

 users
  l.First  Last    r.First
    Elvis  Presley Elvis
    Elvis  Presley Elvis
    Elvis  Presley Elvis
    Elvis  Presley Elvis

[But if you introduce a volatile ID column:]
    First  Last     VID
    Elvis  Presley  1
    Elvis  Presley  2

[And do the same:]
    l.First l.VID Last     r.VID r.First
    Elvis   1     Presley  1     Elvis
    Elvis   1     Presley  2     Elvis
    Elvis   2     Presley  1     Elvis
    Elvis   2     Presley  2     Elvis

[Look! No duplicates. Further, you now have sufficient information you need to make useful judgements about how this relates to the original data source - e.g. in this case we might filter out all cases where the left VID is greater than or equal to the right VID, which gives us a nice collision graph on last names:]

  SELECT * from (the last table) where l.VID < r.VID
    l.First l.VID Last     r.VID r.First
    Elvis   1     Presley  2     Elvis

[This is quite useful for aggregation tasks and various advanced computations. Of course, if we judged that the duplicates aren't meaningful, we could simply eliminate them by making the whole row a key (which is always true in RelationalModel). The reason for adding a volatile ID is that we do have some reason to believe the duplicates are meaningful at their source. E.g. TopMind has advocated using rows for logging or items on a receipt.] {I don't think I have advocated this. See above. -t}

Perhaps, the judgement is made, that no need has arisen to eliminate these duplicates, but that puts you in the position of saying something like this:

I argue that allowing duplicate rows in the DBMS is more harmful than allowing duplicate rows in the DBMS.

Or perhaps:

I argue that allowing duplicate rows in the DBMS is perfectly okay, as long as you don't realize that they're still duplicates, and call them something else.

I've taken a look at http://dbappbuilder.sourceforge.net/Rel.php. What I'm looking for, is my definition of a key as the element that forges a relationship between two tables. your 'generated key' does not do this, a point which you actually allow. Must we call this a key, therefore? I categorize this as a mistake--I'm being pedantic?

Is this a controversial point, that just because your 'generated key' has a unique value for each record, that does not mean that there are no duplicate records..? Right? Just because my table has a field that you figure acts as a unique key, of sorts, it's a particular kind of key, not the usual kind, but it is unique for each record, does that mean that my SQL isn't corrupt. What happens when you go to create relationships? What if you have a table listing customers and a table listing orders? Of course, the orders table does not have any reusable customer information in it. And then, you go and create a report that tracks which order goes with which customer. What happens, in this case? Repeating the word 'key' 'key' 'key' over and over again like a magic incantation, 'generated key' being a rather Orwellian concept, like true freedom, 'generated key is unique'! It's essential that the key be unique! The GenErAtEd? key is unique! I'm good! Satisfied the gods of etc., as top would say. I don't think that's gonna make it rain. When your generated keys are unique, are you also using foreign keys? They'll be unique too. So that's good. Everything is unique. Except for the records. Oops. The keys are meaningless numbers, right? It's the records that are supposed to be unique.

In conclusion, to end where I began, I continue to cling to the principle, that relational databases by design are supposed to eliminate redundancy (if it's okay to use 'redundancy' as a jargon term there--the whole idea of redundancy concerning what you should do in order to maintain data integrity--what should you do? Store information in one place, right?).

Have I got anything, here?

A little, perhaps.

What is specifically excluded by the RelationalModel is the following:

 FirstName LastName
    Elvis  Presley
    Elvis  Presley
    Elvis  Presley
    Elvis  Presley
Inside a relational database, such as the RelProject, it simply can't exist. It must, by definition, either be:

 FirstName LastName
    Elvis  Presley
Or, it can be:

 FirstName LastName Id
    Elvis  Presley  1
    Elvis  Presley  2
    Elvis  Presley  3
    Elvis  Presley  4
If we project away the Id, i.e., we remove it, the RelationalModel specifies that the result must be:

 FirstName LastName
    Elvis  Presley
There is no ambiguity about duplicate rows. A duplicate row is precisely that. In an implementation of the RelationalModel, it does not exist. Ever.

Enforcing the rule that this...

 FirstName LastName
    Elvis  Presley
    Elvis  Presley
    Elvis  Presley
    Elvis  Presley
...cannot exist results in certain guarantees about the correctness of query results. In database systems that permit duplicate rows, those guarantees cannot be made. In particular, I've given an example of SQL queries, above, where duplicate rows may inadvertently result in erroneous query results. That cannot occur (at least, not without deliberately engineering it to be the case) in a relational database.

Note that a generated key, such as in the following:

 FirstName LastName GeneratedID
    Elvis  Presley  1
    Elvis  Presley  2
    Elvis  Presley  3
    Elvis  Presley  4
Returns stable GeneratedID values for the lifetime of a given query in the RelProject, such that if the above is retrieved as a RelVar, e.g.:

 VAR Customers EXTERNAL RELATION /* external retrieval spec goes here */ GENERATED KEY {GeneratedID};
Then the following will be true:

 (Customers JOIN Customers) = Customers
However, if we do the following:

 VAR CustomersTemp REAL RELATION {FirstName CHAR, LastName CHAR, GeneratedID INTEGER} KEY {GeneratedID};
 INSERT CustomersTemp Customers;
The result of the following will be true:

 IS_EMPTY(Customers JOIN CustomersTemp)


Okay, then what if I add something like--don't be distracted by the familiar & various mechanisms for enforcing entity integrity, indexes, UNIQUE constraints, PRIMARY KEY constraints, triggers. Suppose that the PK values are duplicate, yet the row as a whole is not. This is obviously bad. Example, a table with Social Security Number as the primary key, and having two different people (or rows) with the same number, each having unique attributes. Obviously bad. Whatever caused this, it was a malfunction. But, at least, in this case, valid unique data was placed in the row--and that's good.

It is valid data. That it does not reflect the real world, or maybe that it does, is conceptually immaterial. It's a problem for a DBA to sort out.

Next, consider what you would do, if you find a few sets of duplicate PK values, and you verify that the entire row is duplicate. In this case, what does RelProject do? Add a column? To give the row uniqueness? That doesn't give the row uniqueness, it just masks the problem..?

If the database designer has not specified a generated key, it throws an exception upon encountering an entire duplicate row.

I'm saying that there are four possibilities: 1. duplicate keys, duplicate rows 2. duplicate keys, unique rows 3. unique keys, duplicate rows 4. unique keys, unique rows

1., 2., and 3., are malfunctions..? But you want to allow 3., in preference to allowing 1. Must you allow either? Or, if you do, is this still tutorialD?

1 cannot exist. 3 should perhaps be re-stated as "only the key is unique," but that's perfectly acceptable, reasonable, and often occurs.

I mean, relations never contain duplicate tuples.

I take it as a consequence of definition?

I've introduced a bit of jargon--what are tuples? I'm trying to remember my TutorialD, where, you have tuple types, which are like row types. Loose analogy.

I haven't given precise definitions for the fundamental concepts tuple and relation. There is lots that I haven't discussed. However, if we agree that the relational model prohibits duplicates..if it's agreed by all that duplicate PKs are a violation of entity integrity, and should be disallowed in a relational system, there is another point, which I own as controversial? Namely, that as tables should never contain non-unique primary keys, so also, and this isn't redundant to say, tables should never contain duplicate rows.

Yes?

I've not followed anything after "I mean, relations never contain duplicate tuples." The words make sense, but the meaning eludes me. Perhaps this will help: For the sake of the discussions here, "tuple" = "row". However, table != RelVar. A table, as in SQL, can contain duplicate rows. A relation, which is the value associated with a RelVar, can never contain duplicate rows aka tuples.


Well, you are finding duplicates, and you are deciding what to do when they are found. I'd suggest deleting them. You've, it seems to me, fixated on the notion that you have an alternative way to fix the issue--you don't have to remove the duplicates. Instead, you can insert more currupt data, because two wrongs make a right. You may not recognize your own reasoning in this..? ;-)

You've stated baldly that '"only the key is unique," but that's perfectly acceptable, reasonable, and often occurs.' Okay, I think we understand each other perfectly well, and disagree. I'm optimistic about convincing you on this point. We're getting somewhere. I don't take 'only the key is unique' to be perfectly acceptable, reasonable, and often occurs. Probably, I need to clarify that I'm reasoning about surrogate keys, only the surrogate key is unique? That's not acceptable, that's a duplicate row. Agreed?

Ah, you didn't state you were referring only to surrogate keys. However, it may happen (though more rarely) that a set of surrogate keys are unique but all other columns are the same. Unconditionally removing what appear to be redundant entries potentially throws away information. It might be that the duplicates actually have meaning. For example, imagine the following access log entries from Rel's link to a (hypothetical but realistic) external system, such as a router or other hardware gadget:

 Logins:
   UserID   IP           GeneratedID
   Dave     192.168.2.1            1
   Dave     192.168.2.1            2
   Dave     192.168.2.1            3
It might mean I really did log in three times, and I might want to know that I logged in not once, but three times. On the other hand, if I know the entries are redundant and I want them removed, I can simply project away the GeneratedID via the following:
 Logins {UserID, IP}
Which results in:
 UserID   IP
 Dave     192.168.2.1
Either way, I can choose how I want to deal with the data, and I won't get -- as I can with SQL -- an unpleasant surprise in the form of unexpected duplicate data causing incorrect query results.

No, you are trading one problem for another.

What problem am I trading for another?

Size and processing. A generated key takes space and time. It may not matter in a smaller app or system, but could in a bigger one. Having the design option of skipping auto-keys to improve machine resource needs is a good thing. Let's say such a table is consolidated each week to another table with week, userID, IP, and count. But otherwise we don't want machine resources spent on something that does not improve the actual info in the database. An auto-key gives us nothing to serve our needs (granularity by week).

That may have been an issue in the early days of data processing -- late 1950s to early 1960s -- but incrementing an integer is negligible processing now. The space requirements are also negligible -- query results don't normally store every row; typically one row (the current row) is in memory at a time unless the RecordSet can be navigated in reverse. Even then, whilst maintaining an integer certainly takes space, any RecordSet implementation that runs risk of running out of memory is going to support using disk to obtain extra space, so the space consumed by a generated key is no more significant than the space required by any other columns. Such negligible overhead is worth it, in order to categorically avoid -- as I noted above -- an unpleasant surprise in the form of unexpected duplicate data causing incorrect query results.

It can still be big cost for an app approaching Facebook-sized.

Apps approaching "Facebook-sized" are a vanishingly small percentage of database-driven applications. FaceBook and its ilk are already well outside the bounds of conventional RelationalModel-inspired (typically SQL-based) database technology.

         Contributor group 1:
         "Problem type: package leakage"
         Contributor group 2: 
         "Package condition: leaking"
         Contributor group 3: (free-form)
         "The box was found to be leaking."
And what's an example of an "unexpected duplicate"?

See above starting with "Much more subtle and insidious are queries like ..."

But how would such get into a simple IP tracker? Walk us through a scenario.

The simple IP tracker is merely an illustrative example. The same principle applies to more complex scenarios. It is not scenario-specific.

If you "don't use it wrong" you won't get the wrong answer. It's a trade-off between higher machine efficiency and human error risk. While I might personally optimize it against human error, other managers or colleagues may choose different. Maybe it's considered low-importance or "bonus" info such that management doesn't want to sacrifice speed in order to make it "ideal". If they are paying for it, it's their trade-off choice.

Precisely the point of my "much more subtle and insidious" example is that "don't use it wrong" isn't obvious. Clearly, there are times when achieving the utmost of machine efficiency is of paramount importance, but for the majority of tasks for which SQL is used, the overhead of maintaining duplicate-free data is negligible. The overhead of dealing with "subtle and insidious" problems is not.

But again, that tradeoff decision is up to the owner of the machines. Lecture them all you want, but it's their stuff and they can do whatever the hell they want to with it. If your system removes choices for them, they may toss your recommendation for a system that gives them choices THEY WANT. That's capitalism for good or bad. Time-Machine yourself to the Soviets if you want to be the Minister of Uniqueness Compliance.

The "owner of the machines" might insist on all manner of things -- some reasonable and some foolish -- so such requests are irrelevant here. The "owner of the machines" might also insist on using 1960s era FORTRAN instead of modern SQL because that's what he learned in university, or that all data must only be stored in text files so that it's always human-readable. (These are not made-up -- I've had clients request them.) Why give some boss's demand to store duplicate rows more weight than some boss's demand to use 1960s era FORTRAN or text files?

That seems to be an argument-by-extreme-case. I don't see our situation to be nearly as extreme. And an owner CAN use FORTRAN. Nothing is stopping him/her (although he's likely to go out of business eventually). I've heard of anti-RDBMS C shops that did everything in a file-centric. Such didn't outright kill the companies because they probably did OTHER things right to make up for the stupid parts, which is typical: you only have to do FEWER stupid things than your competition, not be perfect in all categories. A basketball team my suck at defense but have killer offense for example such that they survive in the league, perhaps even better than average, like the Steve Nash era Suns. -t

Yes, exactly -- your argument is an argument-by-extreme-case. Why would someone argue in favour of storing duplicate rows in a database? That's every bit as extreme as demanding FORTRAN instead of SQL, or text files instead of an RDBMS.

Big free web service companies like FaceBook and Google have shown they accept the fact they may lose or corrupt the data of say one out of every 5,000 users to keep their service fast and cheep. It's not economical for them to make that 1/5000 happy to make their service slow or expensive or ad-heavy for the other 4,999. That trade-off is a business decision. Our job as techies is to only describe the trade-offs accurately, not make the final choice. If we were good at that, we wouldn't be middle-class cubicle techies.

So you're saying it's acceptable for a typical company to accept the fact that its DBMS may inadvertently allow the developer to generate incorrect results, that might otherwise have been trivially preventable, just to maintain the possibility of storing duplicate rows for the one out of 5,000 users that think they need them?

I'm not convinced it's always "trivial". And the owner gets to declare what's "acceptable". It's their machine and their data. I'm not God nor do I pretend to be. How about YOU go spank them for me.

Of course it's trivial. Problems caused by duplicate rows are trivially eliminated in a system that doesn't allow duplicate rows.

Again, I'm not sure what owners demands have to do with anything, here. This is a technical subject, not a political one. If you wish to discuss the politics of ProjectManagement, this is not the page for it.

It's both. Machines and logic have no goals; they are just bumps on logs. Humans supply goals. Technology often imposes tradeoffs that force the human goal maker to balance conflicting goals (such as speed, money, and short prep time).


Ok, I take it, at this point, that the agenda is to understand duplicate data. The first thing to understand, is that you might not have the necessary permissions to edit or delete records in the database, it might be 'read-only', it might be that a large number of users connect to it. At the very least, if you're going to delete records, you're going to want to be able to back up, first--so as to be able to recover deleted records. So, on that basis, I'm inclined to allow for the idea of 'generated keys'. And, generated keys do not prevent the process of finding records that contain whole or partial matching values. We've gone back and forth a bit, over my broader definition of what counts as duplicate records, that one wants to delete, that need removing..You seem resigned, that one cannot follow a set of specific rules or procedures to find and delete duplicate records accurately, but, nevertheless, you tout that using your system ('generated keys'), after you confirm that a table contains duplicate records, you can delete the unwanted data.

[Eliminating duplicates is trivial. Use `SELECT DISTINCT * from table`. We don't need to "confirm that a table contains duplicate records". We can just blindly apply this transformation, and all duplicates will be eliminated. If that was the only goal, we'd be done. But it isn't the only goal. We also want to preserve relevant information, and duplicates may be relevant information. (Given SQL's many follies, so could be position in the table!) A simple generated key can ensure we don't lose this sort of potentially relevant information accidentally. Of course, it would be better if the data source was a relation to start with, in which case we'd have sets - no ordering, no duplicates.]

Indeed. And the records in question might not be in anything recognisable as a "database", either. They might be log entries in a hardware device, for example, and therefore not capable of being deleted or recovered. There might be duplicates that have meaning -- like my three logins example above -- or the duplicates might be genuinely redundant, but this is something the machine can't know. Therefore, one cannot follow rules or procedures to delete them. It's up to a human to decide whether apparently duplicate data is meaningful and should be preserved, or meaningless and should be eliminated.


See also BagVersusSetControversyRoadmap


MarchTwelve


CategoryDiscussion


EditText of this page (last edited December 2, 2014) or FindPage with title or text search