Auto Keys Versus Domain Keys

Summary of the key arguments on each side of the debate:

Domain Keys:

Auto Keys:


Automatic keys typically come from a misunderstanding of relational principles. Normally a relational schema should not have a lot of auto keys, they are typically not needed.

The folklore including here some OO and O/R literature is confusing several distinct issue. So let's first separate them.


First issue:

Identify and enforce all the domain keys, that is all the valid business rules that dictate that a combination of attributes (columns) should be unique (or identifying) over an entity (table).

First issue is that you have to have candidate keys. An essential part of database systems is enforcing integrity constraints on data -- if you don't enforce constraint, you might as well use ThePrevayler. So regardless whether you created an autonumber CUSTOMER_ID, CUSTOMER_SSN has to be unique. In SQL, that would be a "CONSTRAINT ssn_is_unique UNIQUE (customer_ssn)". That should take care of the vs part. There's no adversity between domain keys and auto keys, all are keys, and relational model holds that all keys are equally important.

Therefore one of the important task of the schema designer is to identify all uniqueness constraints for the domain, and declare them as such. This is where a typical mistake happens with the OO minded designers: they use auto keys in place of domain keys. If you use auto keys, make sure you always use them in addition to domain keys, and make sure you identified all the candidate keys.

I don't think it is limited to the "OO minded". I don't feel much for OO, but still think that auto-numbers are often more change-friendly than domain keys, which are too subject to the whims and politics of the business world in my experience.

You may not feel "OO minded" but you are more table-minded than relational minded :)


Second issue : Should we add auto-numbers (aka surrogate) keys to tables, and use them as foreign key material?

When and based on what criteria and considerations, is it recommendable to add a surrogate key, typically in the form of auto-number (aka IDENTITY, SERIAL, and other database specific terms) to a table?

Before we jump into answering this question, let's reiterate JoeCelko's empirical classification of keys. Although such a classification is not agreed upon, and is entirely outside the realm of database theory, it is a great help for practitioners.

1) natural key = the nature of the data gives it to you. It can be easy to verify or very hard and trusted to various degrees. The simplest example is the pair of longitude and latitude to identify a geographic location (point on the map). Also for relationships, typically a subset of the columns gives the natural key: PAIRINGS(HOME_TEAM,VISITORS_TEAM,DATE) the whole set of columns is a natural key -- they are derived from the nature of the data : relationship.

2) surrogate key = the system (SQL engine) generates this and the users never see it, it remains internal to the information system. Look at Teradata hashing for example. Another frequent example are _OID columns added by object designers to all the tables.

3) artificial key = A man-made code, like an ISBN for books or VIN for a car or SSN for people, etc. While these may resemble the surrogate keys in the sense that they are typically system generated (although a person responsible for assigning codes may still occur in some domains), the significant difference is that the users see it, they are part of the domain (business) vocabulary.

I am bothered by this classification because whether the user sees it or not is somewhat orthogonal to the source of the key. Sometimes users slowly end up using it over time because it simplifies certain communications. If some users see it but not all, then is it a "surrogificial key"? Often PowerUsers welcome such keys, while newbies get confused by them.

Whether or not a user sees the key is of considerable consequence. An 'artificial key' always requires a centralized authority to avoid conflict, whereas a 'surrogate key' is local and therefore does not need any sort of central 'authority' to maintain consistency. This is, essentially, one of the fundamental issues of naming things. Admittedly one may start with a local identifier and make it global by prepending another global id to it ('1' => 'c2.com/item?id=1'), but even this requires a 'central authority' to distribute the domain name, and another 'central authority' to decide what '/item' means.

True surrogate keys could be used to constrain queries, but would never be returned as part of a query. If the ID is never communicated outside the local system, you can guarantee that nothing ever depends on the ID remaining constant, following the fundamental principle of encapsulation: that which can be hidden from everyone else can easily be changed (or destroyed) without consulting anyone else. Considering how important the distinction between these 'surrogate' and 'artificial' keys is in communications between systems, I believe the distinct classifications are very well justified.

I generally considering hiding a key to be an implementation detail. There is nothing in relational theory preventing such, and is possible within the security frameworks of existing RDBMS that have column-based security. Just about any type of key available in any paradigm can be emulated.

Relational doesn't distinguish between types of keys (natural, surrogate, artificial). And the decision regarding which to use should, I think, be one of design based on the domain. Calling such a choice an 'implementation detail' seems to belittle it, as though no real decision should be made on it until one is forced to do so during implementation. Of course, it is nice to have an RDBMS that is capable of enforcing such a design-decision via security-constraints.


Some diverging opinions:

Using natural keys in any project, is asking for trouble. Natural keys should have constraints put on them, sure, but only generated keys should be used to set up relations between tables, to do otherwise is begging for trouble.

May I know whose expert opinion is this, and have more arguments to back it up? I mean that "keys suck and they're a pain in the ass", is hardly an argument worth replying to.

Maybe this is an artifact of poor database implementations of the relational model, but in the real world, we don't get to work with the pure relational model, we have to work with those poor implementations of it, which all suck at using natural keys. Natural keys are always a pain in the ass, continually have exceptions to the rules, and generally make poor choices for identifying rows. System generated keys remove all these problems, and give the programmer something solid to work with that won't change when a business rule changes.

Overgeneralizing without basis. Take these tables:

 Project_Assignment ( Project_Name, Employee_No, Start_Date, End_Date, ... )

Pairing ( Home_team, Visitors_Team, Date)

The natural keys being: (Project_Name, Employee_No), respectively (Home_Team, Visitors_Team, Date). Why would you ever create a project_assignment_id, or Pairing_ID?

OK, going with your example. Say we set the key to Project_Name and Employee_No, now what happens when the project in the project table gets renamed due to office politics? Every key referencing the project name must be updated, by using a natural key, we've made the schema less flexible.

And if the same employee gets assigned to the project twice? You need to add Start_Date to the key.

That's a workaround for a problem I don't have with surrogate keys.

Also, creating foreign keys to tables who use multicolumn natural keys is a nightmare because the referencing table must now have all those columns too, this make creating foreign key's to those tables impractical.

Were the key ProjectID and EmployeeID, nothing would have to be updated at all when project name or employee number changes, a better outcome. It's a practical matter, using natural key's causes too many programming problems.

What problems do surrogate keys create?

I would suggest the following schema, from your example, this will have none of those problems and be much easier to work with programmatically.

 Project (ProjectID, Name)
 Employee (EmployeeID, Name, EmployeeNumber?)
 ProjectAssignment? (ProjectID, EmployeeID, StartDate?, EndDate?, ... )
 Pairing (HomeTeamID, VisitorsTeamID, Date)
 Team(TeamID, Name)

I know it goes against relational theory to have identity outside of values, but it's a practical requirement. We need a way to identify and link data to other data, outside of their values, this could go back to the whole ObjectIdentity argument, but most of us find that relying only on the data for keys isn't usable in the real world, it simply leads to too many problems with cascading updates, key updates, exceptions to the key rules, FK problems with multicolumn keys, etc.. sorry, but ObjectIdentity seems necessary as it helps us avoid all these issues. Maybe we need to separate the physical model from the logical model. I consider the Generated ID to be a necessity of the physical model, not the logical model.

So the first rule a thumb, until I get enough time to construct the whole argument is: you very seldom need IDs for tables that reflect a "relationship" in a E/R schema. Caveat emptor; it's got nothing to do with relational theory (there's no such thing as "relationship" in relational) but with practicality. -- CostinCozianu

Re: The natural keys being: (Project_Name, Employee_No)

Project name is subject to change. If the names changes, then it will produce a ripple effect where all the foreign keys have to be changed. Plus, compound keys are a violation of OnceAndOnlyOnce because we have to keep duplicating columns to refer to other tables.

Regardless your dislike of Project_Name, which in some businesses (what a coincidence, my current company) are set in stone, imagine you'd have Project_No instead. The key would still be (Project_No, Employee_No, and there would be hardly a rational justification for creating a Pairing_ID. Compound keys do not violate once and only once, any more than single column keys do.

Okay, perhaps it was an overgeneralization on my part. If the compound keys are referenced often or are long, then consider using a generated key. Regardless, I don't think using project title, or any titles, as a key is a good idea in most cases.

If you're going with natural keys all the way, why do you even have Employee_No? Shouldn't that be replaced with a natural key on (Employee_Name, Entry_Date)? If you have Employee_No because that's a customer requirement, as a dedicated natural keys proponent you should educate your client that unique id numbers are a terrible idea and that (Employee_Name, Entry_Date) should be used on all office forms instead of Employee_No. That conversation probably won't go well, and that should serve as proof that surrogate keys are a fine idea.

This "proof" misses the point: The definition of a surrogate key is that it "never leaves the information system" (cf. above). If Employee_No was such a surrogate key, you couldn't have any discussion with the client about it appearing on office forms - it just couldn't. The fact that the customer has an opinion about Employee_No implies that it's a plain normal business column. You can reason whether or not to use this column as part of the primary key of that table. But even if you did you cannot argue that it's a surrogate key.


Re: So regardless whether you created an autonumber CUSTOMER_ID, CUSTOMER_SSN has to be unique.

Not necessarily. Foreign visitors may not have an SSN. Perhaps we can put a constraint such that non-empty SSNs are unique, but that is leaning toward validation rather than "hard" column constraints. "If, ands, and buts" are a suggestion that it is a validation issue. Auto-generated numbers are less likely to be bitten by politics and policy changes.

And many government agencies do not do business with people without either SSN or "Tax Payer ID" (an identifier having the same format as an SSN), that's their business rules.

PageAnchor: License_place

I once worked on a project that involved processing vehicle license plate information. I always assumed that license plate numbers were unique. However, I learned that this was not the case due to some mix-ups in assigning numbers.

May we say that the duplicate license plate should not have occurred, had some software engineers did their schema design correctly (i.e. enforcing business rules)? Having this situation, is it OK not to fix the problem, and just have incorrect information of in the database. Not fixing it will only compound the problems we'll have later, because people do not identify license plates by their ObjectIdentity in the database, but by their number. So the wrong entity can be identified if duplicates are allowed, creating further data integrity problems.

The point is it does occur, and the database needs to support it, even if it is incorrect information. The real world is full of incorrect information and exceptions to the rules, that's why a generated key is needed, to support any possible variation of data, even when it violates business rules.

The database doesn't need to support it willy-nilly. There are smarter ways of dealing with these kind of situations rather than letting the error propagate and creating even more data integrity problems (in this case may be as grave as maybe legal troubles) down the road. Just for example: you can have a separate table with entities with erroneous information that needs to be cleared. There are other ways as well, but sweeping the dirt under the rug with an autonumber, is among the worst option.

What do you mean by "be cleared"? Hunt down cars issued plates 15 years ago and make them all get new plates? Either path is ugly and none is clearly superior at this point. In practice such messes from the past do happen and we have to make a judgement.

Let them come to you. They will have to renew their license (if they want to use the car legally that is). At that point you can correct the error. But if you just let duplicate license numbers in the system as if it's business as usual, you can end up creating troubles (not the least of which legal troubles) for one or both of the owners. Imagine somebody reports to the police of a problem with the car with that license plate. Or that one car gets involved in an accident, and the car insurance will be raised for the other car. Anyways, a smart information systems should identify data integrity problems, deal with them, and correct them as soon as possible. Dealing with invalid data as if nothing happens is not a good option, a principle otherwise known for decades as GIGO: garbage in, garbage out.

Nobody proposed continuing with the duplication. It was an accident discovered after-the-fact, probably using pre-relational mainframe software. In practice other information is often used to help identify vehicles, such as make and model of the car, car color, VIN number, drivers license number (paper card), etc. They have to do this anyhow because people tend to mistype license numbers.

Also the discussion is invalid if the project sponsors are willing to issue duplicate license numbers. We assume that the non-duplication was a business rules approved by the project sponsor, otherwise, of course we shouldn't create rules. But if the project sponsors says that license numbers should be unique, a different business rules have to be identified also from the business user, as to how the system should handle erroneous license numbers.

My organization was merely a user of the data. We had no control over the duplication mistake. We were paid to use it as-is. At best we could make recommendations, but with no guarantee that they would have been considered. Although I was too low-level to hear it, the conversion probably resembled:

Us: We've found 5,000 duplicate license numbers. This gums up our database keys.

Them: Is there a technical work-around?

Us: Well, yes, using an internal key, but it complicates the design. Can you re-issue new plates to the duplicate consumers?

Them: It would cost us about $500,000 to replace all those duplicates, plus ill-will from the consumers and hunting down the hold-outs. Thus, we'll price it at about a million dollars. Now, will this work-around result in a million dollars of extra cost?

Us: No, we couldn't say that. It would be roughly a third of that.

Them: Then please implement the work-around (internal key).


Some people believe that record or object keys should be system-generated (auto) integers, while others believe that there are natural "domain keys" such as social security numbers, person names, or a combinations that should be used.

It is no longer a matter of belief, it is a matter of scientific knowledge :)

You have to say more than that... back that statement up or I'll delete it.

If you are patient. This problem is well documented in FundamentalsOfObjectOrientedDatabases, in ChrisDates writings in a variety of other sources. I am translating it here in common language for the practical software engineer.

It should be noted that even the experts don't agree on the ObjectIdentity issue. ChrisDate's word is in no way the final word on the issue. The object world requires ObjectIdentity, the relational world denies its relevance, it's an unsettled issue thus far.


I lean toward generated keys. Social security numbers (from U.S. government) sometimes change and foreign visitors may not have one. Or a company may go international. Names change due to marriages and being named after celebrities or dignitaries who fall into disrepute. It is safer to have unique numbers that are "blind and dumb" to external information. Project titles and product names also often change.

Plus, it is often easier to say on the phone, "send me the info for project 63723" rather than "send me the info for the second project that Jack started on December 3rd of last year".

Imagine Bill Gates saying "run me a status report on project 63723", versus project "Longhorn". Are you sure Longhorn is not 63724?


The reason DB-generated ID numbers are useful is because they're so useless for everything else. Anything that might be important in any other context is possibly subject to change in the future. Project Numbers change if your control-freak CTO decides to create a new Project Number System. People get new SSNs if they're victims of identity theft. But autokeys are so completely boring that nobody would ever try to change them. You just accept what the database gives you and get on with your life. -- francis


Imagine Bill Gates saying "run me a status report on project 63723", versus project "Longhorn". Are you sure Longhorn is not 63724?

{If you don't know the number, then look it up, or let the other person look it up. Nobody said the primary key is the only way to find information. If there is uncertainty, then perhaps try QueryByExample. A unique number is often just more compact and convenient if you know it. In practice if you are dictating it over the phone, you should also include a description of the item to use for verification.}

That's an issue of navigation. As long as you can navigate to the correct record using attributes in the record then you can get to the primary key. Navigation is easier then changing the name longhorn everywhere that it is used. And project names definitely change. Also, comparisons on the numeric key are faster and more storage efficient. You also don't have to worry about normalizing string keys for locale or case.

You are of course 100% correct, generated numeric keys are the best, but they violate the relational theory, no matter how bad the relational theory works in the real world due to database engine inefficiencies.

Certainly doesn't match my reading of relational theory.

Well, if some people have a problem (ObjectRelationalPsychologicalMismatch) reading relational theory, try reading some OO theory: FundamentalsOfObjectOrientedDatabases. Both theories agree on this issue. Some pretty practical folks like JoeCelko, who make a living as consultants from fixing database messes, and are nowhere near the relational theory point of view, agree also on this issue. -- CostinCozianu

Is the "characteristic of the entity" constraint that is bothering you? Can't i say what is a characteristic? Can't i say that a number is such a characteristic?

How about we look at specific scenarios of things going wrong under both approaches.


JoeCelko (a person who wrote a few books, sit on the SQl committee, has tons of practical experiences consulting on database implementation and is pretty much at odds with the relational theory establishment):

http://groups.google.com/groups?selm=7e67a7b3.0106191214.3b99d84%40posting.google.com&output=gplain

An interesting post. Its tone is dismissive which is not warming. There were no slam dunk reasons against identity keys. If you want the order of a select to make sense then use an order by clause. Saying an artificial key is ok because you can use a check digit seems way outside the relational model and not a valid take. The other issues were minor and don't outweigh the pros.

Plus, many of the complaints are vendor-specific issues.


Propagating Changes to Other Systems

It might be possible, even easy in some cases, to change keys based on titles and names when you have full control over the database, but often a database is part of a larger network of other databases and repositories. It will be much harder to fix changed keys in those systems outside of our control. We can't keep "recalling" data we send around.


Out-of-Sync Problems?

If we do have primary domain keys, then there is a possibility that they need to be changed. Couldn't this cause "out-of-sync" problems with processing? For example, suppose we have a web page of project names to be selected. It is there in the listing, but if a key changes between the time the web page was displayed and the time that the user clicks on an item, then something that is there one second will be gone the next from the user's perspective. True, this happens anyhow if say a project is deleted, but in this case it is just being renamed, not deleted. One could argue that this is usually a minor problem, but one that must be addressed anyhow. For example, suppose we send a list of product information out that uses product description as a key, another company provides sales statistics for each product, and then sends it back to us. What if we change the product description in between such a cycle?

Would you try something better than a "product description"? Is a product code or a parts number good enough? There are several issues that need to be addressed when choosing good identifiers, especially in distributed systems. Auto-number keys will not meet all criteria, for example they will not have good validation rules, and will not be easily recognizable and friendly to human beings. Do you want your users to fill in a number instead of a state code in an address? How about sending emails based on autonumber. An email address is a good example of an identifier in a distributed system, that nobody in his right mind (other than the defunct Compuserve system) would replace it with a number (auto-number or otherwise).

There is a list under AutoKeysVersusDomainKeysDiscussion that perhaps should contain these examples. I would note that email addressed *do* have problems when people leave or change duties, but they are usually not primary keys anyhow. Usually they are attributes of other entities, such as "employee" or "role".

So, by the reasoning above, we should no longer put 5 or 9 digit postal codes on mail we send.

No, by the reasoning above postal codes are an acceptable identifier, that is part of the business domain, and people got used to it. See surrogate versus artificial keys. Still, the mail can arrive at destination without zip code, and whenever I want to put the address of my business, I have to check my business card for the zip code. The same does not happen for the web URL or my business email. URLs and emails are better identifiers than zip codes.

Phone numbers are too difficult to deal with, I should just type in the name and perhaps city of the person I wish to call and the phone can connect based on that.

Let's see. Run a quick list of your friends, and try to count for how many of them you'll know their phone numbers by heart, versus for how many you'll know their emails.

When I want to set up an auto-payment, I shouldn't have to deal with an account number, just my name, the bank name and the branch name. You say auto-number keys will not meet all criteria, but the only one you cite, validation rules is easily handled via check digit(s) to catch typos -- this is done with credit card numbers, for instance) What other criteria is not met?

The discussion got confused as the brave defenders of autonumbers everywhere are now split into "show the numbers" versus "hide the numbers" camps.

By the way, US two-letter state codes are already an encoding of state names, many of which can be listed in several ways or misspelled. (Do you know what the official state name of Rhode Island is?)

So US two-letter state codes are mnemonics (artificial keys) but not surrogate keys such as auto-numbers, and justly so. Would anyone consider an State_OID ? Who knows, maybe California will want to trade names with Texas, and we have to let the system have the flexibility to accommodate that change.

Email addresses (the part beyond the @) are mapped to numeric ip addresses.

Exactly, and the funny thing is that the actual IP in the MX record is more likely to change than the domain name. Even when "rational.com" got swallowed by "ibm.com", you cans still send emails at rational.com, and probably they will keep that for a while. If you spread identifiers such as the IP of your mail server, you'll end up nowhere.

The conclusion is that while numeric identifiers have their use for things like ZIP codes, telephone and bank accounts, they are often times not as good and applicable. If one of their requirements is to be hard to guess then yes 16 digit numbers for credit cards are better than email-like identifiers. But if the identifier's quality is to be easy to remember, I have doubts you get better identifiers than email addresses or domain names.


User Visibility

[Users should never see the autonumber keys; they exist to help the system and the programmer, not the user. The user will see only the natural domain keys. The whole issue is that domain keys change too much or have too many exceptions to be reliable. Auto number keys don't; they are solid and reliable and allow the system to suffer through domain rule changes without problems.]

I disagree. It is often convenient to have the identifier displayed somewhere (or at least user-obtainable) so if there is a problem with a given record it is far easier to find the source record.

[If you display it, it becomes domain data, and loses its benefits. As long as no one sees it, they won't ask you to change it, or make it mean more than it is. It's supposed to be a hidden implementation detail, not part of the domain. If they see it, they'll start wanting to start the numbers at a certain number.. or sequence them or a zillion other silly things managers come up with.]

I have never encountered this problem for new applications or entities. If they ask for extra information, just tell them you will make a new field/column for that new info item. Another approach is to tell them that if the numbers are not automatically generated, then a person needs to be assigned the duty of managing the codes. Rather then risk personnel task assignment battles with other departments, they will stop asking.

[I've had these problems many times. Some manager sees a number, decides he wants them to be sequential at a certain level, like project or something and just screws everything up. I'll add a new column for those kinds of things, but then you have to redo the display and possibly rework other affected things. Better to just not show your generated keys at all, they aren't for users, they're for the programmer.]

{Just give it a weird geeky name so that nobody cares. The "Flux Capacitor Number", for example. Maybe put it on the "about" menu or something. If the manager asks what it is, just say it is to assist with phone support, which is the truth.}

What one sees as flexibility another may see as data integrity problems. The fact that users can change everything but the surrogate key, may be a flexibility that comes at a price. I wouldn't want my account number at the bank be changed at the whim of whatever call center operator, or bank manager, just because the software had a hidden OID, and "look, it's so easy, just turn this control into an edit box and you're set". Anyway, that kind of flexibility is provided by ON UPDATE CASCADE, but careful DBAs will not provide ON UPDATE CASCADE on every identifier, especially because they entrusted with data integrity.


Arguments Summary

These arguments are a caricature, they need to be further refined. For example, would you ever use a DATE_ID to autonumber dates? Well, I didn't think so, but the above summary doesn't give much insight.

I have actually considered such for a generic "period" table in which the time span of a period is user-defined. However, that is a specific scenario, not something addressed in a summary anyhow unless a new more general principle can be derived from it.


SocialSecurityNumbers? are not strictly unique. The Social Security Administration (SSA) reuses the numbers when people die, etc. According to legend, some companies with very large customer bases and history have encountered legitimate duplicates. Also, SocialSecurityNumbers? are information protected by privacy laws and usage laws. It is actually illegal for MOST organizations to require or even ask for someone's SSN. Essentially, you can only use an SSN legally if your usage will eventually provide the SSN as identification to a government agency who is specifically empowered to use it (e.g., the SSA or the IRS).

I revealed this fact to my current employer when they hired me to rearchitect their primary business system. They were using SSNs as primary keys to track students. They had already bastardized their system because many students would not provide their SSNs (as is their right) or because they could not obtain the information in a timely manner (at row creation). They had duplicate SSNs because they often had duplicate student records, to which they later added the SSN when they obtained it. The duplicates were not cost-effective to avoid. Often they could not obtain the "unique" SSN/EIN because we never had contact with someone who had the information (e.g., the clerk at our supplier did not know her company's EIN). When I arrived, my employer had "enhanced" their system to assign SSNs in such circumstances, plus they had established procedures to manipulate the SSN to handle duplicates, etc. The result is that most of the SSNs are untrustworthy. I designed the replacement system to treat SSNs as they truly are: optional data that can not be guaranteed unique, and therefore not a candidate key. In the midst of our development, our parent company issued a policy demanding the removal of SSNs as required information from our systems. Since I had already done so, compliance was easy. With the previous system, compliance would have been cost-prohibitive.

Other data that appears to be a unique domain identifier (license plates, etc.) are not actually true candidate keys. Before a column (or set of columns) can be a candidate key in a relational context, you must be able to reasonably guarantee that the values are always non-null and unique. You only have that guarantee when you assign the values yourself, or the values are suitably defined as such by a trustworthy source. SSNs do not meet these criteria. License plate numbers typically do not meet these criteria. I wonder if VINs do. Latitude and longitude do (by their definitions).

I think that SeparationOfConcerns applies here. Primary keys serve a purpose (mandatory uniqueness, simple relationships) beyond that served by a typical domain identifier. Attempting to combine the concerns is risky, complicated, and unnecessary. Using a surrogate key separates the concerns, eliminates the risk, and is the SimplestThingThatCouldPossiblyWork. For me, it is typically a no-brainer, a pattern, that almost always applies and almost never creates problems. With a surrogate key, I can guarantee the constraints that allow me to build simply. In the spirit of SeparationOfConcerns, I almost never allow a user to see my surrogate keys. They are implementation artifacts and are strictly NOT subject to business "enhancements".

In my employer's new system, a surrogate primary key is a standard convention for every business domain table. Therefore, the code to access those tables and to manage their relationships is very simple and is implemented OnceAndOnlyOnce.

RobWilliams


Rob, I agree that SeparationOfConcerns is an important issue. But I think we need to identify the concerns (problems to be addressed first). An important concern is that content in table rows has to be uniquely identified, by part of its content, and this is driven by the business rules of the domain. Therefore your tables should have a candidate key with real content, regardless of whether you use surrogates or not ? If you agree to this point, you may notice that, by definition, surrogate keys are useless in addressing this problem whereas a key based on real content solves it.

Therefore one concern is addressed by always using keys that have domain driven content (information that you display and is used by the users, even if artificially generated -- as is the case of order numbers, account numbers, etc). Indeed I agree that SSN has legal barriers that prevents most businesses from using it (although banks do have to use it, by law). It was a theoretical example. If anything, I'd expect the systems of Social Security Administration (the institution who manages these numbers) to use them as keys, (or at least as the most important part of the key). I'd also expect that the systems of DMV (the institutions who issue drivers licenses) would use the license number as key.

The next question to ask is what do you use to relate information in different tables. If this issue was orthogonal to the first issue, I'd say go ahead use whatever is needed. To begin with, this second problem to solve is at a different (technical, implementation) level. Whether you use a natural key or whether you use surrogate key, has nothing to do with business rules of the domain, but with how you optimize your database implementation for particular data access and update patterns. Using surrogate keys as a mean of implementing foreign keys, is not always the best option, but it has to be taken on a case by case basis. We can discuss the implementation implications of the choice of foreign key material separately.

CostinCozianu


I agree that candidate keys are still very important. Indeed, I would say that the logical versus physical distinction is crucial. My surrogate keys are almost always a strictly physical, implementation artifact. My logical schema is dominated by candidate keys and proper normalization. The logical schema should be able to stand alone. However, surrogate keys are sometimes necessary to complete the logical schema.

More importantly, and relevantly, surrogate keys are an important technique for completing and simplifying the physical schema. Being able to always count on a single-column surrogate primary key makes my database isolation layer very easy. My key generation, navigation, and validation are simple to implement OnceAndOnlyOnce. Using natural candidate keys often means that key management becomes specialized to each table (each table is an exception).

Always using a surrogate key for primary and foreign keys can be storage inefficient, but storage is cheap. Carrying around the extra keys and processing them is also inefficient, but memory and processor cycles are also cheap. The ease of development and maintenance available by simplifying on the convention of surrogate keys dramatically reduces human labor, which is the part that is not cheap.

RobWilliams


Using Technical / Generated Keys can result in unnecessarily complicated code and inefficient applications. For example:

Country has internationally agreed short codes (ISO 3166-1997) that can be used to represent countries. These short codes are universally accepted as abbreviations for the relevant countries.

So we can have a table with Country_Code (as the key) and Country_Name (as the full name). Whenever I use Country_Code as a foreign key in a related entity it is immediately obvious from the content of the field as to its meaning. Only if I wanted to bother with explicitly referring to the Country_Name would we need to bother with additional reads of the databases.

Using a technical or generated key would result in a table with Country_Key, Country_Code, Country_Name. Then when ever I use Country_Key as a foreign key in another table I would have to do a second read of the database to find out which country I was referring to.

What a waste of effort.

I think what is important to remember is that neither opinion is wholly correct and that sometimes generated keys (for example Purchase_Order_Number, Vehicle_Identification_Number) are valid and in other cases (eg Date (why would I have a technical key for a date - when it has to be unique (at least in this universe)), or Zip_code, Postal_code) a more natural key would be better.

Like all DatabaseDesign issues it is a MatterOfOpinion which is preferable and what may be appropriate in one circumstance may be inappropriate in others.

Thank you very much for citing ISO country codes. This is what our financial instruments are referencing in a country field. But now we have a new requirement that super-national institutions like world-bank has to be supported as separate entities. So much for "natural" DomainKey?.

I suppose you mean those 2-character codes. You could perhaps add your own codes for such entities, such as "B1", "B2", "B3" for each bank. None of the ISO ones currently use digits, so you are fairly safe from potential future overlaps. But, it is slightly kludgy.


I agree. You may note that I adopted the convention of surrogate keys for every business domain table. This is as opposed to using surrogate keys for code (reference) tables. In general, I prefer to treat codes as you have suggested.

However, codes can also become a problem for many of the same reasons. They may not be guaranteed unique and they may be optional. They may be storage inefficient. They may contain embedded information that violates normalization rules. There may not be a standard encoding for a particular kind of information, or it may not be known.

Among the most common problems with codes is the simple fact that they may not be "...universally accepted as abbreviations..." Just because there is a standard does not mean that there is universal acceptance or practice. As a matter of fact, there is almost nothing that is universally accepted and practiced in the realm of information and encoding. Country codes are a great example. There are numerous standards for the same (postal) and for varying (TLDs, character encodings, telephone prefixes, etc.) purposes, and there are numerous occasions where an organization may have to translate their codes so that they can interoperate with other organizations.

For example, our new system has to interoperate with our parent company's Oracle Financials system. We have an translation layer between us that converts our country codes (based on a recent standard such as you proposed, but I do not remember which) to the codes expected by Oracle Financials. There are cases where there is no obvious translation, and we have to fudge it. And this is just within our own corporate hierarchy!

Postal codes are also a problem as a "natural" primary key. If you try to use the postal code as the primary key to relate a shipping address to the city and state, you may eventually become quite frustrated by the fact that the mapping is not unique. A postal code can span cities, and using the wrong city can slow down or interrupt delivery. And that is just within the US. It is a whole different ball game in other countries, and many countries do not have postal codes at all.

One of the main differences that seems to motivate this debate is the simple progression from ProgrammingInTheSmall? to ProgrammingInTheLarge?. With small systems such as a typical web site, assumptions like using a particular country code standard can be reasonably made. However, as one's experience grows along with the scope and complexity of the systems one builds, such assumptions get trashed. When ProgrammingInTheLarge?, most such assumptions are trashed and one has to adjust accordingly. The scope of the system forces one to understand and address a much wider scope of standards, purposes, issues, and exceptions. Surrogate auto keys are one of the patterns that have been identified as a result. When ProgrammingInTheSmall?, such patterns are overkill and unjustified.

So, our potentially useful purpose here is to identify the likely points of transition where such patterns become justified, and to identify the many such patterns and antipatterns that arise.

RobWilliams


Here is an example problem where I can't see a good solution using Domain Keys. The purpose is to store the results of many different benchmarks against a configuration of a system, across many different system configurations. You could use two tables:

System Config(configID, Mfr, Model, RAM installed, CPUs installed, CPU model, CPU speed, OS version)

Benchmark Result(configID, metric name, metric result, test date)

The system configuration is inherently combinatoric and so I don't see a single key or small set of keys. Maybe it would be better to use a simple single table?

PeteProkopowicz

Now there you'd have the "natural" key (Mfr, Model, Ram_installed,CPUs installed, CPU speed, OS version). Yes, it will have some performance problems but in principle it can be done. In the worst cases the natural key would be the whole table. The one case current databases cannot handle (because they do not support relation attributes, is when the "thing" to be stored is a set whose cardinality may vary. For example if the configuration was not defined by a fixed number of attributes but defined by a dynamic set of name/value pairs :

 System_Config(config_id, date_created, ...)
 Config_Details ( config_id, param_name, param_value)
 Benchmark_Result(config_id, metric_name, metric_result, test_date)
So this is a case (representing sets) where the way to go is with an ID (as per the classification above), but optimally this piece of information should be exposed tot the user. The users can then provide meaningful names (strings ) for the ID, that will help them referencing the information. -- CostinCozianu


Claim:

Nonsense. There are practical benefits to using domain keys in certain situation, please read this page carefully and learn from it. There are also drawbacks to using surrogates.

[The only "pro" argument I see on this page is that it's more pure, which is not practical. The only thing presented as a benefit which is practical is enforcement of business constraints, which is not exclusive to domain keys (you can enforce constraints on a value or set of values without using it as the only key), and which is many real-world circumstances can actually be a drawback. Examples were given, and "Well, you should have better systems then" is not an answer.]

An extremely common business rule, especially when creating new systems that will manage existing data, is that the system should not allow X, but it must tolerate it is it occurs anyway, because there's existing data that does that, even though it's wrong. Worse, the second part of that rule will often fail to surface until you're well into testing with your carefully designed system, when you actually start working with the full set of legacy data.


Testimonial:

I was recently called in to help another team improve the performance of their application. Early on this team had, for the best of reasons, decided to use only natural domain keys. Because of the nature of the work being done, most of their tables ended up with a variable length text field key that averaged around 150 characters. They had been testing with empty databases throughout development, and now that the QA department was running multiple days of data at a time (about 50,000 records a day) the RDBMS's CPU usage was growing exponentially.

We kept the variable length text fields around for ad hoc query purposes, but used 64 bit auto generated identity keys for all transaction processing. RDBMS CPU returned to levels comparable to an empty database and remained flat for large data sets.

Moral:

Premature optimization is the root of all evil, but sometimes you have to face the fact that natural domain entities can make lousy keys.

I once had to fiercely argue to prevent a similar kind of compounding. It did not help my career, but it was the right decision. (Damn, I sound like W during the debates.) -- top


This page cites SSNs, car license plates etc. as classical examples of "domain keys". Does nobody else but me see that they're actually auto keys in the government's database? This might make a good argument in favor of auto keys, but I'm not quite sure. -- VladimirSlepnev?

They may be auto keys in the government's database, but for any other database they're domain keys, because they're dictated by the outside reality that you're modelling. There should be no confusion on this. A database is a model of something of interest, in the 'real' world (for some application-specific definition of the word 'real'). A domain key comes from the thing being modelled, and is not an arbitrary feature of the database schema. -- DanMuller

There's no confusion, but I'm afraid I haven't made myself quite clear. The government is doing exactly the same thing that we want to do - it's trying to model the "real world" with a big database. And the government's experience shows us that auto keys are somehow needed to model the world. The government can't "just use domain keys for everything" - so why do we think we can? -- VladimirSlepnev?

Oh, I see what you're saying. But even in the government's database, they're not auto keys or surrogate keys in the sense of "a key added for the convenience of the database design". They are part of what's being modelled -- even if no database were involved, one purpose of a motor vehicle department is to generate unique labels for cars. (In a sense, it's their job to generate domain keys for other applications!) So the fact that they generate these things doesn't say anything about auto keys in general. -- DanMuller


TentativeSummary:

Domain keys have the benefit of containing information besides being a key. Auto keys always require an additional read in the database to obtain such. However, auto keys have the benefit of being guaranteed unique, which is the most important feature of any key. Therefore, auto keys are mostly preferred unless you can be sure that your domain key is, has been and always will be, unique. There remains some discussion over when / which domain keys are allowable.

In general, the drawbacks of domain keys outweigh their benefits, so most of the time auto keys are used. -- AalbertTorsius


I thought we'd got beyond Platonist pretensions in the new Age. Oh well.

Neither is best, ladies.

Domain keys are fabulous IF they can be kept unique. The reason being that you can avoid 'hidden' duplicates. Auto-keys do not explicitly map uniquely to the domain, just to the database. So I can enter myself fifteen times, each with a unique key, but it would be wrong. Domain keys, at least theoretically, can be verified against some external physical party.

Autokeys can be short, but to make them correct outside the database you usually need some extra logic.

Furthermore the key may need to be unique across multiple databases, and auto-keys are only local.

Then there's the 'changing primary' problem. Well it isn't that bad, honestly (delete the old, insert the new, update relationships), unless you have relationships from beyond the database, then it is disastrous.

Then there's historical versioning etc etc. i.e. there is not one answer. Key policies need to assessed individually and expressed accordingly. Anything else will eventually return to bite you.

I suspect there are some righteous fence-sitters above, but perhaps they were drowned out in the noise.

-- RichardHenderson

Don't confuse auto-keys with establishing the uniqueness of a natural key. People who use auto-keys, still put unique constraints on the natural keys when possible or necessary to enforce data integrity. Domain keys have no advantage over auto-keys in the data integrity arena. Nor are auto-keys necessarily local, GUIDS are popular for use when you need database spanning auto-keys. -- RamonLeon

UUID keys solve most problems of autokeys _and_ domainkeys at slight performance penalty compared to integer auto-generated keys. Especially in distributed scenarios.

This AnonymousCoward's understanding is that the sole intention of a unique primary key in a real-world RDBMS is to allow the database engine to uniquely identify any tuple under all circumstances for the purpose of relations between tables. To that end, the RDBMS does not care about the contents of the data, other than that supporting indices would be nice. Hence an auto-increment is GoodEnough, and doesn't suffer from being assigned to more than one item, or from tuples having more than one value. UUIDs are also worth considering if sequentially-numbered records poses a security risk (ID values requestable and passed in the clear) or some other practical problem.

This other AnonymousCoward agrees. I started my career heavily biased towards "user keys". Experience (that thing you receive just too late to benefit from it) taught me otherwise. Rules of thumb:

1. If you don't control the domain of the candidate key, consider it volatile (SSN, ISBN, country code, TLD, cheque (US:check) number, etc). Exception: if changing the domain of the key would require change to a fundamental physical constant (the charge on an electron, Avogadro's number, the value of PI, etc), the domain key will provide sufficient uniqueness. UUIDs/GUIDs may be OK; time will tell. 2. Use a domain key for "hinge" entities (those which are not on the "many" end of any one-to-many referential relationship) if rule 1 does not apply. 3. Distinguish between identification of, and user access to, a tuple. In the '80s it was common to have one or more "sort key" columns for user access. In the "Bill Gates" example, project code 'Longhorn' would appear in a sort key column ("PROJECT_CODE_NAME", maybe), possibly with a uniqueness constraint. Changing "Longhorn" to "Vista" would not then require updating several million project timesheet entries. (In some of the systems I worked on, the "customer number" and "product code" were "sort keys", each with a uniqueness constraint.) 4. Lemma A for rule 1: Uniqueness Counterexamples. In the project and project-assignment example above, the proposed domain key (project id + employee id would have failed many times for me. It is not unknown for one employee to be assigned to a project multiple times; for example, during project definition, acceptance testing, and training; in the intervals, the employee works on other projects. If the data model cannot cope with corner cases it is not of much value. In general, if a "candidate key" is not precluded by the laws of physics from having duplicate values, eventually it will have duplicate values. 5. Lemma B for rule 1: Partitioning counterexamples - discussed above with SSNs. [In set theory, a partition is a set of subsets, none of which have elements in common, which collectively cover the entire set.] Is it possible our business could want to do business with someone who does not have a SSN/other (tuple of) attribute(s) that we think is a candidate key? If so, our candidate key does not provide a partitioning projection.

Not a rule, but experience: ON UPDATE CASCADE is NOT your friend if several million rows will be updated. -- AnonymousCoward2.


Using the project name example, there is another problem. Even if you rename all references in your organization's system when titles change, what about external systems or documents? Suppose I make a list in a spreadsheet, take it home to fiddle with it on the weekend, and then come back Monday to key in miscellaneous edits. How will I find the original if it now has a different key, the title? I may be able to do QueryByExample based on values, but that's neither convenient nor guaranteed. A simple, stable, "dumb" key makes this kind of thing smoother. One could also extrapolate to customers and biz-to-biz transactions.


Licence plates are even worse than some people have suggested as candidate keys, because they do not map uniquely or permanently to a conceptual car - cars can get new licence numbers, because the owner has obtained a vanity tag, or moved it between jurisdictions, or allowed its registration to lapse, or a whole host of other reasons, and plates may be issued to different cars at different times even when there's no unintentional duplication (especially vanity plates). There's also the problem of tracking identity between rebuilds - in some places, the VIN plate is the defining part of the car which has continuity, but I'm sure that isn't universal because few things are in law. Then you get tricky things like trade or dealer tags, trailer/sidecar licences (which in some places are separate to the licence for the primary vehicle, rather than caring a copy of its licence plate), and obsolete licences on very old vehicles, as well as the special military and government fleets, and oddities like exempt vehicles, or foreign cars.

For one particular usage I was involved with, it wouldn't matter if the license plate could be re-assigned down the road as long as a snapshot of the active plates for any given point in time was pretty close to unique. The accidental overlap via bureaucratic mixup ruined this approach, creating lots of extra work.


Automatic keys are unable to enforce certain constraints that can be enforced easily with domain keys. An example was a new sub-system that was developed to implement a web application to allow managers of land to report annually on the land managed. The database of which this sub-system formed a part had a convention that an automatic primary key existed for all tables. Also the web front-end developers wanted a single-field primary key for all tables – I am not sure if this was a strict requirement of their tools or just made their use easier. Each manager completes a return each year and each return consists of many data entities such as the parcels of land managed, financial statement/s, the assets, loans, funding, leases, insurance, staffing, etc. Let us consider only the land and the assets on that land. We have a table Reported_Land of which the land_id and the return_id form the composite domain primary key - a parcel of managed land will occur in a return once per year. Similarly a table Reported_Assets had a composite domain primary key(asset_id,return_id). Now a parcel of land can have many assets and one asset can reside on more than one parcel of land eg a tennis court that straddles two adjacent parcels of land. Thus an intersection table Reported_Land_Asset is required to model the many-many relationship. Using an automatic primary key for Reported_Land (RL_ID) and one for Reported_Asset (RA_ID) will mean these columns must also placed in Reported_Land_Asset as foreign keys to implement the intersection table. This provides the constraint that a Reported_Land_Asset must relate an existent Reported_Land to an existent Reported_Asset but it cannot enforce the stronger business constraint that the Reported_Land and the Reported_Asset thus related must also only occur in the same return. IE it should not be possible to relate a Reported_Land row to a Reported_Asset row that belongs to a return of different Land Manager or of a different year. By using the composite domain primary keys (land_id,return_id) for Reported_Land and (asset_id,return_id) for Reported_Asset allows Reported_Land_Asset to have the composite domain primary-key (return_id, land_id, asset_id) with return_id used in both the two foreign keys (land_id,return_id) and (asset_id,return_id). This enforces the constraint that a Reported_Land row can only be related to a Reported_Asset row that belongs to the same Return.

Note that the natural composite primary keys were not implemented in this subsystem due to the conventions of the larger system and the desires of the front-end developers. The constraint had to be implemented in the front-end web application. Needless to say a bug started creating invalid relations which were not flagged as errors by the database. This was fixed. However because the constraint is not in the database a future change to the web front-end may recreate a bug. It means a separate integrity script must be run regularly to check for this. -- Bob Simeon

Enforcing constraints and setting up primary keys are not necessarily related. One can have a "secondary" column(s) that are designated and enforced to be unique without being primary keys. True, it's more processing overhead, but that's one of the trade-offs to weigh in designs. Further, you may have just seen the down-side of artificial keys because that's what was used in that case. Maybe if your system had used "natural" domain keys you would have seen or had other risks/problems. I've personally seen a government organization bungle "natural" or domain keys such that an artificial key was eventually necessary. It's nearly impossible to "take back" erroneous natural keys spread into the wild. One has a bit more control and flexibility over internal keys.

In fact I'm dealing with domain key issues now. They had a "document number" of their choosing, something like "AK-2014-003": a category abbreviation, year of release, and a sequence number. It looked like a good domain key at first because they've been using it for a while manually, but now they want to shuffle their numbering around. Neither key approach is perfect, it's the matter of estimating the lessor of two evils. -t

I agree with the benefits of automatic keys and of course you can have a domain key (single or composite) with a unique constraint as a candidate key in the same table as the automatic key. However in this case the problem was not a uniqueness problem but a foreign key problem in the intersection table (Reported_Land_Asset) which needed to have one column used in both of 2 composite foreign keys to enforce the constraint that the 2 parent tables (Reported_Land and Reported_Asset) each shared the same grandparent (return_id). To enforce this constraint, and because Oracle requires a foreign key to reference the primary key of the referenced table, the 2 parent tables must have a composite primary key and cannot have a single column automatic key. I suppose each component of the composite key can be automatic keys so this example rather shows that some constraints cannot be enforced by single column automatic keys (the normal case). --Bob Simeon

{I'm a bit surprised to hear this. The last time I used Oracle, you could have a foreign key reference a unique key that wasn't the primary key. According to their online documentation, that should still be the case.}

Yes it looks like I was mistaken here. This was some years ago when I last worked with Oracle and either my memory is mistaken or I was under a misapprehension at the time. Nevertheless single column automatic keys are not able to enforce the constraint exemplified and if foreign keys are referencing a candidate key in the parent table the use of an automatic key in that table as a primary key is redundant with no benefit. I was just trying to point out an example where domain keys can enforce constraints that can not be enforced by single column automatic keys alone. -- Bob Simeon


More at AutoKeysVersusDomainKeysDiscussion


RefactorMe: This page is TooLargeToGrasp. A wonder, that anybody is bold enough to read and add anything. A SurfaceRefactoring should be possible.


See RelationalHasNoObjectIdentity, ObjectIdentity, PortablePrimaryKeyGeneration


repeated interest in AugustZeroFive

CategoryPattern, CategoryBusinessDomain, CategoryRelationalDatabase


EditText of this page (last edited February 16, 2014) or FindPage with title or text search