Select Distinct Isa Code Smell

Complete summary at top of AlwaysUseSelectDistinct; short summary: It's a GoodThing if 100% of SQL queries use SELECT DISTINCT. In an environment where most of them do not, any isolated uses of DISTINCT are a smell because they may flag places where a bug arose, and was dealt with in a hurried ad-hoc way by randomly adding DISTINCT and hoping for the best. Some people feel that the need for DISTINCT may legitimately arise often enough to make this guideline unuseful. Perceptions may depend in part on the how the DBMS is being used in a project.

Sometimes one is stuck with a bad schema design because of poor decisions made by people long gone and the cost to clean it up would be great. Under such circumstances, DISTINCT may be the lesser of two evils.


Bad summary. The side that argued for the code smell merely asserted that SELECT DISTINCT may hide bugs in some environments. No examples were presented, on the contrary, a couple of examples were produced to demonstrate that SELECT DISTINCT does not hide bugs. Let's see such an example (that wasn't produced at the time when this discussion was "hot") - if any, and let's assert how common it is and how likely to create problems in real life.

And I resent the idea that projects should have project wide standards forcing things upon the developers that should be left at the developer discretion (anybody here heard of projects where Stored Procedures were mandated ? ). Especially when such "standards" go against the best engineering and scientific knowledge available at the time, and are merely a reflection of the whims of somebody who got the "senior" title. Never using SELECT DISTINCT smells like such a "standard".

[There are legitimate uses for DISTINCT, so tagging it as a "code smell" is simply noise. Rather than rewording, I think the guideline suggested by this page should be dropped. There is no substitute for understanding basic relational theory and its mapping to SQL when writing queries - and that's not a tall order, because the basics of relational theory should not be difficult to grasp for a software developer. SQL is the harder of the two to grok.

That said, here's some data that I recently got from my project at work. It may be too specific to our environment; you'll have to judge that for yourself.

We have a system that automatically generates SQL from relational expressions. (See NotesOnaCeePlusPlusRdbmsApi.) At the moment, the underlying DBMS is Microsoft Jet. (To preempt any derogatory comments: We're hoping to move to a real DBMS soon.) The SQL engine is not too smart; it's often worthwhile to avoid the use of DISTINCT for queries that return large amounts of data. What's worse is there are bugs or limitations in the engine, which are triggered in part by the presence of the DISTINCT qualifier.

I just finished a round of improvements to the SQL generator. The generator had been conservative about adding DISTINCT to queries, doing so often when not needed. One of our tests suites exercises a large number of queries that are used mostly to load choice lists in the UI and to load business objects (this is somewhat of a hybrid system, object-based but evolving towards being more directly relational). After my changes, which improved the analysis of functional dependencies, only about 75 out of about 1800 queries generated during the tests use the DISTINCT qualifier. (I'll check the numbers later this week; this is from memory.)

Now, this indicates that only less than one in twenty queries require DISTINCT; but that's a significant number still, and if you were reviewing a body of queries like this following the guideline that DISTINCT is bad, you'd waste a lot of time.

This is a unit test program exercising predefined queries that support a bookkeeping application. Also, in the course of the tests, variations on some queries are generated repeatedly. OTOH, these predefined queries do form the basis of much our database access. It's hard to say if these numbers are representative of anything typical, but I found them nonetheless interesting. -- DanMuller]


The "DISTINCT" modifier on the SQL "SELECT" statement tells the RelationalDatabase to eliminate duplicate entries from the result set.

In practice, this means that the server will sort the result set to find and discard the duplicate entries. (But if you really need the result sorted, you should specify that in an "ORDER BY" clause.)

The Smell:

Also, in practice, when you see code that uses "SELECT DISTINCT" - especially if it's used often - you should question the code and its writer, to see if the programmer has the foggiest idea what they're doing.

{Or the data designer is/was foggy. One cannot always assume good data and good data design when they walk into a given shop.}

[...unless the project standard is to use "SELECT DISTINCT" in all queries. In that case "SELECT DISTINCT" is not a code smell. "SELECT DISTINCT" is only a "CodeSmell" when general usage in the project is "SELECT <nothing> <columns>" ...and you suddenly find a couple of "SELECT DISTINCT"-s thrown in for no obvious reason.]

People with only moderate knowledge of SQL join syntax often write queries that have an inadequate number of conditions in the WHERE clause to properly join the tables.

The most obvious symptom of this, is that they get an excessive number of result rows from their queries. The quickest and most obvious "solution" that occurs to them is to use "DISTINCT" to "get rid of all those extra and redundant rows."

What they fail to realize is that their query not only returns more results than they want, but it's also very likely to return incorrect results, because they haven't joined the tables properly.

Example of an improper join:

Join Employee to Department table based on Employee.DepartmentNumber? = Department.DepartmentNumber?

Let's say that the PrimaryKey to the Department table is ( DivisionNumber?, DepartmentNumber? ). DepartmentNumbers? aren't unique - just mostly unique, there "aren't that many" divisions that happened to have conflicting department numbers.

So for most employee lists, you'll get the right department information for each employee. But the query is not correct: For employees in some departments, you'll get both correct and incorrect department information.


Actually SELECT DISTINCT is the only reasonable practice, and the only way you can relate SQL databases to their relational foundation. Under the relational model as laid out by DrCodd there are no duplicate tuples. A relation is by definition a subset, and sets have no duplicates.

It's odd that you call upon theory to defend practice. Perhaps you should review the DifferenceBetweenTheoryAndPractice.

All the other practical considerations why DISTINCT should not be used are flawed, for one more reason you need to eliminate duplicates see DatabaseIsRepresenterOfFacts.

For people who are not able to properly write join, eliminating SELECT DISTINCT won't do any good, but you can code inspect their code and force them into using the JOIN keyword. and if you have a relation with Oracle, please tell them to implement the ... standard (apparently they did it in 9i)

The reality is that there will never be a sorting and eliminating of duplicates unless you're using a very old database, that doesn't have a decent query optimizer, or if you have written a bad query. But eliminating the DISTINCT keyword will not do you any good. Actually you should consider using it voluntarily in every query that you write. -- CostinCozianu

Are you trying to say that in a properly normalized database (ThirdNormalForm) all queries will, by definition, return DISTINCT result sets? If so then I agree. The code smell is that DISTINCT clauses often indicate that the database has not been properly normalized. For people who cannot properly write JOIN statements eliminating DISTINCT clauses will show the de-normalized portions of their database and allow them to normalize those parts. Including DISTINCT in every query would not be a good idea in my opinion because it will hide possible errors from you. No programmer is infallible and we need all the safety nets we can get. On the other hand, certain parts of the database may be de-normalized by design (for better or for worse) and may require DISTINCT clauses in order for the queries on those parts to function correctly. -- IainLowe

Iain, first of all, removing distinct is not going to show you anything at all, maybe in some rare cases, but in general a badly written SELECT involving joins will show it's disastrous effects with or without DISTINCT. The usual way you can achieve a well designed schema is by having a skillful designer on the job. A denormalized schema is probably not going to show up in unit tests, but it will be going to come back and haunt you in production. There are plenty of documented cases in database literature. I can hardly see how removing DISTINCT is a safety net.

And the decisive argument is that you can have a supernormalized, well designed schema and proper JOIN clauses, and still get duplicate rows as query results, because this is the default in the SQL standard. More important than the database point of view is the application writer point of view, after all SELECT(s) don't affect the database they just return result sets. You don't want to use your application to test the validity of schema design, and you don't want to manipulate duplicated tuples in the client. -- Costin

It's not that removing DISTINCT is a safety net. It's that by-default across-the-board leaving DISTINCT in, is an ANTI-safety net.

The question is if you want to have a safety net (which you don't have always , but SELECT ALL might arguably help you in this regard, and only occasionally) against catastrophic errors in the SQL code (like failing to specify a JOIN correctly), or if you don't want to risk exposing yourself to more subtle logical errors that stem from the behavior of SELECT ALL.

In any way, SELECT DISTINCT should not be consider a code smell, because if used correctly gives correct results, while SELECT ALL even if used correctly may give incorrect results because that's the way SELECT ALL is. If you think that SELECT DISTINCT is a code smell, then you have to think that any operation at all in relational algebra is smelling, and frankly, you have to give up on relational theory as a logical support for developing with relational databases, and create your own theory with bags of rows, or, even better, bags of persistent objects (the later is quite fashionable these days). -- CostinCozianu


RE: "The reality is that there will never be [...] eliminating of duplicates unless [DB is outdated] or if you have written a bad query." [emphasis added]

Yes: That's exactly the point.

I've seen quite a few cases where programmers didn't quite know how to join primary keys to foreign keys properly. (Especially in cases with composite keys.) And so their queries returned invalid results. The results were obviously incorrect because they contained "so many duplicate rows," in spite of a reasonably normalized database design, but the developers usually elected to "fix the problem" by just dropping "DISTINCT" into the query. This would hide the problem - most of the time - but not fix it: The joins were still bad.

I agree with you that in almost all rational queries I can think of against a reasonably well designed database schema, "SELECT" vs "SELECT DISTINCT" would not cause any change at all in the result sets.

But I'm talking about people, not theory. People make mistakes. -- JeffGrigg

And one of the greatest mistake of many people these days is to think that theory is impractical, or worse, that it goes against practice. Maybe I should start a pattern TheoryAlwaysWins?. I'll come back with a practical data example to show you that SELECT DISTINCT does not hide anything to a responsible developer who does test his application. -- Costin

I don't think TheoryAlwaysWins?. Theory doesn't include the human aspect of the problem. Perhaps SELECT DISTINCT is theoretically no different than SELECT in a perfect database, and theoretically more appropriate in those cases. But people aren't perfect, so we AcknowledgeFallibility and use SELECT instead of SELECT DISTINCT to ExposeErrors. And when we see other people using SELECT DISTINCT, we wonder (since most people don't use it), "What did they do wrong that required them to use SELECT DISTINCT?" That question is our nose twitching, which is why SELECT DISTINCT is a CodeSmell. Remember, a CodeSmell isn't always bad: it's a warning sign, something to question and look at more closely. -- JimLittle

Except that SELECT without DISTINCT is a CodeSmell, it is worse it is flawed unless you are not lucky to have a uniqueness condition. SELECT DISTINCT is always required to correctly manipulate relational data and you should always ask, what do I do wrong if I use SELECT without DISTINCT for example:

SELECT Employee_Age FROM Employees WHERE Dept_No= 6

What do you do wrong here? In case of projections where a candidate key from primary key from the table is not included SELECT ALL is always wrong !!! CodeSmell or no CodeSmell, see DuplicatesAreBad, AlwaysUseSelectDistinct.

You expose no errors, by not using DISTINCT, instead you open an ugly CanOfWorms, you hardly expose any errors but you create the potential for a other errors to come in. -- Costin


Actually, I avoid putting DISTINCT into my queries unless I really need it. SQL is not really set-theoretic, as evidenced by the aggregation operators. Also, most databases will incur the overhead of a sort/unique operation if you include the DISTINCT operator. There's no real way for the database to easily tell if the result is going to be made up of distinct rows or not

As for your Employee_Age query above, slapping DISTINCT on it returns you a useless set. I expect that that query will return as many rows as there are employees in dept. 6. But the query itself is rather useless. More useful:

SELECT Employee_Age, count(Employee_Age) FROM Employees WHERE Dept_No = 6 GROUP BY Employee_Age

Now draw your lovely bar chart. Notice how no DISTINCT clause is needed here. There are only a few cases where DISTINCT is needed. Putting it in to avoid duplicates is only pasting over a flaw in your restrictions. (If the above query returned duplicate rows, it would be most strange.)

-- TaralDragon

Of course SELECT ... GROUP BY removes all the duplicates so DISTINCT is not needed and if you still have duplicates the only solution is to ask the database vendor for a refund. I made that distinction in AlwaysUseSelectDistinct.

In general a SELECT will produce duplicates unless explicitly instructed not to do so, and SELECT DISTINCT is the SQL equivalent of the projection operation in relational algebra, while SELECT ALL may lead to incorrect results in subtle ways, because duplicates can occur.


Now CostinCozianu...

Suppose the Employee table has a PrimaryKey of employee_id.

What's wrong with saying "select first_name, last_name, age, dept_code from Employee where employee_id = 6"?

Why should I be REQUIRED to say "select DISTINCT first_name, last_name, age, dept_code from Employee where employee_id = 6"?

The only case where they could possibly produce different results is if something horribly wrong happened to the database schema. And if things were that bad, then probably none of my queries make sense any more.

SQL is supposed to be a DeclarativeLanguage. Why should be required to add another clause my query in the 99.999% of cases where it couldn't possibly change the result - unless my query was fundamentally wrong in the first place? -- JeffGrigg

Well, Jeff, because you have to manually check that you included an uniqueness condition. Just inspecting your test results will not help. Either way you have to think what you are doing because SQL is not worry free. The goal in using SELECT DISTINCT is to program against a relational database instead of just a SQL database. See UsingSql where I try to put a collection of patterns needed to make a SQL database work as a relational database.

The fundamental problem is that you don't have any good formal model underneath your usual SQL databases and there have been numerous examples of how SQL can return incorrect results because it doesn't follow the relational model.

Just think that by default SELECT would have been SELECT DISTINCT, and then think the other way around: what reasons do I have to use SELECT ALL? You'll see that there is NO good reason to use SELECT ALL, so between SELECT ALL and SELECT DISTINCT what do you choose? Remember that just SELECT is only an abbreviation for SELECT ALL.

-- CostinCozianu


Now getting back to the original assertion, that SELECT DISTINCT hides errors in queries.

Example of an improper join:

Join Employee to Department table based on Employee.Department_Number = Department.Department_Number?. Let's say that the PrimaryKey to the Department table is ( Division_Number, Department_Number ). Department_Numbers aren't unique - just mostly unique, there "aren't that many" divisions that happened to have conflicting department numbers.

So for most employee lists, you'll get the right department information for each employee. But the query is not correct: For employees in some departments, you'll get both correct and incorrect department information.

What was the core of the argument that leaving SELECT ALL would lead to obviously flawed result during testing and debugging (unit testing if you want), while Introducing SELECT DISTINCT will hide the errors by giving falsely correct results (or at least results that look OK at a superficial testing) because the removal of duplicates. This is not true in the case of JOIN queries. Let's take the given example and fill it with some data.

 CREATE TABLE depts (group_id integer NOT NULL, dept_id integer NOT NULL, dept_name varchar(15), PRIMARY KEY (group_id, dept_id))

CREATE TABLE emps (group_id integer NOT NULL, dept_id integer NOT NULL, emp_id integer NOT NULL, PRIMARY KEY (emp_id), CONSTRAINT emp_fk1 FOREIGN KEY (group_id, dept_id) REFERENCES depts(group_id, dept_id))

And let's have this data
 deptsemps
 group_id  dept_iddept_name  group_id  dept_idemp_id
 1  1x111
 1  2y112
 1  3z213
 2  1x224
 2  2y225
 2  3u226
 3  1v317
 3  2x

Then we run the incorrect join query with the SELECT DISTINCT to retrieve the emp_id and dept_name for each employee: SELECT DISTINCT emp_id, dept_name FROM emps, depts WHERE emps.dept_id= depts.dept_id.

This gives us 14 rows, while SELECT ALL gives us 21 rows. Even with 14 rows we should see that the results are incorrect; for example, we shall see among them:

 1x
 1v
which already smells because an employee cannot work in two departments. It is even better because the results are sorted. The problem with incomplete join conditions is the existence of a cartesian product, which will always show as incorrect results. And SELECT DISTINCT is not able to hide the existence of a cartesian product.

Let's consider now another class of errors, incorrect join clauses:

SELECT DISTINCT emp_id, dept_name FROM emps, depts WHERE emps.emp_id= depts.group_id and emps.dept_id= depts.dept_id

Results:

 1x
 2x
 3v

Here the results are the same with or without DISTINCT, and they look OK by coincidence, the only reason to be alert is that we should have had 7 employees in the result, and if we put some more rows in the depts table we might get correct results just by coincidence.

My conclusion is that one cannot rely on catastrophic effects in test runs to decide whether or not the queries embedded in an application are correct. And sacrificing the usage of SELECT DISTINCT will not tell you anything relevant. Keeping SELECT ALL because of the expectation that errors will be magically revealed is not quite right. -- CostinCozianu

You are inverting the argument, no-one is saying SELECT ALL magically reveals anything, it just doesn't magically hide anything. Duplicates in a result set, however they are generated, imply there is a problem in a query. Using SELECT DISTINCT makes evidence of that problem go away. If the DISTINCT keyword caused an error to be thrown on duplicates, then I might have some sympathy for your position; but that is not what it does.

If DISTINCT always fixed the data then one might use it as a default, but it doesn't. This is better seen in value-oriented data such as the shopping receipt example where rows only aggregate by coincidence of date and price. DISTINCT breaks this query such that it doesn't return the correct number of rows. This sort of 'problem' crops up regularly in real-time systems, where a key includes a timestamp of insufficient resolution to guarantee disambiguated rows. This relates to DISTINCT not acting as a uniqueness constraint, but as a uniqueness production mechanism. -- RichardHenderson.

Richard, so you say SELECT ALL is good, it doesn't hide anything. But SELECT ALL doesn't mean anything also, it is flawed, and it is outside the relational model, it cannot even be properly unit tested, although my opinion is that you can't easily apply unit tests to SQL anyway.


In a very general sense, I'd say that SelectDistinctIsaCodeSmell because it usually indicates that the programmer is using SQL without fully understanding it. Yes, there are cases in which SELECT DISTINCT is logically valid. But to use them in the way Costin seems to advocate would seem to lead to programmers thinking "This SELECT ALL query should only return one row, but it's returning more than one, and rather than retool my query to make it correct I'll just use SELECT DISTINCT instead." This seems like a shortcut around genuinely understanding how relational databases work; a ProgrammerSmell?, maybe? -- FrancisHwang

Francis what you just describe cannot really happen, if the query is supposed to return one row returns a bunch of rows you have two possibilities: (1) all of them are duplicates and then the programmer needs to introduce SELECT DISTINCT. (2) not all of them are the same row so even if the programmer puts SELECT DISTINCT he will still have at least 2 rows, so his error is not hidden by SELECT DISTINCT. See below for the full discussion.


CostinCozianu's argument seems to be that since relational databases and SQL claim to be based on formal set theory, then we should ensure that all our queries are formed as "set operations" so that we receive the benefits of performing operations that are firmly based in well defined formal mathematical reasoning.

JeffGrigg's argument is that in most projects, doing "select <nothing> <set of columns>" is the norm: When one sees a couple "select distinct" queries among all the "select <nothing>" queries, with no obvious business reason why duplicates should be an issue, then this is most likely a sign that the programmer who wrote the queries threw in the "distinct" keyword to work around a problem that they didn't really understand.

Unfortunately, SQL databases claim they are based on relational theory ( and consequently on the mathematical set theory and mathematical logic) only when this fits their purpose (i.e. in marketing brochures and in pissing matches with OO databases). On the contrary, when they are criticized on this ground they pretend that they use some kind of extended relational theory (replacing sets with bags to allow for duplicates) which in fact doesn't exist. See DuplicatesAreBad


When I see a bunch of "select distinct" statements, it raises a red flag. I ask the programmer(s) why they put "distinct" in.

Typically, they say "because that query was returning too many rows, so I had to put that in to fix it." That response tells me that they don't know what they're talking about, so the query is probably incorrect.

If they say "at that point, we need to know the set of unique product types so that we can allocate the right buckets for the following process/report," then they probably know what they're talking about, so the query is probably OK. (I might ask why they don't care about the count(), min() or max() of each product type, but it's OK if they don't need that information for the given process.)

If they say "it's our project standard to put 'select distinct' in every query rather than defaulting to 'select all', so that our queries are assured to conform with mathematical set theory," I'd ask them about performance, and I'd think that they are being academic, but I'd have no particular reason to think that they were doing anything "wrong". -- JeffGrigg


Now I suddenly have three replies and all of them repeat the same flawed assumption:

Duplicates in a result set, however they are generated, imply there is a problem in a query. -- RH

Typically they say "because that query was returning too many rows, so I had to put that in to fix it." -- JeffGrigg

[the faulty programmer says] "This SELECT ALL query should only return one row, but it's returning more than one, and rather than retool my query to make it correct I'll just use SELECT DISTINCT instead." -- FrancisHwang

What's more interesting is that all of you are in the comfy position of code inspectors looking down at some poor guy who is just doing the right thing :)

Now this assumption is wrong. More precisely if duplicates are generated in a result set they are a problem indeed, but the problem was generated by using SELECT ALL instead of SELECT DISTINCT or in some cases SELECT ... GROUP BY ...

So let's try a mathematical approach: We have a query that is faulty, it contains SELECT ALL, and the smelly thing about the query is that it generates some duplicates. Further more we'll assume that the query will return the right number of the right columns in the right positions. If the programmer is not able to realize that the result set doesn't have the shape that is expected SELECT DISTINCT wouldn't help him and I'm afraid nothing at all less than changing jobs would.

Let's start with the one table case. We'll also assume that every table has a primary key (or otherwise fire your DBA). So how may a query over one table which has a primary key generate duplicates, considering that it also returns the right columns in a result set?

The answer is only one: it is a projection. It doesn't select all columns or more exactly there's no candidate key included in the columns being selected. That being the case, it will generate some duplicates. What is the solution? Simply to replace SELECT ALL with SELECT DISTINCT. If the programmer was badly advised to put SELECT ALL he might have not caught the error in the first place, because always the development and QA databases will not have enough relevant data, and SELECT ALL might just pass the QA and make it into production. So SELECT ALL is a code smell while SELECT DISTINCT was the right solution. Furthermore, by using SELECT ALL the programmer risks leaving an ugly bug in the software.

Let's see the case when the select involves n tables. If from each of the tables a candidate key is selected entirely than already the SELECT (either ALL or DISTINCT) cannot contain duplicates. A cartesian product on sets will generate another set (no duplicates allowed). The JOIN that can be interpreted by applying the join condition to the cartesian product and even more can't contain duplicates.

Again the only way we can obtain duplicates from a join involving n tables is through a projection (selecting fewer columns) that has at least one table who doesn't contribute with a candidate key to the final set of columns being SELECTED. Duplicates cannot appear from a cartesian product, cannot appear from restriction (WHERE clause), can only appear from projection. And I don't mean projection as in relational model where projection does not contain duplicates by definition, but projection as in SQL databases SELECT ALL.

And how do we get rid of the duplicates generated by a SQL projection? Simply replacing SELECT ALL with SELECT DISTINCT. Sometimes SELECT DISTINCT is no longer necessary because of the GROUP BY clause, which also eliminates duplicates.

-- CostinCozianu

huh? No, that is a circular argument as to why SELECT DISTINCT is good..
Actually, it is reductio ad absurdum argument that SELECT ALL is flawed, and one should always use SELECT DISTINCT, see AlwaysUseSelectDistinct.


I've met a number of developers who fail to specify values for all the columns of a primary key, select only attribute (non-primary key columns) and still expect to receive only one result row back.

Their SQL is wrong.

The result they get back will be wrong in some cases, regardless of their use of "SELECT <default=ALL>", "SELECT ALL" or "SELECT DISTINCT".

I generally recommend sending them to a good SQL class - CBT if that's all the project manager feels he can afford. Unfortunately, the political process in many companies denies these developers the education they so clearly need.

-- JeffGrigg

Can you provide a single example with a case where the results come out right and the developer has unjustifiably used SELECT DISTINCT?

I have to disagree here. It is too easy to blame the company politics. The developers have no excuse for being incompetent at this level. Has anybody started a page yet on ContinuousLearning??

Example:

On the previous project I worked on (in 2000), there was an Employee table with a primary key of (division_code char(4) not null, employee_number char(5)).

The company, through various manual processes "tried" to ensure that employee_number was unique across the company, but failed. In fact, you could only count on employee number being unique within each division, as each division assigned employee numbers itself, and they didn't always coordinate properly with each other. So, there were still some duplicate employee numbers (corporate-wide), and because of the low turnover of their employees, this situation was not going to change any time soon. But the primary key given above is valid, because employee numbers were always unique within any given division.

But, I was constantly finding programmers writing "select first_name, last_name, <and other attributes> from Employee where employee_number = '<value>'". I'd say "put division_code in there too." They'd say "but employee number is unique!" I'd say "no it isn't - not across the whole company. This is a consolidated (whole company) database. Look at the primary key; you have to provide values for all the columns of the primary key if you expect to get the data for only one employee." They'd say "but it works just fine with just employee_number!" I'd say "yes, most of the time, but in some cases it will fail. (Our customers want our programs to work all the time.)"

Personally, I was very unhappy with the job performance of these individuals. But I wasn't the project manager.

P.S. If you want a "SELECT DISTINCT" error, try "select DISTINCT first_name, last_name from Employee where employee_number = '<value>'". When the given employee_number happens to be unique (>95% of cases in this example), the result is correct. If they happen to have tested with an employee number that was not unique, but both employees had the same name (it happens more often than you might think), then the result is still correct. But if employee number is not unique and names are different, there's no telling what the program might do: The database will return two rows for a query that the programmer "knows" can't possibly return more than one row.

(Anticipating the next objection...) It turns out that in this project we had only one development database for a dozen developers, so we usually copied divisions, changing only the division number, to make test data. So, in development, most employee numbers had half a dozen to a dozen copies of the same record, usually differing only by division_code. So... Doing a select assuming (incorrectly) that employee_number was unique in the Employee table would typically return up to a dozen "duplicate" result rows, all exactly the same (assuming that department_code did not appear in their query at all, which was typical). An ill-informed coder might say "oh, I've got lots of unwanted duplicate result rows. I'll just drop the 'distinct' keyword in there, and it'll work fine." Well yes, sort of, for most cases, but not really.


Ok, so I was wrong in my demonstration, I didn't consider that the data itself my contain some peculiar duplicates. It turns out that SELECT DISTINCT did hide an error, although I would say that it wasn't the SELECT DISTINCT all by itself, but helped by the way data in the development database was setup (copying the same data from one department to another led to the fact that the elimination of department_code in the query was hidden by the usage of SELECT DISTINCT).

I still think I was right in asserting that join problems are not hidden by SELECT DISTINCT.

The next question to ask is whether this case that JeffGrigg presented here is enough to justify the assertion that SELECT DISTINCT is a code smell. The case exposed might have had a lot more trouble than SELECT DISTINCT, hidden by the particular development data.

So SELECT DISTINCT in some cases might lead to false positive test results. SELECT ALL in other cases might lead to duplicates thrown to the user, which is bad per se, see DuplicatesAreBad, so one can argue that also SELECT ALL is a CodeSmell. What then is one to do? Maybe SQL is a CodeSmell? Some authorities in the database field think just that while others contend that SQL is the best practical compromise, and anyway it's too late for a change.

The only thing I have clear for now is only when the two are the same: when SELECT ALL will not return any duplicates it will behave the same as SELECT DISTINCT (less the fact that SELECT DISTINCT might have a different order in case that ORDER BY is not present; anyway no order of is guaranteed in the absence of ORDER BY). For example in a query by primary key SELECT ALL is the same as SELECT DISTINCT and one might choose to use SELECT ALL on the grounds that it will reveal coding errors while testing.

However the primary key case is the simplest one where we can have a guarantee that no duplicates should be returned. Other is when the SELECT is against one table and the primary key is included in the selected columns. Again we might have a join and for each table in the join the whole primary key is being selected. In other situations the primary key may be substituted with a candidate key and the no-duplicates condition still holds even for SELECT ALL. Another situation is when a column from the primary key is missing but is bound to a value in the WHERE clause, for example:

 SELECT ALL employee_number, employee_name FROM employees WHERE dept_no='3'

will still return no duplicates.

In all of the above cases (and maybe others) SELECT ALL will perform identically with SELECT DISTINCT, and it might expose some extra coding errors during development and QA.

But the problem still remains that we have to double check (inspect the code) that we are in a case where no duplicates will be thrown to the user or to other part of the system.

So why not code inspect SELECT DISTINCT then? -- CostinCozianu


[[Refactoring note: Somebody felt the urge to "refactor" and draw the conclusions. I invite whoever felt that urge, to justify his position of authority with real arguments.]]

In practice, when you see code that uses "SELECT DISTINCT" - especially if it's used often - you should question the code and its writer, to see if the programmer has the foggiest idea what they're doing.

For correct code, the default behaviour of "SELECT" is good enough in 99% of cases. All too often, "DISTINCT" gets added, because that makes the code work for the test cases, even though the code _is_ buggy and will fail in the real world.

On the subject of CodeSmells, were the default "SELECT DISTINCT" then one should be suspicious of "SELECT ALL".

The point is that the default should work. The fact that it doesn't is a smell.

As to whether the writers of the SQL spec are brain dead morons in choosing "ALL" as the default instead of "DISTINCT", I refer you to SelectDistinctIsaCodeSmellDiscussion?.

In practice, when I see somebody drawing such hasty conclusions, I feel the urge to ask the guy if he read and understood AnIntroductionToDatabaseSystems. Also I'd like to ask him if he is aware that Dr. Codd himself as well as many others think that the SQL standard and SQL databases are quite a massacration [?] of the relational model, and among many other reasons for their view of SQL, SELECT ALL and the bag of rows model is the worst invention. So much for the argument from the authority of SQL committee. Why the hell do you believe that SELECT ALL is a good thing? -- CostinCozianu


[The proposed guideline is clearly a poor one for typical database use. SELECT DISTINCT is a sometimes-necessary evil in SQL. If you find that you rarely use it, it just means that you are rarely doing projections - any projection that excludes the keys of the FROM expression will produce duplicates in SQL. A query as simple as "find all department numbers in use by employees" is, given typical example tables, a projection over the department numbers in the employee table, generates duplicates. (SELECT DISTINCT DeptNo? FROM Employees) And I have yet to see a legitimate need for getting duplicates back from a query; when someone thinks they need duplicates, they're usually doing some post-processing on the data that could be done more efficiently for them by formulating the query correctly in the first place.

Conversely, because most SQL optimizers aren't all that clever when it comes to determining the need for duplicate culling (it's a hard problem, but fairly well understood AFAIK), it's best to leave DISTINCT out when you know it's not needed. The only way to be confident of that choice is to understand the joins well. And the best way to do that is to know the relational algebra well. Think in terms of relational algebra, then formulate your queries in SQL - you'll find that you will avoid SQL's pitfalls, reason around its shortcomings, and generally produce better queries.

The arguments for the page's title all seem to revolve around the premise that there's something a priori wrong, or at least likely wrong, with an SQL query that returns duplicates. But because of the way that SQL is designed, that's simply not so. It's a well-known aspect of SQL that is widely considered a flaw. Since duplicates are almost never wanted in real applications, SELECT DISTINCT is not a smell at all, but merely an occasional or even common necessity. The frequency depends on the types of queries characteristic of your application. Someone whose main experience of SQL is for loading "business objects" might well think that SELECT DISTINCT is a rare and noteworthy thing, but someone writing queries for reports would see it as daily bread.

-- DanMuller]


My experience is that every query I have seen with distinct has had errors. In the real world the problems is two fold: one - you have people of all descriptions writing queries and creating reports and two - people use existing queries as templates for new ones.

A real world example - I had a system that used 5 db's from three different vendors (all in sybase) over 1200 total tables and the data coming from 17 different sources inside and outside the company. Every time I saw a query that had distinct, I worked with the people using the query and invariably we found that they were missing joins or other where clauses.

How did I help them find the errors? By exposing all the fields in all the tables in the from clause with something like select t1.*, t2.*, t3.*, t4.* and working back towards their limited and thus "dupe apparent" select list. (This also provided a method for them to be able to use on their own in the future when debugging queries with undesired/unexpected dupes, they simply have the select with all fields from all tables in their from clause commented out below their real select clause and they can easily toggle back and forth between the select lists when they see dupes to figure out why.) Every time we found that there were problems with their assumptions about the uniqueness of various join criteria (or other where clause filtering criteria - oops, I forgot to specify "SHORTSELL" transactions only, thanks; oops, I thought that table only had one record per stock symbol, I guess I have to add a join to this type table, too) as well as uncovering rare and/or complex data quality and business rule issues. And I do mean complex, things that took weeks or months to figure out how/why they happened. This is the advantage of showing the uncensored field lists to people, even if they don't know enough SQL to fix the query, they can easily see - geez, I am get 2 rows from t9 for every t1 and t2 record and a variable number of t7 records for every t1 and t2 record.

If there are dupes and people understand why they are there great, but if they don't then it means they don't understand their query, joins, filtering criteria and data or there are problems with the data. Plus by _not_ using distinct, if problems arise in the future with the data or field/table/relational structure that lead to undesired dupes then it will be glaringly obvious. So it is not the case that SELECT ALL would be as suspect if it were not the default.


Comment found elsewhere on the web:

"The use of SELECT DISTINCT does not necessarily indicate poor normalization. If I want to know which of the 50 states I have customers in,"

  select distinct state from customer
"...works fine. Typically, I would use GROUP BY (obviously if want to know how many customers in each state) but SQL Server usually generates the same query plan."

Several other people in that same discussion mentioned their experience that DBs, schema, queries, etc, suffer bit rot over time, and small shops often don't have time to revisit and renormalize to really do things "right", so they start using "select distinct" as a time saver, not because they think it's absolutely correct.

Which I suppose is to say they think it's sometimes pragmatically unavoidable, and may or may not reflect a CodeSmell depending on circumstances.


Another example: Suppose I have a table with studentname, gradyear, and degree. Students can graduate more than once with different degrees, so the only candidatekey is (studentname, gradyear). Therefore, if I understand correctly, the table is 3-NF. If I want to select all degrees granted in a span of years, then isn't a select distinct the logical choice?


StructuredQueryLanguage


EditText of this page (last edited July 31, 2007) or FindPage with title or text search