It's common knowledge now that DatabaseIsRepresenterOfFacts. Even if you don't consider this approach when designing your database schema ( be it relational , OO or whatever) it is always good to make a consistency check in the end. Look at your database through the eyes of DatabaseIsRepresenterOfFacts, and this will validate if you have a good schema or a bad schema. A good schema is one that enforces that the facts stored in the database are always consistent, have a clear meaning to the end user and are identifiable for the end-user.
From this perspective you will always know that duplicates are bad.
A duplicate row will say that you have said the same proposition twice, the same basic fact twice. You have not gained anything, the truth value hasn't changed. You have only gained the chance to introduce inconsistencies between the two statements (for example update one but forget to update the other).
Sometimes a duplicate row means that you said different propositions but you said them incompletely which really makes them only one proposition. therefore you have a logical error.
This happens usually when you fail to enforce primary keys, let's say the row (Ordered_Product='X', Quantity=4) and the duplicate of it. This might look OK in an OLAP scenario where you want to research the frequency with which a product is ordered, and the distribution of ordered quantities. Even for an OLAp scenario it is bad, but for transaction processing it's almost a disaster. What happened here is that you want to say that there have been two distinct orders where product 'X' was ordered in quantity of 4. You should have said two distinct propositions by including the ORDER_ID, but by eliminating the ORDER_ID you made them logically one proposition and you are in contradiction with DatabaseIsRepresenterOfFacts principle.
There are several aspects to DuplicatesAreBad and for some people duplicates are not that bad:
If I select the first names of employees in my department, the result set may contain duplicates. This would be a correct result if JeffDillon, JeffGrigg and JeffHunt? all work in my department. It may not be a terribly useful result for processing, because it does not include the primary keys, but it is a correct result.
Jeff, you can use GROUP BY and COUNT(*) so that the result will include the tuple (Jeff,3) which means that the first name Jeff belongs to 3 different persons in the department. A "bag" containing {Jeff,Jeff,Jeff} can mean a dozen very different things.
Consider for example that you might be the only one with first name Jeff and you have a working relation with 3 clients from partner company X. Now a program queries what are the first names of employees that deal with people from company X, but the sloppy programmer forgets to use SELECT DISTINCT. Then you get as the final result the same "bag" as before {Jeff,Jeff,Jeff}, what does it mean now ? -Costin
Now when SELECT DISTINCT removes duplicates it means projection, it responds to the question what are the distinct values of particular attributes in that relation. If you want to have quantitative information related to each value you use SELECT <attribute_list>, COUNT(*) FROM relation GROUP BY <attribute_list>. Even better, if you want to remove the influence of other joins in the quantification of an attribute value, you can use COUNT ( DISTINCT primary_key ) instead of COUNT(*), so it will disregard duplicates generated by join and it will count the occurrences of a value in the original table. --CostinCozianu
Bags unlike relation as sets cannot be the basis for DatabaseIsRepresenterOfFacts, and there is absolutely no notion of correctness that can be tested against bags. The problem is not to justify on a case by case basis that duplicates are OK, but to have a consistent unique model that applies everywhere, having the same semantics, and consequently allowing you to enjoy all the goodies of a declarative programming style. Using duplicates you cannot build a consistent model that applies everywhere, because duplicates mean different things in different contexts like in the example above. And that is only one little part of the problem.
Yep, advice so good it bears repeating. DuplicatesAreBad DuplicatesAreBad DuplicatesAreBad.
See OnceAndOnlyOnce.