Complexity Of Outputting Duplicate Tuples In Tutorial Dee

From a discussion on BagAtational...

Let's define a complexity metric that will be used to judge a new feature. Our metric will be based on assigning a count of 1 to each new issue a user needs to learn in order to know the feature. We'll total up the counts to determine the feature's overall complexity.

Call the metric C, such that C(0) means using the feature raises no issues and therefore requires no new learning. C(1) means 1 issue, C(3) means 3 issues, and so on. C(n) is lower complexity than C(n+1).

The new feature, in this case, will be providing a means to emit duplicate rows/tuples in TutorialDee and Rel (see RelProject).

In TutorialDee, the projection or "give me just these columns in the output" operator is represented by {a, b, ... z} where a, b through z are column aka attribute names. See QueryLanguageComparison for some examples. Projection eliminates duplicates.

To make projection emit duplicates, we'll add a hypothetical optional keyword called "DUPLICATES". A projection of {a, b} will remove duplicates but {DUPLICATES a, b} will emit them.

The keyword itself is given a count of 1.

Using a DUPLICATES projection within a subquery is invalid, because relations are, by definition, free of duplicates. So, {DUPLICATES a, b} in a subquery should throw an error.

That's another count of 1.

Using {DUPLICATES a, b} where a and/or b is a key is meaningless, and should at least throw a warning.

Another count of 1, that.

So, modifying projection to support duplicates has a total complexity count of 3. It is complexity C(3).

Now, let's assume we wish to emit duplicates to a file to meet a user's requirements. Rel can use FOR and WRITELN to emit tuples to some external file like this:

  FOR myRelvar {a, b} ORDER(ASC a);
     WRITELN a || " " || b;
  END FOR;
FOR iterates over the tuples in the relation returned by the expression 'myRelvar {a, b}', sorted in ascending order on attribute 'a'. FOR will not iterate over duplicate tuples because the projection {a, b} returns no duplicate tuples.

WRITELN emits the values of a and b for each tuple. I've omitted the file handling bits for the sake of simplicity. It would be the same regardless how we emit duplicates, anyway.

Now let's compare how we output duplicates.

Assume we're using the following RelVar:

  VAR myRelvar REAL RELATION {k CHAR, a CHAR, b CHAR} KEY {k};
Assume we wish to emit all values of a and b including duplicates, but not the key k.

One way is this:

  FOR myRelvar {DUPLICATES a, b} ORDER(ASC a);
    WRITELN a || " " || b;
  END FOR;
We're using our modified projection operator, which has a complexity of C(3).

Another way is this:

  FOR myRelvar {k, a, b} ORDER(ASC a);
    WRITELN a || " " || b;
  END FOR;
No special mechanisms are required. We merely do not include k in the WRITELN statement, which is precisely the same WRITELN statement we'd use in the first example!

Total complexity of the second method is, therefore, C(0).

C(0) < C(3). The second approach is less complex than the first.

Therefore, modifying projection to emit duplicates is more complex than not modifying projection to emit duplicates.


There are several flaws in your counting, but I'll focus on the biggest first. Your FOR-ENDFOR should be dinged for many points because it's not declarative; some shops may disable imperative statements because of security risks, optimization risks, and so forth. It's extra complexity. Other query languages can do it within the *query* language (declarative) and don't require imperative statements. Your argument could be reduced to, "Since this ugly thing already exists in the language, we might as well use it to do more ugly things."

And WRITELINE doesn't output to standard tables, but a string. If you output to standard tables, such as ODBC results, then usually there's a way to export to CSV, XML, spreadsheets, etc. Custom strings don't do that unless you manually add delimiters etc. (including dealing with delimiter and quote escaping). It's outside of both "table land" and declarative land, and is therefore butt-ugly.

--top

I think you're not seeing the forest for the trees. Use of FOR is not the issue here. The FOR construct was merely selected as an example, one that implies arbitrary manipulation of "duplicate" tuples, including emitting duplicates. As I've written elsewhere, TutorialDee, and hence Rel, is intended to be a general-purpose imperative programming language with built-in database capabilities. It is more than just a query language. Imperative capabilities are not deprecated, but promoted along with functional aspects like explicit RelationalAlgebra operators. In that respect, it's like ExBase. I could equally have chosen to illustrate this with Rel's EXPORT statement, but as it's still a work in progress (whereas FOR is complete) I chose to demonstrate something that interested readers can execute for themselves.

Shops that disable imperative statements? I presume that means no PL/SQL, no Transact-SQL, and so on? I've not seen it, but I can't say I'm surprised. Such shops will not be happy with TutorialDee or Rel. Fair enough.

Outputting to standard tables, by the way, is done by linking to the external DBMS tables or JDBC datasources as RelVars and simply using them in the same manner as native RelVars. INSERT works on them.

BTW, I believe my counting is accurate. The total of three ones is... Three.

Not needing a stupid loop is worth 20 points. You are pushing the issue onto the app language side, not getting rid of it. That's NOT where I'd prefer it be. It's like saying, "My leaky septic tank killing the grass is not a plumbing issue, it's a gardening issue: get a grass species that is shit-tolerant."

Nice rant. Read my response above, again. Pay particular attention to the part where I "could equally have chosen to illustrate this with Rel's EXPORT statement". No loops. The loop was deliberately chosen to illustrate arbitrary processing of duplicates. In other words, you can do anything you like with duplicates in a pure relational environment.

For one, it creates a violation of OnceAndOnlyOnce because one has to duplicate the SELECT list in the formatting side (minus the keys). That's more typing, more code, and more places to make mistakes. With "traditional" query languages, one does not have to write loops and formatting statements to remove the key(s). If that doesn't personally bother you, fine. I just don't want it as part of my tools. I believe most real-world developers will agree with me. The 10% of the population that is anal or purity-obsessed can use yours.

EXPORT, for example, doesn't "duplicate the SELECT list". WRITELN is an arbitrary statement. It could be any code at all including something to export data. That was the point. It was meant to be illustrative rather than definitive. But, as usual, you're merely being pointlessly argumentative.

By the way, I've already got real-world developers using Rel. How many real-world developers are using SmeQL? Why don't you implement it and then we'll see which is more popular. I'd welcome the competition and the opportunity to kick your sorry butt, market share-wise of course.

Let's compare the total code change and effort required to remove the key and create a CSV file using a "bag" query language:

Before:

 SELECT a, b, k FROM myTable
Select "run" icon in Toad, right click on result window, and choose Save As, select CSV

After:

 SELECT a, b FROM myTable
Select "run" icon in Toad, right click on result window, and choose Save As, select CSV (same) [1].

--top

You're kidding, right? Ok, what language statement implements the export to CSV? Are you expecting me to incorporate Toad into my automated toolchain and run it on my unattended headless UNIX box?

Oh, and here's Rel's work-in-progress EXPORT statement:

No duplicates:

 EXPORT myRelvar {k, a, b} ORDER(ASC a) FORMAT CSV TO "myfile.csv"
Duplicates:
 EXPORT myRelvar {k, a, b} ORDER(ASC a) EXCLUDE k FORMAT CSV TO "myfile.csv"
No Toad required.

So you add a new command and EXCLUDE keyword to "solve" it. That can't look well on the complexity scale. (It looks more and more like ExBase's ugly sister, too.)

And what command, with keywords or not, do you use to create a CSV file? Without Toad.

This is ridiculous, of course. You can imagine anything you like and paste it into one of your SmeQL pages. I present a working language that people are downloading and using. What have you created except hot air? You are a festering blight on the IT landscape, and a vast sinkhole into which productivity pours. The sad thing is that it happens voluntarily. You have a unique talent for making it appear, briefly, like your ignorance might be addressable with education. I don't blame you, of course, because you're deluded enough to think you're actually doing something useful. For me, however, I could have written some code -- perhaps fulfilling one of my appreciative users' requests -- in the time it took to write this. I should know better. You, sir, are a troll, but it is I who suck for allowing myself to once again be trolled.

Red Herring. A poorly-designed language is still poorly designed even if it runs. I suggest you get the detectable kinks out BEFORE coding. You didn't do your homework. Sure, I could add linguistic bells and whistles to SmeQl for the hell of it, but if I can keep most of it simple, I will. Otherwise, you are just reinventing the mistakes of SQL, but with a purist twist. -t

I presume SmeQL will export to CSV files using Toad too, yes? Troll.

I would note that an EXPORT operation can be added without adding new syntax, just a library entry. This is because it doesn't rely on hard-wired keywords. I think it's a good thing.

But that's all in theory, because you haven't implemented anything. Isn't it you who always advocates "race the damn car" or some such? When's the SmeQL car going to be more than a teenage sketch on the back of your schoolbooks?

That's legitimate criticism, but it doesn't change any flaws in T.D./Rel. We BOTH have areas to work on, okay?

What you consider flaws might actually be features.

But, admittedly, vice versa (in all possible permutations)...


(On a side note, in this day and age, hard-wired key-word-heavy languages are kind of a smell. Why not use some kind of UniversalStatement (including XML or EssExpressions) so that the syntax tree itself doesn't start to look like SQL's and COBOL's. I've considered a UniversalStatement for SmeQl, but so far it doesn't need enough optional keywords to justify going away from positional parameters. But since positional is a subset of U.S., it can be added later without breaking existing code.)

Because LISP, the UniversalLanguage, has been done.

I personally find the UniversalStatement more friendly than nested lists. And it's closer to Algol-style, meaning more familiar to users.

You've used Algol, have you? Troll.

That's what they generally call languages that look like C, Pascal, Ada, etc. They can trace their style back to Algol more or less.

No, they generally call them "imperative" or "C-style", etc., as appropriate. The term you're looking for is "Algol-derived".

Ada is not C-style, and Assembler is also "imperative". "Algol-influenced" may be the best match so far.

Note that I wrote "as appropriate", you argumentative git. "Algol-derived" is a recognised term in the literature. It means what you're trying to say.

I don't understand why you asked, "You've used Algol[?]". You could have just said, "Perhaps you meant 'Algol-derived', which is usually how it's worded in the literature."

Ah, you are weak in the sarcastic spaces.

So that's what it was.

I would have thought the "Troll" at the end would have conveyed what pure sarcasm -- which is, of course, transparent and sometimes hard to see -- did not.

Well, I mistook the "troll" statements as blatant insults. Silly me.

Oh. I thought they were fine examples of accuracy in product labelling.

Is that sarcasm, or an insult?

It's a statement of truth.

When the "projection" filter is applied.

You've been called a troll how many times? Me, never.

That's because I critique existing stuff people care dearly about. Instead, you make obtuse God-tool crap nobody gives a fudge about. "Troll" has no objective meaning anyhow. Emotional anger and vagueness seem related.

I think you're called a troll because you're a troll. You strike up arguments for no other reason than you like to strike up arguments. There is no intellectual, noble, altruistic, or even justifiable reason for your so-called "critique". You incite anger because it amuses you, and that makes you a troll.

Your assessment of my neural activity patterns is wrong. You are not a mind reader and don't have a window seat to their firing patterns, so stop pretending you do. And there *is* intellectual justification: to make a better query language by dissecting the options. You are likely just sour-graping me for pointing out the silliness and ugliness of TD/Rel caused by obsessive-compulsive tendencies. That's my assessment of your neuron patterns.

Your points might have validity if anyone other than you had raised the same criticisms about TD/Rel. That would indicate there are general problems that are generally recognised. But no, the criticisms you've raised are unique to you. Still if you were a recognised expert in the field, noted for your valuable insights into database language design, your unique views might have validity. But no, you're recognised only as an argumentative, uneducated troll.

If it walks like troll and quacks like a troll...

[You give TopMind too much credit. Trolls are often quite clever... and, importantly, they generally know they are trolling. TopMind isn't a troll. He's a crank... and not one especially competent in offering a facsimile of reasoning. He resorts to sophistry, fallacy, and hand-waving because he honestly believes them to be proper counters to your arguments. He doesn't understand actual reasoning (he lacks the competency to recognize it or follow it, and he no doubt projects his own incompetence with reasoning on the value of the whole discipline). He rarely evaluates or introspects his own beliefs, nor does he seek to resolve apparent inconsistencies between them (well, apparent to us, anyway). He isn't stupid, but it takes far more intelligence than TopMind possesses to overcome the sheer level of arrogance it takes to disregard self-education and the skill of correct reasoning.]

Massive projection, number 2. YOU are the one (or two) who cannot produce clear-cut open evidence. For example, you insist Nulls are objectively and scientifically evil (rather than say a personal preference dislike). YET, you cannot produce science-quality evidence that they are objectively bad, just professor-on-crack-sounding mumbling. YOU are the problem. At least I don't PRETEND my personal preferences are universal mathematical truths. -t

[You have been provided mathematical proof that NULLs lead to various anomalies.]

[You have been provided evidence of these anomalies happening 'in the wild', as one might say. That doesn't make them evil, but it does make them harmful. The problem is on your end, TopMind, because whenever you are provided these proofs, you successfully demonstrate to us your obscene levels of incompetence, and your unwillingness to actually attack the argument - instead you resort to fallacy and sophistry and hand-waving attempts to disregard the proofs. The problem is on your end, TopMind, because you honestly believe that you are the judge for whether you've won or lost an argument.]

"Disregard proofs", pfffft. You really think your personal judgment is iron-clad math-proof kind of stuff. Self-delusion. Next you'll "prove" that purple is the prettiest color. Dolt. Pseudorigor.

[Where, precisely, have I taken personal judgement and called it a proof?]

You are full of it. Usually one finds there's trade-offs involved, not a free lunch, and you handwave away the downsides without any rigor. Imperative loops and WRITE statements? Puke. How 1960's. Oracle and IBM were smart to ignore reality-detached fools like you. Are they also "trolls"?

Dude. You have so gotta work on the comprehension skills. Note, above, where I wrote "the FOR construct was merely selected as an example" and "it was meant to be illustrative rather than definitive" and "TutorialDee, and hence Rel, is intended to be a general-purpose imperative programming language with built-in database capabilities." As such, continuing to harp on about "loops and WRITE statements" merely demonstrates your inability to read, recall, absorb, and understand. It makes you look like an idiot.

[The fact TopMind's irrelevant utterances of blind incomprehension are additionally inconsistent with his arguments in other topics - i.e. that ProceduralProgramming goes well with TableOrientedProgramming, and that ExBase is all that - makes him look even more like an idiot.]

It appears that for Top, maintaining an argument ad infinitum is infinitely more important than maintaining a coherent and consistent argument. I actually thought the FOR loop would appeal to him, given his past arguments. Silly me.

Projection again. You are the one (two?) who started repeatedly calling me a "troll". If you wanted to stick with technical discussion, well, most normal people wouldn't do such a thing. In fact, I'd call that "trolling".

[Very few of your comments in this entire page have any relevant technical content whatsoever. You have been extremely disrespectful. I do not believe you are a troll, but you're certainly an arrogant asshole.]

Very few of your comments in this entire page have any relevant technical content whatsoever. Your complexity "count" conveniently disregards factors you don't want to hear about. You have been extremely disrespectful. I do not believe you are a troll, but you're certainly an arrogant asshole. -t

[Oh? Which factors did I disregard that I don't want to hear about. Be precise.]


Summary Comments

I believe this debate comes down to whether mixed bags+sets or just sets are the best structural fit for business applications. I lean toward mixed because I believe the real world is "too messy" to fit closely to a set-only tool-set, as described above. One is modeling business and political rules created by irrational or poorly-informed managers and users, not PhD's, and bags can better fit this environment. Sets-only are just too idealistic. -t

As has been explained repeatedly, in a true relational system (typically a DBMS) it is trivial to (a) eliminate duplicates on import/link and (b) trivial to produce duplicates on output for those exceedingly rare cases where users require duplicates.

In other words, bags are only (slightly) relevant at the boundaries of the relational system. Within the relational system, bags are not needed at all. Supporting them only adds complexity and potentially diminishes optimisation. Therefore, there is no reason to support bags outside of import/link (i.e., input) and output, as described above.

I have, by the way, never seen a business case for maintaining duplicate rows in a table in a database. I have seen it done, obviously, but never with sound justification. In most cases, it's been done purely out of carelessness or ignorance.

The scenarios in BagNeedScenarios are not that rare in my experience, but we can LetTheReaderDecide those frequency issues rather than continue bickering over that point. As far as optimization, it would only be an issue if there was a bag in a resource-critical portion of a given query. Again, I'm for having choice. -t

Do you expect users to know what the "resource-critical portion of a given query" happens to be?

As I mentioned before somewhere, a poorly-skilled query writer will often make many other mistakes. Duplicate-management being just one. Plus, if the option of bags is removed for them, they may make other messes to compensate, ie, force a bag. WorseIsBetter often just works better in the real world. Idealism often proves uneconomical. -t

Why would anyone "force a bag"? Anyway, at worst that would mean a table with an auto-generated primary key. Your comment that "idealism often proves uneconomical" is based on what?

Auto-generated keys can be unwanted overhead, and may not solve the "bag problem" anyhow, simply putting lipstick on a pig. They may remove "technical duplicates" but not necessarily "logical" or "domain" duplicates. As far as the economics of idealism, I'll refer you to WorseIsBetter.

What is a "logical" or "domain" duplicate? I've never seen that term in the literature.

[The longer you discuss a topic with Top, the more terms he makes up.]

[This is a really bizarre discussion. In a decade of working on accounting systems, I have also never seen a legitimate use of duplicate rows within a system of queries, and none of Top's "reasons" for offering it as an "option" make the least bit of sense to me. When duplicates occur in systems of SQL queries, it's almost always a mistake, or, when deliberate, leads to later mistakes when the output of such a query is used in another context. The cases where an artificial key has been needed in an intermediate query result have been very, very few. It makes tremendous sense to be able to generate non-relational outputs with duplicate rows, and almost none to incorporate them within a relational system. -- DanMuller ]

Accounting is probably a domain where one is least likely to encounter such a need because it's unforgiving and requires careful schema design and query writing. I gave some non-accounting scenarios in BagNeedScenarios. If by chance you disagree that those are likely, just say so and we'll move on. No reason to make it a personal conspiracy to mislead. LetTheReaderDecide if a given scenario is likely in their own domain/experience. Removing the unique customer identifier for privacy reasons was something I did fairly often in one shop, by the way (BagNeedScenarios scenario 3).

Also note that a "forced" temporary artificial key can create the risk that somebody will use it as an (assumed) permanent identifier. -t

[Labeling my field of experience as unrepresentative is a completely evasive cop-out, and completely untrue. Applications where errors are tolerable in the manipulation of data that's worth storing in a database are certainly in the minority.]

[I skimmed most of the BagNeedScenarios page, and others have already critiqued the applicability of the scenarios to your argument. They either describe scenarios at the boundaries of a relation system (and thus don't indicate anything close to a need for discarding relational algebra), or they are the result of inadequate analysis of the domain, or they accommodate mistakes made in legacy or external systems that can be relatively easily fixed or addressed at the system boundaries. No show-stoppers in there at all.]

[If removing a customer identifier results in duplicates, replace it with a count and eliminate the duplicates. Even in a report (which is outside the boundaries of the relational system), a count is more informative to a human reader than having to visually identify duplicates, and is obviously more useful to other software. I go so far as to argue against duplicates even in user interfaces and reports in almost every case where someone suggests them, because the plain fact is that something distinguishes duplicates, even for human readers - if nothing else, there's implicit information in a position in a list.]

That's already addressed in the write-up. If the contract specifies columns A, B, and C, you don't give them D out of some purity religion. I've been chewed out multiple times for giving extra info that I felt would be useful to the customers.

[Stop ignoring half of what everyone says. This statement ignores the fact that you can give someone output with duplicates at the boundaries of the system. This adds nothing but bulk.]

Again, you guys seem detached from typical workings and dealings of the business world, thinking you will be allowed to play Techy-God and/or be given infinite resources to fix all the clunky systems in a typical business network. I used to be kind of like that out of school, wanting everybody to make cleaner data with more validation and more research into hiccups and oddities from other department systems, etc. After a while I learned to hard way to shut up and work within what's given me.

[More typical Top crap. If circumstances require you to deal with a bad situation, you don't turn around and argue with the rest of the world that the bad situation is right. And you definitely don't propagate the problems into new systems if it's easily within your power to avoid that.]

[So many people with more experience than you have argued with you on this wiki over the years that nobody wants to hear your "detached from reality" spiel anymore; it's simply ludicrous and has nothing to do with anything but stroking your own ego. The fact that you have so much experience and have taken mostly wrong lessons from it does you no credit.]

AllRoadsLeadToBeeMinus is generally true. You may make your little spot in the world A+, if you can pull it off, but rarely does one have the power to force that on everybody else. I used to be a contractor, seeing a lot of different systems in different kinds of companies, including huge ones that everybody's heard of, and was given lots of clunky data to work with as-is. If this differs from your observation, well, so be it. I'm reporting it as I observed it. -t

[And that makes not one whit of difference as to how you approach working with such data within a system that you are writing. More distractions.]

Scenarios 2, 3, and 4 do affect one's own tasks.


Re: "Toad" query tool:

"... Dark, damp, and squatty."

Stop it. You're making me aroused.

Please don't post any images of your partners.

Coward.

In this case, yes. Bok bok bok.


[1] I'd like to clear up the Toad example. Often query-related tools will take the result sets and produce CSV etc. as needed. The SQL just produces the "result set" (a table) and the tool does the rest. It's not meant to promote or assume Toad itself, but rather uses Toad as a stand-in for a typical "query tool" or TableBrowser-like tool. Ideally the tool wouldn't care if the result set was generated by a SELECT statement or the proposed EXPORT statement. -t

No one objected to TOAD per se. It's your use of a manual tool in the midst of an otherwise-automatable processing chain that seems particularly nonsensical.

I'm not sure what you mean. Toad is only an example. Note that most RDBMS have "export" related commands, but they run at the server and are thus usually of little help to the query user on the client side. Sometimes a shared file area is setup on the server, but this is a shop-specific decision.


See BagAtational


CategoryQueryLanguage


AprilTen and OctoberTen


EditText of this page (last edited October 28, 2010) or FindPage with title or text search