Sql Flaws

The StructuredQueryLanguage (SQL) is a common relational query language, but it has various alleged shortcomings. This topic focuses on shortcomings or alternatives that are related to the language itself rather than possible flaws in RelationalDatabase theory or RelationalLanguages in general.

See RelationalLanguage for a list of proposed and actual alternatives.

Links

http://www.mcjones.org/System_R/bs12.html (Business System 12 - BS12)

http://geocities.com/tablizer/sqlcrit.htm (SQL gripes)

http://web.onetel.com/~hughdarwen/TheThirdManifesto/HAVING-A-Blunderful-Time.html (Historical perspective on SQL language design - by HughDarwen)

Some of the complaints against SQL include:

Nesting makes it harder to factor out duplication as opposed to (optional) named units
- The relatively new WITH clause can help in this area
Group By inconsistent and/or confusing
Implementations are inconsistent between vendors
Standard lacks tree-based operations. What tree-based operations would be compatible with a relational model? Transitive closure? The point is that everyone comes up with various ways to hack up trees in SQL (e.g discussed in "SQL for Smarties", Joe Celko), but the hacks tend to be awkward and slow, so some people think that some model of trees, any model of trees [or graphs, or multigraphs], should be native to SQL to avoid the need for hacks. See http://docs.alphora.com/D4LGExplode.html for an implementation of the EXPLODE operator proposed by Chris Date for hierarchical queries. The SQL2003 standard introduced XML, which provides XPATH for walking trees. Microsoft SQL Server 2005 introduced Common Table Expressions, which provide recursion to T-SQL code, allowing tree and graph traversal. {Move to RelationalAndTrees}
Inconsistent syntax (see below)
Large, complex syntax instead of using combinations of simpler syntax units. Think of comparing the COBOL parse tree (or generators) to C's or Lisp's parse tree. Moved discussion below.
NullsInSql are a flawed concept. Especially for strings.
Defaults to generating duplicates, which can cause confusion and erroneous results.
- I'm not sure I'd call this a flaw. Duplicate-removal can be computationally expensive. It could be called a "practical trade-off".
The unqualified UNION and SELECT are inconsistent in how they handle duplicates. UNION removes duplicates while SELECT does not. This can lead to confusion and errors.
- A fix could be to have "UNION" mean UNION ALL, with something like UNION UNIQUE for the other alternative. Or, to require an explicit two-word choice among "UNION ALL" and "UNION UNIQUE". Or, rid a "unique" version, and require the usage of GROUP BY instead, which already exists.
Encourages long, hard-to-digest run-on "sentences"
Not a DBA-extendable language (see BusinessSystemTwelve)
Joins could be simplified if using "standard" links (WhereAndAnd) {Natural joins? Some RDBMS support them and some don't.} - The existing implementations are flawed. They should depend on a relationship dictionary and not on naming conventions.
The formal specification is too damned huge
Many SQL-based systems don't allow user-defined views (DatabaseFlaws?)
Column aliases are sometimes not recognized across different clauses. For example, GROUP BY may not recognize an alias created in the SELECT clause.
Updating one table based on values from another table is messy, especially if multiple columns need updating. (Perhaps that is bad normalization, but in practice sometimes it is necessary.)
SQL's syntax is prone to security problems: SqlStringsAndSecurity
Most of the time the Group By clause duplicates information in the Select clause, but no implementations do so. (Sybase in particular does something nasty if you leave data out of the group by clause - they add outer joins!)
SQL is not turing complete. {I'm not sure I agree that a relational query language should be}
Column name collisions in joins between tables when it is not true that EachColumnNameBeUniqueAmongTheTablespace.
Inconsistent in how Boolean's are dealt with. In my opinion, functions should be used instead of IF statements. And, they should be based on integers where zero is False and non-zero is True. It's easier to do aggregate Boolean operations this way in my experience than with dedicated Boolean types. For example, counting the existence of multiple conditions with a single statement. Using a WHERE clause for this can only filter one condition set per statement (except in simple circumstances).
It is not obvious what needs to be done to achieve AccessPathIndependence (use Views to hide the access path as described in ConceptualQueryExampleOfAdvantages)so most people end up writing queries that fail if multiplicity changes, making the queries as fragile as if the database were based on the NetworkModel?
It does not offer a way to derive tables from views using NormalizationBySynthesis, forcing the developers do proper normalization (when it could automatized) and creating the impression that designing a database requires a BigDesignUpFront effort (when in fact it could adapt to the needs of new modules as they appear).
It's too easy to accidentally form a CartesianJoin by omitting a key match-up or two. I've seen bad things happen due to such. Perhaps require the explicit keyword "Cartesian" if such a thing is actually intended. If you inadvertently omit a table join, you'd get a message similar to "A Cartesian join is being requested, yet the key-word CARTESIAN was not included."
A simpler way to indicate what default value is used for "missing" values from outer joins. Null is not always the ideal.
Not defined and/or implementation consistently across vendors:
- Date (and date/time) syntax and rules
- Conditional functions/keywords
- Capitalization handling for strings
- Boolean types
- Nulls versus zero-length strings
- Return info. I realize that not all engines may provide such services, but if they do, it would be nice to have a standard way to package such return info, perhaps as a table.
  - Number of records impacted
  - Time to process query
  - ID (key) of the new record if auto-numbered key
  - Information that the current database returns. This is a way to indicate what the engine returns and doesn't return.
- (EditHint: Some of above overlaps with material above it. Probably needs a re-factoring.)
[more coming...]

(comments about a MartinFowler article moved to DomainLogicAndSqlArticle)

Contrast of "reference-based" and "nesting-based":

t1 = foo(x,y)
t2 = bar(t1, a)

t1 = foo x y
t2 = bar t1 a

Instead of (or an alternative to):

bar(foo(x,y), a)

bar (foo x y) a

I wonder if SQL's "Group By" could be disposed of if referencing were permitted. Group-By smells to me.

Think of "GROUP BY" as "SUMMARIZE PER" - it's used to build summaries. There's really nothing wrong with it, and it has no obvious analog to anything object-related. In fact, it's a (relatively) easy way of expressing something that requires explicit iteration in procedural/OO languages. SQL's specification of "GROUP BY" and related aggregate operators has some flaws, especially with respect to handling of empty sets and NULLs, but the underlying concept is sound.

-- DanMuller

It has been a source of confusion and inconsistency among vendors in my experience. I will try to present an example the next time I encounter a funny one again. I am not suggesting getting rid of a summarizing operation, but simply simplifying. For example, we would no longer need a HAVING clause, because it would just be a WHERE clause in the second statement. HAVING is just a "lost" WHERE clause. Perhaps I should have said that HAVING smells instead of GROUP BY.

X = group(....)// produces "summary" view 
Y = filter( X, ....)  // similar to "WHERE"

Remember that this is (or should be seen as) a functional or logical statement, not sequential commands.

Another example is a "double summary". For example, when we calculate GPA for multiple students, we may want an equal weighting rather than just average all classes. Otherwise, those students with more classes will get more weight. Thus, we want something like:

T1 = group(table=Grades, groupOn=studentID, avg(grade) as GPA)  // average by student
T2 = group(table=T1, groupOn=*, avg(GPA))  // average again

I believe the SQL solution would either be ugly (unintuitive, such as correlated subqueries), require a view, or inconsistent across vendors. I agree that having user-defined or virtual views could produce an SQL solution that is closer to this.

  SELECT AVG(grade) AS GPA INTO $T1 FROM Grades GROUP BY studentID
  SELECT AVG(GPA) FROM $T1 GROUP BY 1

(Did I do that right?)

Yes, that looks right. I wouldn't argue that SQL's syntax is awkward. Your syntax looks reasonable. SQL suffers from being a hybrid of relational calculus and relational algebra. Some time after SQL was initially standardized, I believe, it was shown that the two notations are equally powerful, although certain queries are more easily written in one or the other. Personally, I prefer algebraic notation, like your examples. -- DanMuller

No it doesn't:

 SELECT AVG(GPA) FROM (
SELECT AVG(grade) AS GPA FROM Grades GROUP BY studentID
 )

Well, yes, you basically just substituted the temp table reference with an entire select clause. But, for more complex SQL this does not always work well. For one, somethings we want to refere to the same clause on different nodes of the parse tree (see duplication finder example below). You can't do that very well with nesting, and the fix is often not obvious or takes a lot of code rework. References make it easier to focus on one part at a time, and allow more OnceAndOnlyOnce when nesting does not work. Nesting also tends to force the sandwich bread of a given statement too far apart for larger statements (ThickBreadSmell). Sometimes I will edit a copy of the statement and remove the middle so that I can print out a copy with the bread pairs brought back together.

This substitution so fundamental to relational theory that it has a name: "Relational closure". It may be appropriate to read introduction to databases before ranting why inner view usage is so inconvenient.

It does not take a thick book to know it is annoying.

How about using a view to accomplish this? I'd say a view qualifies as a reusable 'chunk'.

  CREATE VIEW gpa (student_id, gpa)
  AS
  SELECT student_id, avg(grade) FROM grades GROUP BY student_id;

  -- Now you can do
  SELECT avg(gpa) FROM gpa;

Application developers usually are not given permission to create views. Plus, existing RDBMS views tend to be global. We need ad-hoc temporary local views so that the DBA is not bothered and other parts of an app are not impacted. Something as simple as this may work:

 SELECT AVG(grade) AS GPA INTO $mytemp FROM Grades GROUP BY studentID
 SELECT AVG(GPA) FROM $mytemp

The dollar sign tells the parser that the table is only a virtual or "reference" table. The INTO clause itself already exists in many dialects. It just creates actual tables and so is not very practical.

There is a simple solution for the fact that application developers don't have permission to create views, write code that writes SQL, like this:

  my $average_grade_sql = qq{SELECT AVG(grade) AS GPA FROM Grades GROUP BY studentID};
  my $average_gpa_sql = qq{SELECT AVG(GPA) FROM ($average_grade_sql)};

If you want your SQL formatted nicely, something like this works:

  my $average_grade_sql = qq{
    SELECT AVG(grade) as GPA
    FROM Grades
    GROUP BY studentID
  };

  my $average_gpa_sql = qq{
    SELECT AVG(GPA)
    FROM (} . indent($average_grade_sql) . qq{)
  };

  # Helper function
  sub indent {
    my ($text, $indent) = @_;
    $indent ||= 6;
    $text =~ s/^/" "x$indent/egm;
    return $text;
  }

Using this technique I have been able to make understandable queries that nest several levels deep. The final output is impossible to follow, but in the code I have a series of simple queries, each of which does something straightforward.

PostgreSql's "with" clause does a form of named references:

http://www.postgresql.org/docs/8.4/interactive/queries-with.html

This doesn't look any different to me to nesting - it's just casting SQL into a more 'procedural' structure (do X, put result in Y, process Y). In fact it looks pretty much like what some SQL interpreters actually do for some queries. However, it can depend on the specific operation in question, which means knowing the maximum amount of information. One main advantage of SQL is that it abstracts the implementation of set-based operations from developers - i.e. implementation of Hash based joins when relevant, rather than a navigational 'loop' style.

The reference-based approach is semantically equivalent. It's more a syntactic and/or code-management issue than it is an execution issue. We don't have to dictate that one chunk is actually executed before another or even how to execute it. If the optimizer wants to perform the second part first, it can. One approach is to think of the interpreter as physically inserting the reference portion back inside, similar to C macros. Further, there are cases where we want to reference the same chunk more than once. You usually cannot do this with the nested approach without duplicating the leafs, hence a violation of OnceAndOnlyOnce.

Equally the SQL interpreter can rewrite a query for efficiency. Splitting the query into controlled steps removes the ability to do this. I guess if the syntax above had something to indicate that 'the following is all one transaction'.

Some RDBMS do actually allow a syntax more like this - which might be a clearer way of representing nested queries as there is a top-to-bottom flow (SELECT AVG(grade) AS GPA IFROM Grades GROUP BY studentID) AS mytemp IN SELECT AVG(GPA) FROM mytemp

The analytic functions in Oracle also remove a lot of these issues - i.e. you can do the above query without using GROUP BY, or the percentage of the students average mark against the group average, running totals etc, without using GROUP BY. I presume other vendors offer similar extensions. Unfortunately this is both non-standard and the syntax more complex than even group by, that more often I see people recreating the same functionality using procedural extensions.

While on the topic of SQL annoyances, why is the syntax for INSERT significantly different from that for UPDATE?

One thing about writing xfer routines is that SQL lacks a hybrid statement that says something like, "if this record is already there, then updated it, else insert a new one". Instead you have to check for each record. It needs an "UPSERT" operation.

MySql has a REPLACE statement that effectively does this (actually it deletes the keyed row if it already exists, and always inserts a new row), but it's not standard. I'd imagine many other SQL DMBSs have this or similar hacks.

Note that according to the documentation, REPLACE always does a delete first. It would make more sense to check the key, and then do an update if it already exists, else an insert under the hood. I wonder what the reasoning was for going with deletion. Perhaps it would impact "unmentioned" columns, but the "update" approach above should at least be an option for efficiency sake. For example, "REPLACE ROWCLEAR ... " for the deletion approach.

(Oracle offers MERGE)

I'd like a VERSION statement and an ARCHIVE statement. VERSION would be like UPDATE, but would retain a copy of the data as it was before: equivalent to a SELECT followed by an INSERT (and, if on a separate table, a DELETE). ARCHIVE would be like DELETE, but would retain a copy of the deleted row(s): equivalent to an INSERT and a DELETE. Then, guess what? You could have an UNDO or REVERT! Any offers?

Rather than make it a new command, why not have it be an option, like UPDATE......WITH ARCHIVE or something. I suppose that is a rather specialized need, though, to put into a formal language. However, if the language was more functional, like BusinessSystemTwelve, then adding your own operation/function could be less disruptive.

Is it a specialized need? I always look two or three times at an UPDATE or a DELETE to see whether there is clearly never going to be a(n operational business) need to revert to or view the pre-UPDATE version. In fact, all the DB2 systems I've ever worked on have preferred, for business data, SELECT+INSERT to UPDATE, and most preferred UPDATE or INSERT+DELETE to DELETE. Thinking about this from an OO perspective, if a row in a table is considered to be object and the row-level SQL operations are its class's methods, the VERSION (and REVERT) and ARCHIVE (and UNDELETE) 'methods' deserve to be defined to the class.

(Agreed : It would save on the wheel reinvention of auditing systems that goes on, plus would allow RDBMS vendors to offer very efficient auditing - I've seen very bad implementations of audit systems. I would say, however, that it needs to be a property of the table as it will have performance and storage implications. Some RDBMS do already offer versioning and ability to select data as of point in time).

Personally I don't even like the concept of a separate language for database queries. I really don't like embedding one language within another. For programming I'd prefer to see interfaces for coupling a language's natural datastructures to the database (such as perhaps Perl's "tie" function?) instead of embedded queries. This is especially annoying in awkward, static languages like Java. -- Setok

Perhaps your paragraph belongs in ObjectRelationalPsychologicalMismatch, or CodeAvoidance, because it is not specifically about SQL. I personally like relational techniques and don't like using arrays, bags, etc. for anything but the most trivial collections. Perhaps try a dynamic language rather than Java. Dynamic languages get along better with database query languages IMO. Perhaps embedded languages are just a cost of sharing (SharingDataIsImportant).

I do use a dynamic language as much as possible (the ToolCommandLanguage), but I still prefer the idea of having program structures such as objects and procedures tied to the database instead of a separate query language. I'm not sure of the feasibility of this idea, but I would be interested in options for that. -- Setok

Several languages offer "Embedded SQL". In my experience it doesn't make life any easier. It can't conceal the fact that the database is an external component, nor should it. -- EricHodges

I think the problem may be that it is generally hard to represent declarative concepts in imperative languages, especially when it contains things like logical expressions. -- top

Embedding SQL still leaves you with two sets of conflicting syntax to deal with. What I think Setok wants is to have the query operations expressed in the native language rather than in embedded strings in another language.

As an aside, SQL isn't really a declarative language in the declarative-vs-imperative sense. SQL operations apply to larger and more complex primitive objects than is typical, but they're still sequential operations on those objects rather than assertions about them. -- PdS

 ----

My biggest problem with SQL is that it encourages statements like this:

select xx.xx_rec_id promise_unit_ID, xx.xx_volume aggregate_volume, xx_category category, xx_difference difference, xx_initial_liability_amt initial_liability, xx_out_of_pocket out_of_pocket, xx_holdings_Secured_Flag holdings_UNSECURED_FLAG, xx.xx_normal_liability_chart normal_liability_chart, xx.xx_facility_liability_chart FACILITY_liability_chart, yy.yy_rec_id target_REC_ID, yy.yy_prediction_meter chart_prediction_meter, yy.yy_unique_instance_liability_chart unique_instance_liability_chart, yy.yy_target_liability_chart target_liability_chart, xx.xx_indicator_id, xx.xx_unused_src_calc UNUSED_count_SOURCE, yy.yy_target_nm target_identifier, yy.yy_target_id target_ID, xx.xx_facility_id mqxy_promise_unit_ID,xx.xx_amt mqxy_unit_present_count, xx.xx_units mqxy_unit_present_units, xx.xx_standard_amt mqxy_standard_promise_count, xx.xx_units mqxy_standard_unit_units, to_CHAR(trunc(xx_during_contract_DT)) during_contract_DATE, to_CHAR(trunc(xx_termination_DT)) termination_DATE,to_CHAR(trunc(xx.xx_expiration_DT)) expiration_DATE, nvl(agreement_aggregate.agreement_aggregate,0) agreement_aggregate, nvl(lc_aggregate.lc_aggregate,0) LC_aggregate, r.r_xoynx_RTG xoynxS_RATING, r.r_grrrr_RTG grrrr_RATING from (select lyy.xx_rec_id, sum(lyy.l_count) agreement_aggregate from target yy, promise_agreement zz, promise_unit aa, agreement_unit lo where zz.yy_rec_id = yy.yy_rec_id and xx.ca_rec_id = zz.ca_rec_id and lo.xx_rec_id = xx.xx_rec_id group by yy.yy_rec_id,lo.xx_rec_id) agreement_aggregate, (select lc.xx_rec_id, sum(lc.lc_amt) lc_aggregate from target yy, promise_agreement zz, promise_unit aa, lc_unit lc where zz.yy_rec_id = yy.yy_rec_id and xx.ca_rec_id = zz.ca_rec_id and lc.xx_rec_id = xx.xx_rec_id group by yy.yy_rec_id, lc.xx_rec_id) lc_aggregate, target yy, promise_agreement zz, rating r, promise_unit cs where zz.yy_rec_id = yy.yy_rec_id and r.yy_rec_id(+) = yy.yy_rec_id AND xx.ca_rec_id = zz.ca_rec_id AND lc_aggregate.xx_rec_id(+) = xx.xx_rec_id AND agreement_aggregate.xx_rec_id(+) = xx.xx_rec_id and xx.xx_indicator_id = 1 and zz.ca_indicator_id = 1 and (r.r_indicator_id = 1 or r.r_indicator_id is null) order by target_identifier

This isn't something I made up to bad mouth SQL. I'm not that mean or creative. This is an actual SQL statement encountered in the wild.

"Wild" indeed. For one it lacks formatting. Almost any language without formatting (such as indentation) is hard to read. Second, I try to use asterisks to avoid having to mention each and every field, using "as" clauses for a selected few that need another name (this may result in some column data duplication). I do agree that the nestedness "run-on" shown is perhaps one of the legitimate flaws already mentioned above.

Re: "I try to use asterisks to avoid having to mention each and every field" - umm, I think you forgot to mention that you were joking, right? The way you wrote it, it almost sounds like you meant this seriously!

OK, just in case there are any SQL newbies reading this, do NOT do this (e.g., do NOT use 'SELECT * . . .'). Your queries are likely to BREAK as soon as some (future/other) programmer or DBA adds a new column to the table. Very painful to debug...

(well the queries won't break - the issue is the lack of integration with calling languages - i.e. that they don't support run-type binding of the result set - i.e. extend the class to include the new attribute)

Pardon my French, but bullshit. The SELECT * feature is a necessary tool of the language given its limitations. With some sensible conventions, you can avoid breaking changes. Simply apply two rules: (1) always add new data to the end of the table (never insert new columns in the middle) and (2) always test each value before fetching it from your dataset in your business code. Returning results that are too long is harmless, and too short is too if the last results in the set are treated as "optional" - this provides both forward and backwards compatibility. If your tables map directly to your business objects, SELECT * (or select foo.*) is perfectly reasonable. The only one to avoid (if you follow rules 1 and 2) is "SELECT foo.*, bar.*) where bar.*'s start column will be unknown.

Either way, this is all a workaround to a crufty failure of SQL - the inability to construct queries out of reusable parts. A query is AllOnePiece unless you generate it in your application (outside of the DB) and StoredProcedure nuts find that unacceptable. If you could define the "SELECT ( )" part of your file and some of the JOINS and WHERES of statements in other files, you wouldn't need this mess. Yes, I know you can do that with subqueries and temporary tables - but that means you have to sacrifice the efficiency of the DB. If I have 9 different procedures that fetch a list of LineItems? from table, the SELECT() part of all of them will be nearly identical - but unless I construct from outside SQL, coding OnceAndOnlyOnce is impossible. Following the above rules means that SELECT * will be automatically maintained, while SELECT(blah blah blah) will require 9 procedures to be updated. -- MartinZarate

N.B.: Your code should not be accessing the result columns positionally anyway. Use the names and you'll avoid a lot of these issues. The other downside of using *, however, is that it makes it harder to find all references to a given column in your SQL. Still, I do end up using * from time to time, since RulesAreMadeToBeBroken.
Asterisk does not necessarily imply "positionally". It depends on the API's. The good ones should allow reference-by-name. But I'll take a moment to "plug" SMEQL (TopsQueryLanguage) here because it allows one to pick columns using tables (virtual or real). Thus, one can use the regular query idioms to manage similar column lists. -t

How can you avoid making breaking changes if you have columns with the same name? No, you can't always just say "then make sure no columns have the same name", because you often want to join to a table twice. If any one of those tables is actually a multi-table view, SELECT * can never be guaranteed to not break, and the whole point of views is that they are supposed to be indistinguishable from tables.

There are downsides to both approaches. One needs to weigh the trade-offs per query, shop conventions, app language, etc. -t

[Using meaningful aliases for your tables (as opposed to xx, yy and zz) would also go a long way towards enhancing readability and comprehension.]

Maybe it is 3D coordinates :-)

Try adding meaningful aliases then. I don't think it will go very far towards enhancing readability and comprehension. I think that statement is just too long for an average human brain to manipulate with confidence.

Well, SQL is a programming language, and some programs are longer than others. Granted, it's a declarative language, and generally of a fairly functional language paradigm, but it's a programming language nonetheless. This means (at least) two things:

1- SQL can be long if the task it is doing is complicated, and 2- Like in other languages, style counts. In the above example, it counts a great deal.

In short, if you don't approach it with the same care as you approach your non-SQL code, don't be surprised if it gets screwed up.

If we try refactoring this SQL, what do we get?

The following refactorings were applied:

1. Capitalize SQL keywords. 2. Put separate constraints and column names on separate lines. 3. Indent equally detailed expressions equally. 4. Use a monospace font. 5. Right align equally important SQL keywords equally.

SELECTxx.xx_rec_idpromise_unit_ID, xx.xx_volumeaggregate_volume, xx_categorycategory, xx_differencedifference, xx_initial_liability_amtinitial_liability, xx_out_of_pocketout_of_pocket, xx_holdings_Secured_Flagholdings_UNSECURED_FLAG, xx.xx_normal_liability_chartnormal_liability_chart, xx.xx_facility_liability_chartFACILITY_liability_chart, yy.yy_rec_itarget_REC_ID, yy.yy_prediction_meterchart_prediction_meter, yy.yy_unique_instance_liability_chartunique_instance_liability_chart, yy.yy_target_liability_charttarget_liability_chart, xx.xx_indicator_id, xx.xx_unused_src_calcUNUSED_count_SOURCE, yy.yy_target_nmtarget_identifier, yy.yy_target_idtarget_ID, xx.xx_facility_idmqxy_promise_unit_ID, xx.xx_amtmqxy_unit_present_count, xx.xx_unitsmqxy_unit_present_units, xx.xx_standard_amtmqxy_standard_promise_count, xx.xx_unitsmqxy_standard_unit_units, to_CHAR(trunc(xx_during_contract_DT))during_contract_DATE, to_CHAR(trunc(xx_termination_DT))termination_DATE, to_CHAR(trunc(xx.xx_expiration_DT))expiration_DATE, nvl(agreement_aggregate.agreement_aggregate,0)agreement_aggregate, nvl(lc_aggregate.lc_aggregate,0)LC_aggregate, r.r_xoynx_RTGxoynxS_RATING, r.r_grrrr_RTGgrrrr_RATING FROM(SELECT lyy.xx_rec_id, SUM(lyy.l_count)agreement_aggregate FROMtargetyy, promise_agreementzz, promise_unitaa, agreement_unitlo WHERE zz.yy_rec_id = yy.yy_rec_id AND xx.ca_rec_id = zz.ca_rec_id AND lo.xx_rec_id = xx.xx_rec_id GROUP BYyy.yy_rec_id, lo.xx_rec_id ) agreement_aggregate, (SELECT lc.xx_rec_id, SUM(lc.lc_amt)lc_aggregate FROM targetyy, promise_agreementzz, promise_unitaa, lc_unitlc WHERE zz.yy_rec_id = yy.yy_rec_id AND xx.ca_rec_id = zz.ca_rec_id AND lc.xx_rec_id = xx.xx_rec_id GROUP BY yy.yy_rec_id, lc.xx_rec_id ) lc_aggregate, targetyy, promise_agreementzz, ratingr, promise_unitcs WHERE zz.yy_rec_id = yy.yy_rec_id AND r.yy_rec_id(+) = yy.yy_rec_id AND xx.ca_rec_id = zz.ca_rec_id AND lc_aggregate.xx_rec_id(+) = xx.xx_rec_id AND agreement_aggegate.xx_rec_id(+)= xx.xx_rec_id AND xx.xx_indicator_id = 1 AND zz.ca_indicator_id = 1 AND (r.r_indicator_id = 1 OR r.r_indicator_id IS NULL) ORDER BY target_identifier

FixMe: Is this statement valid? The "xx" and "lyy" tables (or table aliases) do not seem to be FROMmed (or defined) anywhere.

Also, some columns seem to be badly named:

 xx.xx_indicator_id seems to be a boolean, not a key.
 zz.ca_indicator_id (aka promise_agreement.ca_indicator_id) 
seems to be a boolean, not a key.
 r.r_indicator_id(aka rating.r_indicator_id)
seems to be a boolean, not a key.'

The formatting helps, but there is still the intrinsic problem that that is a huge chunk of code, and your only prayer of getting a handle on it is for the context to give you enough hints. In any other programming language, you could take each of the sub-SELECTs out, declare them in advance, and give them meaningful names, you could probably give meaningful names to various combinations of things in the WHERE clause, etc. SQL is more of a write-only language then Perl is.

Agreed. Ideally it should be broken into chunks, with each chunk documented. It suffers ThickBreadSmell. The SQL standard does not allow this.

(Still needs to go to the RDBMS as this big chunk. Developers rarely understand how to efficiently implement a query - think of a query engine as a repository of algorithms. I'd also like to see the functionality of the above written in 'any other programming language'. But I would agree, ability to declare and re-use elements of a query without defining them as server side views would go a long way. Something like a block structure that sits around the SQL to allow you to declare sub-elements up front, but unlike existing procedural extensions (sub-queries defined as cursors) would prevent the developer from choosing the implementation).

It could still be a syntactical division, not necessarily an implementational division. It would be almost as if the temporary/local views were like macros that were substituted into the final query before being (internally) submitted. But if that is too tough, then implementational division would still be nice. Hopefully a standard would allow both, and the decision is merely a DB engine implementation decision so that smaller-footprint engines could take the easy way out if they want, but use the same language convention that an efficiently-parsing/partitioning/factoring BigIron engine would use. Syntactical abstraction/partitioning is not necessarily the same thing as implementation abstraction/partitioning.

Re: Large, complex syntax instead of using combinations of simpler syntax units. Think of comparing the COBOL parse tree (or generators) to C's or Lisp's parse tree.

Aesthetically I agree, but on the other hand, COBOL-like syntax, such as SQL suffers, tends to have a lot of grammatical redundancy that makes it much easier to do really good error messages (in theory and often in practice), compared with the more obscure error messages that can be generated for C syntax errors. This is one of the very few good arguments in support of the otherwise nonsensical "let's program in English" notion.

I find this a little hard to believe. I never considered SQL error messages in existing products particularly noteworthy. Maybe a specific example would help.

Note that I said "COBOL-like" and "in theory and often in practice", meaning I'm not talking about having surveyed existing SQL products, I'm talking about what is possible. I have done extensive work on adding error recovery to YACC grammars, especially automatic error recovery (as opposed to adding zillions of hand-crafted error recovery rules), and I'm just saying that this is both the conclusion of standard theory in the field, and of my experience: redundancy in a grammar assists syntactic error recovery.

In C, by contrast, relatively speaking there is little redundancy, and that limits the quality of automatic error recovery and of error messages.

Dig up the paper on Graham-Rhodes error recovery and its practical application to the eyacc variation of yacc that was created to parse Berkeley Pascal; either it or the eyacc paper says exactly what I am saying.

I sympathize that existing parsers found in the wild may not come anywhere close to what they could do in theory. To get good results I had to modify YACC; very very few people ever do that.

-- DougMerritt

Well, it seems very counter-intuitive, but to some extent I guess I will take your word for it. You've studied the issue more than I have. However, in practice it appears not to be a strong enough factor to bring back COBOL-like languages. I have never heard anyone try to sell COBOL or SQL by saying the error messages are significantly better. Maybe because the time saved by implementing a simple grammar is spent on other areas of debugging and error message systems to compensate.

Probably true. And I personally dislike English-like syntax.

I've always wondered why SQL's problems with nulls couldn't be managed by extending the languages that call it (C++, Java, etc.) to handle nulls in a SQL fashion natively anyway.

There are some wiki topics on this somewhere around here.

There are ambiguities with NullsInSql; see http://www.sqlite.org/nulls.html for a walk-through of how various databases handle NULLs. And the issue that I see with NULLs is mostly that it is not well-defined, and seems to invalidate a bunch of the relational mathematics.

I suspect that there is something amiss with SQL or the way we use it. When I have to design a database which will be implemented in SQL, the essence of design is the same as for any other technology. The essence of design is the understanding of the application. I find with SQL that when I attempt to prototype a design I need to divert my attention to the practicalities of fitting my understanding of the application to SQL's requirements. Often the inconvenience of this diversion stops me attempting a different idea on the prototype. A language intended to be the database language should encourage experimentation in design. SQL did achieve this early - it was often a better standard of database management than many othe approaches at the time. But it never progressed. Except for PostGres?, I am told. -- PeterLynch

Are you saying that your table schema designs are being influenced by the abilities and limits of SQL? I believe it possible, but would like to see specific cases.

Yes -

An example of design with SQL in mind constraining design.

Say I have an entity called Processes, which contains metadata about each named process in the system. Each instance of Processes contains the location, the calling arguments, the returned argument etc. At some later stage I realize that I can determine the called processes of each process, so I add them to the database definition - now the Processes entity has a new attribute, Called.Processes.

I am confused here. What does the dot mean in this case?

It is just a delimiter. Just a part of the name. I am used to dot being a part of a variable name, and I use it on Wiki, even though it can cause your question, because it renders well and does not get confused with WikiWords - (CalledProcesses? Called_Processes Called.Processes)

In SQL I need to create a table which looks after this one-to-many relationship. Called_Processes would be an appropriate name.

In a MultiValue'd database, I would add the Called.Processes attribute to the Processes entity, just as in SQL, but then the database would be ready. I would not need to create the one-to-many mechanism.

It is not clear to me if its SQL or RelationalTheory? itself that is the issue that's bothering you. Jaming repeated values/records "inside" a record creates its own problems and difficulties, such as messy queries when you don't wish to use the parent relationship.

I believe I understand your attitude to three or more dimensional database rows. I have reservations about them too, but I have used them successfully for many years. Unfortunately, if you are a member of the huge majority who have not used MultiValued databases, I cannot explain the concept to you in less than a months working with you, or a book. At least the MultiValue approach allows the many-to-many relationship to be trivial.

So SQL requires a change to the structure of the database in a very common situation - one-to-many being added or one-to-one changing to one-to-many.

Now I do not mind changing the database definition to reflect any desired change in the RealWorld, but changing the structure because the element I want is plural - well that is too much. It affects my design at every turn. I must do BigDesignUpFront to ensure that I have considered enough of the actual implementation to be sure that I have determined the plurality of most that matters.

Not to mention many-to-many - What incredible contortions do you need to go through with SQL when you then decide that you want to have a "Calling.Processes" attribute in the Processes entity? In a MultiValue'd database, all that is required is an index on the Called.Processes attribute. Each process is the Calling.Process of its Called.Processes.

I think the above example is pretty normal but it then again, Im a bit divided as to whether it really should be SQLs fault. In almost all application implmentations, changing the meaning of some attribute from singular to plural is a pretty big deal to begin with. Suddenly your assignment code has to iterate over arrays or you have to get your functino to detect whether a single value or array was sent. SQL is a kind of "assemly language" for compiling data definitions into low-level form, so it does make sense that you have to talk about your definitions as tables. But the query side shouldn't really be as complex as the definition side implies. Perhaps there should really be a shorthand for doing these grouping queries? Actually, there are: Postgresql allows you to use arrays as groups and mysql has a group_concat which emits a comma-separated string.

Ok, I have a problem with this example.

The relational position is very simple. You declare the relation between a called process and it's calling process by giving every process a "caller" attribute. This has the handy side effect of constraining the relationship to be one-to-many; many called processes may have the same caller, but no called process may have more than one caller.

Now getting the facts you want from the facts in the database is simple; I'm sure you yourself can formulate the appropriate relational or SQL queries for "give me all the processes that a given process called," or give me the caller of a given process," or whatever.

Let's move on to the issue about changing the structure of the database. That's being gone at backwards, IMHO; if you need to change the structure of your database and it's hard, the solution is not to throw out the structure but to make it easy to change. To do otherwise is akin to not doing unit testing at all because the existing code is unsuited to it, rather than changing the code to be more suited to unit tests.

As for the "incredible contortions" you mention, it seems to be you going through them, not the relational folks. You have to go and add indexes to get the calling processes information from the called processes information; the relational developer simply declares the relationship between the two and the DBMS figures out for him how to answer whatever question he should propose. (Frequently optimization still has to be done by hand, which is unfortunate, but you don't need to do anything just to get correct answers, as opposed to your system which appears to provide no answer at all to a question it's not been previously prepared for.)

-- CurtSampson

I would like to see specific UseCases

 ...
 WHERE len(trim(nullFix(codeA,' '))) < 3 OR len(trim(nullFix(codeB,' '))) < 3 
    OR len(trim(nullFix(codeC,' '))) < 3 OR len(trim(nullFix(codeD,' '))) < 3 ...

Here, the same expression has to be performed on multiple columns. It is difficult or impossible to factor such expression to a single place to keep with OnceAndOnlyOnce. One could perhaps argue that the schema builder should have made such similar columns be rows instead, but that is not always the case. Besides, sometimes we are stuck with bad schemas and have to live with them.

PageAnchor: "dup_finder"

Here is an example that tends to represent the kind of duplication (OnceAndOnlyOnce violations) I often encounter in SQL.

Sample Table "codeTable"

locat code descript ----- ---- -------- 10 AA Foo Bar 20 AA Foo Baar 30 AA Foo Bar 10 BB Glab Zab 20 BB Glab Zab ....

select * from codeTable where locat not in (30, 50) and code not in ( select code from (select code, descript // gets unique code-and-descript combos from codeTable where locat not in (30, 50) group by code, descript ) group by code having count(*) > 1 // (error? should be "="?) ) order by code, locat

Here we have a table of codes in which we want to find and study typos in the descriptions that are supposed to repeat for each location. (Perhaps the repetition is bad normalization, but sometimes one has to deal with such data from clunky old systems.) For example, the second row in the sample data has the typo "Baar".

In this case we want to ignore codes from location 30 and 50 because we know they are not being used right now and thus we don't care to inspect them. To do it properly, we have to apply the filter in two different places. I see this kind of thing in a good many queries. There may be ways around such, but they are not obvious and not general-purpose solutions to such.

Contrast this with a TopsQueryLanguage kind of solution:

  ct = filter(codeTable, locat not in (30, 50))  
  counts = group(ct, compareConcat(code, descript), c('code, count() cnt'))  
  f = filter(counts, cnt > 1)
  j = join(ct, f, a.code = b.code)
  sort(j, c('code, locat'))

Here, the location filter only needs to be mentioned once.

Footnote: I introduced an undocumented function "compareConcat" in this example because TQL only compares on expressions rather than combos of columns and expressions. Perhaps TQL should use a TqlColumnTable instead of expressions for grouping. Something to ponder. (Change suggestion since noted)

SQL dialects that allow virtual tables in queries can essentially do the same thing, I would note. Instead of using "codeTable" in the above SQL example, one would first define a virtual table with the given locations filtered out already and then only reference the virtual table.

One added bonus of TQL here is that we can study the intermediate steps just by commenting out some of the lines:

  ct = filter(codeTable, locat not in (30, 50)) 
  counts = group(ct, compareConcat(code, descript), c('code, count() cnt'))  
  // f = filter(counts, cnt > 1)
  // j = join(ct, f, a.code = b.code)
  // sort(j, c('code, locat'))

It is harder to do that with the nested approach because one has to comment out both the top and bottom of the sandwich layers.

Another OnceAndOnlyOnce problem is long-winded nested functions used in multiple places. It would be nice if a query language offered at least syntax substitution macros.

  select
     rpad(nvl(ltrim(foo),' '), 30),
     rpad(nvl(ltrim(bar),' '), 10),
     rpad(nvl(ltrim(foobar),' '), 18),
     rpad(nvl(ltrim(fooadf),' '), 5),
     rpad(nvl(ltrim(fooadfasd),' '), 25),
     ....

It would be nice to define a virtual function:

  deffunc mypad(x, howlong)
    return rpad(nvl(ltrim(x),' '), howlong)
  enddef

  select 
     mypad(foo, 30),
     mypad(bar, 10),
     mypad(foobar, 18),
     ....

Readability goes way up.

I can understand not wanting to risk run-away execution by allowing TuringComplete algorithms. But, a compromise is to allow a function to be defined in terms of combinations of existing functions. Loops and perhaps recursion would not be permitted.

Or you can allow loops and recursion but apply other resource limits, such as application steps. There are many ways to prevent run-away execution without neutering the language. Even if a language is TuringComplete, most of its actual uses will terminate reasonably. A person who favors 'dynamic' languages (dynamic typing or type-light) would be consistent if they also favored catching run-away problems at run-away time rather than catching them at language-design time (which is even before compile-time).

Oh bleep, not another "safety" fight.

I just find it curious how extremely inconsistent your views are. Your views on TuringCompleteness in query languages and figuring out which operations should be "permitted" are squarely in the "safety" camp. You traitor. =)

It has more to do with partitioning responsibility than "protecting" stuff.

Your words: "not wanting to risk run-away execution". That's about "protecting" stuff. You even suggest that "loops and perhaps recursion would not be permitted" in order to avoid this risk. Seriously, you may as well have been jumping around without a hard-hat and singing "We can dance if we want to..."

As far as "partitioning responsibility" goes, you're now in conflict with other of your beliefs - especially those regarding leaky abstractions, polymorphic dispatch vs. if-thens, helpers vs. wrappers, advantages of exposing the runtime engine, and table-oriented programming. "Partitioning responsibility" is one way of saying "language and tool-enforced modularity and encapsulation". If you honestly give much weight to "partitioning responsibility", you really haven't shown it in your words above or anywhere else.

We are wondering off topic. I don't want to flood this topic with a long battle.

Beating a dead horse here: standard SQL is TuringComplete and has been since last century; implementation vendors have just been slow in adopting the relevant parts of the specification (SQL:2011 features T121-132). Segue into the next topic...

Personally, I'm not sure I want it TC. It may encourage too much using-the-wrong-tool-for-the-job.

There is that risk, but the benefits probably outweigh the downsides. There are iterative processes that if they are defined inside a database (as a StoredProcedure, usually) and execute inside the DBMS, gain the following benefits:

They run significantly more efficiently as iterative processes inside the database than as declarative processes inside the database (though, arguably, this could be addressed by a better declarative query language);
They require dramatically less network bandwidth by running them inside the database than exporting the data and running them outside the database;
They can be shared by multiple languages and environments if they're inside the database;
They are not dependent on a client-side language and/or infrastructure;
They can execute inside a SQL transaction.

Please try it on a different nation or planet first before taking it mainstream.

Too late, by almost 25 years. See http://www.orafaq.com/wiki/PL/SQL and http://www.postgresql.org/docs/9.1/static/xplang.html and so on.

The above snippet highlights the core failure of SQL - no interest in reuse of code at all. Every bit of reusability has been grafted on after-the-fact in an ad-hoc non-portable manner. Anything that doesn't fit perfectly 100% into the relational paradigm has to be coded in by the developer, and then cannot be reused in any sensible way. There is no library to provide common data structures like trees, linked lists, and versioned data, no way to define new grouping functions (Oracle has an agonizing approach to that involving implementing all of the methods needed to support it) and a host of other painful problems. Solutions are either one-time hacks added by the RDMS vendor (like Oracle's Connect By clause) or left to be "conventions" for users to quibble over and re-implement endlessly. SQL is worse than JavaLanguage for OnceAndOnlyOnce. Only the specacular usefulness of the underlying RelationalDatabase has made us all tolerate this painful language.

This is not a fault of relational, but SQL, or implimentations of SQL in particular. The proposed TopsQueryLanguage provides a syntax that allows easy site-specific extensions by adding new "functions", as long as those "functions" follow certain rules.

Here is another example:

  replace(replace(replace(B.TELEPHONE_NUMBER,'(',``),')',``),'-',``) AS PHONE
  // I had to use some back-ticks because regular quotes confuse wiki

Is much less readable than something like:

 // strip out phone number punctuation
 $tmp = B.TELEPHONE_NUMBER;     
 $tmp = replace($tmp, '(', '');
 $tmp = replace($tmp, ')', '');
 $tmp = replace($tmp, '-', '');
 PHONE = $tmp;

When you deal with queries all day, nested crap like the above bites deeply into productivity.

This format is easier to read than the original SQL:

 // strip out phone number punctuation
 REPLACE(
 REPLACE(
 REPLACE(B.telephone_number,
         '(',''),
         ')',''),
         '-','') AS phone

It is an improvement over the original, bit I still find it confusing to figure out what starts where and why that way, especially for more complicated expressions. It has the ThickBreadSmell. It seems something only a Lisp fan would love :-)

For nested selects and operators the following format is even cleaner:

 REPLACE
 (  REPLACE
    (  REPLACE
       (  B.telephone_number,
          '(',''
       ),
       ')',''
     ),
     '-',''
 ) AS phone

It is still code bloat, but it at least makes nesting readable. Another solution could be to use a stored SQL function, even if it is referred to from only one place of code.

Not Equal and Nulls.

If one is using an outer join or if there are nulls in the data, then one often has to do:

  X <> 'IGNORE-ME' OR X IS NULL

The vast majority of the time when one is using not-equal, one also wants the nulls taken into account. Thus, to have to include the Is Null clause and remember to include the clause is a common annoyance and mistake pattern.

It probably stems from not-equal being equivalent to "is-less-than or is-greater-than", both of which undstandabilty need to ignore nulls. However, perhaps this equivalence should be questioned. Or the whole idea of Null strings being dropped, which would solve roughly half the problem.

But the presence of a NULL means the value is not known - at least, not in this table; if you don't know the value, what makes you so sure it's not equal?

Experience. I've rarely needed the null rows excluded, and it's easy to explicitly exclude for the rare exceptions. It simply violates common sense.

The science of SQL includes the term "query". Oddly enough,a query can also update data, insert new data, remove data, create an entity, import or export data to other data management languages etc. In other words, the IT industry misuses the word in one place, and compensates for the misuse by using it in a "technical sense" at the very base of SQL's language definition. This is one of the better examples of the IT industry abusing the English language - I reckon it could be a sub-conscious urge to obfuscate to keep programming for programmers. -- PeterLynch

"The science of SQL"???

Overloading "query" probably owes much to early attempts to market technical database products to non-technical managers and users. E.g., "Need an answer to a question? It's easy -- just enter a query! Need to update the database? It's easy -- just enter a query! Need a new table...?" Etc. Everything-is-a-query makes the product seem simple and easy.

I agree there is a terminology problem, but I'm at a lost to present an alternative. "Command"? "Statement"? -t

CategorySqlProgramming CategoryQueryLanguage