Nulls And Relational Model

ChrisDate argues at length against ThreeValuedLogic in the context of database systems in the book AnIntroductionToDatabaseSystems (and in many of his other writings). Much of his objection stems from the presence of nulls in many database products (which force ThreeValuedLogic to be used everywhere - a bit bizarre, considering that RelationalDatabases make the ClosedWorldAssumption typically).

That doesn't make sense. "ClosedWorldAssumption" appears to mean "not three valued". But SQL RDBMS databases (by far the majority in today's world) certainly do use NULL. So you seem to be saying that it's bizarre that RelationalDatabases use NULLs considering they don't use NULLs. Please clarify.

[What I meant was -- NULLs introduce "unknown" into a ClosedWorld?, which without NULLs would have no need for ThreeValuedLogic. Date argues that this introduces quite a few wrenches in the works of relational theory. (Date also argues that SQL RDBMS's are poor implementations of the theory - he is a committed theorist who decries much of common practice).

Of course, the problem may be the ClosedWorldAssumption. It assumes that everything is known a priori; if something isn't explicitly or deductively true, then it's false. However, unknowns are present in the RealWorld, and something that needs to be dealt with.

UNKNOWNs and other NULLs don't actually violate the ClosedWorldAssumption. They assert that you KnowWhatYouDontKnow? - a state of ConsciousIncompetence?, knowledge that a particular question simply cannot be answered. That fact just happens to be part of the ClosedWorld?. It's part of what is known a priori.

Date's solution - refactoring any attributes that might be NULL into their own relation (table) - and including a tuple (record) into the table if the relationship is defined; excluding it otherwise - seems attractive, as it avoids NULL. On the other hand, it has its own particular flaw - getting back to the ClosedWorldAssumption. The absence of a particular tuple doesn't mean "unknown", it means "false".

Tables are not Relations but rather representations for them. That means there is an interpretive layer. Tables are allowed, for example, to represent everything that is not in a Relation - everything in the table is 'False', and everything else is 'True'. In more general use, you're even free to choose whether the absence of a tuple represents 'false' or 'unknown' so long as you have a valid means of communicating that it is Known what is Unknown (which requires three-valued logic).

The problem with NULLs (one of them) is that they force a "conversion to unknown" early in the process, and thus lose information. Date gives several examples where this can lead to incorrect query results. -- ScottJohnson]

Worse, his solution doesn't actually fix what he claims to be the real problem -- the avoidance of ThreeValuedLogic. Does Alice with her salary of 42k have a greater income than Bob, whose salary is Unknown? The answer is still Unknown. Splitting the tables doesn't help. Nothing was actually gained by doing so, except the conversion of physical nulls (represented as Nulls in a relation) into semantic nulls (represented as Nulls via a separate table). What is lost in the conversion is simplicity in the relational model of the data, and an increase in the number tables, by what was essentially the process of forcing certain indexes onto the data.

I'm familiar with Date's arguments. What I mean is that your phrasing seems self-contradictory. It's not that it's bizarre for an SQL RDBMS to be ClosedWorldAssumption and to have NULLs, it's that it is impossible. Date is saying it shouldn't have NULLs, in which case it would be closed world.

But since it does have NULLs, it is not closed world.

Of course, it isn't a traditional open world, either. It's a world of "assume any proposition is false, unless explicitly declared to be true or unknown". How the difference in default assumptions affects the underlying theory, I do not know.

I think this is just an issue of phrasing; I don't think I'm arguing about the gist of your point.

I don't think so, either. But spending the day hashing out such nuances can be so much fun. :)

But getting to that: Date is a bright guy, and certainly knows the subject backwards and forwards, and has many enlightening things to say, but I don't think that he is quite the last word on ThreeValuedLogic outside of RDBMSes, and even in the world of SQL there is room for some debate once one has grasped his point. (Most of the problems stem from people unaware of his arguments or alternatives.)

Outside of databases, multi-valued logic has a number of different interesting branches in pure math, some of which have practical applications.

-- DougMerritt

Agreed. -- ScottJohnson

Date's solution should be rejected right away, on the simple ground that it's utterly inelegant and raises the level of complexity to ridiculous levels. Here are the final queries reconstituting a table that had in the beginning two nullable columns (admittedly null has in each column two different interpretation. From http://www.hughdarwen.freeola.com/TheThirdManifesto.web/Missing-info-without-nulls.pdf

 WITH (EXTEND JOB_UNK ADD ?Job unknown? AS Job_info) AS T1,
      (EXTEND UNEMPLOYED ADD ?Unemployed? AS Job_info) AS T2,
      (DOES_JOB RENAME (Job AS Job_info)) AS T3,
      (EXTEND SALARY_UNK ADD ?Salary unknown? AS Sal_info) AS T4,
      (EXTEND UNSALARIED ADD ?Unsalaried? AS Sal_info) AS T5,
      (EXTEND EARNS ADD CHAR(Salary) AS Sal_info) AS T6,
      (T6 { ALL BUT Salary }) AS T7,
      (UNION ( T1, T2, T3 )) AS T8,
      (UNION ( T4, T5, T7 )) AS T9,
 (JOIN ( CALLED, T8, T9 )) AS PERS_INFO :
 PERS_INFO

Of course, the query above is a monstrosity. The elementary solution is to have relational algebra over a decent type system, where one would declare:

  type salary_info= Unsalaried
                    |Unknown
                    | Salary of number
  function to_string Unsalaried= 'Unsalaried'
                     | Unknown= 'Unknown'
                     | Salary of x = string(x)

And that's it, assign the corresponding types to nullable columns create a to_string, operator for each of these types and then the above monster query becomes:

  PERS_INFO {id, name, to_string(job), to_string(salary) }

Basically, what it boils down to is that they want to supplant the primitiveness of their type systems by artificially using relations. Given a decent type systems, there's hardly any need for three valued logic. The fact that something is unknown can be encoded explicitly using an extended data type. -- CostinCozianu

It should be pointed out that Date argues (rather forcefully) for a decent type system. You and he agree (and I too, though my experience in this particular subject is limited) that the typing facilities provided by SQL are embarrassing on many levels. OTOH, Date seems to think that encoding domain-specific NULLs with specific semantics isn't very good either; I don't agree with him here.

However, I would suggest one ReFactoring, using the "type union" operator (union in the VariantRecord? sense, not in the set-theory sense, in that a domain union could represent exactly one object of exactly one of the specified types. I cannot remember the CategoryTheory name for this operator). Create a Salary type (domain), which encodes a (known) salary. Then, where the salary might not be known or applicable; define the attribute to be of type "type_union(Salary,Unknown,Unsalaried)".

Part of the problem with NULL (both in SQL and in many non-relational languages such as JavaLanguage) is that you can never exclude it from the possibilities - any variable at any time might be NULL. NiceLanguage is one attempt to fix this issue in Java - references in Nice, unless declared that NULL is a possibility, must always be bound to a valid object. CeePlusPlus references are useful for the same reason.

-- ScottJohnson

"Part of the problem with NULL (...) is that you can never exclude it from the possibilities - any variable at any time might be NULL." -- IIRC, you can mark a field as being 'required'. That eliminates the possibility of NULL.

I fully embrace the idea of typed unions as discussed by Costin, and share your disgust with the limted typing facilities provided by SQL. I just feel it's necessary to add that, with NULL and SQL's limited typing, the data type already is:

  type salary_info= Null
                  | Just a Number
  function to_string :: salary_info -> String = \x ->
             x ? Just n  => to_string(n)
               | Null    => "null"                 -- or some other string, often empty

This is, of course, limited to the binary (eqv. to the Maybe type); SQL simply doesn't provide better options.

I'd also note neither our solutions nor even Darwen's solve Darwen's or Date's pet peeve with ThreeValuedLogic, nor even fix the ClosedWorldAssumption, when NULL is used to represent unique unknown values. Using your salary_info:

  type salary_info= Unsalaried
                  | Unknown
                  | Salary of number
  type logic_primitive= True | False | Unknown
  function operator ">" :: salary_info -> salary_info -> logic_primitive =
             -- (a naive implementation; violates ClosedWorldAssumption)
             \sl sr -> (sl,sr) ? (Salary of n, Salary of m) => (n > m)
                               | (Salary of n, Unsalaried)  => True
                               | (Unsalaried, Salary of m)  => False
                               | (Unsalaried, Unsalaried)   => False
                               | (Unknown, _)               => Unknown  
                               | (_,Unknown)                => Unknown

If you don't see it, the problem is that if sl and sr are the same 'Unknown', then sl > sr is provably False.

One correct approach:

 type salary_info = Unsalaried
                  | Unknown UniqueId?
                  | Salary of number
 type logic_primitive= True | False | Unknown
 function operator ">" :: salary_info -> salary_info -> logic_primitive =
             -- '#' in this syntax is a guard
             \sl sr -> (sl,sr) ? (Salary of n, Salary of m)                => (n > m)
                               | (Salary of n, Unsalaried)                 => True
                               | (Unsalaried, Salary of m)                 => False
                               | (Unsalaried, Unsalaried)                  => False
                               | (Unknown idl, Unknown idr) # (idl == idr) => False  
                               | (Unknown, _)                              => Unknown
                               | (_,Unknown)                               => Unknown

Another correct approach:

  type salary_info = Unsalaried
                   | (implicitly unique) Unknown
                   | Salary of number
  type logic_primitive= True | False | Unknown
  function operator ">" :: salary_info -> salary_info -> logic_primitive =
             \sl sr -> (sl,sr) ? _ # implicit_eq sl sr      => False
                               | (Salary of n, Salary of m) => (n > m)
                               | (Salary of n, Unsalaried)  => True
                               | (Unsalaried, Salary of m)  => False
                               | (Unsalaried, Unsalaried)   => False
                               | (Unknown, _)               => Unknown  
                               | (_,Unknown)                => Unknown

... where 'implicit_eq' and 'implicitly unique' are key words to the language utilized when translating this declarative language into a computation. It turns uniqueness into an implementation detail. (One could, for example, use only pass-by-reference and test address sl == address sr... or one could literally create unique ids, as the other correct approach indicates).

More correct approaches are possible. Moving operators up one or more levels to records, relations, or even operation over other relational operations, all provide for possible correct approaches.

Use of NULL as representing implicitly unique Unknowns is a valid approach, and does not inherently harm a RelationalModel. Implicitly unique values still cause a snag for constructing 'new' relations (e.g. adding a new entry with this or that NULL from another entry), but the RelationalModel (being mathematical) doesn't require that ability; any defined relation is fully described by the relational operators that created it.

(Of course, the fact that SQL fails to implement the correct relational operators to handle this particular use of NULL does violate the ClosedWorldAssumption and relational equivalence for SQL. It's important to recognize the correct reasons for this, though, so the correct parts are fixed.)

[Moved from NullConsideredHarmful]

ChrisDate has rallied against Nulls in database languages. He suggests they create a ThreeValuedLogic, which is far more complicated to grok than 2-value Boolean (True and False). I tend to agree. Nulls cause more problems than they solve.

What really tics me off is nulls in string concatenation. For example:

  x = a || b || c

where "||" means concatenate. In some systems if a or b or c is Null, then the whole fricken result is Null. This is unacceptable and ripe for bugs and confusing to newbies IMO. The Null becomes like a poison-pill where the presence of just one ruins the whole expression. Sort of how mad-cow disease works. One ends up having to rap each item in a null-fixer function. Concatenation should treat nulls like a blank of zero length. Similar things can happen in a math expression, but the result is not as error-prone as concatenation in my experience.

I'm curious as to how much that opinion is based on unexpected results. Of course the "proper" (that is to say SQL92 standard) way to concatenate strings where you don't want propagation of NULL is with the COALESCE() function, e.g.,

x = COALESCE(a, ") || COALESCE(b, ") || COALESCE(c, ");

in other words, there are functions that are built to handle the realities of 3-value logic in SQL. Of course if you use Oracle 8, wherein

x = a || b || c;

produces the same results as above (i.e., Oracle 8 is not standard compliant), you might be a bit startled when you start using a SQL92 compliant db and get a NULL result. I however, chalk that up to learning. If one accepts NULL as a necessary part of robust data sets that attempt to model real-world systems (see below on that) then one needs to articulate the logic of NULL values and have ways of dealing with that logic consistently. To simply decree "no NULLs" and try to avoid the issue then one assumes either (a) universal default values (this approach, by the way, doesn't work) or (b) different rules for propagation of NULL in expressions. The latter approach, (b), can be made to work but it amounts to arguing with certain details of the standard and instead needing other to-be-standardized ways of dealing with NULL values: i.e., check each and every value when one cares about whether there's a NULL. It amounts to a preference of default behavior that one would rather see. The formal logic remains unchanged.

Repetition is a Smell

Re: Of course the "proper" (that is to say SQL92 standard) way to concatenate strings where you don't want propagation of NULL is with the COALESCE() function

Yes, but one often has to keep using it all over the place. Something that is repeated often is probably a design smell, or at least a yellow alert. I would much rather have concatenation that treated Null strings as a zero-length string rather than a poison pill.

Regarding databases, I think Nulls are acceptable in numbers or dates, but not strings. (Whether nulls even there mess up the "purity" of relational, as ChrisDate contends, I can't really say.) A zero length string can represent an "empty" string. We don't need nulls for that.

But what about the case in which a zero-length string is a valid value? I run into that conundrum all the time.

But why would you need a null?

Say I'm writing an application that periodically polls devices via SNMP and stores the result in a database. Some of the data are strings, and it's valid for them to have any conceivable value, including an empty string. However, sometimes the device flakes out and doesn't return a value at all. One method would be to introduce another column that stores whether or not the device really returned anything; this complicates things because it requires a read to determine if a read is applicable.

(Most database API's return the quantity of result records. You should check this number, not the actual values. Further, if you issue a query and the system cannot complete the query, it should be an exception (error condition), not a value.)

Note that some databases treat nulls and zero-length strings as the same thing. I say toss, or ignore, that convention.

Why? If we accept nulls for numbers and dates for perfectly valid reasons, why would we exclude them from strings? Why wouldn't we want to be able to distinguish between zero length strings and the absence of value?

Partly because strings are more flexible than numbers and dates. You have more choices in which to indicate different kinds of information. You can have a string value of "null" (literal), for example. I have not encountered a situation where null strings would be more helpful than problematic.

A string can represent a value *plus* any possible string (including itself; QuineProgram). For example, I can write software such that strings that begin with a space character " " ("are escaped with a space character") represent the literal string after the space character, while the strings that begin with "null" and "http://" and "wiki:" are interpreted as something other than a literal string.

But if the word "null" means a string has no value, then we can never store the word "null".

Only if the software is incorrectly written. If I want to store the literal string "null", " null", or " null", the software can internally represent it as " null", " null", or " null", respectively. Of course, the software has to strip out the escape character from the internal representation when it needs to display that string. (From the outside, it should be impossible to discover what the escape character is - common escape characters are " ", "\", ">", "=" - or in other words, the software should be "transparent", allowing all possible strings to be stored without any weird restrictions.)

Unfortunately, lots of software is incorrectly written. For example, lines beginning with "From " mean something special to some email software, and cannot be transmitted in the body of the email. Also, have you heard the story about the car with the license plate "NONE" (http://www.snopes.com/autos/law/noplate.asp)? (EditHint: Is there a WikiPage somewhere about "software should be transparent", with tips like

"Use

  printf("%s", data)

to print strings received from outside source, because carefully-crafted strings can crash

  printf(data)

." ?)

MySQL can give entirely different results when a field is NULL than when it contains an empty string. In fact, the documentation suggests using a zero-length CHAR field as a true boolean, because it saves space. The programmer may choose whether a NULL value means something significant, or whether it is allowed at all. In one instance, I had a table where NULL meant that the value couldn't be used (even though empty values were allowed), but I had to change the table after switching to a system that simply didn't initialize NULL fields when selecting records.

Something to consider: NULL is overloaded; it means multiple things. One of the meanings is not just "unknown", but rather BottomType, the "universal subtype", commonly used in type theory to signify a divergent computation, a fatal error.

See WhatIsNull for a listing of some of the many meanings of "null"/"nil"/etc.

There are theoretical arguments for why any function that is passed an indicator of fatal error should also return fatal error, which is why f(NULL) => NULL.

Of course, that only applies to strict functions. When using short-circuiting logic, true || Bot should be true; false || Bot should be Bot. Likewise for the and operator

It's actually a different line of argument, whether f(UNKNOWN) should equal UNKNOWN or not.

Depends on the function. If the function in question is f(x) = 3, then f(UNKNOWN) should probably be equal to 3.

It depends not just on the function, but on the whole system. Different theoreticians differ in their opinion on the example you gave. Still, my point is literally what I said: this is a different line of argument than the arguments for and against the treatment of divergence, even though the arguments can sometimes overlap. Some would say that your example should yield 3 but if passed bottom, should yield bottom... because they use different arguments for the two cases.

On the other hand, there are different theoretical arguments for "total" functions, which can accept parameters of any type at all. Under some (but not all) definitions/systems, this means the ability to have functions such that f(NULL) != NULL.

It's a rather complex set of issues, and gets very far from day to day SQL programming hassles.

-- DougMerritt

Indeed

-- ScottJohnson

Moved from RelationalLanguage

About dynamic typing: You must remember that types in the relational model serve a key purpose. A type is an integrity constraint, allowing a user to define a range of values that should be in an attribute of a tuple. Second, types enforce closure. A valid relation is a closed set. SQL tables, given all of SQL's features, aren't relations, since they allow NULLs: for example with integers, a NULL is not an integer; allowing NULLs breaks closure.) Denormalizing, in effect, breaks a relation's closure, and it has serious practical consequences in terms of integrity and the like - the concept of types are critical to the relational model.

Regarding: "SQL tables, given all of SQL's features, aren't relations, since they allow NULLs." -- I'd think that isn't a problem if you conceptualize the type as 'Maybe Integer' rather than 'Integer'. It reads 'Integer' in SQL, a 'Maybe Integer' in your mind and your code, and communication is complete; the language has done its job.

That's a common misconception. NULLs are not values. The result of an equality comparison between two NULLs is not true, for instance. NULLs introduce a three-valued logic, which has a number of consequences that complicate SQL. Many people find the complications worth dealing with, but it's important to at least understand that they exist. -- DanMuller

This is no misconception. NULLs are definitely values. NULL is an immutable concept that is stored within cells, communicated within messages, etc. It fits the very nature of what values are, just as much as concepts like 'one', 'two', and 'three'. The problem is pretending that NULL isn't a value simply because it isn't of the nominal domain associated with a particular field in an SQL relation. A quirk SQL and many RDBMSs happen to have is that the nominal domain is not the actual domain. I.e. if the nominal domain is 'Integer', but it isn't explicitly marked as being 'required', then its actual domain is 'Maybe Integer' because it might be nothing; it might be NULL. This would probably be less a conceptual stumbling block for users if the type had been clearly marked 'Maybe Integer' instead of just 'Integer', but it's something one can learn easily enough - if it reads Integer, and it isn't constrained as being required, you should read 'Maybe Integer' and write your code thusly.
Nulls do not break closure of operators over the actual domain -- Maybe Type. They do cause problems when pretending the domain is the nominal domain (Type). I.e. it's foolish to try multiplying two Maybe Integers as though they were Integers. However, you (quite legally) can trivially define all operations involving a Nothing to return a Nothing - it's one means of providing closure over the Maybe Domain. The programmers don't have a choice of that default behavior in SQL, but that's a very reasonable default.
I agree that NULLs complicate code. Their proper handling requires an extra case statement. However, where NULLS are unnecessary, they should be excluded as a possible value in that given field in the database. Where they are necessary, NULLs complicate things no more than necessary. Trying to eliminate necessary complexity is generally futile. The best one can ever do is improve clarity and reduce confusion (switching nominally to 'Maybe Integer' would probably help).
Wrgt the equality comparison between two NULLs, that depends on the conceptual nature of the particular NULLs involved. 'NULL' is heavily overloaded... probably too heavily. Equivalent(UNKNOWN,<anything>) is UNKNOWN, and requires ternary logic to process, but Equivalent(VOID,VOID) is TRUE, and Equivalent(ERROR A, ERROR B) might be TRUE, FALSE, or possibly UNKNOWN depending on the contents of A and B. Ideally, the programmer would have access to the necessary constructors to design whatever equivalency operator he or she needs for the query (e.g. If isnull(X) ...). If the '==' operator happens to treat NULL always as UNKNOWN, then so be it - the users of the SQL must simply keep that in mind while at the same time remembering that they have a Maybe Type. If the meaning of NULL can be changed, or if multiple breeds of NULL are included (e.g. VOID, UNKNOWN, EMPTY, ERROR), then the number of cases to handle increase, but the default operators can become far more precise.

NULL does not fit the nature of all values, hence NULL should not be available to all types. My thermometer cannot display NULL degrees, nor can my motorcycle go NULL miles per hour. In the context of a Temperature or MilesPerHour type, a NULL is not an appropriate value. A NULL could, however, be an appropriate value for a type that represents the measurement (or lack thereof) of these value types, but it's semantically inadequate in real-world terms. It would be more appropriate to define (for example) NotMeasuredYet and/or MeasurementMissing and/or MeasurementLost (or as many as needed) values for TemperatureMeasured or MilesPerHourMeasured (or MaybeInteger, if you like) types, along with appropriate type-specific - not universal - operators so that the logic remains binary-valued. Additional universal NULLs (of which VOID, UNKNOWN, EMPTY, etc. are examples) merely increase complexity by turning binary or ternary logic into n-valued logic, and introduce further universal values which may not belong to certain domains. TedCodd suggested the use of similar values in some of his later writings on the RelationalModel, but this approach has largely been deprecated. As DanMuller pointed out, use of NULL as a universal value for all types introduces three-valued logic, with attendant complexity. Arguably, this is unnecessary. See HughDarwen's "How to Handle Missing Information Without Using Null" (http://www.dcs.warwick.ac.uk/~hugh/TTM/Missing-info-without-nulls.pdf) for one theoretical possibility. There are others. -- DaveVoorhis

"NULL does not fit the nature of all values, hence NULL should not be available to all types." -- You are correct, and I fully agree. Indeed, NULL is generally not representing a value of the domain represented by the data, but is instead a value of the data modeled in its relation. There are exceptions - if your thermometer is of the LCD type that sometimes fails to actually show a value, for example, then that'd be part of the domain value rather than of the database value. Got lazy and forgot to record: Nothing. If you looked, but it was blank: Just Nothing (or Error, if you wish). Record of Temperature: Just Just <temperature>. These are fundamentally different. ... Of course, in many cases it's better to represent no-data by simply not storing it. In most cases, you can get away with this.
"Additional universal NULLs (of which your VOID, UNKNOWN, EMPTY, etc. are examples) merely increase complexity by turning binary or ternary logic into n-valued logic, and introduce further universal values which may not belong to certain domains." -- Incorrect. Alternative versions of NULL do not increase the logical complexity of operations except where they increase the necessary set of logical primitives. Only UNKNOWN, in the list given, does this.
"As DanMuller pointed out, use of NULL as a universal value for all types introduces three-valued logic, with attendant complexity." -- incorrect. Use of any sort of value (including the 'Maybe' sorts) only introduces a three-valued logic when it causes logical operations over values (such as equality) to be unable to return or decide upon Boolean values. For Maybe types, if NULL is treated as VOID, this is not a problem; VOID == VOID, and nothing else is equal to VOID. If NULL is treated as UNKNOWN, then you are generally stuck with ternary logics when reasoning about your data. For certain ComplexValue?s (e.g. functions), ternary logic would be introduced anyway because such things as 'a == b' can be undecidable. Further, nitpicking just a bit, neither Dan nor I said anything about NULL being introduced as a universal value for all types, at least within my recollection. NULL is just a value of the Maybe Type, which happens to be the type of value that is stored and reasoned about in the presence of knowledge with holes. If NULL were considered a value of all types, then Maybe Integer could include the value 'Just NULL' - this would be a BadThing since NULL lacks the properties of Integer).
"See Darwen's How to Handle Missing Information Without Using Null" -- Unfortunately, the approaches advocated are neither the SimplestThings nor the most elegant across all datasets and applications. The most natural and elegant ways to deal with missing data is to either record it as missing or not record it at all. These are fundamental. Anything else is a hack. Darwen's approach, through plentiful effort and additions of constraints, does manage to eliminate the physical Nulls in the relations... but does so by creating semantic Nulls through special tables for 'unknown' and 'unsalaried' peoples. It's semantically equivalent; the missing data was properly recorded as missing, but it was far from the SimplestThing (there are now three tables to look at to discover a person's salary, for example), and gained nothing in terms of the processing one would need to do when it comes time to check the related elements in the DataBase... or at least nothing that a good indexing mechanism wouldn't have solved. They didn't even solve what they propose to be the real problem: three-valued logic is STILL necessary when one begins comparing UNKNOWN salaries to those of specified and unsalaried persons. That's critical; it's inherent in the fact that UNKNOWN and three-value-logic is a necessary complexity in any system with incomplete information. Use of the physical NULL in particular might violate Date's and Darwen's sensibilities; however, that is a problem that they've created for themselves, in their own minds... not one that naturally exists in the RelationalModel or Relations. The RelationalModel was created for modelling Data, and a record of 'No Data' is still data.

I'm somewhat baffled by the response. I was referring explicitly to what you quoted, which was about SQL. In SQL, NULLs are definitely not conceived of as values. This is spelled out in the standard (although it contradicts itself on the matter elsewhere). As you elliptically imply, to treat NULL as a value requires reinterpreting what SQL's equality operator means, and leaves you with none predefined or definable within SQL - and that's just the beginning of the complications that would arise from the reinterpretation. Also, I quite disagree that "null" is an immutable concept independent of context. NULL in SQL can't possibly mean the same thing as null in Java or NIL in Lisp, for instance, since SQL has no notion of objects (nor does the relational model). Discussions of different ways of dealing with the unknown concept (and other possible states) are legion, and you can find many on this wiki, but they're all about changes or alternatives to SQL, and thus not directly relevant to what you quoted. -- DanM

I responded to: "SQL tables, given all of SQL's features, aren't relations, since they allow NULLs." This is incorrect because allowing NULLs does not violate the fact that they are relations. The properly inferred type of those data domains where NULLs are allowed is the Maybe type, or some equivalent, and NULLs will always be values of that type. Attempting to redefine things in some screwed up manner so that "NULLs are values, but actually aren't" won't change this; it just makes fools of the standards writers because they can't recognize (or refuse to recognize) the type they're actually using.
Yes, I read most of this the first time, but it's still incorrect in the context of SQL. What value is not equal to itself? NULL = NULL is not true in SQL. Considering the SQL NULL as a value requires you to also consider SQL a language in which equality testing requires a bizarre contortion ((x IS NULL AND y IS NULL) OR (x == y)). I suppose you can do it if you must, but SQL is already tortured enough to start with that it seems an unhappy approach. -- DanM
- In the most general sense, it is undecidable when two different expressions of a single value are equivalent. The TypeClass? that offers the equivalence operator is NOT guaranteed. However, it's reasonable to note that any one NULL has the same expression as another NULL.
- Values express concepts. It is UNKNOWN whether UNKNOWN_VALUE <a> does or does not equal UNKNOWN_VALUE <b> where <a> != <b>. When NULL is being overloaded to conceptually represent unique UNKNOWN_VALUEs (so every different NULL in the entire database is conceptually a different UNIQUE_VALUE), then it is UNKNOWN whether one NULL in the database equals another NULL in the database. That's pretty simple. With this interpretation, the onus for correctness falls on SQL to alter its equality operator and other relational operators to meet this interpretation. Due to this, if you wish two different NULLs to compare as equal, then that bizarre contortion is a necessary complexity. The main problem with SQL is it has poor, tortuous syntax for specifying such cases... not the fact that Nulls happen to be representing the fact that you don't know some other fact. (One should note that comparison of UNKNOWNs is hardly the only place where MultiValuedLogic is introduced; things like '50.0 plus or minus 0.2' are also values, and questions like '50.0 plus or minus 0.2' == '49.99 plus or minus 0.2' can be very questionable, and should probably be answered with a confidence rather than a boolean.)
- Properly, UNKNOWN_VALUE <a> == UNKNOWN_VALUE <a> should be TRUE forall <a>. The main error SQL has is not in representing unique unknowns in its tables (as that's perfectly legal), but in that it doesn't track the origin of those NULLs for relational operations. At operation time, it doesn't know whether it's comparing one NULL in the database to another NULL in the database, or to the same NULL. In its failure to handle this complexity, SQL ends up introducing what are conceptually 'new' NULLs in the act of comparing a particular NULL in the database to itself, and thus violates the ClosedWorldAssumption in its relational operators. The fault here, however, is provably in those operators. The existence of NULL in those tables, even conceptually unique NULLs, violates none of the requirements for those tables to represent a subset of the cartesian product of domains.
- SQL has many flaws. I'd never argue that point with you. SQL fails to meet the definition of a truly 'relational' DML/DDL for a number of reasons (including allowance of repeated elements in 'sets', the required existence of pure presentation-layer concepts like group-by and reverse order that make no sense as operations on true relations, and the naive and sometimes incorrect handling of the UNKNOWN_VALUEs (represented as NULL) by relational operators). More generic flaws also include its tortuous and non-ergonomic syntax that grows nearly illegible for any large program (caused largely by its lack of good functional and/or macro facilities), its disgusting lack of value-description and value-representation-description capabilities for a system designed specifically to store values (which requires both describing them and representing them digitally), and its inflexibility in warping what I call the 'DataSpace' to capture desired sets of datapoints based on arbitrary criteria and distance functions (something better than just numerical inequality and string equality is needed - regular expressions, partial value matches, matches based on arbitrary criteria like 'prime number', etc.). But, despite all its flaws, conspicuously absent among any valid list of reasons as to why SQL is not relational is: "SQL tables aren't relations because they allow NULLs".
- I'll also note that the context of SQL doesn't matter to this point; the notion that relational tables don't represent relations because they allow NULLs is false in any context, no matter what NULL happens to be representing. The definition of a relation is both very simple and sufficiently broad that relations can legally accept NULLs, UNKNOWNs, and any other representable concept as perfectly valid values in their domains. And those values can further represent whatever concepts one might wish to represent (but, as mentioned above, that puts the onus on the operators because the actual value in the table is independent of the conceptual value). The fact that SQL NULLs are by default reinterpreted to the concept of unique UNKNOWNs doesn't the nature of the tables, though it does reinterpret the type from Maybe Type to MaybeKnown? Type. This sort of reinterpretation causes no problems with the underlying model because the DomainType? can be any value type in the RelationalModel. (It is not logically inconsistent with the RelationalModel to use the UniversalType? for every domain, and any other specification of a DomainTypes? is simply a constraints. It is not logically possible for a reinterpretation of a type to compromise the RelationalModel.)
The RelationalModel is about modelling data. Data comes from real-world sources. The writers of SQL did properly recognize that real-world data can be corrupt, incomplete, etc. and so NULL is reasonably allowed to represent unknown or void data (though those really should be different). Thus, NULLs are part of the type of data, if not of the underlying domain. Thus treating NULL as a value (of the data) does not actually increase the complexity of SQL at all, because you're already doing. You've got to deal with this issue anyway. If a field isn't listed as 'required', then you're stuck with dealing with the possibility of it being NULL. And while you cannot change what SQL is doing, you can gain some clarity of mind by recognizing the fact that NULL values are simply part of the Maybe type that is being used to represent data. That's what I said initially, in addition to noting that NULLs don't (and can't) actually violate the concept of Relations because Maybe Types are perfectly valid as data domains within relations. (And that's true regardless of whether you're talking about SQL in particular, or any other DML.)
"Also, I quite disagree that "null" is an immutable concept independent of context." -- Lol. I'd love to see you try to change it. All concepts, values, messages, etc. are immutable; all you can ever change is state and interpretation. When you change interpretation, you're simply saying that some concept is a stand-in for some other concept. This is, indeed, common for NULL. However, by changing your interpretation, you didn't change that underlying concept. Implying you did would be akin to saying: "I started using 1 to represent True, and 0 to represent False. Therefore I mutated the concept 1 to True and the concept 0 to False." It's a flawed statement. Concepts are incorporeal; they don't "exist" in any physical sense. If you can mutate them in any real sense, then you have far more power than me ;). However, that comes down to a proof by definition. I agree that, when you add an interpretive layer, you may begin interpreting NULL as whatever you wish (and NULL is heavily overloaded for such tasks). The only significant comment I'd add to that is: of course you can! When you add an interpretive layer, you can interpret any whimsical concept as any other whimsical concept. The value forty-two could represent the string: "What is Six times Nine?", etc. You should stick close to the original concept for communication clarity, of course. NULL is sticking close to its origin in its use as a term to represent a value of little importance outside the fact that it isn't some other value... a 'nothing'. Its conceptual semantics are naturally good for terminals and unknowns.
Whatever. I'm not interested in word games. The term "NULL" in SQL obviously means something different than the term "null" in some other contexts. Trying to contort the various uses of the term to fit an underlying universal "null" concept can be informative sometimes, and misleading or obfuscating in others. I consider this attempt to fall in the "obfuscating" category. -- DanM
- I've not played any word games, and my original statements attempt only to clarify. If you wish to stick to your own muddied conceptions of language and communication, then so be it; it isn't important to me that you in particular have a clear understanding of value, concept, state, representation, etc. However, you should note that I've already specified MANY times that the interpretation of the value NULL is highly overloaded, that it represents different things, etc. These are important phrases with very real meanings to someone who studies and understands language and computation, even if you tend to overlook them because you are unconsciously unaware of their importance.
- What the SQL standard says about NULL doesn't change the actual nature of how NULL is utilized in SQL. NULL in SQL is a value (an immutable, representable concept) that is represented digitally and stored in the table (as NULL) that, itself, represents unique UNKNOWN_VALUEs, which again are values. Because NULLs in tables are intended to represent unique UNKNOWN_VALUESs, SQL's equality operator and other relational operators should have been designed to properly handle them as such. (They weren't.) That doesn't change the fact that the NULLs (which are actually in the relation) and the unique UNKNOWN_VALUEs (which are conceptually represented by the NULLs) don't cause problems for relations.
- Use of NULL for unknowns is generally considered reasonable because the conceptual semantics of the word 'NULL' are close enough to the conceptual semantics associated with UNKNOWN_VALUE (in the sense that you can reason about neither). Even better would be to use of NULL to represent NULL (void, mu, nothing), in which case NULL == NULL is perfectly TRUE, and to then use UNKNOWN_VALUE <id> to represent UNKNOWN_VALUE <id> - these represented values would be even closer to their actual meanings, using fewer layers of translation between representation and value. However, I've never attempted to insist that the best choice of meaning for NULL is what SQL actually means by NULL. That would require more respect for SQL than I possess, plus a dash of wild delusion.

Thus SQL isn't relational. It violates the relational model by allowing users to break closure, besides its many other problems. Tutorial D, for example, is a relational language. It understands the concept of types (or domains) and ensures that relations are closed sets.

-- BrianOlsen?

May I ask what a non-closed set looks like? Would that be like ambiguities as to which rows are included?

A non-closed set would be one marked incomplete -- if something isn't marked as in the set, then you don't know whether it supposed to be in that set or not, unless it is explicitly marked as not being in that set. It can't be done in binary (i.e. a single table) because there are three conditions. For open sets, you have: it's in the set, it's not in the set, and you simply don't know. As such, you need to somehow represent these three conditions. Wrgt 'ambiguity': Open sets can be represented in a ClosedWorld?, in which case there are no ambiguities; either you know it's in the set, you know it's not in the set, or you know that you don't know whether it is in the set. And this will never change, because your ClosedWorld? will never grow any larger. Of course, true ambiguities are readily introduced even under a ClosedWorldAssumption by simply describing the undecidable... e.g. "the subset of the powerset of programs that contains only sets of equivalent programs". However, when open sets are represented in an open world (see OpenWorldAssumption) then there is always the possibility that information will be added to your knowledge base and let you 'know' something you didn't previously know. This requires either the traditional redefining of 'unknown' to be a state-based logical primitive or the use of NonMonotonicLogic, because you may no longer represent 'Unknown(X)' as a permanent 'fact' of your world.