Nulls And Relational Model

ChrisDate argues at length against ThreeValuedLogic in the context of database systems in the book AnIntroductionToDatabaseSystems (and in many of his other writings). Much of his objection stems from the presence of nulls in many database products (which force ThreeValuedLogic to be used everywhere - a bit bizarre, considering that RelationalDatabases make the ClosedWorldAssumption typically).

That doesn't make sense. "ClosedWorldAssumption" appears to mean "not three valued". But SQL RDBMS databases (by far the majority in today's world) certainly do use NULL. So you seem to be saying that it's bizarre that RelationalDatabases use NULLs considering they don't use NULLs. Please clarify.

[What I meant was -- NULLs introduce "unknown" into a ClosedWorld?, which without NULLs would have no need for ThreeValuedLogic. Date argues that this introduces quite a few wrenches in the works of relational theory. (Date also argues that SQL RDBMS's are poor implementations of the theory - he is a committed theorist who decries much of common practice).

Of course, the problem may be the ClosedWorldAssumption. It assumes that everything is known a priori; if something isn't explicitly or deductively true, then it's false. However, unknowns are present in the RealWorld, and something that needs to be dealt with.

Date's solution - refactoring any attributes that might be NULL into their own relation (table) - and including a tuple (record) into the table if the relationship is defined; excluding it otherwise - seems attractive, as it avoids NULL. On the other hand, it has its own particular flaw - getting back to the ClosedWorldAssumption. The absence of a particular tuple doesn't mean "unknown", it means "false". The problem with NULLs (one of them) is that they force a "conversion to unknown" early in the process, and thus lose information. Date gives several examples where this can lead to incorrect query results. -- ScottJohnson]

I'm familiar with Date's arguments. What I mean is that your phrasing seems self-contradictory. It's not that it's bizarre for an SQL RDBMS to be ClosedWorldAssumption and to have NULLs, it's that it is impossible. Date is saying it shouldn't have NULLs, in which case it would be closed world.

But since it does have NULLs, it is not closed world.

Of course, it isn't a traditional open world, either. It's a world of "assume any proposition is false, unless explicitly declared to be true or unknown". How the difference in default assumptions affects the underlying theory, I do not know.

I think this is just an issue of phrasing; I don't think I'm arguing about the gist of your point.

I don't think so, either. But spending the day hashing out such nuances can be so much fun. :)

But getting to that: Date is a bright guy, and certainly knows the subject backwards and forwards, and has many enlightening things to say, but I don't think that he is quite the last word on ThreeValuedLogic outside of RDBMSes, and even in the world of SQL there is room for some debate once one has grasped his point. (Most of the problems stem from people unaware of his arguments or alternatives.)

Outside of databases, multi-valued logic has a number of different interesting branches in pure math, some of which have practical applications.

-- DougMerritt

Agreed. -- ScottJohnson


Date's solution should be rejected right away, on the simple ground that it's utterly inelegant and raises the level of complexity to ridiculous levels. Here are the final queries reconstituting a table that had in the beginning two nullable columns (admittedly null has in each column two different interpretation. From http://www.hughdarwen.freeola.com/TheThirdManifesto.web/Missing-info-without-nulls.pdf

 WITH (EXTEND JOB_UNK ADD ?Job unknown? AS Job_info) AS T1,
      (EXTEND UNEMPLOYED ADD ?Unemployed? AS Job_info) AS T2,
      (DOES_JOB RENAME (Job AS Job_info)) AS T3,
      (EXTEND SALARY_UNK ADD ?Salary unknown? AS Sal_info) AS T4,
      (EXTEND UNSALARIED ADD ?Unsalaried? AS Sal_info) AS T5,
      (EXTEND EARNS ADD CHAR(Salary) AS Sal_info) AS T6,
      (T6 { ALL BUT Salary }) AS T7,
      (UNION ( T1, T2, T3 )) AS T8,
      (UNION ( T4, T5, T7 )) AS T9,
 (JOIN ( CALLED, T8, T9 )) AS PERS_INFO :
 PERS_INFO
Of course, the query above is a monstrosity. The elementary solution is to have relational algebra over a decent type system, where one would declare:
  type salary_info= Unsalaried
                    |Unknown
                    | Salary of number
  function to_string Unsalaried= 'Unsalaried'
                     | Unknown= 'Unknown'
                     | Salary of x = string(x)
And that's it, assign the corresponding types to nullable columns create a to_string, operator for each of these types and then the above monster query becomes:
  PERS_INFO {id, name, to_string(job), to_string(salary) }
Basically, what it boils down to is that they want to supplant the primitiveness of their type systems by artificially using relations. Given a decent type systems, there's hardly any need for three valued logic. The fact that something is unknown can be encoded explicitly using an extended data type. -- CostinCozianu

It should be pointed out that Date argues (rather forcefully) for a decent type system. You and he agree (and I too, though my experience in this particular subject is limited) that the typing facilities provided by SQL are embarrassing on many levels. OTOH, Date seems to think that encoding domain-specific NULLs with specific semantics isn't very good either; I don't agree with him here.

However, I would suggest one ReFactoring, using the "type union" operator (union in the VariantRecord? sense, not in the set-theory sense, in that a domain union could represent exactly one object of exactly one of the specified types. I cannot remember the CategoryTheory name for this operator). Create a Salary type (domain), which encodes a (known) salary. Then, where the salary might not be known or applicable; define the attribute to be of type "type_union(Salary,Unknown,Unsalaried)".

Part of the problem with NULL (both in SQL and in many non-relational languages such as JavaLanguage) is that you can never exclude it from the possibilities - any variable at any time might be NULL. NiceLanguage is one attempt to fix this issue in Java - references in Nice, unless declared that NULL is a possibility, must always be bound to a valid object. CeePlusPlus references are useful for the same reason.

-- ScottJohnson

"Part of the problem with NULL (...) is that you can never exclude it from the possibilities - any variable at any time might be NULL." -- IIRC, you can mark a field as being 'required'. That eliminates the possibility of NULL.

I fully embrace the idea of typed unions as discussed by Costin, and share your disgust with the limted typing facilities provided by SQL. I just feel it's necessary to add that, with NULL and SQL's limited typing, the data type already is:

  type salary_info= Null
                  | Just a Number
  function to_string :: salary_info -> String = \x ->
             x ? Just n  => to_string(n)
               | Null    => "null"                 -- or some other string, often empty
This is, of course, limited to the binary (eqv. to the Maybe type); SQL simply doesn't provide better options.

I'd also note neither our solutions nor even Darwen's solve Darwen's or Date's pet peeve with ThreeValuedLogic, nor even fix the ClosedWorldAssumption, when NULL is used to represent unique unknown values. Using your salary_info:

  type salary_info= Unsalaried
                  | Unknown
                  | Salary of number
  type logic_primitive= True | False | Unknown
  function operator ">" :: salary_info -> salary_info -> logic_primitive =
             -- (a naive implementation; violates ClosedWorldAssumption)
             \sl sr -> (sl,sr) ? (Salary of n, Salary of m) => (n > m)
                               | (Salary of n, Unsalaried)  => True
                               | (Unsalaried, Salary of m)  => False
                               | (Unsalaried, Unsalaried)   => False
                               | (Unknown, _)               => Unknown  
                               | (_,Unknown)                => Unknown
If you don't see it, the problem is that if sl and sr are the same 'Unknown', then sl > sr is provably False.

One correct approach:

 type salary_info = Unsalaried
                  | Unknown UniqueId?
                  | Salary of number
 type logic_primitive= True | False | Unknown
 function operator ">" :: salary_info -> salary_info -> logic_primitive =
             -- '#' in this syntax is a guard
             \sl sr -> (sl,sr) ? (Salary of n, Salary of m)                => (n > m)
                               | (Salary of n, Unsalaried)                 => True
                               | (Unsalaried, Salary of m)                 => False
                               | (Unsalaried, Unsalaried)                  => False
                               | (Unknown idl, Unknown idr) # (idl == idr) => False  
                               | (Unknown, _)                              => Unknown
                               | (_,Unknown)                               => Unknown

Another correct approach:
  type salary_info = Unsalaried
                   | (implicitly unique) Unknown
                   | Salary of number
  type logic_primitive= True | False | Unknown
  function operator ">" :: salary_info -> salary_info -> logic_primitive =
             \sl sr -> (sl,sr) ? _ # implicit_eq sl sr      => False
                               | (Salary of n, Salary of m) => (n > m)
                               | (Salary of n, Unsalaried)  => True
                               | (Unsalaried, Salary of m)  => False
                               | (Unsalaried, Unsalaried)   => False
                               | (Unknown, _)               => Unknown  
                               | (_,Unknown)                => Unknown
... where 'implicit_eq' and 'implicitly unique' are key words to the language utilized when translating this declarative language into a computation. It turns uniqueness into an implementation detail. (One could, for example, use only pass-by-reference and test address sl == address sr... or one could literally create unique ids, as the other correct approach indicates).

More correct approaches are possible. Moving operators up one or more levels to records, relations, or even operation over other relational operations, all provide for possible correct approaches.

Use of NULL as representing implicitly unique Unknowns is a valid approach, and does not inherently harm a RelationalModel. Implicitly unique values still cause a snag for constructing 'new' relations (e.g. adding a new entry with this or that NULL from another entry), but the RelationalModel (being mathematical) doesn't require that ability; any defined relation is fully described by the relational operators that created it.

(Of course, the fact that SQL fails to implement the correct relational operators to handle this particular use of NULL does violate the ClosedWorldAssumption and relational equivalence for SQL. It's important to recognize the correct reasons for this, though, so the correct parts are fixed.)


[Moved from NullConsideredHarmful]


ChrisDate has rallied against Nulls in database languages. He suggests they create a ThreeValuedLogic, which is far more complicated to grok than 2-value Boolean (True and False). I tend to agree. Nulls cause more problems than they solve.

What really tics me off is nulls in string concatenation. For example:

  x = a || b || c
where "||" means concatenate. In some systems if a or b or c is Null, then the whole fricken result is Null. This is unacceptable and ripe for bugs and confusing to newbies IMO. The Null becomes like a poison-pill where the presence of just one ruins the whole expression. Sort of how mad-cow disease works. One ends up having to rap each item in a null-fixer function. Concatenation should treat nulls like a blank of zero length. Similar things can happen in a math expression, but the result is not as error-prone as concatenation in my experience.

I'm curious as to how much that opinion is based on unexpected results. Of course the "proper" (that is to say SQL92 standard) way to concatenate strings where you don't want propagation of NULL is with the COALESCE() function, e.g.,

x = COALESCE(a, ") || COALESCE(b, ") || COALESCE(c, ");
in other words, there are functions that are built to handle the realities of 3-value logic in SQL. Of course if you use Oracle 8, wherein
x = a || b || c;
produces the same results as above (i.e., Oracle 8 is not standard compliant), you might be a bit startled when you start using a SQL92 compliant db and get a NULL result. I however, chalk that up to learning. If one accepts NULL as a necessary part of robust data sets that attempt to model real-world systems (see below on that) then one needs to articulate the logic of NULL values and have ways of dealing with that logic consistently. To simply decree "no NULLs" and try to avoid the issue then one assumes either (a) universal default values (this approach, by the way, doesn't work) or (b) different rules for propagation of NULL in expressions. The latter approach, (b), can be made to work but it amounts to arguing with certain details of the standard and instead needing other to-be-standardized ways of dealing with NULL values: i.e., check each and every value when one cares about whether there's a NULL. It amounts to a preference of default behavior that one would rather see. The formal logic remains unchanged.


Repetition is a Smell

Re: Of course the "proper" (that is to say SQL92 standard) way to concatenate strings where you don't want propagation of NULL is with the COALESCE() function

Yes, but one often has to keep using it all over the place. Something that is repeated often is probably a design smell, or at least a yellow alert. I would much rather have concatenation that treated Null strings as a zero-length string rather than a poison pill.


Regarding databases, I think Nulls are acceptable in numbers or dates, but not strings. (Whether nulls even there mess up the "purity" of relational, as ChrisDate contends, I can't really say.) A zero length string can represent an "empty" string. We don't need nulls for that.

But what about the case in which a zero-length string is a valid value? I run into that conundrum all the time.

But why would you need a null?

Say I'm writing an application that periodically polls devices via SNMP and stores the result in a database. Some of the data are strings, and it's valid for them to have any conceivable value, including an empty string. However, sometimes the device flakes out and doesn't return a value at all. One method would be to introduce another column that stores whether or not the device really returned anything; this complicates things because it requires a read to determine if a read is applicable.

(Most database API's return the quantity of result records. You should check this number, not the actual values. Further, if you issue a query and the system cannot complete the query, it should be an exception (error condition), not a value.)

Note that some databases treat nulls and zero-length strings as the same thing. I say toss, or ignore, that convention.

Why? If we accept nulls for numbers and dates for perfectly valid reasons, why would we exclude them from strings? Why wouldn't we want to be able to distinguish between zero length strings and the absence of value?

Partly because strings are more flexible than numbers and dates. You have more choices in which to indicate different kinds of information. You can have a string value of "null" (literal), for example. I have not encountered a situation where null strings would be more helpful than problematic.

A string can represent a value *plus* any possible string (including itself; QuineProgram). For example, I can write software such that strings that begin with a space character " " ("are escaped with a space character") represent the literal string after the space character, while the strings that begin with "null" and "http://" and "wiki:" are interpreted as something other than a literal string.

But if the word "null" means a string has no value, then we can never store the word "null".

Only if the software is incorrectly written. If I want to store the literal string "null", " null", or " null", the software can internally represent it as " null", " null", or " null", respectively. Of course, the software has to strip out the escape character from the internal representation when it needs to display that string. (From the outside, it should be impossible to discover what the escape character is - common escape characters are " ", "\", ">", "=" - or in other words, the software should be "transparent", allowing all possible strings to be stored without any weird restrictions.)

Unfortunately, lots of software is incorrectly written. For example, lines beginning with "From " mean something special to some email software, and cannot be transmitted in the body of the email. Also, have you heard the story about the car with the license plate "NONE" (http://www.snopes.com/autos/law/noplate.asp)? (EditHint: Is there a WikiPage somewhere about "software should be transparent", with tips like

"Use

  printf("%s", data)
to print strings received from outside source, because carefully-crafted strings can crash
  printf(data)
." ?)

MySQL can give entirely different results when a field is NULL than when it contains an empty string. In fact, the documentation suggests using a zero-length CHAR field as a true boolean, because it saves space. The programmer may choose whether a NULL value means something significant, or whether it is allowed at all. In one instance, I had a table where NULL meant that the value couldn't be used (even though empty values were allowed), but I had to change the table after switching to a system that simply didn't initialize NULL fields when selecting records.


Something to consider: NULL is overloaded; it means multiple things. One of the meanings is not just "unknown", but rather BottomType, the "universal subtype", commonly used in type theory to signify a divergent computation, a fatal error.

See WhatIsNull for a listing of some of the many meanings of "null"/"nil"/etc.

There are theoretical arguments for why any function that is passed an indicator of fatal error should also return fatal error, which is why f(NULL) => NULL.

Of course, that only applies to strict functions. When using short-circuiting logic, true || Bot should be true; false || Bot should be Bot. Likewise for the and operator

It's actually a different line of argument, whether f(UNKNOWN) should equal UNKNOWN or not.

Depends on the function. If the function in question is f(x) = 3, then f(UNKNOWN) should probably be equal to 3.

On the other hand, there are different theoretical arguments for "total" functions, which can accept parameters of any type at all. Under some (but not all) definitions/systems, this means the ability to have functions such that f(NULL) != NULL.

It's a rather complex set of issues, and gets very far from day to day SQL programming hassles.

-- DougMerritt

Indeed

-- ScottJohnson


Moved from RelationalLanguage

About dynamic typing: You must remember that types in the relational model serve a key purpose. A type is an integrity constraint, allowing a user to define a range of values that should be in an attribute of a tuple. Second, types enforce closure. A valid relation is a closed set. SQL tables, given all of SQL's features, aren't relations, since they allow NULLs: for example with integers, a NULL is not an integer; allowing NULLs breaks closure.) Denormalizing, in effect, breaks a relation's closure, and it has serious practical consequences in terms of integrity and the like - the concept of types are critical to the relational model.

Regarding: "SQL tables, given all of SQL's features, aren't relations, since they allow NULLs." -- I'd think that isn't a problem if you conceptualize the type as 'Maybe Integer' rather than 'Integer'. It reads 'Integer' in SQL, a 'Maybe Integer' in your mind and your code, and communication is complete; the language has done its job.

That's a common misconception. NULLs are not values. The result of an equality comparison between two NULLs is not true, for instance. NULLs introduce a three-valued logic, which has a number of consequences that complicate SQL. Many people find the complications worth dealing with, but it's important to at least understand that they exist. -- DanMuller

NULL does not fit the nature of all values, hence NULL should not be available to all types. My thermometer cannot display NULL degrees, nor can my motorcycle go NULL miles per hour. In the context of a Temperature or MilesPerHour type, a NULL is not an appropriate value. A NULL could, however, be an appropriate value for a type that represents the measurement (or lack thereof) of these value types, but it's semantically inadequate in real-world terms. It would be more appropriate to define (for example) NotMeasuredYet and/or MeasurementMissing and/or MeasurementLost (or as many as needed) values for TemperatureMeasured or MilesPerHourMeasured (or MaybeInteger, if you like) types, along with appropriate type-specific - not universal - operators so that the logic remains binary-valued. Additional universal NULLs (of which VOID, UNKNOWN, EMPTY, etc. are examples) merely increase complexity by turning binary or ternary logic into n-valued logic, and introduce further universal values which may not belong to certain domains. TedCodd suggested the use of similar values in some of his later writings on the RelationalModel, but this approach has largely been deprecated. As DanMuller pointed out, use of NULL as a universal value for all types introduces three-valued logic, with attendant complexity. Arguably, this is unnecessary. See HughDarwen's "How to Handle Missing Information Without Using Null" (http://www.dcs.warwick.ac.uk/~hugh/TTM/Missing-info-without-nulls.pdf) for one theoretical possibility. There are others. -- DaveVoorhis

I'm somewhat baffled by the response. I was referring explicitly to what you quoted, which was about SQL. In SQL, NULLs are definitely not conceived of as values. This is spelled out in the standard (although it contradicts itself on the matter elsewhere). As you elliptically imply, to treat NULL as a value requires reinterpreting what SQL's equality operator means, and leaves you with none predefined or definable within SQL - and that's just the beginning of the complications that would arise from the reinterpretation. Also, I quite disagree that "null" is an immutable concept independent of context. NULL in SQL can't possibly mean the same thing as null in Java or NIL in Lisp, for instance, since SQL has no notion of objects (nor does the relational model). Discussions of different ways of dealing with the unknown concept (and other possible states) are legion, and you can find many on this wiki, but they're all about changes or alternatives to SQL, and thus not directly relevant to what you quoted. -- DanM Thus SQL isn't relational. It violates the relational model by allowing users to break closure, besides its many other problems. Tutorial D, for example, is a relational language. It understands the concept of types (or domains) and ensures that relations are closed sets.

-- BrianOlsen?

May I ask what a non-closed set looks like? Would that be like ambiguities as to which rows are included?


See also SqlFlaws, NullVersusNone


OctoberZeroSix

CategoryRelationalDatabase CategoryNull


EditText of this page (last edited July 10, 2010) or FindPage with title or text search