The "thin table" approach is one proposed solution to "ridding nulls" from databases (FearOfAddingTables). However, conceptual nulls don't really disappear because in most cases, the user interface follows the "wide table" approach.
For example, suppose we had the following form:
Person ID: [_________] Name: [______________] Year of Birth: [_____] (optional) <Save>The traditional approach would save a Null to the database. Whether the database actually uses a place-holder for Null or not is an implementation decision. It does not have to physically store a null marker and we shall not assume here that it does. The actual underlying table contents may be more or less the same with either approach. In practice, most vendors opt to use an actual null marker.
The thin-table approach may have saving code like this:
if (! blank(birthYear)) { sql("insert into birthYearTable ($personID, $birthYear)"); } // [simplified for illustration purposes]But null-ness of a form still exists in "user space". If we deviate from the user viewpoint, then the code to use avoidance techniques is a bit awkward. For example, an update may resemble:
if (! blank(birthYear)) { sql("update birthYearTable set birthYear=$birthYear where personID=$personID"); } else { // clear out any existing birth-year sql("delete from birthYearTable where personID=$personID"); }The code appears to be simpler and more efficient if we just stick with the traditional approach because it maps to the user interface better: a blank/empty "spot" becomes a null. The "avoidance technique" is awkward and requires at least two queries instead of one (and more if we have more null-able columns). We have to go out of our way to step around nulls in the thin-table approach. I suppose we could automate the avoidance approach, but its basically just reinventing the wide-table approach using a custom API. It's reinventing the wheel for dogmatic purposes.
Null can be viewed as merely a linguistic shortcut. The DB is not obligated to store physical nulls and could use the thin-table approach internally without changing the external behavior. Those who claim that the thin-table approach "gets rid of nulls" while traditional approaches don't are over-focusing on common implementations. SQL does not dictate internal implementation.
--top
Practical implementations of DBMS's will generally store physical nulls for space and speed optimizations when working with 'wide' tables. However, it isn't the physical null (or lack thereof) that matters to thin-table proponents, who would be perfectly okay with the DBMS implementing a wide-table under the hood (and representing NULLs) for speed and optimization purposes. After all, the SQL does not dictate internal implementation. Rather, it is getting rid of the logical null that provides some value because it avoids the issue of null-equality. When you join over a NULL field, the result is unknown except when you join a null with itself (same NULL source). Dealing with the 'unknown' is problematic logically (you really need "maybe this row exists" features), and dealing with that known exception is problematic technically (because you need to track where each NULL came from). Thin tables avoid both of these problems. When it comes to outer-join queries, it is true that you'll end up with NULLs in the resulting view, but this is considered acceptable as it is part of the definition for outer joins... and if you want updateable views, you really need to track where everything came from anyway.
However, I think it unfair of you to keep characterizing the thin-table approach as if the main advantage it claimed was getting rid of logical nulls. Those are one (alleged) advantage of thin tables among many, and isn't even one of the greater advantages. Thin tables offer benefits more in the direction of eliminating unnecessary coupling of data and data management (allow its storage, discovery, and manipulation to be more easily divided among services), supporting YagNi (just add a table when you need it... no chance of breaking existing code that couldn't possibly be using a table you just added), consistency of interface for one->one and one->many properties, flexibility of schema (e.g. easier to change from one->one to one->many), supporting metadata (temporal properties (when it became true, when it stopped being true, when it was entered into the database, when it was removed from the database), secrecy levels, who entered it, source, etc.) on a per-fact basis, supporting a more logic-language style DML, etc. I'm not going to name or describe them all here (there was plenty of that on the other pages) but, as a thin table proponent, I can say for certain that being rid of logical nulls is not a primary motivating factor for my interest. And getting rid of physical nulls isn't even on the agenda... heck, I almost consider a denormalization optimization required to make a thin-table implementation fast enough to be practical outside an academic environment.
I disagree with the non-null related claims for thin tables. However, we already have ThreadMesses elsewhere to discuss those other issues. Nulls are mostly a tool that allows us to keep the internal view closer to the user's usual view (wide tables), for doing otherwise adds a translation cost in terms of code volume and complexity. The further you deviate from the user's view, the more effort/complexity/repetition to translate back and forth. It's that sample (almost). I'm not giving up simplicity for somebody's purity obsession. --top
Its a matter of the purpose. Sharing is a different goal than application-specific usage. The exchange system is an exchange system, not a domain database. Plus, thin tabling does not buy much worth while both in terms of both practicality and purity. I know you will disagree. Let's not muddy this topic outside of the null issue please. Create ThinTableDebateContinued? or the like if you wish, please. --top
Of course I disagree, though I'll note that 'worthwhile' depends on one's goals - I'm a distributed systems guy, and my goal is almost always to have dozens of decentralized programmers and users able to work with a project at once with minimal central authority... sharing is implied, and temporal/versioned/secure/user-tracking databases are very much worthwhile, as are the YagNi benefits. As far as your excuse "it's a matter of purpose..." goes, I seriously doubt you'd agree with what using that excuse directly implies. It implies that, if the purpose of the database is sharing rather than application-specific usage, you'd be all for aiming at the 'cleaner model' and not letting "presentation issues become hard-coded or arbitrary design issues unless necessary". Essentially, you're saying your argument about "closer to the user's usual view" doesn't apply if sharing is involved. If you agree with that, then okay. If not, then you probably need to spend some effort identifying your real reasons for your beliefs (be they rational or not) and get out of your current habit of latching onto whatever sounds good at the moment.
Even if I was guilty of all the dastardly deeds you accuse me of, talking like that does no good. Even from your perspective, if you get into the habit of talking/writing like that it will spill over into your work, and give you the reputation of being yet another sour-mouth aspberger IT guy.
"Dastardly Deeds"? That would be giving far too much credit to what is more likely a malignant ignorance of self. As far as your other points, I'll admit it: I'd make a terrible IT guy. That's why I'm a programmer, not a database technician. Dealing with ignorance and irrationality irritates me... especially when it is among people from whom I should be able to expect competence. You irritate me a great deal for this very reason - almost everything you believe seems (to me) based in ignorance and irrationality, starting with EverythingIsRelative (a straightforward self-contradiction). You've done much to reinforce this impression over time.
That's because you are a poor debater so far: you spend more text and energy to insult instead of clarify and provide specific examples.
Perhaps I should have qualified "sharing". In the other topic, I was concerned with different app languages, OS, databases, and architectures sharing the same info. However, you seem to be talking about an environment that assumes mostly the same tools and languages. That is a different animal. You are really asking for an-app-specific distributed database of sorts, not a paradigm/language-neutral exchange system/format.
I can't agree with your impression of my intent. I'm not assuming the same tools and languages, but I am assuming "sharing" means (at the very least) among different applications and services - i.e. the opposite of "app-specific", using the same data via different utilities, etc. According to one of those rare statements of yours with which I agree, a database whose data is shared among many utilities should not hard-code presentation issues for any particular application. And, yet, that is exactly what you propose above.
In the specific case of nulls, I don't see thin tables providing anything that wider tables do in general. If I weigh the two options, and one of the factors is about fitting the user's view, I see little to counteract that in thin tables, so side with the user-side view to avoid the translation tax. It's all logical and rational weighing of the trade-offs. Put the factors on the scale and see what way it tips. Of course you will disagree with me about thin tables in general (even outside of the null issue) and we are back to square one.
Would you agree with this: "All else being equal, fitting the user view is better than not fitting the user view because it reduces translation/conversion". Yes or no?
--top
Sure, I'd agree with that last statement. Or, more accurately, I'd say that having fewer layers of indirection/translation/conversion is generally a good thing. There is no mechanism to measure "fitting the user's view", but we can measure "fitting the final display" based on how much data collection and processing must be performed.
Better than null is being specific and factual, which NULL is not, I'm afraid.
Person ID: 881 Name: Johnny Muddy Year of Birth: EXPLICITLY NOT FILLED IN DUE TO PRIVACY Person ID: 782 Name: Bill Smith Year of Birth: 1983 Person ID: 23523 Name: Joe Smoker Year of Birth: COULD NOT READ HANDWRITING Person ID: 125 Name: Baby Janet Year of Birth: NOT BORN YETThe problem with null is you don't know the facts. The above solution gives you exact facts. This is why Codd was proposing several nulls instead of one null. I think proposing several nulls is bad and that one needs to make his own specific factual error information, avoiding null all together. If the form field was not filled in, then say so. Don't give useless information like "NULL".
How to implement this? Using an enumeration or a table that references the above "nulls". The above "nulls" are not nulls at all - they are facts.
The above solution does not disturb the data and fragment it into thin tables as darwen/date suggest... but the problem is the above idea does not have any implementation. It can cause issues too such as "SELECT * FROM table WHERE YearOfBirth? < 20". Does "not born yet" mean the person is less than 20? All nulls should be ignored in what cases? But these are not nulls - these are facts... so "not born yet" would mean the person could be less than 20.. but implementing this special logic into a database is tough.
"Could not read handwriting" gives us a clue so we can call the person on the phone and ask them their year of birth and correct this missing data. Where as a NULL gives us no information about what we need to do about the data - the null could mean that the person did not want the year of birth filled in, or it could mean the person forgot to fill it in. Null offers no facts. "Explicitly not filled in" means the person did not fill in the data because they did not want the year of birth shown, it was private to them. Once again these are all facts.
The problem is the complexity of describing data. Not everything is just a number from 0-10 or a year from 1900-2100.
Dynamic/flag-free typing can easily solve this: "Year" could store numbers as well as text, perhaps validated against a ConstantTable. Done! If we can't do that for some reason, then simply have a separate "reasonForMissingBirthYear" column. I don't think we need to invent more column conventions/types to solve such. --top
Validated against a constant table Top? Do you know what you are doing? You are being a hypocrite and supporting the thin table model. I don't know how many times I have to repeat this: your constant table is just a thin table. Now it is of course a different thin table than what Darwen/Date suggest but you have to stop attacking and inflaming "thin tables" since you yourself recommend them. And you don't even realize your hypocrisy. So it is kind of a violent agreement but not a full agreement.
BickerFlag: I never claimed that ALL tables should be wide. Ever. Thus, there is nothing to be "hypocritical" against. I committed no e-sin and deserve an apology. You are again fast and loose with your accusations. Is your code the same way? Perhaps "mixed-width" table is a better description if what I prefer if that makes you happier.
Let's just say for this context that we are comparing dividing to remove nulls against not dividing. Hopefully this will avoid a wider definitional debate about table widths. --top
As for inventing a new type for it.. a constant table and a type overlap.. since validating it against a constant table would be a form of a type system. Your validation means "what type of error is this?". How do you find out what type of error it is? You look it up in the constant table. Actually the constant table is NOT constant.. because it can change. The errors can be expanded. But I guess it is constant for a longer period of time than say regular data.
Everything (or at least most things) *can* be viewed as types if one wants to. I don't dispute that. But in this context, we are not assuming DB-level support for that kind of thing. It is an app-level "type system" if you will. --top
I really like the example above. It showns an approach that embraces the user and his needs and problems and doesn't force a model on him ("enter a date, only date and nothing but a date"). The problems are of course how to deal with these cases. We could imagine lots of further cases - some of them parameterized further (like: only year given). And each case might need to be handled differently with respect to querries like the SELECT above. In the end these data fields can only be dealt with appropriately with complex sub relations and probably modeled by complex classes in the application. But then you quickly loose the advantages of the relational model in the database: There are no longer simple tables and simple relations if even a birth date is represented by lots of tables. Then you could as well use an object store - it is as accessible. No. The strength of the database approach lies in the reduction of the data to simple types. Otherwise we need a layer in between the application and the elementary relational core representation. And this layer doesn't exist and will not exist shortly. -- GunnarZarncke
I agree that LifeIsaBigMessyGraph and modeling all the potential nuances can be daunting. But what would the UI look like for this kind of thing anyhow? What could complicate this even more is if only parts of the birth-date are known, such as the month but not the day, or day and month but not year. A criminal tracking system may have such a requirement because spies may see b-day parties through the window, etc but not be sure if the party is actually on the day or the nearest weekend. We couldn't use a traditional date for that. Textbook models often assume clean data and well-established facts. In reality, that's often not the case. In practice, it may not make sense to codify all known reasons for partial information and merely include a Note box to put that kind of information. Trying to codify missing-reason is likely PrematureClassification? in my opinion.
I think it fine to track reasons for entry failure... such things would be interesting to datamine later for statistical process control and such. But putting them in place of the normal data seems a semantic failure of the relation. Saying why you failed to get data probably belongs in another table entirely. As a potential example:
TABLE entry_error_log ------------------------- table String // names a table column String // names a column key String // {name=value,name=value} for easy parsing time_error Datetime // time of actual entry failure (if known) time_record Datetime Autotime // time error was recorded attention Bool // true if this should receive human attention description String // string describing why the data could not be recordedI've encountered a few cases where I wish I could have switched the DB over to "soft integrity" so that it would at least run without stops for a short period of time, but problems could be tracked for later analysis. Sometimes LimpVersusDie is more practical, just as it is in the animal kingdom. --top