Database Is Representer Of Everything

In other pages, it is argued whether the DatabaseIsRepresenterOfFacts or the DatabaseIsRepresenterOfEntities. These arguments are theoretical in nature (I should let it be known that in an ideal world, I strongly lean toward the former). However, switching from theory to practice - in modern RDBMS systems (SQL-based ones), the database is the representer of everything: Relations are used to represent the following:

Facts (relationships): Things like "the boss of X is Y" and "X works for department Z".
Entities: The actors and objects (both terms used in the general English sense, not in the CS sense) of interest to the enterprise; things like Employees and Accounts and such.
Events: Purchase orders and the like. Events represent activities, either pending or closed. Purists occasionally like to point out that closed events (those which have reached a conclusion) need not be included in the database, as they aren't necessary to discover the current state of the enterprise.
Other historical data: Employees who have been fired and such. May still be of active interest; or may be present simply due to data-retention policies.
Relational schemata: Most RDBMS's use relations to describe the schemata of other relations (much as many OO languages treat classes as FirstClass objects, without imposing any stratification to separate the domain data from the metadata.
Other RDBMS metadata: Access control tables, etc.

Whether or not this state of affairs is good or not; I won't argue (here). But it's useful to how databases are used in practice, especially when practice diverges from how they ought to be used, according to theory.

Ahem. Everything? Where in the database do you find profit = revenue - costs? -- mt
- {In the Bernie Madoff simulation.}
I'm not sure that's the strongest NitPick, since there are several ways such could be put in the database, regardless of whether it's common to do so.

And does the guy who posted the above really think he said something different then DatabaseIsRepresenterOfFacts ?

The above only presents a classification/distinction on the kinds of facts to be stored in the database. So, it is not an alternative to DatabaseIsRepresenterOfFacts, but at best a refinement. If this refinement represents a progress or is an unnecessary distinction that complicates things to no good avail is a more elaborate discussion. We know with certainty that the refinement is both a progress in some areas (the process of database design) and becomes regressive if stretched too far (one will certainly be in danger of missing the forest for the trees).

To begin with, think "data-base": a collection of data, plural from the latin datum, which means a "given", a proposition that is given to be true, a fact. So "Entities", fine: we all recognize entities around us and databases store facts about entities, also they stores facts about relationship between entities. "Events": facts about state of the world or state-transitions of the world at a given moment. "Metadata": facts about data. "Historical data": well historical facts: "at the closing bell of 2005-01-01, SUNW stock was 4$" it's still a fact even as it has the time qualifier attach to it, of course it's different in content from (SUNW stock is 4$) - currently is implied, but it is a fact nevertheless. -- CostinCozianu

(For the record, I did not create this topic. --top)

The contrasting of facts/entities/everything as the scope of a database seems to me a quibbling over alternative points of view of the same thing. The fact-based approach seems the most primitive, able to represent just about anything you want, and is more amenable to formal and semi-formal verification and manipulation by virtue of its simplicity. Everything else can just be seen as a higher-level interpretation of what the facts describe - or conversely, a higher-level abstraction which can be decomposed into a (fact-representing) relational schema. Entity-relationships diagrams have long co-existed with relational schema as a way of describing the model implemented by a database, and people rarely (until relatively recently) confused them with the schema itself.

Higher-level abstractions typically simplify some things, but make other things more difficult or impossible (without resorting to the underlying lower-level concepts). It doesn't surprise me, for instance, that the object database approach has difficulty coming up with good general query mechanisms - it's the price paid for using a higher-level abstraction. Whether the price is worthwhile or not depends on your application and the benefits you get in return.

I look forward to seeing if anyone can make a good case for an entity-based approach that replaces, rather than augments, a fact-based approach. I'm sceptical that this will happen.

-- DanMuller

Depends on what you mean as to whether that's even logically possible, never mind feasible.

The strong version of the "of Facts" approach holds that it doesn't make sense to list a fact more than once, and therefore there should not be duplicate rows. This view doesn't need further argument, it works. The actual SQL approach of allowing multiple rows is obviously not a replacement, it's an augmentation. So much for that angle.

The weak version of "Facts" is that one can conceptually consider each row to be a predicate; the presence of a row {A, B, C, D} means e.g. (A and B and C and D) = TRUE. It's hard to see how it would be possible for an alternate approach to replace that mental model. Relational DBs are, after all, based on sets and logic; how can you replace that and still be relational?

So I'm waiting to hear an interpretation in which it would even be possible.

This is not a change in my position; I may have been unclear before, but I never considered any possibility than [??] conceivably augmentation, never of replacement (at least, not within the world of relational). -- DougMerritt

OK, thanks for the clarification.

Regarding duplicates and identity: Allowing duplicates just requires an implicit notion of row identity, in order to anything at all interesting with the duplicates. Make that identity explicit, and you're back to unique rows and keys. I've never understood the arguments for duplicates, nor the arguments for introducing a notion of implicit row identity (which appear both in conversations regarding duplicates, and in those regarding the storing of objects). Common implementation features such as autokeys, cascaded updates, and cascaded deletions make it quite easy to deal with explicit row identity.

"It's hard to see how it would be possible for an alternate approach to replace that mental model." A higher level of abstraction built on top of the relational model could, in a sense, effectively replace the relational model as a mental model, just as high-level languages have effectively replaced assembly language from most programmers' point of view. (Although that's not the sense of 'replacing' that I had in mind above.) Hopefully that won't happen yet, since no higher-level model has yet emerged that actually warrants that analogy. But maybe the discussion you've started is a step towards finding such - who knows?

-- DanMuller

Re: some future higher level model: yes, exactly, that would be very interesting!

Re: implicit/explicit row identity: I'm a bit confused as to your point. The physical side of things uses something like row IDs in order to get all of its work done, not just to deal with duplicates. They shouldn't literally be made explicit in the logical model (although some vendors, like Oracle, support it). When you say "explicit", do you mean a surrogate key? That's a fully logical model value, not a physical model value, so it's quite a different issue. Cascading operations can have quite a severe performance impact, nor are they necessarily tied to all uses of surrogate keys. What was the question associated with all that? -- DougMerritt

Looking back, I'm not sure what, exactly, evoked my comments about keys and identity. I think it was a reaction to your comments about duplicates above (being an 'augmentation' in SQL, which is questionable - it might actually be a 'breakage', since sets with duplicates are not sets), and to some of the verbiage at the beginning of DatabaseIsRepresenterOfEntities on the topic. Sorry for the apparent non sequitur.

By 'explicit', I mean any key, which in turn implies a surrogate key if none other is apparent in the domain being modelled. The fact that internal row IDs exist in implementations is immaterial. All of my comments on relational databases throughout C2 refer to logical aspects, except where I've explicitly noted otherwise. The performance impact of cascading operations is an implementation issue, and need not be concern - after all, if you're using a surrogate key to provide identity for rows that would be otherwise duplicates, it's unlikely that you'll ever have a need to update that key (so cascaded updates are not a significant concern), and cascaded deletions are needed when they're needed, regardless of the cost and regardless of whether the feature is provided by the DBMS or the application (and they're likely to be faster in the former case).

Ah. Well, that's largely true of EmployeeNumber? - although I've seen them reassigned! Small-dicked new CEO with ego issues: "I demand to be employee #1!" But ok, rare operation. The question is simply whether it is rare or not. In the case of e.g. malleable trees, with surrogate key functioning to label tree node, one can't blithely assume no significant costs and ignore things as "implementation issues". Things cease to be a matter of mere petty concern about implementation when the cost rises past a pain threshold. -- DMe
True, however it's easy to conceive of implementations that store a key value and its foreign references exactly once, making such an update trivial. Since performance issues are extremely subject to implementation details, they're best left out of such discussions unless a logical view of things clearly and unavoidably forces one to deal with expensive algorithms. I'm assuming here that the discussion revolves around what could be, rather than what is available from existing DBMS implementations. Please correct me if this is the wrong scope. -- DanM
Not at all, but it's great to clarify, since it's easy for the reader to get mixed up as to whether the context is "in the ideal" versus "feasible today". E.g. earlier I made a comment on Top's query language, and even though that's his future ideal, not a real world system today, I felt I should explicitly clarify, since someone might have been thinking in terms of translating Top's syntax into SQL, and what I was suggesting in the ideal isn't feasible in today's SQL systems (that all tables should really be views except for the admin who creates the underlying base tables, so that he could fiddle with implementation without impacting existing queries).
The Representer topic includes both today's practice, where "of Fact" wins, and hypothetical future things that build on top of it, so confusion can indeed occur. In most of today's systems, trees are not well supported and can be (will be, in the general case) unacceptably slow. I've commented in the past that this isn't out of necessity, but due to happenstance; mostly, vendors haven't cared yet, aside from cashing in on fads and offering separate implementations of XML databases...sigh. -- DMe

If you agree that DatabaseIsRepresenterOfFacts is a rigorous, solid, and sufficient basis for the relational model, and that DatabaseIsRepresenterOfEntities alludes to a possible higher-level abstraction built thereon, perhaps the latter should actually be named FactsAsRepresentationOfEntities?. This would more clearly distinguish the discussion from current lamentable trends (at least in industry literature, perhaps to a lesser extent in academia) to 'rethink' database theory (from scratch) without building on the integrity of the relational model.

Hmm, interesting idea. The page clearly needs work...I'm not sure, maybe it would be better as a second page after (well, or even before) the original is fixed to better frame the issue. -- DougMerritt

-- DanMuller

It appears that the Database represents nouns, but not verbs. It may represent everything but not the operations permitted on those things. It can represent constraints (forbidden operations) but not desired ones. It therefore encourages a code/data duality. -- AnonymousDonor

Insofar as you can come up with propositions to describe concepts of interest - nouns, verbs, invoices, rodents, you can use a relational database to represent instances of those concepts. There is, for instance, nothing in relational theory that prevents you from storing facts of the form "add_integers has parameters A and B, and is implemented by the code 'return A + B'", which could be said to represent functions.

But note that the relational theory addresses only these facts, these relationships among data. To say that it encourages code/data duality is like saying "C++ variable declarations can't represent functions, so they encourage code/data duality". C++ variable declarations are one part of a whole, and similarly, relational database theory by itself makes no attempt to address everything needed to describe executable programs. (This is why I find language-specific relational database APIs to be an interesting topic.)

I hope this makes it clear why it's wrong to say that a relational database "represents nouns, but not verbs". A particular relational schema might or might not represent either.

-- DanMuller

I completely agree, but of course it's unfortunate that such things usually must be implemented in ways unknown to the language and DB, e.g. strings or blobs. It's much nicer when the underlying system knows what you're talking about.

This is orthogonal to the relational model, of course. A query language + RDBMS that supported FirstClass functions as supported types would still be just as nicely relational.

And similar comments apply to all the systems that don't support any kind of user defined type; it's not just about verbs.

And of course, none of this has to do with Representer of Facts/Everything/Entities (which is implied by Dan's comments, I'm just making it explicit). -- DougMerritt

I see no reason why DB's cannot store functions, methods, etc. I used ControlTables fairly often in my ExBase days. It could be a lot easier to add meta data to them so that things can be more easily found based on different criteria. I would hate to think that file systems are the pinnacle of organization. The only reason file systems persist is because thinking in terms of sets and views instead of trees has a longer learning curve. --top

AugustZeroFive