Does Relational Require Types

Under RelationalLanguage, RelationalModel, and others, some suggest that relational requires StaticTyping. I have not seen anything that suggests it is required. Relational does not really deal with "sub-column" issues, other than requiring operations be well-defined and consistent. I personally would like to see "dynamic relational" explored (see MultiParadigmDatabase).

This is a good question. The RelationalModel as it is usually described includes domains (types) as a central notion, but they seem redundant vis-a-vis attribute constraints on RelationalVariables in the database. (A possible indication of this is that DateAndDarwen's work, which emphasizes ManifestTyping, also downplays attribute constraints to the point of a passing mention.) So far, I can't convince myself that approaching RelationalAlgebra from a "typeless" or dynamically typed point of view does it any particular violence. Since explicit typing does offer implementation advantages, this raises the question of whether requiring attribute types in the RelationalModel is an example of confusing implementation issues with logical model requirements.

Whether "official" relational requires types or not may be separated from the question of whether dynamic relational (or whatever we call the new baby) is possible and useful. Generally, it appears that relational and types are independent. For example, if one does a join, the expression that matches rows between two tables only has to come up with a Boolean answer. How it gets that Boolean answer is and should be outside of what relational defines as long as it follows the rules. Example:

  Select * from a, b where fooble(....)

We don't know what operation fooble does. We only hope that it returns True or False for each combination of the CartesianJoin (which is the widest possible join). As long as fooble follows the rules, it should be valid under relational algebra. The rules being:

It returns a Boolean result (however Boolean is defined in the engine)
It is consistent. That is, we get the same answer each time we run it.

If these are not the minimum requirements for a relational-usable operation, then we need to settle that first.

SqLite appears to be the only database that uses manifest typing instead of column typing. Column typing could be seen as a narrow interpretation of an attribute constraint. Date never spilled a lot of ink over it, because relations are themselves supposed to be something a higher-order type, and those are more or less disjoint. Typing is more of an implementation detail that helps the RDBMS optimize storage and comparisons.

Does it in any way resemble the to-the-wind dynamic approach proposed in MultiParadigmDatabase?

[[Typing is NOT just an "implementation detail". Typing keeps the database sterile and ensures you don't mix up a string with an integer. Typing is for humans, not for computers. The fact that it optimizes in some cases is a nice bonus. Usually elegant solutions are fast enough, while inelegant solutions tend to be slow.]]

Validation can also "ensure you don't mix up a string with an integer". And being sterile in a non-sterile world can potentially bog you down such that your competitor is getting things done while you are having piety contests.

[[Piety contests, or science contests? You can validate at run time, but with programs you can statically validate at compile time or interpret time before your program goes live. How is this bad? With regard to databases, making sure that all cells in a database column will ONLY contain true or false, is an extremely helpful tool. What if someone inserts a "ture" misspelled "true" string into the boolean column? You can automatically ensure it will only contain true or false using a type system. You can manually validate it but that is going a step backward, and forcing you to do all the validation yourself! Types keep the database sterile. Humans make mistakes, we need all the help we can get. With computer programs, a compiler validates at compile time. Playing word games and redefining types as just some form of validation, or just side flags, or just pylons, isn't going to help the situation at all. A compiler validates your program as much as it can using many techniques. It seems this has become just a semantic war. Why would you want to spend time on validation, isn't that just a piety contest? why bother validating at run time or at compile time, if it is such a waste of time? Forget it all, just remove all sterility - don't wash your hands or your programs.]]

Something about this discussion is confused. Would you be happy if the only type offered was BLOB? No? Then types of some sort are important to you. In a similar vein, less-than/greater-than etc have importantly different meanings for alphabetic strings than they do for dates. If you want dates to be supported appropriately, you want types.

One does not need supported types to support comparisons. It just requires operations/functions that are specific to "types". In other words, no polymorphic overloading of operators. However, there are plenty of existing topics that debate strong/weak/none typing. I don't think we need to ignite those HolyWars here. Opinions on types are like a$$holes: everybody has one. -- top

    ...WHERE dateCompare(xDate,'>',yDate)

The static vs dynamic typing issue is secondary, but real world relational database systems could not be implemented efficiently if all types were dynamic, so it's a very important pragmatic issue. This is not a small issue, even though in the abstract it could be viewed as "six of one, half dozen of the other". RDBMS efficiency on common operations is critical in any number of ways to vast numbers of people and companies, and the speed gain from having some built-in static types beyond just BLOB is not a matter of 5% or even 20%, it's quite possibly a million-fold in some common cases. (Physical-layer handling of BLOBs lacking any static meta-information usable for optimization is extremely, extremely, inefficient, by comparison.)

You are probably right that there will be some performance penalty for dynamic or non-typing. But just like dynamic languages, you do it because it lowers the cost of developer time, not because it is faster, especially in domains where nimbleness is survival. CPU power is less and less of the bottleneck over time. Things like bus and network throughput tend to be the DB bottleneck. However, indexes may be an exception (see below). -- top

All of that really should be obvious. The apparently less-obvious issue is that user-defined types should also be heavily supported, but in general are not, and DateAndDarwen most certainly do explain why that's an issue, so again, something is confused on this page.

If DBs supported user-defined functions, they wouldn't really need user-defined types if we go the dynamic route. A string is interpreted however the UDF wants them interpreted.

That's the thing that has always confused me about Top's position on types: in terms of programming languages in general, ok, it's a deep and confusing topic, so, whatever, people have different views - but surely Top has grokked the DateAndDarwen position about types in regards to RDBMSes, yes? No?

Being a non-type fan, I haven't really given it much thought. The typing system (or lack of) should be orthogonal to the query language in my opinion. The query language only needs to know how Booleans are defined/recognized (and perhaps equality?). Outside of that, I don't see why the query languague should care about the underlying typing system of the "expression engine". For example, somebody should in theory be able to plug an imaginary-number "expression engine" into a RDBMS as long as Boolean results are somehow defined. (SQL may have basterdized such separation.)

It is roughly analogous to a "generic sorting engine", where the engine only needs to define an interface to a HigherOrderFunction[1] that returns less-than, equal, or greater-than for any comparison. How a given HOF goes about it, or what it operates on is not the concern of the sorting engine. (It is a common example found in FP and closure articles and books.)

If somebody makes a case that they should be integrated instead, I may pay more attention to type issues in relational. However, perhaps indexing can be made more efficient if it knows about a particular "type". But indexing is an implementation detail. Thus, it would not be relational algebra that is being improved by static typing, but the indexing engine. -- top

[1] The HOF reference is not meant to be an endorsement by me of HOF for all programming languages. Although, it may indeed be handy for building a meta RDBMS kit. There are also ways to obtain semi-generic sorting without HOFs.

Orthogonal system features become CrossCuttingConcerns in implementation. I agree that, to the degree possible with respect to indexing and cluster-based queries, the type system (including the potential for user-defined types) should be an 'orthogonal' feature to the relational operators. However, what you suggest with regards to date - e.g. weakening the type system on the relational side such that everything is a string or BLOB - is a violation of orthogonality, as it requires programmers to write functions and store values using the limited set of types supported by the RDBMS rather than by an 'orthogonal' set of types.

How so? DynamicRelational allows optional dynamism. That is not a bad thing, but I don't want to get into another dynamic-versus-heavy-typing debate here. We have enough of those. -t
Upon review, I agree the above was poorly phrased. What I meant to say is that it is impossible to implement typing as 'orthogonal' to relational by your approach of DynamicRelational and user-defined functions. When it comes time to design and implement an RDBMS, it is necessary to choose and implement the 'orthogonal' type system along with storage, indexing, and everything else.
For hardware-related efficiency reasons, perhaps it's best that the type system be "compiled in". But, this does not mean that relational "requires" a type system in the theoretical sense.

That said, I don't think this 'ideal' can be maintained when it comes to optimizing queries with indices, especially not when dealing with region-based and cluster-based queries (likeness, distance, partial ordering, etc.). Any practical RDBMS system is going to need some advanced knowledge of the types it is indexing, or at least of the likely queries, in order to group records for rapid access based on features or properties other than equality.

Agreed. Logical flexibility is often at odds with "physical" flexibility, AKA "performance". This is largely why the first generation of RDBMS hard-wired in the type and operation/function system. But thinking about logical separation can still result in a kind of "kit" where special domains can build domain-specific RDBMS without starting from scratch. -t

Separation of Relational Engine from Domain Math

The issue of separation of relational from the "domain expression engine" recently came up in a debate where the somebody claimed that finding the nearest N points to a given point in a geographic map system would have to traverse every point under relational, and thus be O(n). A solution is to move such operations to the domain language instead of let the query language directly do it. Example based on basterdized SQL:

   ...WHERE pointID IN nearest(myTargetID, 10)  // nearest 10 points

This way, one does not have to entirely abandon relational to get decent performance for certain kinds of operations. A domain-specific operation/function, "nearest", is introduced into the expression language here. Standard SQL may not allow such, but SQL is not the pinnacle of relational and seems to dismiss the concept of separation of domain math from relational math.

I have not seen much info about how and whether the "expression engine" can or should be considered orthogonal to the "relational engine", as long as a minimum set of requirements is met. It seems the relational engine only needs a way to know how to interpret expressions into Boolean values. Thus, any math that can produce or be interpreted as Boolean can be "inserted" into the relational domain. Beyond that, the sky is the limit. What the "cell types" are is also up to the domain.

Does anyone know of DrCodd's or ChrisDate's work on such separation?

-- top

I believe that, rather than injecting a few specialized operators like 'nearest', we should seek to identify the higher order functionality necessary to achieve that sort of function. 'nearest', for example, fundamentally involves viewing tuples as existing in a 'space' defined by some sort of distance or strength function, so perhaps a distance function should be part of the operator. Further, such arbitrary limits as '10' might be better (or 'more consistent with other relational operators') if lifted into a higher layer (i.e. ask for the nearest, then limit the overall query result to returning 10 items). In this sense we might naturally have a non-boolean 'strength' for each entry in a query result... an idea that has promise generalize readily into similar operations, and might take over for 'likeness' matches in the normal relational queries. (related: DataSpace)

Anyhow, before deciding that operations like 'nearest' belong in 'domain maths' it is probably worth looking to see if there is something more fundamental, higher order, that is going on with such operators. After all, perhaps we can find just a small fixed count of higher-order relational operators that are general enough for every region, clustering, cliques, orderings and partial orderings, and the various graph-based operation over relations that we might dream up. This isn't impossible. After all, we already have FoldFunction over inductive datatypes to prove that much. And it is well worth gaining these operators we can think of today at the higher order, since it is easier to use (and often easier to implement) higher order operators than it is to go in and tweak relational engines to support a new specialized operator.

One point is indexing for such operations. Whereas indexes for other relational operations seem to be based on individual tuples, I suspect we'd want some ability to also declare for each table some indexes regarding the common 'Distance' or reachability functions used for those tuples. Having these spaces named would allow their use in the query (nearest(point,myTable,myTable:fnDist)) to both save on typing and help ensure the indexing optimizations are discovered if available.

(An example of orthogonality of domain math and relational math based on a UseNet discussion)

Anyhow, here is an example of what I meant by orthogonal to types. Let's take a sample SQL statement:

 SELECT *
 FROM foo
 WHERE columnA + columnB > columnC

Let's use a functional syntax to represent the same thing:

 SELECT *
 FROM foo
 WHERE greaterThan(add(columnA,columnB),columnC)

Now every typical expression can be represented with nothing but column references, operations (functions) on columns, and/or other such operations. For good measure we should probably throw in constants also. The usual infix notation (like "A + B") we are familiar with is just syntactical sugar to help our eyes.

Conceptually, relational expressions should not care what these operators are as long as the final expression returns something that can be interpreted as a Boolean value in order to filter rows. We assume regular Euclidian math and Boolean algebra because that is what we are familiar with. However, I see nothing in relational algebra that excludes the idea of "custom maths" or custom types for specific domains or uses.

Somebody could make up a weird math with odd operators (using our functional syntax for familarity):

 SELECT *
 FROM foo
 WHERE zlog(drek(columnA,columnB),fleek(columnC))

As long as "zlog" returns something that can be interpreted as a Boolean expression, it can be part of relational math. The "expression engine" is generally separate from the relational engine and/or relational math. Relational math operations generally fit the form of:

  table = relationalOperator(table, expression, ...[more tables or expressions]..)

The relational operators are "fixed" (at least the base ones), but the expressions are open-ended as long as they fit the criteria already described. Relational imposes rules at the table level, but not at the cell level. Cell-level operations can do/use whatever math model they want as long as they recognize columns in some way. They are only constrained when they are "fed" to relational math. In other words, the intersection between the "table math" (relational) and the domain or custom math (expression system) can be kept fairly small to give the expression math a lot of room.

Another way to say this is that relational math and the domain/custom math can be treated as orthogonal worlds and only have to agree on a minimal set of protocols to talk to each other. All the domain math needs is adaptor operators that can convert domain math results into Boolean so that the relational world can act on it, and a way to ask the relational engine for column content. Other than being a "keeper", the relational engine does not care what is in the column cell.

In the example above, "zlog" is such an adaptor. The "fleek" operator (toward the right of example) may or may not be usable to the relational engine. We don't know without looking at its type or domain indicator.

Aggregation may require yet more rules. I'll have to ponder this one.

As far as the SELECT statement, it just calls the domain-specific functions and returns whatever the result is. Example:

  SELECT zlog(columnC)
  FROM ...

Here, function "zlog" simply returns its value and the SQL command merely passes it along as-is.

So far, it appears these are the only requirements for the "domain math" to hook into relational math:

At least some operators that return Boolean-interpretable values (called "adaptors" here)
Decomposable into operators and operands
Some of the operands in some operators accept/recognize table columns. Otherwise, we would have a WalledGarden.

Note that relational math does not need to operate directly on cell values (columns). In some cases it may provide a nice shortcut, such as if the cell values contain boolean-able values, but that is merely a bonus shortcut.

-- top

"What's more (in case you might be wondering what Codd meant by the term declared relation), references[3] and[4] both make it clear that a declared relation is a named relation ("time-varying," of course) that (a)is explicitly declared as such to the system,(b)is described in the system catalog, (c)can be updated (so the declared name denotes different relations — that is, different relation values — at different times), and (d)can be referenced in queries (and constraints, presumably).That looks like a relvar to me, and an explicit one to boot." -- PDF File From Hugh Darwen

"From: Fabian Pascal: It is not the applications, but the DBMS that enforces integrity constraints, that is the whole point of database management. Before we had databases/DBMSs we had only applications and files and then applications enforced integrity constraints. Database management was invented to eliminate all the problems this created." ..."The DBMS enforces integrity constraints for all applications." --http://www.dbdebunk.com/page/page/859852.htm

[p.s. sorry to make direct links to the dbdebunk website (as maybe Fabian doesn't like it), but how the heak is anyone supposed to reference his website sensibly if he doesn't want me deep linking? It is impossible to sensibly conform to his request of just linking to his home page, IMO. I'd even volunteer to maintain his website for him and secure/improve it if he doesn't like me doing this deep linking.. as websites should be designed in a way where you can send people to pages! In addition I'd even help dbdebunk make their site more navigate-able if they wanted.]

"D shall permit compile-time type checking."..."By this prescription, we mean that insofar as feasible--it shall be possible to check at compilation time that no type error can occur at run time. This requirement does not preclude the possibility of "compile and go" or interpretive implementations. " --From TheThirdManifesto

In other words, striving for idealism and perfection is what we should aim for in a type system and DBMS - but compromises can be made where they must be made.

A domain is a named set of values. Such values, which shall be of arbitrary complexity, shall be manipulable solely by means of the operators defined for the domain(s) in question (see RM Prescription 3 and OO Proscription 3)---i.e., domain values shall be encapsulated (except as noted under RM Prescription 4). For each domain, a notation shall be available for the explicit specification (or "construction") of an arbitrary value from that domain." ... "We treat the terms domain and data type (type for short) as synonymous and interchangeable. The term object class is also sometimes used with the same meaning, but we do not use this latter term. --From TheThirdManifesto

Above sheds some light on domains and types and the importance of them.

From DbDebunk website: "In Parts 1 and 2 of this article, I showed among other things that each relvar has a relvar predicate and the overall database has a database predicate. Now, I hope it's obvious that the predicates in question are all ones that are "understood by the DBMS": They're stated formally (they're part of the database definition, in fact), and of course they're enforced by the DBMS, too. For such reasons, it's convenient to refer to the predicates in question, on occasion, as internal predicates specifically--principally because relvars and databases also have external predicates, which I want to discuss next." --http://www.dbdebunk.com/page/page/622397.htm

From C. Date, in a DbDebunk article: "Ideally, therefore, the DBMS would know and understand the external predicate for every relvar, so that it could deal correctly with all possible attempts to update the database. Unfortunately, however, we've already seen that this goal is unachievable; that is, the DBMS cannot know any given external predicate. But it does know a good approximation: It knows the corresponding internal predicate--and that's what it will enforce. (Thus, the pragmatic "criterion for acceptability of updates," as opposed to the ideal one, is the internal predicate, not the external one.)" --http://www.dbdebunk.com/page/page/622397.htm

In other words.. type enforcement (named sets, domains, et. al) is extremely important with regards to database integrity.. and all the above quotes should shed a lot of light, hopefully clarifying much confusion on this subject.

{Regardless of what we call it, it seems there may be some value, practical or experimental, in separating them. Why should relational be defined in terms of domain-specific or math-specific operations? The "language" of the predicate system could be independent (beyond minimum requirements needed to use relational). Any notation or math that can return a Boolean result, provide equality operations, and can have discrete elements definable as columns could participate in most of not all basic relational operations as documented by C. J. Date. This could open up relational to more domains, and separate concerns.}

I'm not clear what you're suggesting here. Are you suggesting separating type systems from the relational model? If so, in effect that is already the case. In DrCodd's original description, each attribute has a type (aka domain), but beyond that the model itself need not refer to types. The RelationalModel will, in theory, support any valid type system that allows the full RelationalAlgebra to be computable. There may be some confusion because DateAndDarwen advocate a particular type system (DateAndDarwensTypeSystem) they believe particularly suited to the relational model. In TheThirdManifesto, discussion of the RelationalModel and DateAndDarwensTypeSystem are sufficiently coupled to perhaps cause the mistaken impression that DateAndDarwensTypeSystem is part of the RelationalModel. It isn't, nor is any other particular type system. There is some debate over whether to employ StaticTyping or DynamicTyping, and whether WeakTyping is acceptable or not, and so forth, but these are part of the broader debates about approaches to types in programming. Types are essentially orthogonal to the RelationalModel.

It is to note FabianPascal addresses it from a practical standpoint. i.e. without a type system, what is a database? SQLite? In other words - how would you ever implement a reasonable relational model without a type system? What values can exist in a relational model without them being some type of thing? Blobs of text, or blobs of clearly defined things like integers and strings? How can an integer value be inside a relational database without the relational model supporting that something is an integer? It just exists as a blob of SQLite text? This is a debatable practical vs model issue. Model cars still have tires and paint and doors and steering wheels, although they could get away without any of that and just be four wheels on a block of wood. What is a model?

I didn't write that the RelationalModel should be typeless. In fact, I wrote that "each attribute has a type." Note that the RelationalModel is not a database. However, many database systems are implemented using the RelationalModel (or something roughly derived from it, in the case of SqlLanguage), which often leads to a mental conflation of the two. Whilst you cannot implement the RelationalModel without a type system, you can implement the RelationalModel with a variety of different type systems. Which of these are suitable, or not suitable, for a particular database is a different debate. It's worth noting that the applications to which SQLite is usually applied are quite different (and should be!) from those to which Oracle is usually applied.

DynamicRelational discusses a weak-typed approach. Basically it follows conventions of weak-typed languages: everything can be viewed more or less as a string, and operations parse the strings as types if and as needed. (This does not necessarily mean that an integer has to be stored as a string but more so that explicit columns types are not required, at least not as a default.) It may have value in agile or RAD development.
Everything can be viewed more or less as a string. LOL, what an overly simplistic view of computer science. Really, a JPG image or BMP image is just a string and is not much different than say a BOOLEAN value which can only be true or false? A count column that can only be a number, is just another string which can hold a BMP image or JPG image even though it is only restricted to numbers? What nonsense! A database must have a proper type system or it is just a gibberish storage system.
You don't need "internal" types for that. You seem naive about how to use databases. One could create a schema such as:

images (table) ------- Name // Similar to file-name Format_Type // bmp, jpg, gif, etc. Bytes

So on paper - with a relational model - how does this model put integers into it's model without saying that something is an integer? Maybe it could support a TypeModel without being a specific Type System. For example, types can exist without being a specific implementation of that type system.

The RelationalModel can, to put it simply and in general, support any type system that exposes an operation for determining the equality of attribute values of equivalent type. Other operations on types may indeed lie outside the RelationalModel's view of the type system. Practical implementations usually require (at minimum) an operation for performing an ordinal comparison on attribute values of equivalent type, but may also support attribute values with no exposed operations (i.e., unknown type) as long as these attributes do not participate in certain roles in the RelationalAlgebra, for example, they are not the common attribute(s) in a JOIN.

By the way, I presume by your mention of "on paper" that you are suggesting a non-database implementation of the RelationalModel is a purely theoretical construct. Not so. The RelationalModel can be implemented as a useful system for manipulating and deriving facts -- or, more informally, for managing collections of attributes -- without necessarily having any of the characteristics usually ascribed to a database system, such as persistence, ACID compliance, etc.

By on paper, I think what was meant was something similar to math. How can you teach math, without introducing a type system? You can do math on paper so let's use it as an example. In math we have types of numbers, which are extremely useful and neccessary for humanity.

http://www.purplemath.com/modules/numtypes.htm

Some people, oddly, think types are not that useful in programs. EducationHasFailedUs. One can beat around the bush and pretend types aren't really needed to do math, but really they are. Since the relational model is mathematical, and since math contains types of numbers, it follows logically that without types, we are doomed and screwed. Possibly types need to be introduced in math class before the student reaches computer science and even begins on the relational model. In fact most math classes do introduce children to number types, but it seems people forget the usefulness of number types or something, and I think EducationHasFailedUs when I see pages like DoesRelationalRequireTypes and ThereAreNoTypes, along with statements such as "types are just side flags" and other such nonsense on this wiki. DoesMathRequireTypes ? Maybe math doesn't require types if you prefer to DodgeTheIssue and play word games, and make things more murky (MuddyTheWater?).

{Yes, let's not DodgeTheIssue, and instead create an iron-clad definition of types before bitching about type definitions you don't personally like. TopsTypeDeterminatorChallenge.}

"See DoesRelationalRequireTypes" was inserted, seemingly randomly, on a number of pages related to the RelationalModel where it had no relevance at all. I have removed it from those pages. -- DaveVoorhis

CategoryRelationalDatabase CategoryLanguageTyping