Database Type

Languages have type systems... arrays, integers, strings, etc.

But we don't have a type that we can declare which is a Database.... The database schema is duplicated across several programs and we end up in a horrible mess.

Eh? The Database Schema, including all constraints on uniqueness and foreign keys, is the Database Type. And if you don't think arrays, integers, and strings aren't duplicated across several programs, you'd be wrong. I do agree that we're missing common language support for Database Schema/Types, and I agree that this problem should be rectified. Top, DV, myself, and a great many others have given it more than a little thought, and RubyOnRails comes directly out of an implementation of the idea. The main problem has been that most languages lack a very flexible type-system.
- Agree, the problem with RubyOnRails which I do use.. is that it is kind of a hack... generating a lot of inefficient and bloated SQL which is hard to tweak and tune, since its kind of hidden magical DoubleDipping that is done in the back ground (i.e. object relational mapping).
- As for Top and related relational talk (including dbdebunk.com) yes I have read many of these pages. Let's link some of the content to this page and start categorizing this pattern.

People abuse lists, hashes, associate arrays, pascal records, C structs etc. as their databases. People map their databases to their hashes and lists... and even objects. Yes people even abuse OOP and map objects to their database... which is duplication.

It would be nice if the database was an actual type that we could declare and use in our program. Or at least it would be nice if the database schema were maintained in one single place instead of all this mapping and DoubleDipping. Mapping duplicates and violates OnceAndOnlyOnce. Automatic mapping is better than manual mapping at times, but it would be better to reduce the mapping to nothing... and just declare the schema and maintain the schema as if it were a native type (with the server being able to modify the schema through TCP/IP in one place too).

Ah, behold the almighty call for CodeGeneration + Compile-time Resource Acquisition. Ooh, and wouldn't it be wonderful if the languages were extensible enough that you could add this feature without breaking or reinventing the compiler? And why should you maintain the schema in your code directly if you want it to also be used by external databases, or for communication between programs? Just have an SQL file outside filled with 'CREATE TABLE' commands,import it, then generate code (or even a whole database-object!). That's the way to do it!

The problems are: database lifetime.. who creates the database? The program or a person somewhere else? How do all the programs get updated with the same database schema? (global schema would be good...)

Here's the concept in code:

unit schema; // Everyone who has access to the database using SQL also modifies this schema // this schema can be controlled remotely... not just in a text editor locally // This schema is global in that two programs do not redeclare it over and over var d: databaseSchema ( customers: id: int; name: string otherTable: foo: int; bar: int );

end.

program accessTheDb; uses schema, // we declared the schema once.. othermodule; begin // now we access a customer print(customers.name where customers.id = 123); end.

Every time the database schema is modifed on the SQL server (or some form of database server), the schema in our database type module (unit schema) is updated. Our type may/could even have more detailed information such as all the foreign keys, indexes, etc. We need to access schemas more robustly than our current solutions that are really just error prone maps.

The above schema could be the database schema itself on the server... and our programs query for this unit when we use it. The real problem with today's technology is the database has a separate life than the program... so we create maps to the database and duplicate our schemas in several programs.

The problem may be impossible to solve perfectly (we can dream, and make compromises) and the above may be an idealistic pipe dream or reinventing of existing mapping techniques...

It may also be that a unified DatabaseLanguage? built into the program must be used... which means every programming language will have to use some standard built-in database language that isn't so possible to implement in each and every language natively. i.e. reinventing SQL.

This page is here to brainstorm ways of how to get the database into our programs? Currently, the database is too far away from us and too hard to use. When I say too far away from us... I mean it is like pulling teeth just trying to get databases connected to our programs. Whereas if I want to create an Array or a Hashlist it is really easy. That's because arrays and hashlists are types... hence the page title of DatabaseType.

Toward what you're suggesting, DateAndDarwen's abstract "D" language specification in TheThirdManifesto is not just about creating replacements for SQL, but about defining the desired characteristics of general-purpose programming languages that incorporate persistent relations (via relation-valued variables, aka relvars) as first class objects. There's Microsoft's LINQ, which is like Embedded SQL done better. There is also ExtendedSetTheory, which implies unifying containers, collections, sets, persistent constructs such as files, and even external resources under one set-oriented algebra. Implementations based on these ideas, in one way or another, can easily overcome simple database integration problems within a given language, but not a broader issue that is certainly not unique to database application development: Any form of distributed application development or maintenance becomes problematic when changes to, or definitions of, a centralised, structurally-mutable service need to be deployed to multiple clients or peers.

Digtal Mars D language may conflict with the name.. ;-) A lot of the talk on dbdebunk.com is idealistic but we also have to think.. what is a compromise that we can get working on today? They seem to have all the answers but no actual solutions. I agree with a lot of Top's and Dates and dbdebunk's articles and writings. LINQ is something that peaked my interest but I see it as too Microsoft-proprietary-ish, and even a few hypocrisy issues.. (Anders H. stated that a language should not get complex and combine too many paradigms and etc. and then went right on to claim that now C# is containing more functional programming, in a video I saw on Chanel9. Have to find the link and insert it here.).
I believe DateAndDarwen's use of "D" takes precedence, unless Digital Mars' offering was around before (at least) 1995. Also, DateAndDarwen's "D" is neither a language nor a specification for a particular language. It is a detailed description of the characteristics that any RelationalLanguage should have. To illustrate its application, a sample language specification -- for a language called TutorialDee -- is provided in TheThirdManifesto. You might like TutorialDee -- its syntax is closer to PascalLanguage than CeeLanguage.
- {There have been more than a dozen languages named 'D'. Precedence is irrelevant; this is a popularity contest.}
A lot of the awkwardness of accessing DBMSes from a client language goes away when (a) the language offers dynamic constructs that facilitate accessing arbitrary ResultSets, rather than making it painful like C/C++/Java/C#/etc.; and (b) you don't take the foolish approach to database application development that treats the DBMS as a mere persistence layer for "domain objects". If you've ever created a Customers table in the database, along with an analogous Customers object or structure in the application, then you've probably produced the latter problem. SQL DBMSes make poor object persistence layers, but excellent fact processors. If you treat the application as being an equal collaboration between a well-defined database on the server side, and fact collectors/presentors on the client side -- instead of treating the client as the whole application with a (somewhat-deprecated) DBMS-based persistence mechanism -- much of the DatabaseImpedanceMismatch disappears. RubyOnRails tries to reduce the database to a persistence layer, and this is one of its flaws; a particularly awkward one in enterprise settings, where the database will typically predate and long outlive the applications that access it. -- DV
- {That's a useful summary of RubyOnRails. I now know that I don't really need to look into it.}
Note: to use Ruby or RubyOnRails one does not have to use the mapping (ActiveRecord) though.. one can still run regular queries. What I found when working with it was that I eventually looked at my console window with a bunch of SQL bloat thrown on my screen trying to figure out what it was doing. I.e. eventually one converts to SQL. If there is ever a bug in the persistence framework, eventually one will have to debug the SQL - and if one uses only the persistence framework having little knowledge about SQL, it becomes a large problem when tuning/debugging the database (and heck maintaining it externally without Ruby on hand) is required. There is this nice switch to turn on that allows the SQL to be debugged - but it spits it all out in a diarrhea storm on the console window and the eyes get sore fairly quick.
If one continues to use the persistence framework without knowing the actual underlying SQL, then one will not be able to ever maintain/update/administer his database without having ruby on hand. This is not so good for large projects which may have a dedicated database engineer working on mostly the database - although most of the projects I work on I am the database engineer and the HTML hacker, and the programmer.. but as projects grow larger I find dedicating the resources to more people means not everyone should have to know ruby just to administer and query the database. One cannot cut and paste some ruby code into his SQL Query Browser program or his PHPmyAdmin program or whatever tool he uses. More ranting on the issue is seen at http://z505.com/cgi-bin/qkcont/qkcont.cgi?p=ActiveRecord-a-Toy-For-Children

This problem is, of course, faced when developing database applications, but also exists in any environment based on client/server architectures such as CORBA, J2EE or WebServices. It centers around the need to maintain and/or deploy live applications from multiple code bases. In short, there typically must exist separate sets of client code and server code, which must be maintained and kept in agreement with each other over the entire application lifecycle. This is awkward. I believe a step in the right direction would be full-featured general-purpose programming languages designed for a distributed environment that:

Shelter the developer from any notion of code locality (i.e., what resides on a server vs. a client),
Dynamically, automatically, and invisibly deploy initial installations, subsequent component changes and/or schema & data updates to distributed nodes (servers, peers, workstations, etc.) under the control of an optimiser,
Contain constructs to facilitate access to (inevitable, and changing) external resources, whether distributed or not,
Seamlessly incorporate persistence, and some theoretically-sound data model such as the RelationalModel, and
Use a single, integrated, OnceAndOnlyOnce-facilitating source-code base to define everything from database schemas to user-interfaces, but without compromising flexibility, power, or developer ease in any particular area. Who on the development team gets to define these should be a matter of policy and security, not language capability or the intended purpose of any particular node.

We should extend Codd's 11th Rule for Relational Databases ("The distribution of portions of the database to various locations should be invisible to users of the database.") to distributed systems in general: The distribution of portions of the applications to various locations should be invisible to users and developers of the application. In short, developing and deploying distributed applications needs to be indistinguishable from developing and deploying non-distributed apps. Except where proven impossible or impractical, of course...

I suspect the third bullet point, above, will be the most difficult one to get right.

-- DaveVoorhis

I just realized recently that a RelVar is actually an instance of a DatabaseType (or, maybe more specifically, a Table Type)... for example, all variables are instances of a type (for those who think in types). So in fact a relvar is really an instance of some sort of grander type.. or RelType?. The specifications of the RelType? or DatabaseType are the schema, the constraints, etc. The types that determine attribute restraints (column restraints) are just like how a struct or record has sub fields with more type specifications per each struct or record field.

No, a relvar is an instance of a variable. A variable may possess a type which constrains the values it may contain, but a variable is not a type.

{Well it is a 'rel'var, not just any old 'var'. That implies it is named at the top level of a relational database, and that it names a mutable relation (if not mutable, not really a 'variable'). I would note that relation does not equal table... except in practice. Tables can't be infinite. Relations - subsets of cartesian products of potentially infinite domains - certainly can be.}

It is, in fact, "any old 'var'" that happens to contain a value which is a relation, and which is often but not necessarily persistent, as a relvar (short for "relation-valued variable" or "relation variable") may be temporary or local. A relvar is not necessarily at the top level of a relational database (i.e., it may be a local relvar), though the top level of a relational database consists of persistent relvars. Variables are named, relations are not. See RdbRelVar. By the way, the cardinality of a relation held by a relvar is immaterial to the definition of "relvar".

{A 'var' that necessarily identifies a relation isn't just "any old 'var'" - the proof being trivial, that I can point at a 'var' at random (e.g. one identifying an 'int') and it isn't a relvar. As far as the 'naming' goes, from my viewpoint variables never 'contain' values (values being immaterial and entirely un-containable). Variables do identify (or 'point at') and thereby give name to values - generally a temporally limited name based upon the lifetime of an object containing a representation of said value and an associated decodec. And I do know that the cardinality of a relation held by a relvar is immaterial to the definition of "relvar". That's why it would be a 'wrong answer' to say that a relvar has some sort of 'table type', because tables necessarily have finite cardinality, they being defined in terms of specifying exactly what they contain. A relvar identifies a relation, which could potentially have infinite cardinality, which is sufficient to say that it is not a 'table type'. (The statement to which you're responding was targeted originally at someone else rather than you, of course.)}

Re: "The main problem has been that most languages lack a very flexible type-system" [emph. added] -- It's one of the reasons why "tag free" languages seem more suited to databases in my opinion: you don't have to think that much about mapping the language's types to the database's types except when you do transformations. I'd estimate that typically 80% of the "cells" retrieved are simply passed along as-is to I/O or to back to the same or other tables. If the API's allow them to go out the way they came in, then we don't have to think about types much: we just marshal them around based on marshaling business logic and don't mess with most of them on a value level. (The custom biz apps I see tend to be marshal-heavy and transform/compute-light.) We don't really need a MirrorModel of the schema, which is arguably a violation of OnceAndOnlyOnce. One could argue this is fragile per schema change, but even in a type-heavy compiled language we'd probably have an app crash anyhow if the schema changes. Typically a DBA adds columns rather than changes their type or removes them. Typically both types of apps would be "blind" to column additions. Changes in the semantics or usage of columns is fairly common, but I'm not sure heavy-typing on the app side will detect significantly more of these than a tag-free approach. Usually these kinds of errors are detected when the app expects something a certain way but doesn't get it, and either pukes or produces garbage. --top

What are "tag free" languages, and how are they more suited to databases?

I've seen shops design databases under a policy that only the 'text' data type is allowed. That delivers a certain flexibility until you need to share the database among many users and applications and expect to maintain database integrity. Implementing constraints and database-side processing is onerous and fraught with peril when the only type is a character string, and it results in poor performance when the database grows.

I'm not sure allowing a column of type "database" is a good idea; it can create a nested mess. A reference to a table, I'm okay with (link). A column or "cell" with a database "inside" strikes me as playing with fire. It's putting fire in the match instead of using matches to create (external) fires. I'd like to see demonstrations of the benefits first. It risks creating a HierarchicalDatabase when we want set-oriented relational. Relationships are typically treated on an as-needed basis in relational. Nesting tends to force a hierarchical relationship that is more difficult to de-nest our view of when we don't want a nested view.

A column of type "database" is no different from a column of any other type. If we allow a column of type "image", or type "Excel file", or type "integer", why not type "database"? You don't have to use it if you don't need it, but if you do need it, what's the alternative?

In relational the default view is more or less "flat" and we add projections to give us specific data-structure-like views, not the other way around. Starting from a "flat" base is a better starting point in my opinion, because you will often have to make two transformations if you don't want a tree view: first flatten your view, and then re-project it into your desired structure. Think of the hub of a wheel where the "flat" view is the center. If you allow or encourage starting at the "tree" spoke, then to get a different view, you'll often have to first drive to the hub, and then drive to the desired spoke. Most transformations will be a shorter drive if you start at the hub. --top

In the RelationalModel there is no such thing as a "default view" nor is there any notion of "flat" or non-"flat". The RelationalModel consists of relations and a RelationalAlgebra for manipulating them. The types of the attributes of the relations can be any type, which implies that they can be relations.