I created this page to describe some things about a database API I've been developing at work for some time now, to answer the curiosity of another wikizen. Unfortunately, I'm not at liberty to share the code, but I've occasionally thought about starting an Open Source version.
The API is based on Tutorial D, described in TheThirdManifesto. My main goals are:
The basic approach is to supply a core system, the primary purpose of which is to allow the construction of data structures that represent types (domains), individual database values, tuples, and query expressions. A separate subsystem is responsible for translating the expression trees into DBMS queries. (Via SQL, although there's nothing inherent to the first system that mandates this.) Finally, the schema designers provide a layer on top of all this which defines types, base relations, predefined views, and data constraints that make up a specific logical schema.
The read-only part of this system is largely working as I write this (Jan 2003), although a version of our software using it hasn't been released yet.
To give a flavor of what C++ application code ends up looking like, here's an example off the top of my head showing the definition and processing of a query. This would make more sense if I provided information on the example schema, but use your imagination:
void PrintEmployeeProjectManagers?(CDatabase& db) { CTupleList tl(db["Employees"] .Join("Projects") .Join(db["Employees"] .Rename("ID", "ManagerID") .Rename("Name", "ManagerName?")) .DropAllBut?(attr("Name") | attr("ManagerName?"))); for (CTupleList::iterator i = tl.begin(); i != tl.end(); ++i) std::cout << (*i)["Name"] << " " << (*i)["ManagerName?"] << std::endl; }The joins are, implicitly, inner joins. Outer joins are also supported, but since the system does not support null attribute states, the syntax for outer joins is a little more complicated and requires the expression to supply default values.
What about just supplying blanks for strings and 0's for numbers? Such defaults could be in a system table so that things like default dates can also be supplied. It could be a cluttery mess to supply defaults to each column.
The hairiest part of the current implementation is, by far, the SQL generator. Jet's SQL dialect has many odd limitations and quirks that must be worked around in various ways. It is certainly possible at the moment to write queries that will produce an exception instead of an SQL string, but my users haven't discovered any of those recently. :-)
The SQL generator is basically a Visitor class that walks the expression tree and builds a different expression tree using data structures that are tied more closely to Jet SQL's structure. This is then finally turned into an SQL string when the query is executed to produce a tuple list.
The operators that are currently available include:
The type system is problematic and has just undergone a lot of revision. Tutorial D's types have a lot of restrictions that C++ types don't, so there's not a straightforward mapping between an arbitrary C++ class and a database type.
The write path will be very interesting when I get around to working on it. It should be possible to make just about any view updatable, provided that there are default values defined for attributes not exposed by the view. Since the system will have an algebraic expression tree that defines each view in terms of base tables, I'll be able to implement some of the suggestions from the relational theory literature on handling updates through joins, unions, etc., breaking each write down into a series of updates on base tables.
-- DanMuller
Some questions and comments:
My C++ is rusty. How do you propose things like WHERE clause Boolean expressions be defined? Also, is it possible perhaps to make a more language-neutral API?
To be practical, these would have to be translatable into SQL for use on existing machines. If it is superior to SQL, that might be difficult.
Further, I cannot find a non-fee version of the Tutorial D language beyond some cryptic syntax charts. Do you have any leads?
I am glad to see somebody working on such things, though. SQL needs an overhaul. It is the COBOL of relational query languages IMO. Keep up the good work!
Thanks for the encouragement!
Boolean expressions based on references to attributes can be represented in the syntax trees. The Restrict operation (which I forgot in my first draft of this page) is used to apply a boolean expression to a relation.
API's to make syntax trees? That does not sound very pleasant. I think I would prefer just passing on a Boolean expression to the database engine (or syntax builder) as a string.
Regarding Tutorial D, a copy of an LALR(1) grammar for it is available on the authors' web site, http://www.thethirdmanifesto.com. Or you could look for an inexpensive used copy of TheThirdManifesto. It's not easy reading, but if you're seriously interested in database theory, I think it's rewarding. Stay away from the first edition; the multiple inheritance scheme is much revised and better thought out in the second.
(Discussion regarding Chris Date moved to ChrisDate.)
A similar API is available in HaskellLanguage: HaskellDb. A CeePlusPlus API could have similar design.
2004-10-13 DanMuller: Work has continued, sporadically, on this system. A lot of my time, unfortunately, is spent working around limitations of the Jet DB engine currently underlying the project - the generated SQL has uncovered some truly bizarre bugs in the SQL interpreter.
Brief mention is now made of this project on http://www.thethirdmanifesto.com/. Too bad this can't be open-sourced.
Check out http://duro.sourceforge.net/, which must come close.
One area where I may eventually end up diverging quite a bit from Tutorial D is in the type system for scalar types. I'm no longer convinced that the strict set-based type system is worth the hassle. It may be perfectly adequate to have a looser system based on convertibility of types, together with type constraints on attributes in database relvars. In other words, shift the emphasis from domains to attribute constraints (where D&D tend to go the other way, barely mentioning attribute constraints in passing.) This direction in thinking stems in part from some exchanges with Costin Cozineau a few months ago here on C2. So far, we've been working entirely with primitive types. There hasn't been a strong push for representing application-specific types directly in the database schema, which is in itself rather interesting.
This is interesting. Any possibility of more detail?
Sure, but I've not had a chance to flesh the idea out further, so the details are just from my mental musings.
When I started working on this API (which we call CsiDb?), I was a firmly entrenched StrongTyping advocate. If I'd had my druthers, the query expressions built with this API would have inferred the specific types of attributes and expressions at compile time. There were a number of forces working against this, however:
Type checking really needs to happen only in these places:
The main disadvantage, of course, is the lack of early detection of type errors. However, I've already largely sacrificed this to satisfy the application programmers.
Working with a dynamic API in a language like C++ is inconvenient because of the compile/link/test cycle. Type errors and join errors (usually, unwanted joins on like-named attributes) are not uncommon while developing a complex query expression. What is sorely wanted is some sort of interactive tool for testing new expressions - perhaps something as simple as taking a user-supplied expression string, writing it to a C++ source file, compiling it, and dynamically loading and testing it - a sort of poor man's REPL.
The problems with Date and Darwen's type system have primarily been due to the fact that it's hard to implement efficiently in C++. Not nearly impossible, just hard. And my work to date is indicating that it's unnecessary, if one is willing to sacrifice compile-time type checking.
-- DanMuller
Someone added a reference here to RollYourOwnDatabase. Please note that this page describes an API, not a complete database system. -- DanMuller
Is the interface generally intended to go on top of existing RDBMS? If so, how are C++ expressions converted into SQL?
Yes. Not that I wouldn't find it interesting to write my own database system from scratch, but why would I? (And more importantly, why would my employer pay me to?)
As described very briefly above, the C++ expressions build data structures which are later traversed and converted to different data structures that closely mirror the structure of SQL statements, and then those are converted to actual SQL. If I were using a more capable DBMS that had a good optimizer, I could probably skip the intermediate step. As it is, I have to worry a bit about the form of the SQL, for reasons of performance. In its current incarnation, CsiDb? does a bit of optimizing, for example discarding unnecessary join terms.
I have since moved on to a project built in C# (as of the latter half of 2005). We have re-implemented a small subset of this library there, and are considering the possibility of expanding it. (One alternative is to limp along with a small subset until MicrosoftLinq becomes available, then switch to that.) I'm finding that it's not possible to make the API fit as seamlessly into the language because of C# limitations -- e.g. the logical OR and AND operators can't be overloaded. The result looks somewhat more like the awkward examples shown on ExpressionApiComplaints. -- DanMuller
See also ExpressionApiComplaints, RollYourOwnDatabase, TopsQueryLanguage, MinimalTable, TrueRelationalToPseudoRelationalMapper, TutorialDee