Extended Set Theory Storage Model

Relational math can describe the logic of the data it provides to you, but it isn't strong enough to describe the storage space upon which it is implemented. ExtendedSetTheory (XST) can model the user data, but it can also model the storage, all the way down to the bits.

Why is it a bug? Specifically, why would we want to restrict the modelling of database representation issues to have to use same theory that is used to model the abstract database structure? Especially when there is an existing theory of abstraction functions and representation invariants that is perfectly adequate?

It's a bug because we are not able at all to model using the same approach all the way down if we want to. No one says we would have to. The problem is that now, if we want to, we cannot. We are forced now to change models even if we don't want to.

It seems kind of obvious to me that a unified theory that can cover everything would be the preferred approach, just as we seek in physics and other sciences. I guess it doesn't seem that way to you.

--rj

The reason it doesn't seem that way to me is that modelling data representation in terms of abstraction functions works for any paradigm, not just relational. I.e. we already have a "unified theory". -- DavidSarahHopwood

[XST would appear to offer an alternate "unified theory". Is it a problem to have more than one such complete paradigm? -- DougMerritt]

[...] ExtendedSetTheory can model the user data, but it can also model the storage, all the way down to the bits.

I wonder why one would want to do this. You use the relational model to model the logic level. Than you use whatever you want -- hierarchies, networks, files, any kind of graph or list -- to map that to implementation. A RDBMS should have all this implementation functionality, but very separate from the logical model; and users should never deal with the physical layer or its mapping to the logical schema, only DBAs and the such should care about this.

Here's a simple example. (I don't see a good way to unthread all this, but hope it's still worth getting down.)

Imagine some simple relational table:

 | Name             | Age |
 | Bill Jones       |  36 |
 | Cassandra Walker |  29 |

This might well be implemented in some flat way on the disk: say 30 bytes of alpha for the Name, one byte for the short Age. Now suppose we want all the people whose age is less than thirty.

One way to do this is with some code that would look like this:

  Relation young = new Relation();
  foreach ( Person p in People ) 
    if (p.Age < 30)
      young.Add(p);

This would be nearly efficient, except that it has to create and destroy all these Person records, and then flatten them back out into bytes and put them on the disk.

Another way might be something like this:

  inByte= 0;
  outByte = 0;
  while (inByte<= People.Size)
    int age = People.File[inByte+30];
    if ( age < 30)
      Movebytes(People.File[inByte], Young.File[outByte], People.Recordlength);
      outByte += People.Recordlength;
    inByte += People.Recordlength;

In words, we just map to the disk, compare the byte, and if it's less than 30, we move one record from input to output.

This would be tons more efficient than the first example. The problem that an RDBMS has is that it has to figure out all those addresses and file-to-memory mappings, and finally pop down into a routine that basically does what the one just above does: whip through all the data comparing and moving bytes.

It turns out that this is quite doable, of course. Oracle proves that. But it isn't easy.

I don't understand why it isn't easy. Translating from the first example to code based on byte offsets is just what any compiler for a language with structures would do. Then memory-mapped files or similar can be used to support persistence. All well-understood technology, not requiring any new theory.

Woah, there, cowboy. "Require" would mean "isn't possible any other way", which very obviously is not the point here. The author of this page is describing what he considers to be a superior approach, not a "required" approach. As to "easy", words fail me...have you tried to do this? There is no known approach in the world that makes this stuff "easy" to implement. Maybe you mean "easy to think about", but that is rather different. Making stuff work well in implementation is usually much harder than simply finding a conceptual solution.
- I did mean "easy to think about". Yes, it's difficult to implement -- whether or not you use ExtendedSetTheory or conventional set theory. But it's not excessively difficult; in particular, it's no more difficult than writing a compiler for a programming language with structures, and the part of an OS that supports memory-mapped files. It can't be, because it's essentially the same two problems. People do this all the time without finding it useful to reconstruct set theory. Note that there are other ways of modelling representation issues (using abstraction functions) that most designers are likely to be more familiar with; it's not a matter of using ExtendedSetTheory or just "winging it".
  - Ummm...doesn't that all boil down to "there are more common/well-known ways of doing it"? I don't know about you, but I personally am always on the lookout for superior approaches, and in a sense, the less well known they are, the more interesting.
  - As for compilers, databases are a much more restricted problem, semantically, and it may be that this technique applies well only to DBs...or that, if it were better known that it worked well for DBs, maybe it would be extended to compiler implementation and become a new standard technology there.
  - The issue of OS memory-mapped files is on the other hand trivial compared with what DBs need to do, so that's an uninteresting comparison.
  - I don't have a good enough feel for the application of EST to DBs to know for sure, but I find the claims made on this page intriguing, and to me it seems a cop-out to dismiss the subject out of hand. -- dm
    - I'm not dismissing it out of hand, but I haven't seen anything yet in the papers about XST that are on-line, that I don't already know how to do with about the same degree of difficulty in other ways. (And the paper at http://xsp.xegesis.org/Xsp-uxr.pdf was positively counterproductive.)
    - Well, yes. I suppose if it's a question about optimizing your use of your time, there is not yet a clearly compelling argument that you should spend your time on this; it would be a risk with an unclear payoff.
Well-understood technology -- hmm...the history of the technology of Oracle is extremely interesting in that regard -- you've seen that, right?
But yes, it's been implemented well many times by now, so the people who put in the elbow grease understand how to do it; it's no longer rocket science. But this page is describing an approach that perhaps does not require it to be some architect's third system, which is otherwise required. If true, that's a major advance, yet one that apparently has been largely overlooked due to large company dominance -- marketing vs technology. -- DougMerritt (signed due to strong opinions ;-)
- If that thing exists at all. All sources of information presented so far, show just an apparently superficial notation artifice with an extra label attached to the fields of a tuple. From there to the wondrous claims of XST on this page, is an extremely long way to go, and in absence of scientific literature references and also in absence of conclusive implementation, there's a certain of hype attached to it. It may be understandable that some technologies fail for marketing-related reasons, but also there are hardly any references in database research literature. There are a few other claims to fame with regards to the problem of a good mathematical model for the mapping between logical schema and the physical layout, but the jury is out for the time being.

One of the putative advantages of ExtendedSetTheory is that since the math can map all the way from an abstract relation, through the relation's representation on disk (however complicated), down to bytes in memory, the whole operation above can be done by mathematical expressions, rather than by "flimsy, ill-defined operations in code".

People who have built an RDBMS in the ad-hoc way, or who don't have to build one at all, might say "OK, so what, I can still do it." And they're right. But I have some input for you.

Back in the day, teams of mine built three RDBMS products, two of which were ad hoc, and one of which was built on what we understood of XST. The latter was very neat. What it did was take the relational expression desired by the user (SQL-equivalent) and do mathematical substitutions on that expression until it came down to a series of operations to be done on storage (disk mapped to memory). Then it executed those expressions, producing the result.

It was really neat. You could implement any relational operation you wanted "just" by expressing the mathematics that defined it. It was very hard, but in the end we who had built DB both ways really felt that it was better in the XST style.

But that was long ago, and in another company, and besides, the product's dead.

That's very interesting. Is there somewhere I could learn more about that approach?

Unfortunately, there is not much really good to read. Some stuff by the originator is at http://www.xegesis.org/xsp/ . Those few who have built things based on the theory have this in common: we all got something good, and we are to this day not sure that what we did was what Dave was talking about. He's a hard man to understand ...

Well, could you give any examples you recall of how the substitutions worked in what you did?

I'll try. I don't know if they'll be evocative of how cool it is when it works.

Suppose we have a query like SELECT NAME, AGE, ADDRESS FROM PEOPLE WHERE AGE = 21.

Now the extended set theoretic model for PEOPLE means that its records look like this:

  PEOPLE ::=
  {
    { Jones^NAME, 36^AGE, 1234 Main St^ADDRESS }
    { Smith^NAME, 24^AGE, 221 Baker St^ADDRESS }
    ...
  }

And the query would like to use an input set like this:

  QUERY ::=
  {
    { 21^AGE }
  }

And the result set we want is exactly

  PEOPLE | QUERY

where | is the XST operator "Restrict", defined as

  A | B = { a^i such that a element-i A and exists b element B and a intersect b = b }

Now at this level of "abstraction", the operation is defined in terms of arbitrary elements and exponents (which are here modeling field names, although they don't have to).

But the same operation can be used, directly, to refer, for example to memory.

Suppose also that we have the data in a flat table, with 2 bytes for age, 30 for name, 50 for address.

We can imagine that we keep symbol data in a standard format like this:

PEOPLE_SYMBOLS FieldName Offset Length Age 0 2 Name 2 30 Address 32 50

The standard format, of course, can be represented in its own format:

SYMBOLS FieldName Offset Length FieldName 0 16 Offset 16 2 Length 18 2

And the data for our table looks like this (with appropriate blanks to line it up):

PEOPLE_DATA 36 Jones 1234 Main St 21 Smith 221 Baker St

QUERY_DATA 21

The operation that produces the bytes that make up our result table is

  PEOPLE_DATA | QUERY_DATA

Using the same definition of restrict as we started with. The only difference is that we are now interpreting the table as something like

  {
    { 3^0 6^1 J^2 o^3 n^4 e^5 s^6 ... }
    { 2^0 1^1 S^2 m^3 i^4 t^5 h^6 ... }
  }

and the query (of course) is { { 2^0 1^1 }}

So the result of all this is that we can parse the SQL in the "usual" way, but we only have to determine that the result we want is the original restrict.

Then we do various math to map the byte addresses around. This math effects the same table lookups one would do using ad hoc code, but, for example, it tends to use the same operations yet again.

For example, if we want to know the byte offset and length for the Name field, because we want to print it or something, it turns out that the information in question is in the record

  PEOPLE_SYMBOLS | { { Name^FieldName} }

How can I get to that operation? One way, certainly, is with the necessary ad hoc code to rip apart the symbol table, figure out a restricting record, and plug the stuff in. And the ad hoc code isn't incredibly hard to write.

However it is rather easier to write in set theory, with the right set of operations.

It turns out, in the simple case, that the system can completely bootstrap all the information it needs from a few simple facts: the location and format of a table saying where all the tables are, and the format of the SYMBOLS table. (We can get the location from the tables table.)

Most everything else then turns out to be expressions in math, very little procedural code.

When we use SQL or some other relational language, we find ourselves thinking in aggregates -- whole sets of reocrds -- most of the time.

When we implement a relational system, we usually have to think rather procedurally, or at best in terms of objects, to do the implementation. Using XST, more of our thinking and programming can be in terms of aggregates.

So it turns out to be very powerful in the hands of someone who knows how to use it.

Is it worth bootstrapping up to know how to use it? I'd not make that claim today, when there are, as far as I know, no available XST systems to work with. Things might have been different, but that's the way things are.

--RonJeffries

Cool. Your example makes it seem easy, so I'm probably missing a lot of the nuances. I'll have to think upon this. Thanks.

Hi all,

I have gotten an interest in D.L. Childs' works when I read about them in Ron's articles in 2005. I have recently come to the conclusion that XST, XSN and XSP go much further than simply saying "Well, my table is a set, and I'll be selecting all entries that have this field equal to this.". This does nothing more than SQL, and I'd say it actually does less.

Childs talks about three different things:

XST, eXtended Set Theory, which is a foundation that gives mathematical soundness to the concept of a "tuple".
- Which is nice and all, but set theory already provided mathematical soundness to the concept of a "tuple". If that is all XST provides, then it's not worth much.
- The main benefit of XST is that it doesn't accidentally express ordering properties. Also, Childs is discussing the relational tuple, not the mathematical tuple. A relational tuple is a mathematical record (named elements). This can still be modeled in classical set theory: an extended set is simply a set of pairs. But, since every set in XST is a set of pairs, the algebras can take greater advantage of that.
  - {I'd argue -- with more agreement than disagreement -- that the main benefit of XST is not that it "doesn't accidentally express ordering properties", but that it explicitly expresses ordering properties. The RelationalModel, for example, says nothing about how ordering is implemented. XST makes ordering an explicit part of the relevant XST operators. Furthermore, the notation used to indicate this ordering ("scopes") map neatly to real-world computational constructs like RAM, disk blocks, nodes in a network, attributes of a tuple, pixels on a screen, etc. This means mathematically-proven equivalences between expressions consisting of XST operators can be used to implement automated optimisation that applies equivalently whether referencing extended sets that represent RAM, disk blocks, nodes in a network, attributes of a tuple, pixels on a screen, or all of these at once. This implies that the same automated optimisation mechanisms can be applied to the lowest level of a system, the highest level of a system, and the mappings between those levels.}
- That actually makes my point. Since you can model relational tuples in set theory, and XST will be inconsistent if set theory is inconsistent; there is absolutely no benefit to XST as for as soundness is concerned. Your point about accidentally expressing unwanted properties is certainly a valid benefit; but it's its own advantage, not one tied to soundness.
- I disagree. The issue of soundness arises due to the composition and comprehension rules that XST provides to make working with extended sets relatively convenient. There is more to extended set theory than extended sets, including the primitives that make them convenient for data processing. What XST achieves is a sound, convenient, well founded, composable, one-size-fits-all aggregation type for structured information that avoids accidental expression of ordering. Fitting all these properties together is not trivial, thus only by fitting them together does XST become interesting. Thus, soundness is not a separable issues.
- Finally, what sort of advantages do you envision the algebras taking advantage of? I can see how it might make some definitions simpler (I could also see it making some more complicated), but is there anything else?
- An improved mathematical foundation tends to give you easier proofs for optimization techniques, shallower abstractions, and often a cleaner syntax.
- (BTW, the database theory books that I have on hand define a relational tuple as a mathematical one. If you don't mind, I would be interested in seeing what you are using for the relational tuple).
- Typically, a mathematical tuple is defined as an ordered pair, triple, quadruple, etc. The ordering exists by definition of tuple. IIRC, DrCodd followed this tradition in one or two of his earliest papers, but quickly decided to attribute names to the elements of the tuple to keep it user-friendly, more commutative, and simplify definition of joins. While these definitions are still mathematical in nature, they are considerably more difficult to model in set theory (the model for ordered tuples does not apply). Feel free to peruse the Wikipedia article on the subject (http://en.wikipedia.org/wiki/Tuple#Relational_model).
- Thanks. I thought it something like that, but I wanted to make sure.
XSN, eXtended Set Notation, which gives a way of conveying these mathematical objects into readable characters.
XSP, eXtended Set Processing, which is (hold tight) a model, based on XST, to describe processes that transform sets. This is based on XST transforms, and can be used to describe how XST tranforms (even simple scope transforms) can be implemented on systems.

Buried into one of the papers, there is a particular sentence I want stress. I will mention it further down.

First, Childs mentions that Codd's works on relational data models required strong Data Independence between application, model and storage, so that the performance of the storage would be determined by the storage model only, and not by the way the queries were written. Yet, in nowadays RDBMS, you have to pay attention to the way queries are written, and nobody laughs.

Basically, the application should query the model, and the model should query the storage. The applicative queries to the model should only determine what data you want. Not how you get them from storage. XSN Restrict would be one way. SQL without hints would be another.

About storage performance, one of the keypoints of applying XSP is at I/O level. The sentence that is buried down hild's papers is that: column storage is more efficient than row storage. Part of what I realized is this:

Suppose you could store a table with two columns "Name" and "FirstName?" that have 5 characters each. (for clarity)

You could represent each of your row data as a tuple of tuples like this:

 <
  <D, o, e, ., ., B, i, l, l, y>,
  <D, o, e, ., ., J, o, h, n, .>,
  <B, o, b, ., ., B, i, l, l, y>,
  <P, o, w, e, r, M, a, x, ., .>,
  <G, r, a, n, t, H, u, g, h, .>,
  <D, o, w, ., ., J, o, n, e, s>
 >

And store them in a flat file as "row storage" :

<D, o, e, ., ., B, i, l, l, y, D, o, e, ., ., J, o, h, n, .,B, o, b, ., ., B, i, l, l, y,P, o, w, e, r, M, a, x, ., .,G, r, a, n, t, H, u, g, h, ., D, o, w, ., ., J, o, n, e, s>

Suppose, now, that each one I/O gives you 4 (arbitrary but) consecutive bytes from your storage. Your query is "the first name of all names that start with D and o". Your I/Os would give you :

 <D, o, e, .>
 <B, i, l, l> (selecting the first first name)
 <y, D, o, e> (end of selecting the first first name, plus enough info on the second name)
 <J, o, h, n>
 <., B, o, b>
 <P, o, w, e>
 <G, r, a, n>
 <D, o, w, .>
 <J, o, n, e>
 <s>

For a total of 10 I/Os. Which is already improved with regards to the dumb conditioning, because I took advantage of conditioning every I/O.

Now, imagine that I store characters per "column".

 <
  <D,D,B,P,G,D>,
  <o,o,o,o,r,o>,
  <e,e,b,w,a,w>,
  <.,.,.,e,n,.>,
  <.,.,.,r,t,.>,
   ...
 >

Let's admit that I use one file for each column position. I'd be using:

2 I/Os to read the values of ALL first characters
2 I/Os to read the values of ALL second characters

At this point, I have tested all "rows" to my condition. I know exactly which data to fetch.

I'd be using 8 more I/Os to fetch all relevant data. That totals to 12 I/O, which kinda defeats the demonstration.

But wait! Look at tuples number 2, 4 and 5!

I can express them with a default in the data definition!

 2 -> 'o', unless otherwise stated. + { r^5 }
 4 -> '.' unless otherwise stated + { e^4, n^5 }
 5 -> '.' unless otherwise stated + { r^4, t^5 }

Admittedly, I can save 1 I/O on the condition on the second character. That totals to 11 I/O.

Wait, there is more! What if I used a row storage for the FirstName? column only?

(I am tired of typing, so please imagine)

I would have to fetch rows 1, 4 and 5. 4 and 5 being consecutive, 5 I/Os are enough. That totals to 5+3 = 8 I/Os with such a structure.

That does not look like a impressive improvement. Yet, this is because my I/O buffer is only 4-byte wide. If you admit that at least one fetched byte is useful per I/O, it floors the useful data ratio to 25% in any configuration, even row storage. Actual buffers are bigger than this (like 4KB) with rows have an average of 100 bytes each. If you condition on 10-byte names, 90% of the data you fetched was wasted if the row does not match. You can only condition 40 rows per I/O. With a column storage, you can condition 4096 rows in one shot.

XSP allows you to express such transformations from (e.g.) "SQL tables" to storage and has mathematical soundness to prove the equivalences. (and content preservation) The performance would then be dependent on the way the table-storage equivalence is implemented, and not on the query stating "HAVING" or "WHERE (SELECT .. WHERE)".

Other things can be imagined:

You could store the same dataset on two disks, one with column storage, one with row storage. And you coud use the column storage to condition, and while the row numbers matching the queries are arriving through the first pipe, you could use the second one to fetch the relevant rows. (And since both storage store the same data set - or equivalently so - if one storage burns, you can reconstruct it from the other. That would not be enough, but you get the point. Storage also means "memory".)
You could imagine filling your disk buffers with ~100% useful data, if only they would accept sets as queries. The data access bottleneck would then be systems processing, and not I/O anymore. And we programmers know how to cope with such.

Laurent LA RIZZA