Extended Set Theory

Unfortunately, the following doesn't quite capture the performance modeling potential that extended set processing (XSP) may provide. While the papers at http://xsp.xegesis.org might imply that XSP may only be particularly suited to database and data processing applications, XSP is not limited to these -- any more than the RelationalModel is limited to just DBMSes. XSP can provide a unifying mathematical structure over all known data structures and data representations, such as files, records, queues, stacks, XML files, SQL tables, B-trees, and so on. Since all of these can be represented as extended sets, they can all be manipulated and transformed with extended set operations -- just as relations can be manipulated and transformed with RelationalAlgebra.

In short: In the same way that relational algebra provides operations on relations, XSP can define operations on anything that can be represented as an extended set -- which is no less than any and all data representations, whether low-level (physical), high-level (conceptual), or the mappings between these.

-- DaveVoorhis (with input from D L Childs)

A proposed extension to classical SetTheory proposed by DaveChilds, in which sets can be augmented by what are known as scopes (a term which has little to do with scoping in ProgrammingLanguage theory). Scopes allow the imposition of ordering and/or the attachment of labels to sets; what the RelationalModel achieves by the convention of limiting its classical sets (relations) to contain only "tuples" (unordered sets of name/value pairs, essentially).

Unfortunately, online references are scarce - however it looks interesting.

One online reference is at http://xsp.xegesis.org/Xsp-uxr.pdf ; unfortunately the paper does reek a bit of marketing-speak. (If you ignore that, you will find some technical rigor...)

Quoting from this paper:

Consider the following two tables T1 and T2.

 A  | B  | C
  T1 =  ----+----+----
 a  | b  | c
 x  | y  | z

 C  | B  | A
  T2 =  ----+----+----
 z  | y  | x
 c  | b  | a

While they are obviously different to an observer, they have exactly the same domain names and row values, and are therefore considered, under the conventions of the RDM, to represent the same "relation". The intelligent human can easily understand the extra-mathematical convention (which, to my knowledge, cannot be defined in the mathematics underlying the RDM) that the ordering of the rows and columns is insignificant. Capturing the same notions at a formal level is considerably more challenging.

Your knowledge of RDM is faulty: no extra-mathematical convention is required. Difference in ordering of values in rows/columns are an artifact of particular tabular representations you have chosen. There is not ordering of rows/columns in RDM. Access to a specific tuple (row) is by key(s) and then to a specific attribute value (row-column intersection) is by attribute name. SQL-based DBMSes are not a counterexample, they are known to deviate from RDM significantly.

-- MaD70 2008/05/02

Specificaly C. J. Date, in Database in Depth - O'Reilly, gives this definition of tuple:

"Definition: Let T1, T2, . . . ,Tn (n >= 0) be type names, not necessarily all distinct. Associate with each Ti a distinct attribute name, Ai; each of the n attribute-name:type-name combinations that results is an attribute. Associate with each attribute a value vi of type Ti; each of the n attribute:value combinations that results is a component. The set of all n components thus defined, t say, is a tuple value (or just a tuple for short) over the attributes A1, A2,..., An. The value n is the degree of t; a tuple of degree one is unary, a tuple of degree two is binary, a tuple of degree three is ternary,..., and more generally a tuple of degree n is n-ary. The set of all n attributes is the heading of t."

-- MaD70 2008/06/23

Huh? Formalizing this is trivial. The simplest way is to model a row as a function from field names to field values, and a table as a set or multiset of rows; then equivalent tables have identical models. (Purists might want to specify this axiomatically rather than by assuming a specific model, but there is no difficulty in doing that either.)

Added later after reading "Feasibility of a set-theoretic data structure: a general structure based on a reconstituted definition of relation": this is in fact how Childs does model it in that paper. So I don't see what is "considerably more challenging". Oh well.

[...] Classical set theory deals with a membership concept in which something is either "there" or "not there". Extended set theory adds an additional condition to the membership concept in which something is either "there under the specific condition" or "not there under the specific condition". [...] As we will see shortly, it is this additional membership dimension that allows XST to model structures and operations that actually exist in a computer, but that cannot be modeled [in] CST.

"Cannot be modelled in classical set theory"?!

[Yeah, yeah, yeah, if you work at it, you can build up all of math from set theory; that's not the point, and the paper you're criticizing isn't a pure math paper.]

It didn't say, "cannot easily be modelled...". Besides, this was just the final straw in a paper that is full of hype.

[I took the point to be that it takes meta-notation that is traditionally outside the realm of set theory, and extends set theory to include a certain kind of label as part of the extended theory.]

[One could doubtless do the same thing modelled in plain old set theory, but sometimes it is simpler to make an addition to the underlying theory.]

A fair point, but as far as I can tell, the theory seems to be a kind of modal logic. Why does the paper not say so, rather than implying it is something new and magical?

[Anyway, whatever flaws this particular paper has do not apply to the original theory, as far as I know (which isn't saying much). The question is whether this approach is helpful; I'd just ignore the overarching claims.]

OK, the theory probably deserves a second chance. Are there any other on-line papers describing it?

Answering my own question: the reference and abstract for the main paper "Extended Set Theory" is at http://www.informatik.uni-trier.de/~ley/db/conf/vldb/Childs77.html. Other papers by Childs are referenced at http://xsp.xegesis.org/Papers.htm.

http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=umr;cc=umr;sid=e2f01a05d983efad565cb19b5468870c;rgn=full%20text;idno=BAC0294.0001.001;view=pdf;seq=00000001 also seems to be relevant (full text of "Feasibility of a set-theoretic data structure: a general structure based on a reconstituted definition of relation").

URL above gives an error (2008/05/02). That paper is available at: http://hdl.handle.net/2027.42/4164 Moreover there is Description of a set-theoretic data structure, also by David L. Childs, at: http://hdl.handle.net/2027.42/4163

-- MaD70 2008/05/02

From TheThirdManifesto

I worked for many years on extended set-theoretic databases, which go beyond Codd and Date in providing an even more complete and mathematical set of database principles.

I wonder what is an extended set? Sounds like snake oil for me. A different kind of set most probably is something else. For example, SQL tables aren't sets, but bags, because they have duplicates.

Relational math has an inherent difficulty, according to Dave Childs, whereby all the operations one might do on relations do not always return relations.

That's OK, because the operations one might do on multirelations always return multirelations. (A multirelation is just a multiset, a.k.a. bag, of tuples. See http://erw.dsi.unimi.it/ERW/x407.html and http://vigna.dsi.unimi.it/ftp/papers/Multirelations.pdf) Furthermore these generalized operations coincide with DrCodd's original definitions when they are applied to sets.

Queries have to be "conformable". Dave thought it would be nice if database was based on a kind of mathematics that had closure. He observed that when one considers what one "would" get back from a relational query that isn't conformable, the thing appears still to be a set, even though it isn't a relation.

Similarly, relational theory, at that time at least, outlawed things like nested relations. Again, Dave thought that set theory would be better, because it allows nested sets.

Yes, outlawing nested multirelations is an unnecessary restriction.

But there is a problem using classical set theory in this way: N-tuples are defined in set theory, typically via something like <a, b> == { a, {a, b} }. The practical result of this would be that if you got the latter set as a result, poof it would be an n-tuple.

I'm not sure why this should be considered a serious problem. Most mathematicians view constructions like { a, {a, b} } simply as an example of how tuples could be defined in terms of sets, and treat tuples as distinct entities in practice. If you want more formality than that, then there are easier approaches than having to reconstruct set theory; for example, reserve a distinct symbol <pair> that is not otherwise used, and define <a, b> == { <pair>, a, {a, b} }.

Or, to model arbitrary records you could define <field1:a, field2:b> == { <record>, {<field1>, a}, {<field2>, b} }, where <record> and the field names are distinct from any other values. Constructing any arbitrary container out of sets (or indeed, any other reasonable container) is not a problem.
Another alternative is to use the definition of a relation from CategoryTheory, as described in http://vigna.dsi.unimi.it/ftp/papers/Multirelations.pdf.

So Dave devised a new theoretical definition of set theory that keeps n-tuples separate from nested sets. It has been looked at by better math heads than mine, including W V O Quine, and found to be solid.

The practical aspect of it all is that it offers an alternative way of implementing databases, with interesting potential advantages:

Operations would be closed. Every operation on two (or more) sets would return a set as its result.
More computer data structures could be directly defined. More on that in the response to the question below.

Relational math can describe the logic of the data it provides to you, but it isn't strong enough to describe the storage space upon which it is implemented. XST can model the user data, but it can also model the storage, all the way down to the bits.

I wonder why one would want to do this. You use the relational model to model the logic level. Than you use whatever you want - hierarchies, networks, files, any kind of graph or list - to map that to implementation. A RDBMS should have all this implementation functionality, but very separate from the logical model; and users should never deal with the physical layer or its mapping to the logical schema, only DBAs and the such should care about this.

You want to do this because there are richer information structures available to you this way. Consider a string of bits vs. an int of bits. Their semantics are defined by specific logical permutations of their bits. Different logical permutations. How many such permutational semantics do we have access to? Well, seven or eight fundamental types, most of which amount to one of [string, integer, float, pointer]. And there are a few weird ones - GeneralizedBalancedTernary and so on. But what else is possible? Who knows? No one has explored this - or at least no one who has published anything about it. -- Pete
- This has been considered, which is why various types are in use. Also, machine instruction formats employ a variety of bit patterns.
- More information please. If the few fundamental types we use are, after consideration, the only fundamental types in the universe that possess utility, then please point out someone who has demonstrated this. If, on the other hand, you know of someone who has adequately described the entire universe of such permuatonal arithmetics, please point them out instead. As for the alphabet of one or another machine instruction set, unless there are permutational arithmetics backing those alphabets I fail to grasp your point.
- [You're mistaking his point. He is contradicting your statement that "No one has explored this - or at least no one who has published anything about it", both of which are false, and you pretty much already contradicted yourself when you said "there are a few weird ones". Yes, and in fact, there are a lot of "weird" ones. Guess how many different kinds of ternary codes have been explored? Guess how many kinds of GreyCode?? The entire set of what is possible for a binary operator on two N bit inputs yielding an N bit output is obvious: the set of all such possible mappings (which is rather large for N=32). Are any of those mappings provably useless in all possible applications? Of course not. One can always work backwards to contrive a use for something.]
- [But how many of the possible operations are definitely useful? Well, the ones where people have published showing that they are useful, obviously. Which of those uses are common enough to put into an instruction set? Look at all of the instruction sets ever developed and you will see various opinions on the topic, with some consensus and some idiosyncracies. Multiply-and-accumulate is considered critical for DSP processors, but is rare in general purpose CPUs, so it is a question of optimization for intended use.]
- [What does this have to do with relational databases? It's getting pretty far afield, but I would claim that it would be obviously useful to be able to introduce new operators and types rather than being stuck with just the builtins that e.g. SQL offers. And that of course is quite an old opinion in some quarters by now, not something unique to the topic of this page.]

Yes, you do that. And that's the bug. A relational system cannot be implemented in relational terms: the logic making up relational database systems has to be implemented in ordinary software, and is therefore not logically complete in the way that the relational operations seem to be.

Why is it a bug? Specifically, why would we want to restrict the modelling of database representation issues to have to use same theory that is used to model the abstract database structure? Especially when there is an existing theory of abstraction functions and representation invariants that is perfectly adequate? See ExtendedSetTheoryStorageModel.

It would be better, one argues, if your theory could work all the way down to the bits, so that instead of writing more or less ad hoc code to implement your database, you could implement mathematical expressions of more and more detail until voila! you were done. In relational, this is really not possible, because relational just won't describe itself all the way down. In ExtendedSetTheory (XST), it turns out, you can do this. So, in this sense, it's "better".

Along the way, XST also models things that relational cannot. For example, I might readily build an object Person that contained a collection Children, each of whom was another Person. Relational would have us implement this with a relation of Person, and some kind of key field, maybe Parent, so that we could join an individual Person back into the database to find her Children. Relational thinkers start to think that this is "natural" but in fact it is not. Hierarchic or object thinkers might well prefer the hierarchic nested structure, taking all the Persons back to some primordial set of First Persons.

Neither is "right". What is the case is that XST can model the Person relation if you want to, but it can model the hierarchic one as well. And it can do more: XST can directly model the storage underlying either of these models.

Why would you want to do that? you ask. Well, let's think about that on a separate page ExtendedSetTheoryStorageModel.

I suspect that there really is a conceptual barrier between object thinking and database thinking.

Obviously, because there is no underlying theory behind object-orientation. This is one of the strongest points of Date, Darwen and Pascal. There are no precise definitions, and when there are they are unnecessary. For example, what's an object, a variable or a value? And a class, what's it besides a type?

[[The ActorsModel, which is equivalent to concurrent object-oriented computation, has considerable theory underlying it. In this model, an object reference is an actor name, the state of an object is an actor and its set of acquaintances, and a class is an actor that constructs new actors. Alternatively, objects can be modelled as closures in the LambdaCalculus; in that case a message send is a closure application, an object reference is a closure value, the state of an object is an environment, and classes are higher order functions.]]

Well, I don't know the theoretical answer to that but I have been in my day expert in both relational and in OO, and they are both powerful ways of organizing my thoughts. Yet O-R Mapping is weak, and leads us to wonderful solutions with inherent problems built in.

For example, I'm here right now because there's a discussion going on the Refactoring group about refactoring SQL. It comes down to this: you can't refactor SQL because everything you do of that kind wrecks all those cool optimized queries.

It would be better, one argues, if there were a more smooth transition between one's [object] program and one's database, instead of a barrier. XST offered that possibility. There have been a few implementations of XST and they all turned out to be quite marvelous. None of them made it in the marketplace. That might be a condemnation of the notion, but I don't think so. I've seen too many marvelous things not make it in the marketplace to believe that failure or success in the market implies much about technical or theoretical quality.

-- RonJeffries

The relevant point seems really to be about avoiding semantic noise. For example, representing a set as a list introduces 'semantic noise' because the interpreter/compiler doesn't know that the list is to be interpreted as a set. Use of an AbstractDataType can help, but an ADT doesn't work so well in 'open' systems (i.e. communication with an external database) and even in a closed system we'd be depending far too heavily on the mythical SufficientlySmartCompiler. My impression is that extended sets hit a sweet spot between classical sets recursive record-based data structures. That is, extended sets unify structural properties from both of these domains without considerable semantic noise. This is useful if you wish to avoid complexity from supporting two classes of operators (i.e. some operators for functional data, other operators for sets). I'm considering extended sets on this basis: my earlier design used recursive record-based data structures, which was suitable for MessagePassing, but I've changed basis to ReactiveDemandProgramming which pervasively uses sets, and I'm finding quite awkward the need for separate operators for sets vs. other data structures.

That said, XSP seems to represent a particular technology rather than a well-defined closed calculus for operating upon extended sets. I would really like to see an effective representation for processing extended sets - preferably something defined reflexively (i.e. such that queries and processing on extended sets are themselves extended sets). I would like to see reasoning about termination, determinism, and optimizations. I'm not seeing this sort of information on the XSP site, or if it exists it is buried beneath a ton of marketing and buzzwords.

Being able to abstract all (or at least many) computational objects as extended sets indeed provides a degree of semantic simplification. This is analogous to the manner that, e.g., the RelationalModel simplifies storage by representing all storage mechanisms as RelVars. However, I think extended set theory's main strength is that it provides an opportunity to use a common, relatively simple optimisation scheme from the lowest, most concrete levels of a system to the highest, most abstract levels. I don't know to what degree either of these ideas have been implemented; it's something I intend to ask D L Childs.

XSP is not intended to be a particular technology. I can't comment on the writing style used on the XSP web site -- other than to note that Childs is a mathematician, not an English writer, and the site appears to be mainly intended to promote the idea to potential implementers. And by "potential implementers", I mean investors with the money and inclination to fund speculative commercial software development; the site is largely not aimed at computer scientists or programmers. I too would like to see an effective representation for extended sets, plus reasoning about termination, determinism and optimisations. I suspect that's left as an exercise for the reader, but I will ask Childs if he's written anything on these.

It seems our areas of interest have some overlap. Given I'm working on a PhD in this area, I'm not sure whether to feel threatened or organise a conference. :-) I suppose the appropriate "academic" thing is to do the latter.

-- DaveVoorhis

SetTheory, CategoryMath, CategoryTheory (heh, heh...), AlgebraixData (does this implementation mean this is no longer just a theory?) No, there are a number of implementations; of classical set "theory", too.