Are Tables A General-Purpose Data Structure?
See MinimalTable for definitions of "table".
Claims to investigate:
- Tables are more general-purpose than mixed structures (bags, stacks, hashes, etc.). Tables can represent anything that bags, stacks, hashes, or other data structures can.
- Relational and normalization rules limit variability in table designs, making them quicker to grok due to consistency. Fewer combinations and techniques to learn.
- Tables are easier to scale, both in size and complexity, without having to overhaul access code, because they are close to a one-size-fits-all structure.
(Please place rebuttals below, not in the list.)
(Request is rejected. Put a summary of rebuttals here or you force a biased discussion.)
- Well, then how about a summary of replies. Details don't belong in the intro in my opinion.
This focuses on the concept of tables rather than
BigIronDatabases in particular. For the purpose of discussion, exclude speed issues arising from a hardcoded implementation: Notably, given appropriate constraints on the table (which is part of relational theory), one can have table implementations which are exactly as efficient as any given data structure for the operations that that structure would support natively, while still allowing the flexibility one would expect from a table.
[original, causing confusion] This focuses on the concept of tables rather than BigIronDatabases in particular. For the purpose of discussion, exclude speed issues.
Why? Speed is frequently important - even if we limit ourselves to BigOh issues; and tables have speed issues even if a NimbleDatabase (or none at all, just raw tables) are used.
Only in very limited circumstances. If you can identify a problem situation WRT speed, I would be glad to take a look. Maybe we can do a wild-west speed shootout.
The 3 claims above are mostly true. See the discussion below and at other places. It is *exactly* performance that can make one prefer a non-table structure. Notably time complexity. The time complexity of accessing a table element by key is O(log n). A hash table element O(1). The first element of a linked list, by reference, negligible. An array element by index, negligible. However, in non critical applications, O(log n) is fine, and therefore there is no reason to use non-table structures (other than programming habits). -- MariusAmadoAlves
O(1)? Not sure about that. Why don't RDBMS vendors build keys with hashes then? -- AnonymousDonor
Many do, though they use those hashes for binary trees (hence the O(log N)). It's certainly possible to index with a HashTable. Most uses for an RDBMS involve large data sets, though, where the storage overhead of a hashtable is prohibitive. This is really a case of a big-dataset tool being adapted for a small-dataset problem, not an intrinsic problem of tables. A lightweight in-memory database {NimbleDatabase}could easily use hashtables for indices, and then would have the same speed as a hashtable.
There are cases where tables have intrinsically worse performance than dedicated structures, though. Consider inserting an element at the head of a list. This is O(1) for any reasonable LinkedList implementation. It will generally be O(n) for a table, because it requires adjusting the sequence numbers for all following records.
{{It requires logically adjusting the sequence numbers, but it does not require actually iterating through all the sequence numbers and changing each one. The arguments in favour of tables are based on using a uniform table interface, not a uniform implementation.}}
These sequence numbers are part of the visible interface, so they can't be optimized away.
I am not sure what tablized implementation of a linked list you are envisioning.
You may be able to get amortized O(1) updates if the actual sequence numbers are not visible, only ordered.
Leave a gap between sequence numbers, and then double the gap with each renumbering, like how vectors resize themselves. The constant of proportion will be significantly worse than with a normal LinkedList implementation. And you lose random-access indexing of elements by position. But that's the same trade-off you face between arrays (O(N) insert at beginning, O(1) random access) and linked lists (O(1) insert at beginning, O(N) random access).
(EditHint: The list discussion is already mentioned below and should probably be consolidated together. But I've already refactored this page once and it just ended up getting ThreadMessed again.) -- JonathanTang
-
- I can't see why a table couldn't implement a linked list in the same complexity as a 'real' linked list. Why wouldn't you just maintain a 'head' element which contains no real data, only a link to the first element? You could then do any addition anywhere in the list by appending a new element to the table and updating the two elements (if it's two way) that refer to it. O(1) if you already have the node you want to work with. Requiring that we can replace the head is a needlessly complicating assumption. -- WilliamUnderwood
-
- I forgot about that possibility (that's what I get for commenting first and then reading the rest of the page). Yes, it's absolutely possible for a table to mimic the list exactly. You're likely facing higher constant factors, but that's an implementation detail, not something inherent to the data structure. The analysis I gave above is really for an array, and an overly complicated one at that. Although implementing an ArrayList this way at least achieves the NoPointersIdeal mentioned below. -- JonathanTang
-
- Following a "pointer" in a table means doing a lookup on some key and is O(log n), not O(1). In general, a table directly implements a map or a set efficiently. Representing any other data structure is only possible by abusing keys as pointers and consequently incurs an additional logarithmic cost.
-
- Hardly. Consider a 2 column table with the constraint that the first column is an integer value smaller than 4 billion. This can trivially be implemented as an array. Any relational operation that maps to the operations one generally does with arrays is done in exactly the same complexity as an array. Stating it again to make sure there's no confusion: the algorithmic complexity is a function of the constraints and the implementation of a table. The fact that we have a set with tuples in it has no bearing on it. -- cwillu
-
- Well, no. To be trivially implemented, the integers have to be densely allocated and some sort of garbage collection is needed. This is not even transparent to the user. The result is an array, and therefore should not be called a "table" any more.
-
- No, only the implementation is an array. The app writer may never know which implementation is used under the hood, and it may change dynamically by the underlying engine if say the size or usage pattern changes. RDBMS often use B-trees, for example, but that does not mean that one is directly "using" B-trees. One could also swap an array implementation with a linked list and the app writer may never know the difference (other than maybe analyzing performance profiles). "Table" does not define/dictate an underlying implementation.
Good point. Tables are an interface, not an "implementation". For example, the MS-SQL product has something called a "clustered table", which uses an internal arrangement that speeds up primary key-related operations at the expense of other keys or searches. Such a table does not behave any differently outside of performance.
{Some of the above can perhaps be moved to AreRdbmsSlow.}
I will agree that linked lists are probably one of the "least natural" structures to emulate with tables. Emulation is doable, but not very pleasant. But not easily handling the entire set of structures is not a reason to abandon the idea. -- top
Claim 1: Tables can represent anything that bags, stacks, hashes, or other DataStructures can.
- A map is a table with two columns where one is the primary key.
- A bag is a table with two columns, one being randomly [auto-increment?] generated or otherwise not exposed.
- This depends on whether tables allow duplicate rows. If yes, then they already are bags. {That is why the auto-gen column; it "absorbs" the uniqueness qualification of relational without affecting the data being considered.}
- This is much like the "A set is a table with one column". In practice though, it's much easier to make a bag out of a set than it is to make a set out of a bag.
- Another variation is one unique column for the data and another as the count of occurrences of that item.
- A set is a table with one column (which would be unique under relational rules)
- A stack is a table with data and timestamp (or auto-increment) columns, where the query selects for the last (highest) element added.
- A queue is a table with data and timestamp columns, where the query selects for the first (lowest) element added.
- A linked list is a table with a successor or predecessor field.
- A sequence is a table with an ordering field. Reals can be used instead of ints so that renumbering is not required as often.
- A tree is a table with a parent field. See TreeInSql.
- A graph is a set of nodes and a mapping from node->node. See GraphAlgorithmsWithTables.
- A 2D array can be represented by 2 index columns (usually integer) and one data column. The primary key would be the combination of both index columns. It is essentially a "sparse array". A trigger can produce a non-existing record upon update.
Con:
- Above claim is victim of TuringTrap fallacy. The ability to 'represent' other structures is hardly unique to Tables. One can also represent all the above structures using: untyped lambdas (Church numbers, booleans, lists, etc.), cons cells (Lisp / Scheme), object graphs, records, etc.
- Then "general purpose" has no real meaning then.
- Sure it does. But "general purpose" requires more than the mere ability to 'represent' other structures. If you can make positive claims about the performance, composability, portability, safety, and simplicity of these representations, then you can call it "general purpose".
- But the weights will be different for different people. For example, I'd rank portability much higher than "type safety".
- Sure, weights may be different. Weights will vary from one purpose to another. If you hold 'portability' higher than 'safety' no matter what your potential task is, that mostly marks you as a fool. Would you consider portability more important than safety even when writing software for a pacemaker? To be "general purpose", a structure must be sufficient for a very broad range of 'purposes'. This usually requires achieving both: (a) a balance of properties, and (b) ability for the programmer to influence properties for a given task.
- Performance. The above representation of TreeInSql, for example, cannot support logarithmic-update-cost persistent tree structures. That hurts, algorithmically.
- "Doing everything well" is not a requirement to qualify. Every approach has up's and down's.
- Having ups and downs is okay. Having huge, glaring holes in efficacy is not okay. "Doing everything GoodEnough" is the basic requirement to qualify as general purpose.
- "Gaping holes" only for the way YOU do things.
- I'm quite certain that I'm not the only person who does things the way I do them. After all, I learn many techniques from literature that other people write.
- Weak support for algebraic 'variant' data types (typically tagged unions). You really need multiple tables (a record of tables - or a database state) to represent complex structures like tagged union algebraic types.
- That is a linguistical/style preference, an implementation choice. It's like saying that JavaScript is "bad" because it's not Java when you would rather do Java than JavaScript.
- What, are you going to claim that something that is not tabular, such as your DynamicRelational or MultiParadigmDatabase, should be considered here? I'd rather you stick with the well accepted defs. If you want to discuss some cousin data-structure, use another topic statement. Unless you can hammer down what you're talking about enough for both sides of the argument to grasp it, you're just HandWaving like usual about fuzziness and relativism. We hardly need more of that crock on this wiki.
- I'm not sure what you are talking about. Please clarify.
- I'll use small words: stick to the MinimalTable definition specified at the top of the page or admit that you're a HandWaving nutcase.
- The "style preference" is nestedness versus referencing. Relational's style tends to be "referency" instead of nestedness.
- No, TopMind, that is TopMind's and SQL's style. Relational doesn't care. Other relational languages, such as PrologLanguage and MercuryLanguage, use nested structures to greate effect.
- But that is not "relational". They may be relational-like or have relational influences, but that's not the same as being relational. If you mean the "cells" can contain nestable types because relational doesn't limit what types go in there, I may agree, but that's a different animal.
- Claim 1 undermines claim 2 and claim 3. Use of tables to represent complex structures is not easy to understand (for example, it results in very complex queries and DML for simple tree operations (equality tests, for example) and does not scale (the 'performance' mentioned above is one issue, but composition - i.e. stacks of trees, graphs whose nodes are labeled by stacks of trees, etc. - is a bigger problem).
While this is true; it is likely true for many containers mentioned. LispLanguage shows you can model everything with lists (with pairs, actually). Tables are hardly special in this regard
True, but I find Lisp structures harder to view and lack inter-developer consistency, and don't scale to "large-scale data banks" very gracefully.
[I suspect that the 'representation' of trees/graphs/stacks/lists/etc. using tables would prove equally delinquent of inter-developer consistency and as bad or worse for scaling.]
Exactly - Lisp cons cells are universal, but they're rather low level. Also, they don't support random access structures well (most Lisp dialects do provide random access arrays, but in addition to cons cells, not defined in terms of them).
But, they are a similar kind of contender to tables in that you don't have a bunch of different "kinds" of things, only lists with the ability to nest.
Claim 2: Tables are more consistent and easier to learn than dedicated data structures.
Pro:
- Tables map effectively to logical domain relations (predicates), especially in the higher normal forms (6NF).
- Relational rules and normalization rules constrain creativity in structure designs, making them more consistent. A maintenance programmer can make certain assumptions about the data, and does not have to learn the APIs of multiple collection types.
- Relational tables can represent complex data relations without nesting structures.
- Declarative and consistent 'API' in the form of a schema. Does better than most ADTs when it comes to describing uniqueness constraints and relationships between relation variables. (DependentTyping may support this, but isn't readily available.)
- Cascading deletes. NuffSaid.
Con:
- Representing complex DomainValues by use of toplevel (non-nested) tables forces programmers to face some very complex issues for simple equality tests, especially if performance (shared structure) is desired - which introduces need for explicit garbage collection.
- RelationalBreaksEncapsulation. Use of tables to represent behavior and variations on it (DataAndCodeAreTheSameThing) tends to contradict modularity and increase coupling to implementation details. One can dodge this problem by putting behavior into an 'opaque' structure (like a String, ScriptingLanguageAgnosticSystem or AbstractConstructor + MementoPattern) then restoring the behavioral descriptions in another (non-tabular) structure. But that dodge just proves the point: tables work very poorly here. Attempting to use tables to represent, say, a queue of abstract FunctorObjects will be far from easy to comprehend.
Claim 3: Tables scale more easily.
Pro:
- Adding a column to a table can be done without touching existing code. Adding an extra key to a map or set requires switching data structures or nesting one structure inside another. (DiscontinuitySpike).
- Virtual tables (such as RDBMS "views") can insulate applications from changes in the large-scale table structures.
- Example: GraphAlgorithmsWithTables where the map of lists could not handle the addition of weighted links. It would require far more rework to add that to the navigational version than the table version.
- Note: I don't claim that tables are without DiscontinuitySpikes, only that there are less of them. No technology is completely free of them.
Con:
- Tables used to represent complex structures (like those mentioned in claim 1) tend to compose very poorly. There is very little code reuse in moving from a 'tree of strings' to, say, 'a tree of stacks'. This might be overcome if tables, queries, and their types are made FirstClass (akin to lists and list-builder notation in Haskell). In that sense, if tables are allowed to have columns of tables (or records thereof, for representing variants) then composition can be recovered.
- Poor encapsulation and modularity for composition of behavior. Modularity is required for many kinds of scalability. Heavy reliance on 'applications' knowing how to interpret the behavior, values, and structure of opaque behavior descriptions (like scripts).
- Simple operations (like equality tests or comparison operations) for complex structures represented in tables (e.g. to compare two stacks or linked lists stored in tables) are challenging to write in a manner that hurts extension. Auto-generated 'keys' tend to forbid simple table-to-table comparisons.
- Tables can be monstrously inefficient for applications where their power isn't needed. As a simple example; a linked-list can be traversed much faster than a table (even if we assume that the table is in local memory, and there is no transactions layer getting in the way).
Regarding performance: As computers get cheaper and smaller, the idea of a
NimbleDatabase in a wrist-watch will be less and less far-fetched. After all, some want to put Java engines in wrist-watches. Table engines are hardly more bloated. Also, you can't play both sides of the road. If it is a small structure, then the overhead of the table engine is probably not going to matter most of the time
anyhow. Efficiency generally does not matter for small sizes. And if it grows big or needs concurrency, RDBMS scale easier there than linked lists.
Nobody is objecting to NimbleDatabase, not even in the wrist-watch. It's the more extreme "we should use tables for everything!" idea that raises objections.
Tasks where dedicated structures are harder to use than tables:
These are things that equivalent dedicated structures seem to have the most DiscontinuitySpike problems with in my experience, as they require more coding than a table would to add them:
- Adding a new column (especially going from 1 or 2 to 3+)
- Adding a new "column" is an operation that doesn't make sense for many data structures. For one, it presumes that the elements being stored are records - what often are mistakenly called "tuples" by RelationalWeenies :) - when in many cases they aren't. At any rate, a table is little more than a set of such records, with a possibly stronger "equivalence" test available.
- Whatever we call them, having to add another "attribute" in a given "node" is a common frustration I have found with dedicated structures.
- Is that a fault of your language (which may not be the 'pinnacle' of procedural or whatever)? Going from a set/list/stack/etc. of non-records to a set/list/stack/etc. of records can be somewhat of a challenge because it forces all the code touching the set/list/stack to change accordingly. More flexible languages can often work around this using polymorphism. Strict adherence to a polymorphic principle (like OO - always using a method to access an attribute - or alternatively using BoostFusion and template-driven compile-time polymorphism) can pretty much solve that problem.
- Sorting
- Disagreed; there are many efficient sort algorithms for arrays, linked lists, etc.
- Yes, but it is often extra effort to integrate the algorithm if not already included, and it is bad factoring to have a different algorithm for each and every dedicated structure. Again, it is a combinatorial mess. Tables give you that out-of-the-box. Plus, you usually don't have to alter the original structure positions.
- It is not a "combinatorial mess". At worst, you write one or two sort algorithms per generic structure (i.e. stable-sort and performance sort). A 'combinatorial mess' would exist only if for some strange reason you needed to adapt every sort to every data-structure - a ridiculous claim. Further, most programming systems also "give you [sorts] out-of-the-box". And if you're willing to take the same performance hit you ALWAYS take for tables, you simply copy the structure then sort.
- Sorting by a different field than original requirements
- Disagreed again; sorting algorithms will happily let you apply your own ordering predicate function (the StandardTemplateLibrary does this easily). Of course, you are again making the table-centric assumption that the things being contained with the data structure are records.
- See first item WRT "record-centric".
- Looking up by more than one key or argument
- Adding concurrency to
- Depends on what you mean. If you mean "it's easy to make tables concurrent by switching to a better RDBMS", OK. OTOH, if you have a "raw" table not part of some database - adding concurrency to that is at least as difficult as arranging concurrent access for any other data structure.
- Please clarify. If you mean across different DB vendors, the issues are no more difficult than with dedicated structures. Thus, the worse case is still a wash. (The "average" case is simply some minor query language changes.)
- You seem to be assuming your 'general purpose tables' are already in an external DBMS and accessed via a protocol and standard language. But the discussion of whether tables are general purpose structures would apply equally well to considering whether they be provided instead of say, arrays, in the CeeProgrammingLanguage?. Adding concurrency to table-primitives is not going to be easier than adding them to array primitives. The greater complexity of table structure and index management has a high probability of making it more difficult.
- Browsing
- OK; this is another form of AdHocQueries.
- In part, but it is also about having existing table and query-result browsers (usually interfacing via SQL).
- Existing table and query-result browsers by no means can be expected to work or even be able to access the FirstClass tables - especially the temporary tables specific to the application instance.
- Quick data dumps or echoes for debugging
- Tables are easy to represent on a piece of paper, but so are many other collection types.
- But each one has to reimplement it, and you may not have that built-in for some.
- Huh? If your language has support for dumping tables to text, it's just as reasonable to expect similar support in a language that uses lists (like Lisp or Scheme). If your language lacks such support for dumping table text, then you'd need to reimplement it per table just as assuredly as you'd need to do for linked lists. If you are assuming that all tables are stored in some external DBMS, then you've already lost your argument that tables are general-purpose because you've constrained their usage.
- Persistence
- Again, you are equating "tables" with RDBMS.
- The main point is that you (or API builder) only has to implement it once, not for each and every dedicated structure.
- Your main point is pointless. This returns to claim 1. You could just as easily say the same for lists: "I only need to implement persistence once for lists, then represent all other structures as lists." You could just as easily say the same for lambdas: "I only need to implement persistence once for lambdas, then represent all other structures as lambdas." You could just as easily say the same for heap: "I only need to implement persistent virtual memory once (TransparentPersistence), then use memory for all other structures." It is pointless to claim an advantage for tables that may also be claimed by every alleged competitor of tables. What sort of fucked up logic causes you to repeatedly make such "points"?
'Real-world example of Collections vs Tables'
I wrote a python program, and after reading about TOP, noticed that several of my classes had the sole purpose of emulating either a record or a complete tables, including multiple keys and ad-hoc (tabular flat-file) persistence which you get for free with RDBMS tables.
- I had a class whose attributes were a list of (A,B) pairs, a list of C, and nested dictionary D such that D[B][A] contains a list index which has to be consistently updated. Instead I could have used a table with columns (A,B,C), and retrieved records using row number or key (A,B).
- I had another class which simply stored a number of attributes (a,b,c,d,e), and instances were retrieved from a persistent dictionary (shelf) keyed by attribute 'a' and needed instances are pre-retrieved in a Python dict (to avoid having to do retrieval in the middle of other I/O operations). These could have been records in a table with key of 'a'.
There are two issues. Firstly, procedural/relational approach to access the table (e.g. via SQL or an HDF library) would maybe save 50% of the lines of code over each partially-table-emulating class, which only takes 100-150 lines of python.
Second, is that my program does millions of small operations on data before returning to the user. With RDBMS tables, an SQL query to retrieve a single item is at least several times slower than a native dictionary lookup, and would dominate the running time.
- Do you have sample data? I'd like to test this in FoxPro (without using SQL).
For True Speed, I could have used C++ STL collections, at the expense of a lot of programming time when the algorithms are 'dynamic' in that they do lots of creating and passing around of objects, making it hell to do C memory management or C++ Resource-Acquisition-Is-Initialization.
- SO* Where do I find a "fast as native dict/list/set" implementation of an in-memory relational table? (I don't care if the persistence is slow.)
-- Jettlogic
Discussion & some other thoughts
Surely all data structures are (mathematical) relations between data items. You can represent relations in many ways, tables, structs-and-pointers, s-expressions, etc. Which representation is best is an engineering trade-off, not an ideological absolute.
The flip side is saying that nuts and bolts and standardized 5-volt circuit board parts are "not an ideological absolute". Yet, we find standardization and consistency useful in nuts, bolts, circuit board standards, etc. Tables make a very good standardized "structure" component in a similar way. True, for special purposes you will always need custom-made thingies. But, I think we are at the point in software engineering where we can farm many things off to nearly off-the-shelf standards. Perhaps some might find that boring, but sometimes progress does that. I think smaller dedicated structures made sense when hardware was expensive, but now the slightly higher overhead of table engines is small compared to labor costs. (Although special niches will always be hardware-bound.) -- top
[It should be noted that 5-volt digital logic is rather quickly being replaced with lower-voltage logic; 3.3 and 1.8 volts are now very common. This trend has been going on for several years now]
[At any rate, the expense of having multiple voltage levels (5,3.3,1.8) on a single board, along with the need for logic which works with multiple voltages and/or converts between them, is far greater than having different data structures (relational tables, lists, trees, arrays, etc.) in a software system.]
Perhaps, but why mix if you don't have to? BTW, I think a simple non-nested dictionary array is the upper limit. Beyond that, tables are preferred in my view.
Different containers serve different requirements. HorsesForCourses and all that. Having tables and linked-lists in the same application doesn't hurt anything; the two need not come in conflict. I see no compelling reason to use tables for everything. (When there is a good reason to use a table, then I use one. When a linked list is more appropriate, I use that. This ain't rocket science...)
I have not seen enough "table failures" to agree with that. If you have the table engine in there anyhow, then the additional cost of another table is trivial. Plus it scales in complexity better. Better future-proofing.
[delete troll]
Much of this page seems to be predicated on the idea that tables == database; and table-behavior == SQL. This misconception is apparent on both sides of the debate. Tables don't do concurrency, sorting, persistence, etc. RDBMS's do these things. It's true that these things can easily be added, but that just extends the basic error of mixing abstraction with implementation.
But upgrading from mere tables to a RDBMS is minor compared to the alternatives. It is a few tweaks here and there, not a complete switch to a different kind of thing or different kind of API. It is more incremental in nature compared to the alternatives.
Stack Example
Someone brought up the fact that tables aren't stacks. True, but if you're thinking in an abstract domain then you don't need stacks. What you want is to get the most recent data added to a collection, a concept can be implemented in many ways. The simplest would be a column that represents the order in which the rows were added into the table (e.g. an auto-incrementing field). The big weakness of tables is that they force you to think about these fields early; the great strength of tables is that they force you to think about these fields early.
Most programmers like to assume that things like references and collections-classes "just work". This is good for programming, but can lead to premature fixing of the underlying metaphors of the design. If you think "I need a stack here" then you miss the fact that you are simply defining an ordering-relation for extracting data from a container. Ignoring details like this can be good, but hiding them may also prevent you from exploiting it.
Take the example of an audit trail. If you want an audit trail, then you don't want to lose data from the table. You could do this by having an audi-object separate from your stack object (with reference-semantics this is not hard); or you could think in terms of tables, and redefine the ordering-relation for stack-usage to find the most-recent "active" row; and remove by marking it as inactive. A "stack" view thinks inactive rows are deleted, whereas the "audit" view sees active Vs inactive rows.
Interesting analogy, but I am not sure how the last paragraph points out "premature fixing of metaphors" described in the second paragraph, at least not "bad" fixing. It is simply a decision whether to use relational as the underlying philosophy or the "old style" approach.
The table is more flexible. With the stack, you either have to add a secondary new structure to get the audit trail, or abandon a pure stack altogether. With the table, you just make some adjustments, such as an added "is_active" flag column, but are still using the same table. It is back to the advantage of has-a over is-a.
The definition of "General Purpose" does not mean "Universal". It is an antonym for "Special Purpose". For example, an X86 processor is general purpose, and an NVIDIA or ATI graphics GPU is special purpose (specialized for graphics). A Pentium, being general purpose, can do the computations that the GPU does, but cannot do them with anything like the same performance. The same is true of a turing-complete formalism based on tables. Tables excel at computations that are "Single Instruction, Multiple Data" in nature (the form of parallelism where you want to apply a computation to a large set of data, where the computations on individual items do not interact), permitting powerful computations to be expressed succinctly. They are also very good for expressing tedious control systems: a natural representation of a state machine is a structure known as a "state transition *table*". The only reason that it's easy to lose sight of the power of tables is that many people's introduction is via the brain-dead language known as SQL. -- DaveWhipp
Hmmm. That is an interesting way to look at relational tables. However, things like Join operations and aggregation require inter-row interaction. Are there any candidate "general purpose" non-SIMD structures/paradigms?
You might want to read some of the old (early '90s) books on the ShlaerMellorMethod. This method was popular in telecoms and control applications. You modeled your application using tables and relations that have associated state machines to describe (not proscribe) their behavior over time (in response to events). By the mid nineties it was losing in the methodology wars, but had developed the formalism to include an action language rooted in SetTheory. Much of the work has been carried over to define executable profiles for UML modeling (ExecutableUml).
Part of what bothers me about dedicated structures is their IS-A nature when a HAS-A nature would be more flexible interface-wise it seems to me. Why is an "ordered list" not a list with sorting set to "on"? Or why do we have to go through gyrations to get a 3-column structure out of a map (with two columns)? The number of columns and the number of sortable and/or unique columns should be features (attributes) that are switched on or off as needed. Dedicated structures seem to be a sloppy attempt to take combinations of all these orthogonal or semi-orthogonal features and turn them into a taxonomy out of historical habit.
Perhaps if the interfaces were not dependent on the "type" of dedicated structure, it would not be much of an issue. However, it is in most APIs. You cannot change a list into a map without having to overhaul existing interfaces that used the list. Tables appear better at this because adding a new column, a new key, and a new sorting column generally do not require changing existing code. If one sets out to make the interface handle such changes incrementally, I believe the result will resemble tables. [Note: I may have repeated this issue elsewhere; it sounds familiar, but I forgot where.]
The answer to your question is "efficiency and simplicity". We don't need one super mega list with every possible feature embedded in it waiting to be switched on when the data is just a list. Sometimes a list is a list. When it is, just use a list. If that changes and it needs to be an ordered list, then use an ordered list.
By your logic, all we need is one big program with every feature in it waiting to be switched on. (Which is one way of thinking about a TuringMachine. Our job is to figure out how to set the switches to get the desired behavior.)
- Exactly. Each such switch is called a "bit", and it happens to take a lot of them to create a program. ;-)
No, you are exaggerating. However, that is an interesting way to view things. Note that we don't have to necessarily include features not actually used. Anyhow, think of it as a thought experiment and focus on the interface, not on bytes consumed.
If all you mean is that ordered list and list use the same interface, we already do that. We also don't include features not actually used.
Well, some of the typical set of dedicated structures indeed do or can share some parts of their interfaces, however, each cannot be replaced by the other without changing existing code in many cases, especially when using combinations. Besides, why have a different name for each combination? Many are just names given to each iteration in a loop that would resemble something like:
...for columns = 1 to 2
.....for unique keys = 0 to 1
........for sortKeys = 0 to 1
(Dots to prevent
TabMunging)
It is illogical to me to create a taxonomy of combinations in place of describing the 3 or so factors that define it. It's a form of AttributesInNameSmell. Plus, tables are not limited to just 1 or 2 columns, keys, sorts, etc. The sky is the limit. The small numbers in the traditional data structures are arbitrary maximums. I see nothing magic in them. I don't get it. You keep talking about taking up less RAM and stuff, but have no interface-focused justification that I can see. On a small local scale they are possibly faster and take up less RAM; I won't challenge that claim much (yet). But is that it? I don't see many change-friendly properties from a software management standpoint.
Why is an "ordered list" not a list with sorting set to "on"? '
- Actually, I think I worded that wrong. It should read something like, "Why is a list not a set with ordering set to "on" instead of a different 'thing' altogether?"
How does one maintain order in the list? The problem with an ordered list in a database table is that to reorder the list, one must update a sequence number in each table row. Give me a linked list any day, where I can reorder items by only modifying the neighbors.
Isn't that called a "linked list"? One can implement such in tables by having previous-ID and next-ID foreign keys. I have not found a need to use such very often, mostly because it is tough to hook/define into a UI. Thus, they perhaps are best used for internal structures. Another approach is to use a sequence number (float or double) that takes the average when you insert something between two nodes. When the decimals come close to the limit of floats or doubles, then resequence the whole list. Triggers can assist with this.
Please note that ordering is only significant to a user; one certainly would not bother about reordering lists unless the user wanted to be able to specify the order of items in a list. There are times when the user will need to specify order based on some criteria not contained in the data and must be user specified. ''In what way? What if you want to perform a non-commutative operation on the items in the ordered list, in order? -- AdamBerger
A bunch of the above seems predicated on the desire for UniformityUberAlles. The line of argument goes like this:
- Tables, as well as other data structures, are equivalent (you can implement one in terms of another).
- Tables are better at X than other structures (for some X, in this case X is AdHocQueries)
- Simplicity of design suggests we have only one data structure (from which all others may be derived), think of all the poor dumb programmers who can't handle tables in the same code as sets or arrays
- Therefore, we should abandon everything but tables.
This argument is fallacious in a couple of ways.
- It can be made for most of the container types on the list! I could argue that lists are the data structure of choice, and give examples of things that lists do better than tables. (Of course, I'd be ignoring the shortcomings of lists - they are fast for arbitrary traversal but suck for random-access, for instance - but tables have similar shortcomings).
- Lists don't scale well, and lack consistency because there are too may different ways to arrive at the same solution for anything beyond a one-dimensional list.
- You're missing the point; I'm not arguing that we should replace everything with lists (even the smuggest SmugLispWeenie wouldn't argue that). I'm pointing out that a similar argument could be constructed for lists; it's just as fallacious. The argument I'm advancing is there is no compelling reason to suggest that we should abandon these "simple" containers and use tables for everything. It would be an AbstractionInversion of the highest order to do so.
- (Maybe you just haven't met one smug enough?)
- As I pointed out above, simple lists and simple dictionaries are not the problem. It is when such structures grow in complexity and become nested (lists of lists, etc.). In other words, we can have a toolkit with lists, dictionaries (maps), and tables.
- Throw in a couple more primitive containers; and I'm with you. Again, nobody is arguing against tables; for many things they are the best tool for the job. Just not everything.
- It would be interesting to see your list of primitives and the justification for each one.
- Consistency is not evil. It is one of the reasons we went away from GoTo's. Lists of pointers are the Goto's of data structures.
- Consistency just for consistency's sake, while not necessarily "evil" (an overused word, for sure), is definitely an AntiPattern. See GoldenHammer, et al. Your references to "goto" is a complete and utter NonSequiter?.
- Tables are perhaps a SilverHammer?, as close as one can get to a GoldenHammer these days.
- Would that be a claw hammer, a sledgehammer, a ball-peen hammer, or a rubber mallet? Even in something as specific as hammers, one finds many types. Tables are certainly useful I'll agree - and it's a pity that they don't appear more often, especially outside the context of a RDBMS (there are many many uses of tables that do not need transaction semantics nor AtomicConsistentIsolatedDurable semantics).
- The "simplicity" argument is unconvincing. As mentioned above; there is a tremendous cost involved in mixing 5V and 3.3V logic on a board - an example from ElectricalEngineering? that top mentioned. However, that doesn't apply at all here; a good case would be that trying to shoehorn a LinkedList into a table is more complicated than having a dedicated LinkedList structure. (One should note that if one is to access a table as a LinkedList; one would be interfacing with a LinkedList abstraction anyway. If I'm writing LinkedList code, I want LinkedList operations - insert at head, insert at tail, car, cdr, etc. I don't want to write a relational query (in SQL or TutorialDee or anything else) to get the first element - to do that would be tremendously more complicated.
- Re: "I don't want to write a relational query to get first element." If you need that often, then just write a getFirst() function or stored procedure to wrap the query.
Regarding your linked list argument, if you implement a linked list directly, you end up creating a structure resembling:
struc linkedListNode{
previous: ^linkedListNode // pointer to another node
next: ^linkedListNode
dataPortion...
}
A direct table version is pretty much the same thing. If you want dedicated functions/stored-procedures/triggers such as car, cdr, etc., then go ahead and make them.
[So what you're saying is that if one implements a linked list "directly", one must define a node structure, and write whatever functions (car, cdr, etc.) one wants to use to access the list. While if one implements a linked list using tables, one must define a node table, and then implement "functions/stored-procedures/triggers" to access the list. And the difference is what exactly?]
[EditHint: move discussion related to linked lists to one section.]
To me, the greatest pro of TOP is one actually not discussed here yet: the promise to fulfill the NoPointersIdeal. I'm currently researching this (however as a sideline to my main research on adaptive hypertext). Eventually I'll have more information.
-- MariusAmadoAlves
In what way does TOP get rid of pointers? Even topmind himself said, above, that to implement things like linked lists, stacks, and queues using tables, you need a column containing sequence numbers. In what way do those numbers differ from pointers?
In that they are *keys*. [And so the difference is? Whether you call them pointers, references, keys, indexes, or Spot the Wonder Dog, it doesn't change the fact that you have to do the same operations on them when you insert or delete items in the collection as you would if you were doing the same operation using pointers in C. Tables provide no additional abstraction.]
Tables *do* provide additional abstraction: arbitrary views. The set of possible keys is open. The set of pointers is closed. Consider the dataset of pairs (letter, number)
{(a, 1), (a, 2), (b, 1), (b, 2)}
Put this in an array or linked list, and you cannot declaratively express the subset of pairs with letter 'a'. Put it an table, and that comes naturally.
[You can't?
filter (\x -> (head x) == 'a') input
--
JonathanTang]
I would note that your example does not use a column of sequence numbers. It is not a list, stack, or queue. How does this relate to my questions about pointers?
[But now you've changed the problem. We were discussing using tables to emulate stacks, queues, and lists. Those things have standard interfaces. If you allow arbitrary views, you are no longer modeling stacks, queues, or lists. The fact that those structures don't provide arbitrary views is irrelevant; that's not what they are designed for.]
- Limiting the operations is generally considered more of a security issue than it is a "kind" of structure. In a "big-iron" RDBMS we can use stored procedures for Push and Pop but otherwise prevent direct access to the table(s). The problem with a built-in limit of operators is that we cannot later expand our operators without copying to a different "kind" of structure. It is perfectly common for somebody to ask, "Hmmm. I want to see what is inside that stack". One can pop the entire stack and push it back in if they really want to see what is inside and do something else with it. Thus, we are not preventing much, only making it more difficult.
- Perhaps I should have worded my statement that tables provide no additional abstraction more precisely. Using columns of sequence numbers to enable tables to be accessed as if they were stacks, lists, and queues does not fulfil the NoPointersIdeal, because those sequence numbers are, themselves, pointers.
I don't think the concept of pointers can be completely wrung out of software. Even a US Zip-code is a pointer in some ways. But we can strive to reduce them and reduce the problems caused by them. Also note that a table that emulates a linked list cannot be hacked to gain direct memory access in the way live RAM addresses can, because it is not referencing actual memory. And, it can be used by multiple languages at the same time. By the way, why is a sequence number used to create a stack a "pointer"? -- top
Pointers != "live RAM addresses".
But are all sequence numbers "pointers"?
If they are used to reference elements, they are pointers. Many languages use pointers that aren't physical addresses.
[I prefer the usage where "pointer" is used only for RAM/core addresses, and "reference" is used for everything else, because it's handy to distinguish the two things. E.g. a symbolic link in the filesystem is a reference, but I dislike calling it a pointer. That terminology distinction certainly isn't universal, but I think it's very common.]
The idea of relational is that "references" are there if you want to use them, but they are not the only means to produce views of things. They are simply attributes that describe nouns like any other attribute.
No one disputed that.
My idea is that the references should be made in terms of the abstraction. C pointers are in line with the general C abstraction, because that is the von Neumann machine. That is, the semantic universe of C is the machine model. This is the greatness of C, and what made it snowball.
But we want higher-level models now. That's the whole force for OOP, for example.
By "in terms with the abstraction" I mean, for example, that a vector element should only be referenced by its index, a tree node by its path, a table element by its key. Never by a pointer, which is breaking the abstraction, spiking through it right down to the machine model.
Here by "abstraction" I'm thinking mostly *containers*, but other concepts fit as well.
"pointer"."reference"
.
.
.position
addressaccess.key
(C)(Ada).iterator(TOP)
........................................
machine. abstraction
model.model
--
MariusAmadoAlves
Exactly; so "pointers" are more concrete while "references" are more abstract.
[Hey, nothing like an explanation of relational theory from guy that knows one data structure. C'mon top, tell us some more about how we're all wrong and you're right.]
The arguments are given above (consistency, scaling, viewability, etc.) If you disagree with them, then I don't know what else to say. They seem to be strong points to me. Maybe it is all subjective preferences. Perhaps if I did not "grow up" on nimble table-based systems such as Xbase (even though the language itself needed work) I would not have this view. Let's hope we all get a chance to use our pet techniques and not be kept from that goal by The-One-Way zealots. Let's hope we can live in a world where FP fans can use FP, where Java zealots cannot keep Smalltalk from Smalltalk fans, etc. I don't suggest that TOP be forced down everybody's throat, anymore than OO should be forced down everybody's throat. -- top [Maybe move this to the TopMind Q&A section in a few months.]
I don't think any of us (at least not myself) want to force OO down anyone's throat. I, of course, cannot take responsibility for tool vendors' hype, but that's marketing and not science.
Something about this discussion has been bothering me for some time now, and I think I see what it is. I think that there is a MetaLevel conflict in TOP's comparison between trees, lists, collections, etc., on the one hand, and tables on the other. While it is hard to say that one type of data structure is 'higher level' than another, there is surely a difference between a structure which is defined in terms of it's implementation, such as a linked list or an array, and one which is defined in terms of it's purpose, such as a collection. There is often some confusion even with a given data structure type; for example, a stack can refer to any last-in-first-out ordered collection, but it most often refers to the specific implementation via an array and a stack pointer, especially when implemented in hardware.
Whether "data structures" are implementations or interfaces is unsettled debate.
Relational tables are a very high level abstraction, with a multi-level structure and complex intrinsic behavior. They can be implemented in a variety of manners, and in fact a given table may in it's lifetime be 'transferred' from one implementation form to another (from a disk-based b-tree to a multidimensional array in memory, to give one common example) while retaining its essential 'table-ness'. It seems to me that comparing a database to a tree or list is like comparing and apple to a carbon atom; the apple may consist in part of many carbon atoms (among others), but the properties of them are radically different.
Even the most MinimalTable is vastly more complex than a LinkedList containing the same data, even if the actual in-memory structure of the data is the same, because the table also has indices, relational links, as so forth. Substituting a table for a lighter-weight structure is not only an AbstractionInversion, but also overkill.
Why is it overkill? See PageAnchor: "virtual_tool" under AbstractionInversion. It is only "overkill" if the drawbacks outweigh the benefits. Are the drawbacks only performance? Maybe we are being "wasteful" on the low-end, but it does not matter. At that scale CPU power is cheap. Perhaps this is kind of like the depression-era (1930's) generation versus those born in the 50's and 60's. The depression-era person may take worn shoes to a shoe-smith to be repaired, while the 60's person will just throw them out and buy another. When it comes to CPU usage, I am of the latter generation. More on this below.
Even in terms of uniformity it fails, because you are treating the table as if it were a list, or a tree, or what have you, rather than using the structures and features inherent in the table itself - in effect, it would be turning TOP on it's head, using tables to implement an ad-hoc AbstractDataType hierarchy.
I don't see why it fails "uniformity". Please clarify. Using a table as a stack is not going to have a lot of variation from developer-to-developer.
Conversely, it is entirely possible to implement a MinimalTable abstraction using, say, linked lists, and then using those as to construct as NimbleDatabase - in effect, creating a DBMS library. Indeed, this is precisely what would be necessary to implement a TOP system, at least underneath.
I suspect my perspective on this is somewhat different from that of many programmers, because of my background in Scheme; in the LispFamily languages, it is not at all unusual to create a table structure from lists, and then abstract it so that it is effectively treated as a unitary database from the perspective of the ClientProgrammer. While this is done in more conventional languages as well, it is not as common; more importantly, tables constructed this way are easily displayed as tables without having to write a formatted print function, and can be stored in the same EssExpression syntax as the program. A simple example of this can be found in Section 3.3.3 of StructureAndInterpretationOfComputerPrograms. -- JayOsako
So it comes down to EssExpressions versus TOP as the EverythingIsa finalists, eh? My biggest problem with EssExpressions is that they don't narrow down the possible variations enough (lack standards/consistency) and don't scale. They suffer from the typical complaints of "navigational" structures (NavigationalDatabase). For example, look at the different ways to implement many-to-many relationships using EssExpressions. Tables can do the low-end, medium-end, and the high-end with little or no changes. EssExpressions can do the low-end great, the medium okay, but suck at the high-end. I would graph it as follows:
EssExpressions:
Low: ***************
Med: ***************
Hi_: *****
Tables:
Low: *******
Med: *****************
Hi_: *****************
The average for TOP is higher than es-exp's in my opinion, so they are the better general-purpose tool, especially if complexity of a project grows over time.
I rarely see project complexity shrink over time, thus it seems more logical to emphasize the higher end when making a judgment rather than the lower end. If you see dark clouds on the horizon, you should pick the bigger umbrella even though the current rain may be light. -- top
That is just the opposite of what I meant, really. The last example was just that - an example of one way to implement a table in a given language. The points I'm making is that a table is not a simple data structure, but a high level abstraction which can be implemented in several different ways, and that turning around and implementing lower-level structures using tables is counter-productive. -- JayOsako
Why is it counter-productive? I agree with using lists and maps when appropriate for small local usage if the language readily provides them, but nesting/combining them is past the threshold in my opinion. For example, I sometimes use maps in place of named parameters when languages do not support named parameters. -- top
By the time you are scaling to the point where it would make sense to use a table in lieu of a list, array, or collection, you would be better off redesigning the data structure as a full-blown table (or some equally sophisticated data management method), rather than using the table to emulate the simpler data structure. Mind you, I suspect that most people would put that point a lot farther up the scale than you would.
In the end, I am not really disagreeing with your basic arguments - that tables can be used as general-purpose data structures (at least in principle), that too much emphasis is placed on OOP in areas where it is poorly suited, and that relational tools (among others) are underutilized in general - but I think that you are undercutting your own argument with your stridency. You'd do a lot better if you demonstrated your point rather than arguing it, especially considering how outnumbered you are. -- JayOsako
Okay, examples:
- In GraphAlgorithmsWithTables it was shown that tables were better able to weather the addition of weights to the graph (at least as far as representation).
- Under "Code Overhaul Example" in DedicatedStructuresVersusRdbms where it was later decided to preserve order. (This one actually happened to me.)
I would like to say that
- Querying, Joining is not the function of the table itself. It's the function of DB engine. If DB store data using nested list, there is no reason we cannot use SQL to query and join it.
- Any Container (those that can contains more than one items) can emulate to be other kind of Container, thus is capable of being General Purpose Structure. All the feature that's mentioned above to make table look so efficient is the feature of DBMS, they work their ass off to tune their DB engine. Table itself is just as dumb as array, tree, or nested list.
- When you say table, lots of features (normalize , Joining) are there because you have multiple table in DB. So don't assume that nested list must have one root. We could have multiple nested list also.
How is the "table engine" going to know if a given EssExpression fits enough table rules to be usable? I can envision a table engine that internally uses es-exp's or one that exports tables to them, but something that can be altered alternately between the table engine and direct Lisp commands may turn into a headache. Does the table engine validate it each time before using it as a table? Or just raise an error along the way as it encounters them? That is going to put more burden on the table engine as far as complexity and speed. And, it is possible it may result in major corruption if the misinterpretation results in scrambled eggs. It may be unrealistic to expect the table engine to detect all possible corruptions of es-exp's (corruptions being non-table-shaped or table-interpretable structures). For example, there may be something that is table-able, but not in the developer expected. The table engine may just "go with it", and create a growing problem. -- top
How is the "table engine" going to know if a given EssExpression fits enough table rules to be usable?
Because it was constraint to be so. How does "table engine" store row in current DB implementation, Array of bytes? How does "table engine" make sure array of bytes fits enough table rule to be usable? It's the same thing. Data is, after all, Data. It cannot validate its own consistency. Can your Table detect if data in each column is in correct type? No, it cannot. What detects the inconsistency is your DB engine. Same would go if we have nested list DB engine.
- The user does not touch the internal representation, but they can dinker with essExp's.
- The user does not touch the internal representation of DB because they don't want to. If we provide DB interface to SEXP and they dinker with it, it's because they want to do it. User can dinker with EssExp? that will get to interface with SEXP DB engine, BUT the engine will ignore the malformed one. How do you think XML based DB deal with malformed XML input?
- OR, The user does not touch internal representation of DB because they can't access it. If I tell you that today, Oracle use SEXP for their internal DB format, will start dinker with your Oracle suddenly?
- If they won't need or want to touch it, then why bother making a Lisp-implemented table system? It would be better reuse to grab an existing, well-tested one and use or write adaptors. I suppose if there is already a good table system written in Lisp, then it might make sense to use it, but I don't think that will scale to the 100 or so languages in common use: EachLanguageReinventsWheel?. (MinimalTable systems can be written using maps of maps in other languages, so EssExpressions are not the only way, BTW.) Plus, you may lose inter-language live data sharing ability. And, if it has a Lisp-specific query language, you cannot switch to say Oracle as easily if you later need to upgrade your power. Actually, I don't see it as that big of a difference either way, but I think sharability and scalability slightly trump the convenience gained by it being "native". (BTW, if I were going to roll my own, I might use a query language besides SQL and not bother with explicit types.) And essexpr's are ugly. Perhaps if one made a UniversalStatement-based Lisp, it may be more palatable to more people.
but something that can be altered alternately between the table engine and direct Lisp commands may turn into a headache. It may be unrealistic to expect the table engine to detect all possible corruptions of es-exp's
I'm not talking about something tightly integrated to lisp here, in DynamicStringVersusFunctional? you sample putting TCL code in DB. How is that not "unrealistic to expect the table engine to detect all possible corruptions of tcl-exp's".
- The DB is just storing the code snippet, not interpreting it.
- The same, SEXP based table engine is just storing SEXP snippet, not interpreting it
I am not talking about LISP at all, I'm talking about how it's pointless to raise any general purposes data structure, since every data structure can simulate to be other data structure also. Be it nested list simulating table or table simulating nested list.
what's the different between.
VALUES(1, "john", "20040101")
and
'(1 "john" "20040101")
And, please, Don't assume that since it's possible to use nested list, everything must be nested and messy. Is it not possible to create messy Database schema?
- Yes, but it usually violates normalization rules when it is messy. There is no consensus way to determine if nested lists are messy. The techniques for nested list normalization are not well developed. And, there are too many different ways to represent the same thing with nested lists. Consistency is better if it has little or no extra cost.
There seems to be a major miscommunication here, but I have not figured it out yet. Maybe I'll rest it a while and then try again some other day.
The major miscommunication seems to be that your definition of table includes DB engine in it. DB is table WITH required set of operation. Giving any datastructure and that same set of required operation, anything will be as useful as table. That's why I don't see any point of pushing table to be general purposes structure, because anything is.
I am confused by "as useful as table". Do you mean representationally equivalent, or as convenient?
As convenient.
I am still not following your argument. Perhaps we should consider it with both the operations included in "table" and without the operations.
Let's consider different between Table and nested list assume that BOTH has the same set of operation.
-SQL-- CREATE TABLE employees (id as integer, name as string);
-SEXP-- (setf employees (create-table '((id integer) (name string))) #internally implement using nested list
-SQL -- INSERT INTO employee VALUES (1, 'john')
- SEXP-- (insert-into employee '(1 "john"))
-SQL-- DELETE FROM employees as emp WHERE id = 1
-SEXP (delete-from employees :where ((emp) (= (id emp) 1)))
-SQL-- SELECT * FROM employees as emp WHERE emp.id = 1
-SEXP (select-from employees :result :all :where ((emp) (= (id emp)1)))
-SQL UPDATE INTO employees VALUES (...what ever SQL allow..) WHERE ...
-SEXP (update-into employees ((emp) (...whatever lisp can do...)) :where ...)
I don't see the big different or convenient advantage between the two.
Reply above.
Why are people in this forum mentally crippling the concept of a table by comparing it to our primitive state of computer science? Modern DBMSes/SQL/OOP are hardly the pinnacle of technology, each with severe limitations, verbosity and redundancies.
This cannot be fully implemented without a revolutionary new paradigm, one that merges a sophisticated DB, a replacement for SQL, an unbloated language (100% of modern languages are redundant), and an IDE. A RelationalLanguage (better than SQL) should be part of the language. Tables should be part of the language.
Tables are conceptual, whether they are 'persisted' in a DB or in memory as a structure or cached, that is simply a 'persistence' attribute of a table. This could be optimized by the compiler or runtime engine. An array/collection/list/tree/dataset/dictionary/stack/hash/map/set/XML/etc. is just a table with differing properties. All should share the same set-like manipulations (add, remove, sort, save, load, copy, find, filter, query, iterate over, read, write, etc.). It is RIDICULOUS that YourFavoriteLanguage has redundant separate structures for every one of these and redundant operations for each of these often with different names.
Tables are the GrandUnificationTheory? of Computer Science. The problem is our sorry state of tools (like SQL and OOP and .NET/Java and IDEs). I find it amusing when some .NET idiot shows me his classes that read a table (from DB) into a table (class), translates it to a table (DataSet) then serializes it (save table) into an XML file (a persisted table) with a million identical methods at each stage. Hell the modern languages spend 30-40% translating/formatting/manipulating identical structures and identical methods.
-- AnonymousDonor
You might be right. Please join the <blatant self-promotion>RelProject and maybe help solve this problem.</blatant self-promotion> -- DaveVoorhis
The PetStore animal menu has a possible example of this. It seems natural to put the description and link info into a map (dictionary) array, especially if we are going to use them for more than one loop or purpose. However, maps don't contain inherent order. If one started out with a table, they could add a sequencing column as needed without overhauling what already exists. -- top
Real-world example of Collections vs Tables
I wrote a python program, and after reading about TOP, noticed that several of my classes had the sole purpose of emulating either a record or a complete table, including multiple keys and ad-hoc (tabular flat-file) persistence which you get for free with RDBMS tables.
- I had a class whose attributes were a list of (A,B) pairs, a list of C, and nested dictionary D such that D[B][A] contains a list index which has to be consistently updated. Instead, I could have used a table with columns (A,B,C), and retrieved records using row number or key (A,B).
- I had another class which simply stored a number of attributes (a,b,c,d,e), and instances were retrieved from a persistent dictionary (shelf) keyed by attribute 'a' and needed instances are pre-retrieved in a Python dict (to avoid having to do retrieval in the middle of other I/O operations). These could have been records in a table with key of 'a'.
There are two issues. Firstly, procedural/relational approach to access the table (e.g. via SQL or an HDF library) would maybe save 50% of the lines of code over each partially-table-emulating class, which only takes 100-150 lines of python.
Second, is that my program does millions of small operations on data before returning to the user. With RDBMS tables, an SQL query to retrieve a single item is at least several times slower than a native dictionary lookup, and would dominate the running time.
For True Speed, I could have used C++ STL collections, at the expense of a lot of programming time when the algorithms are 'dynamic' in that they do lots of creating and passing around of objects, making it hell to do C memory management or C++ Resource-Acquisition-Is-Initialization.
- SO* Where do I find a "fast as native dict/list/set" implementation of an in-memory relational table? (I don't care if the persistence is slow.)
-- Jettlogic
Generally if you are processing a lot, then a DB engine will perform well, and if you are processing only a little bit, then it does not really matter in many cases because it is finished up either way soon. Thus, I am wondering what kind of environment you have such that you need to shave speed or resources. Part of the bulk is the SQL language and parser. A simpler or more regular query language may be warranted if you want to shave time.
---
There is this visual interactive programming language whose name I just cannot recall that recently introduced tables as a means to visualize conditions. Can somebody help and add the link? I think that would be a strong answer for the question AreTablesGeneralPurposeStructures.
Re: "Tables are more general-purpose than mixed structures (bags, stacks, hashes, etc.). Tables can represent anything that bags, stacks, hashes, or other data structures can."
I'd challenge this claim with regards to variants (be they values or objects). Tables are, as traditionally defined, limited to tuples of a single structure. A bag, stack, hash, set, etc. can contain variants because their nature does not impose upon the structure of the values or objects they contain.
- Tuples are indeed defined to have a single structure (i.e., have an invariant type), but there is nothing in the RelationalModel (for example) that precludes defining a tuple type to be, say, TUPLE {x VARIANT} or TUPLE {x Object} or whatever. Therefore, with respect to containing variants, a table or relation is not different from a bag, stack, hash, set, etc., except that the variant must be defined as a named attribute of a tuple. A tuple is invariant, but a tuple's attributes need not be. -- DaveVoorhis
- True, you could make tables work with variant attributes. It just hasn't happened in practice. And for a reason: variant attributes require more complex indexing and a far more complex query and manipulation language (turing-complete, to deal with fixpoint recursion types). RelationalModel is incompatible with Tuple x Object, though, since objects are not immutable values while tuples ARE immutable values. You may, otoh, have Tuple x Object Reference. -- AnonymousDonor
- A minor quibble: It hasn't happened in (most?) SQL DBMSes, but it's certainly been implemented in in-house relational collection classes and the like. I've implemented several of these, and so have others. Off hand, I'm not aware of any open source implementations, but they surely exist. Indexing is generally facilitated by constraining the tuple attribute types to an interface or base class that supports an ordinal comparison operation, and the TuringComplete query and manipulation language is typically the implementation language itself. As for Tuple x Object vs Tuple x Object Reference, this too can be constrained (or limited by policy, or just developer caution) to references or immutable ValueObjects. -- DV
It depends on whether you see the "nodes" as being columns or rows. And theoretically, there is
DynamicRelational, with flexible columns. Current implementations of RDBMS are often optimized for speed, not structure flexibility. But it does not have to be this way.
Considering a column as a 'node'? I've never done that before. That would make a row into a set of values associated with the given node which, unfortunately, would not be distinguishable from one another (unless there is some other row identifier, like an ordinal position, or at least one column noted as 'header'). Even so, I have trouble seeing how this would handle the issue with variants, especially as it extends into states of virtual objects and such. (...) After thinking about it a bit, I'm quite sure you'd gain nothing by considering columns as nodes... you'll certainly need some sort of header column, at which point you're no better off than you were with a header 'row' at the top of each column in the more traditional column definition... not in any computational sense, at least. Language-wise you'll have an easier time adding features to nodes and essentially calculating header-names if there is more than one header column, but you'll likely have a much more difficult time adding entities and making one entity reference another.
That aside, consider a traditional variant value type: 'Tree A B = Node:(A,Tree A,Tree A) | Leaf:(B)'. The variant is Node vs. Leaf; A and B are immaterial, though they might be primitives, tuples or records, coinductives, or variants themselves. Now, it is very easy to create lists of trees, bags of trees, stacks of trees, etc. You could have a map of trees, but a tree has no inherent name, so you'd need to use an extrinsic one (chosen by the agent working with the tree). To make a table of trees, though, I think you'd need at least two tables - one table representing the collection of trees with identifiers into a table carrying information about each tree (which may also handle the recursion). You've already gone from one collection to two, so one table was unable (via this approach) of storing a collection of trees. That means one table was not very general purpose, IMO.
DynamicRelational is remarkably more flexible in practice, but not computationally so (since you can have a single table with all the columns that your domain for DynamicRelational might ever use). You really need tables where elements can be tables (fixpoint) to have true general purpose structure.
Also, traditional structures like maps hardwire two columns (attributes) into its design/definition. When you want 3 or more, you face a DiscontinuitySpike that you don't face with tables.
With this I agree. My statement only regarded variant types; it was not a general objection to tables.
How about we explore something specific, a realistic UseCase. The generalizations here are not dissectable as-is.
Use case: You wish to represent an unordered collection of tree-values (that are conceptually immutable, so as to simplify the problem), rejecting duplicates. An individual tree value can represent a computation, a search, a piece of knowledge, a grammatical rule, an unlambda function... quite a few things, really. For a number of these, you'll wish the ability to find particular values by features they possess (aka structural patterns). A tree value is of type: Tree N L = node:(N,Tree N L,Tree N L)|leaf:(L). Trees may be arbitrarily complex within its typing constraint. Assume N and L are, for simplicity, strings. They do not name the nodes or leaves.
Use case: A bit easier (due to less component variation): a collection of digraphs. Every digraph is described as (N,E), where N and E are, respectively, a set of nodes and a set of edges. Each node in the digraph has features N', and each edge directionally connects one node to another (directionally) and has features E'. Assume N' and E' are strings, but are not names. Given a node, you must be able to rapidly find all connected edges. Given an edge, you must be able to find its nodes. Nodes probably will need to be named within each graph to allow the edges can be described, but the names should not be considered intrinsic to the nodes or the graph... and should not conflict between graphs (so if nodes must be held in a separate table, they'll need to be named by both graph and node name). A 'set' of digraphs would be considerably more difficult due to the NP problem of identifying isomorphic graphs, and so is left out here.
- This appears to be dictating the implementation up-front, such as the kind of structures to use, and thus is not a true use-case.
- There is no no specification of implementation, above, only of requirements that exist based on the domain: what data must be represented, and which access features that data must possess. (No manipulation features were specified.) You don't need to use (N,E) to represent digraphs; that is, however, how they are described. Any design that meets these requirements is legal.
Use case: You wish to represent a collection of beliefs. Each belief is a datum - a proposition held to have a certain truth value in a set of worlds (or, alternatively, a proposition held to have a certain function of world->truth value), where time is one descriptor of world (so the world of the distant past is different than the world of the present or future). In addition to being a datum, a belief is held without absolute confidence; it is relatively easy to represent confidence in a database as a simple realnum ranging from 0.0 to 1.0; it doesn't even require much precision... 8 bits is plenty since quantifying confidence is such a fuzzy and imprecise thing anyway. For now, we'll also simplify worlds and simply give each one a string name (can come back to this if you pass the first challenge). Propositions, however, will be more difficult. A proposition is an abstract logic function over a world... e.g. P(x) is true or false depending on what 'x' happens to be in the world. A proposition can be a predicate (... make it a simple string for now ...) over some (possibly empty) ordered set of identifiers (which are world-dependent variables, and may also be strings for now), which for the moment we'll simplify to at-most-two identifiers - that is: Truth:(PredicateName?,Identifier1,Identifier2) (with some null ID). And:(Proposition,Proposition), Or:(Proposition,Proposition), Not:(Proposition), ExOr?:(Proposition,Proposition) (to allow OnceAndOnlyOnce), Impossible:(Proposition) ('I believe X is impossible', which tends towards the inference 'I believe Not:X'), Possible:(Proposition) (I believe X is possible; does not infer 'I believe X'), Probable:(Proposition,Probability) (I believe that X has Y% chance of being true), Contingent:(Proposition,Proposition) (I believe that if X is true, then Y is true). You have inference rules, too (e.g. 'Not:(Possible:(Not:(X))) => X, with same confidence'), but those needn't be represented in this
collection, so don't worry about them overly much (they could be, though... it's perfectly reasonable to have one believe in a certain set of inference rules. However, that adds pattern-matching stuff and meta-level worlds that make the problem remarkably more difficult). However, to use those inference rules efficiently, you need to be able to quickly find propositions that share common substructures in common worlds in order to find: contradictions, beliefs applying to the same targets (identifiers), beliefs applying in the same worlds, beliefs using the same predicates, replications (where two beliefs support one another), abstraction points, and more. My expectations that a typical table's indexing can provide more than a sputtering bit of help on the searches, or that even several would work to well represent the collection (even as well as a straightforward 'set of value objects' or 'list of value objects'), are extremely dismal. The structure needed here needs an enormous variety of indices if it is to meet time-constraint requirements of a modern inference engine.
Is your argument related to "variants", or performance?
- A collection provides access and manipulation, and the characteristics (conceptual and computational) of access and manipulation are the sole determinants as to whether the collection is appropriate for the task. You can't ever, ever, EVER dismiss performance as relevant. The issue here in particular is the representation of variants (a challenge) while under the constraint of maintaining certain (algorithmic) performance characteristics (a far greater challenge). Failure to meet both requirements is a failure for the use of tables in this domain.
- But I would like to separate the issues for discussion. If slowness is still an issue, they we can look deeper for ways to improve it. Otherwise, you might be tossing out something that makes the developer's job easier for the sake of performance, and 20 years from now it may not matter if tables are slower, just like GUI's versus command lines.
- You do know the difference, do you not, between algorithmic performance and speed? I speak only of the former. Faster processors don't fix algorithmic performance issues.
- You haven't shown it broken. If you can prove that ANY AND ALL implementations of relational would be slow for your scenario, then you may have something. Showing Oracle having a bad big-O does not mean a custom-tuned engine could not be built for a particular need; it just means that Oracle's implementation does not scale. Like I keep saying, if you can build a custom-tuned navigational thingy for a particular app, then why can't the other side build a custom relational engine? (And a custom relational language? And if its custom enough, it won't even be relational! All I need to do is sit down and re-implement tables or databases from scratch for each domain, eh? And then they're general enough. Feh.) We are getting away from the topic of "general purpose" anyhow. (Some have argued that it is easier to build a custom navigational DB than a relational one. To some extent I agree, but there are downsides, such as the loss of general-purpose queries. I'm sure the Cyc people like to take inventory of their rules and study patterns in them.)
- I needn't show the representation of variants broken for all implementations of relational (or at least for all existing access and manipulation languages) for all possible approaches to representation, though doing so would be quite generous of me (and probably quite incomprehensible to you). I only need to show it broken for all implementations of relational for the representation approaches you provide, which is considerably easier. Anyhow, if I need to specialize every database and query engine to the particular domain, it's reasonable to argue that tables themselves certainly aren't more general than, say, arrays wrapped in objects... for which I could do the exact same sort of tweaking and specialization on access.
- If you dismiss performance issues and language access issues, your claim should be: "Lists are the only structure you'll ever need! You can do any operation on a list, relational and otherwise! Sure, you need to search from the beginning of the list each time, but performance is irrelevant! Oh, and Lists can arbitrarily support lists, tuples, relations, whole databases, whatever! And if you want, you can write a relational language to access lists! Even better, the implementation and access languages in which you write and access your list is likely turing-complete, so you'll be able to tweak it however you need, rather than having to build a new language from scratch!" Etc. I don't believe this a valid approach. If you wish to prove that a structure will support a domain, you need to prove that it both can represent that which the domain requires and that it can access and manipulate this representation in the manner the domain requires.
Use case: Interactive scene-graph - a collection of object-descriptors to be drawn and animated over a sequence of frames with constraints on ordering and occlusion, plus a feedback requirement requiring that ready access to which objects might be the target of a user's attention when the user points at something on the screen. The collection itself needs only to provide the indexing to make this possible.
I would like to see some sample data. Remember, anything that can be represented as a graph of object pointers can also be represented relationally.
- "relationally" as in more than one table? (oh! I can do with 10 tables that which you can do with one list and a memory address space. Whoopee!). If a table is general purpose, then do it with at most two tables. Anyhow, the relevant aspects of a scene-graph are that you have many geometric shapes, each of which have position and orientation, that must be displayed. The shapes also have animation information, though that can be handled later if necessary. For an 'example' scene, look around your room or office and consider how you'd represent every single object in it as having a position, orientation, and image in a 'scene' that may, itself, be viewed from many positions (e.g. floating camera). Representing the data is part of the problem, with the other part being that tables need to represent it -without- reaching algorithmic complexities above those required by a realtime domain.
- Ten tables may still be better than navigational RAM-pointer hell; (pointers to ten tables each with pointers into the other nine and themselves... I'm not sure you've gained much over one big table called "memory" with pointers into the one big table called "memory". Can you explain to me the difference?) the GO TO of the 2000's. Anyhow, the devil is in the details, and I don't have enough details to formulate a design. Also note that pointers to RAM objects often make it difficult to share the framework among different languages. The obsession with RAM pointers is one reason why GUI kits tend to be language-centric, making standardization and reuse difficult.
Although I am no Cyc expert, initially, it seems one may be able to represent the above with just one table:
(Technically, Cyc isn't involved at all. I hate it when people talking about belief databases start thinking "It's Cyc! It's Cyc!" or start using the word 'Cyc' just to make themselves look a bit more knowledgeable on the subject. This is just a collection of beliefs and an inference engine. Cyc is a -particular- collection of beliefs and inference engine.)
table: beliefs
---------
beliefID
operator // ex: and, or, probable, etc
weighting // 0.0 to 1.0
param1
param2
param3
param4
And I have no problem using generic strings to store parameters which may be numbers in some cases. I prefer using dynamic or type-free languages anyhow such that it does not matter to my app code. If we want to validate numericness for some combinations, we can put in triggers like this:
TRIGGER ON CHANGE TO TABLE BELIEFS(newrec, oldrec))
IF newrec.operator = 'FOO' AND NOT IS_NUMERIC_PARSE(newrec.param2)
RAISE ERROR '2nd parameter must be numeric for op FOO'
END IF
END TRIGGER
Where does beliefID come from?'
Let's assume it is an auto-number.
and, more importantly, how do we know which propositions are in our collection of beliefs,
Don't the operators determine that? I am hesitant to make a hard-split between them in order to serve future flexibility.
- No, the operators do not determine that. It's one of the larger failings of the representation you selected. Consider trying to represent the belief: "Not:(Tables are General)". Your current representation would look a lot like:
(ID,Op,Wt,Param1,Param2,Param3,Param4)
{ ...
(12345,"Truth",100%,"General","Tables",-,-),
(12346,"Not",100%,"12345",-,-,-)
...
}
- I don't see anything wrong with it. And, I cannot play "guess the requirements" all year. Either lay out all the details and rules at once, or agree to skip this scenario. It is unrealistic to ask somebody to build a Cyc-like engine using half-ass requirements. Also, I think we should move this to BeliefDatabaseExample because this page is getting TooBigToEdit.
- Your representation fails to distinguish proposition from belief. You might be able to argue that: "well, all you need to do is see if a belief is part of some other belief", but doing so doesn't fix the problem (because sometimes you DO believe a proposition that is part of another belief). Further, performing any such check costs O(N) unless you maintain your own index.
- If that is the *only* classification you want (which to me seems limiting), then simply add an "is_belief" flag column and put an index on it. Or make them separate tables. However, it seems this would limit the flexibility of the system, for it is "hard partitioning". Flexibility versus performance is always a design issue no matter what you do.
- Putting a flag there would do the job of distinguishing the two (though it complicates all the queries you might perform as required for the domain). When the world becomes more than a string, perhaps you can add another flag?
- That is, the belief you're trying to represent is "I believe that tables are not general" (represented as beliefID 12346). To do so, you needed to precede this by representing the belief: "I believe that tables are general" (beliefID 12345) as a referent for 12346. So you've just represented an ambivalence (I believe tables are general and I believe that tables are not general). Also, semantically it seems'' that this table is representing something else entirely (that you want to just ignore): that 12345 is a belief, so 12346 is really a meta-belief (I believe that I don't believe that tables are general), which really does directly contradict 12345. However, that can be ignored... it's just confusing. The real problem is that you needed to represent as a belief that which was intended to just be part of another belief.
- Are these classifications one-per-item, or can a given item have multiple classifications? These are the kinds of details/reqs I don't have. I would be inclined to assume they may be multiple and use a many-to-many table to classify the items. This keeps them flexible for future needs. One-to-one and tree classifications often hit walls when things scale. I would figure that one would want to test arbitrary sets of beliefs against others. For example, OO beliefs against relational-weenie beliefs, or Bob's beliefs against Susan's, or Bob's 1997 beliefs compared to Bob's 2006 beliefs. Thus, categories and ownership etc. of the items should be kept flexible, and a many-to-many table would be in order, plus a query language to dynamically calculate such comparisons as needed. You seem to be hard-wiring your classification assumptions up-front, a common OOP design error IMO.
- You have exactly the requirements. You might not understand them, but you have them.
- It is sloppy back-of-napkin stuff. The problem is your rushjob scribbles, not me. I don't have to be insulted like this! (You ask about "item classifications". I asked for a collection of beliefs. What classification do you think that is? Each thing in the collection is an effin' belief. You might second-guess the requirements, but they're spelled out quite well: what I demanded (table representing collection of beliefs), what a belief is (a unity of datum and confidence), what a datum is (a unity of world and proposition), what a proposition is (a variant, described in detail above), what a world is (string - name of world, at least until later), and what a confidence is (a percentage). This is a very, very well-defined system. If you want to go more general, yeah, you're free to tag each belief with -who- believes it using either an identifier on each belief (which is bad practice) or (better) a many-to-many table (Bob believes A,B,D with confidences a,b,d; Sue believes B,C,D with confidences b2,c2,d2). I don't care if you go more general... hell, it's wonderful! so long as domain requirements are met for accessing beliefs. But don't give me any crap about not knowing what the requirements are because you keep thinking, "hey, I can add features X,Y,Z that go above and beyond the listed requirements." And NO, you DON'T have to be insulted. Whether you perceive such things as scribbles and insult really is up to you. You can ask for clarification where you are confused and you shouldn't feel stupid because of it - confusion in communication is fundamental to ambiguity in both the English Language and in interpreted context.
- Reply moved to BeliefDatabaseExample, "section B".
- You are to represent ONE collection of beliefs (does "a collection of beliefs" mean something different where you're from?), and each belief is represented with multiple parts including a datum (which is, itself, formed of a proposition and a world) and a confidence level. Further, you have requirements on access.
- I don't mind if you must go to multiple tables. Doing so DOES support that 'Tables are not General Purpose Structures' (even if Databases might be). However, in this case you'll either need to go to multiple tables (I'd suggest: Beliefs & Propositions at minimum) or you'll need to have variant-typed tuple attributes and highly specialized indices.
- I don't have enough requirements details to determine how many tables are needed. And, relational does not rule out "specialized indexes". You keep assuming that relational dictates implementation. It does not. It is a protocol, an interface that does not care whether you use pointer lists or rodents on treadmills to implement.
- Relational constrains both approach and implementation, like it or not. If I use a bunch of navigation-based queries, even if they -look- like relational queries, I no longer am using a relational approach. Instead, I'm using a procedural approach that (even if ideally indexed) probably costs considerably more than straightforward procedural. Relational is allowed to have specialized indices, of course, but if you specialize your indices, you no longer have a general-purpose structure. That's pretty basic. Specialized-purpose is the exact opposite of general-purpose.
and which are simply propositions that are part of other beliefs?
Add a many-to-many table
table: belief_xref
--------------
beliefRef1
beliefRef2
relationStrength
- This table is to represent what?
And why do propositions that are not directly associated with beliefs have confidence values? Propositions don't have confidence values... only beliefs do.
If they are not used, then ignore them.
- I can't tell without massive processing which 'beliefs' are to be ignored. I can't just "ignore them". You show me how to just "ignore them", eh?
Do plain old propositions have 'beliefIDs'?
Perhaps I should have called them "suppositions" instead of beliefs. I don't yet see a reason to make two different tables, but am not hard against it.
What would be the algorithmic cost of querying something simple like, say, all the beliefs with predicates referencing a particular identifier? (I think it will be higher than O(N^2) due to reducing from sub-propositions back up to the belief). Oh, and you lost the world (that one is an easy fix). You've a ways to go yet, and I think you'll quickly violate several "philosophies of relational" that come from Codd (in particular, normalization forms). Your use of triggers to maintain consistency as part of the database is admirable, but I don't recall it being a general property of tables or even of relational databases. I imagine, though, you could say it in a declarative manner.
I will not comment on performance issues, other that saying that if you can roll-your-own Cyc engine then one could also roll their own relational Cyc engine and tune it to need. (That depends on what you mean by "relational cyc engine". I think you'll need to actually explore the approach for such a query if you're to understand the difficulties involved in it.)
As far as triggers, perhaps constraints can do the same thing. The difference between them is mostly a language implementation issue. And I don't see it violating common normalization because there is no duplication from not using columns. (Maybe in fixed-record implementations there is such, but I am not assuming we're using such.)
. . . .
I was tempted, but won't include: use-case: text file - ordered collection of characters over which you perform pattern-searches over subranges. It isn't relevant to the point on variant types, but does represent another area where tables aren't at all useful (you can represent any list or array as a table, but try putting a subrange pattern-search into any modern query language and head will be dating desk in no time).
Relational does not define implementation. (With this I'd argue (speaking as a language designer): Language properties ALWAYS constrain implementation. If a language is to provide or relational operators, that property constrains your implementation. Constraint is a valid approach to definition (anything that meets this constraint is relational...), though it isn't specification. So it is only accurate to say that Relational doesn't specify implementation.) A DB engine can be tuned or even custom-built for a particular usage pattern without losing its relational nature (although maybe making some sticky or ugly performance trade-offs). You can argue that Oracle or MySql would be too slow for such, but that does not mean it cannot be done in the absolute. Perhaps what is needed is a toolkit to build such contraptions. That way one can get dedicated performance, but not lose most of the flexibility of relational, but merely slower performance for areas outside of the target usage patterns.
- I'm not against providing extra tools or tweaks for your Tables. For example, indexes over spatial identifiers (e.g. binary space partitioning) is a fine path to tackle one significant fraction of the scene-graph problem in a manner geared for performance. However, you're not allowed any changes that make them into something that falls outside the traditional definition of 'table'. The biggest constraint you're dealing with is the fact that tables only support tuples of simple data, which generally must be of one type for a particular column (though you're free to type everything as a string). Whatever types you allow, they must be supported by your query and manipulation language. What you don't get to do (arbitrarily, at least) is start allowing table-elements to be tables or complex data... if you wish that, you'll need to upgrade your languages to work with these, and chances are the work with these won't, at all, look relational. I could claim: "Tables are good for anything! Look! I put a list as the first and only element in my table, and it, therefore, has all the properties of a list!"... Wonderful.
- Relational does not rule out custom data-types. See DoesRelationalRequireTypes. However, I believe the utility of such is often exaggerated by the requestors, who are just used to type-heavy programming out of habit or personal preference.
- All data representations are, by nature, "custom". All value-types are, by nature, intrinsic to the values. Attributes have types whether the gods of relational want them or not, all bureaucratic concept of 'require' aside. The problem isn't providing or requiring custom types, but rather requiring or providing particular categories of custom value-types and actually supporting them. Of better question is whether: (a) Relational provides TypeSafety (hugely different from having types), and (b) whether the types a query and manipulation language supports are sufficient to the domain to which the language is applied (implementation aside), and (c) whether the types a query and manipulation language supports constrain all implementations in such a manner as to weaken relevant access and manipulation characteristics and make the language largely useless to many domains.
- TypeSafety is a personal preference. Type-free or type-light is Turing equivalent to TS and is an implementation detail. Why are you introducing yet more issues anyhow when we haven't finished the current set?
- TypeSafety isn't a "personal preference"; it's a property of language and implementation. However, whether you like it is personal. And the closest you can get to type-free is only-one-type (e.g. "everything is a string" or "everything is a lambda"). Seeing as strings and lambdas are types, though, you can't escape them entirely. As to the latter question: If you wish to bring variant-typed-values into the language, that imposes a great number of constraints and requirements on the implementation and language both. Many RelationalWeenies seem to think that simply pointing out that you can add custom data-types (e.g. variants) makes problems go away. It won't. Attempting to do so will, more often than not, introduce more problems than it solves, require massive modifications to the language (often to the point where 'relational' operators are a tiny, corner-case aspect), neatly sidestep many benefits from modern approaches to indexing (to the point the tables with complicated values are little better than lists, if that), and generally cause havoc... even given care. The point is still true, though - you can create custom types for tuple-attributes. Just don't pretend that makes problems go away.
- As a note, tables would probably work much better as general structures IF they support other tables or relations as elements (the fixpoint of that concept). The relational query language would need to be a great deal more complex, though, and allow for such things as fixpoint inner or outer joins. It's doable, but it gets complicated right quick.
- Nested tables goes against the philosophy of relational, which basically assumes that nested-ness is merely one viewpoint among many for any given datum. However, one can achieve basically the same thing by referencing a table or row by name/ID without selling out to the hard-nesting gods.
- Where did you get the idea that "nested tables", by which I assume you mean relation-valued attributes, "goes against the philosophy of relational"? Current literature on the RelationalModel allows relation-valued and tuple-valued attributes and provides RelationalAlgebra operators for manipulating them. -- DV
- Which literature are you referring to? Regardless, It does not fit my view of relational because one is supposed to keep a relative view of data, and nesting hard-wires a nested view. I've yet to see a good reason to use it. [please don't change my quotes to bold unless I put them bold to begin with.]
- Literature-wise, start with TheThirdManifesto, which is the definitive document on the RelationalModel. Indeed, I am not aware of any need for relation-valued attributes at the level of defining a schema, as I believe any relation-valued attribute in a given relation-valued variable can be replaced with an equivalent separate relation-valued variable, but they can simplify certain queries - in particular, those involving aggregate calculations. -- DV
- Do you have a quotation from it?
- Sure. "... attributes of relations can be of any type whatsoever ... It follows that such attributes can be of relation types in particular, since relation types are certainly types; in other words, attributes can be relation valued, meaning we can have relations with attributes whose values are relations in turn." Chapter 2, page 23, "Databases, Types, and the Relational Model 3rd Ed." C.J. Date, Hugh Darwen 2007.
- Allowing and recommending are two different things.
- Of course. The facility is provided for completeness. Whether it is appropriate for a given set of circumstances is another matter. As noted above, relation-valued attributes or "nested tables" are not required in a schema, but their use in partial results can simplify certain aggregate queries. -- DV
- You can't reference another table and still be in the same table. If you must use two tables to represent common things, then tables are definitely NOT general purpose structures. Databases might be, but not tables. As such, I still maintain that: tables would probably work much better as general structures IF they support other tables or relations as elements. That aside, nesting doesn't violate the 'philosophy of relational' nearly so much as it violates the practice. A relation is just a set, after all, and sets are values.
- For many of the problems above, what are really needed are general-purpose containers (e.g. lists or maps or arrays) united with specialized indexes that must be maintained (continuously, but possibly lazily) and that allow virtual or variant-structured values and index on properties that are difficult to acquire by looking at any single attribute (thus entirely defeating the table's basic indexing mechanisms). If tables (especially "a" table) can't effectively represent these values (especially if "a" list of conceptually immutable virtual value-objects does fine) then tables are proven to have a weaker level of 'GeneralPurpose?' than does the list. Similarly, if your query and manipulation language of choice can't effectively specify which extra specialized indexes are needed (which constrains the implementation to being able to provide indices based on some specification), then it's fair to argue that the table manipulation language isn't general enough for the domain. 'General Purpose' doesn't require it applies to every domain, of course... but if you find that tables are bad for domains requiring specialized indices over variant objects, or domains requiring pattern matching over subranges of objects, then you can add a relevant caveat or rule-of-thumb involving tables... e.g.: Tables aren't so great with variants. Tables aren't so great in domains requiring subrange pattern-matching. (What is a subrange, anyway, when dealing with tuples?)
- You have yet to establish this. I suspect you are just trying to find a database-like mechanism that mirrors your favorite programming language so that you don't have to change your thought process. This is not necessarily a "bad" goal, but it may not be universally beneficial.
- I have yet to establish my truth for my statements here because I'm not arguing them... just stating them as fact. There's a big difference. I've already explored these cases, often by struggling to implement them. You have yet to do so. You asked for use-cases to explore, and you have a few. Explore them. Ignore what I've learned by experience if you feel it won't help.
- You have to show it, not say, "I had difficulty putting this into tables, trust me". Anecdotal evidence is not good science.
- :Rolls eyes: I've provided use-cases. Those DO show it, as well as any example can. You explore them and gain your own wisdom.
- No they don't. Your scenarios and metrics are vague. Here is the kind of thing you need to do (at least):
- Foo-oriented programming reduces the volume of code changed by 40% over procedural
- Here is the procedural code sample: xxxxxxxxxxxxxxxx
- Here is the foo-oriented code sample: xxxxxxxxx
- Here is how I counted volume of code (...) which shows a 42% reduction in lines of code and a 38% reduction in tokens.
- This argument isn't about code-reduction (which is enormously dependent on approach and implementation language); nor is it about procedural vs. relational. It's only about whether a table and its access and manipulation language, as a collection, is general enough to support various domain requirements. The only metric is how close it reaches: fails, partial, complete success. Further, given any particular sample, you'll latch onto it, provide a million hypothetical "what ifs", and tell me that I need to prove something about all possible samples. I believe that would be counterproductive. You, as the defender of the table, are free to offer a sample to defend it... whereas I, as the offender, would be expected to provide samples that intentionally undermine your position. Anything I provide, even if it is near optimal, would be attacked simply because I provided it.
In debates like this, one needs to be very very specific. Vague notions usually don't translate well across brains.
- Being specific doesn't translate well across brains, either... not unless both parties are of enough expertise to understand what is being discussed. Right now I'm of the opinion that, even if I did write a proof about use of variants and indices on variants for tables, it'd be about 15 pages long and nigh incomprehensible to you. It wouldn't be productive.
- I can understand quite a bit if it is well written with examples. Anyhow, if you don't want to do the necessary footwork, then I guess we are done. Anyhow, why focus on me? If you can find an *objective* flaw in relational, you could make a name for yourself.
- The basic approach to representing fixpoint variants in tables is usually handled by creating many tables with near navigational-style queries and sub-queries. It gets fugly fast. I've seen bad: like you provided with the 'belief' table, where the meaning of an attribute, and whether another 'join' should be performed, depends on a whole array of attributes (flags, operator-names, etc.). And I've seen worse, where the meaning of a particular row isn't knowable except by knowing -how- you can navigate to it (as you did prior to including the is_belief flag, you couldn't tell which was a proposition vs. belief except by figuring out how you could navigate to it). You close your eyes to the problem, but ultimately what you're doing is using a few relational operators to perform 'goto' operations on data, with a ton of checks (on flags) between gotos to determine where you should goto next. Your relational is no longer meaningfully relational... rather, you have a procedural solution disguised as relational. You've seen it before. You've represented trees in relational and had to figure out how many times to either inner-join with self or how many steps to navigate even when figuring out the extent of the tree, before processing anything meaningful from it. You've ignored the algorithmic performance cost of these operations, pretending that table-indexing fixes a problem of this sort (which is occasionally true if the indexes are specialized over joins, or if joins are cached, but even one self-join can start pushing algorithmic costs up massively). However, feel free to close your eyes to the problem. After all - these problems are subjective. After all, it isn't a problem if you don't run smack into it, which means it isn't a problem if you know how to fumble your way around it (like dressing up procedural as relational) and can therefore pretend it doesn't exist.
For one, tree operations can be added to relational operators so that more can be done declaratively. Oracle includes some. Second, trees are overused in my opinion. General-purpose does not necessarily mean 100% everwhere everything anyhow. A swiss army knife may not have a flashlight, while Bob's knive does, but that does not disqualify it. And you have not showed an objective performance problem. You provide no clear structures and no clear sample data or queries that are failing them. And for that matter, matching every possible navigational structure in performance is not necessarily required to be called "general purpose". A swiss army knife may not be the optimum wine bottle opener, but that does not disqualify it. You lack precision and specificity in your complaints still. Further, if tables do degenerate into navigational-like messes in worse cases (not shown yet), that does not disqualify them by itself. You are beating up imaginary strawmen, not just strawmen but imaginary ones. Your problems with relational are non-established and the reasons why such would disqualify "general purpose" if true have also not been established either. You have to pass thru 2 gates, not just one. -- top
I only claim that tables aren't so great with variants (especially the fixpoint-recursive sort). Unfortunately, variants are everywhere, like it or not - almost every domain has them.
- Sorry, I'm not clear on something here: According to the RelationalModel, although a tuple type must be invariant, its attributes may be of any type. This implies that they may be variant, and indeed the TutorialDee specification proposes a type system that permits variants through inheritance. Therefore, I don't see how a set or list containing variants - which I assume is quite acceptable - would be any different, in terms of holding variant values, from a relation or table that contains tuples that contain variant attributes. Of course, to make it relational we must be able to (at minimum) test variant attribute values for equality so that relational operators like RESTRICT and JOIN can be performed, but the notion of "equality" can be generalized to mean anything we like, as long as it can be applied to all the variant values of a given attribute that may be encountered in a given query. If we also provide a comparison mechanism to obtain the ordinality of a given attribute's values, then indexing is possible. I'm not clear, therefore, why you deprecate relations and tables - in terms of holding variants - while sets, lists, stacks, queues, etc., are presumably okay. Why is one kind of collection okay for variants, but another is not? When I raised this a slightly different way above, you merely indicated that (you believed) it hasn't been done in practice, not that it doesn't work well. Is there something I've missed? I can appreciate your points about the difficulties of using tables or relations to represent beliefs or scene graphs - which is in part why we have logic engines and rendering engines in addition to DBMSes, instead of just DBMSes. To a slightly lesser degree I can appreciate the difficulty with trees and graphs, though appropriate use of transitive closure may address that. I can also appreciate that there's little point in "table-ising" a stack when a Stack is what you need. However, using variant tuple attributes in tables or relations strikes me as exhibiting no more problems (and no less) than using the same variants in stacks or lists. -- DaveVoorhis
- The RelationalModel doesn't forbid the use of complex types. Among the types you can conceivably use for attribute values are recursive, fixpoint variant types (like Leaf vs Node in a tree), whole collections (relations, arrays of characters called 'strings', arrays of complex types and other arrays), coinductive types (tuples, records, infinite streams of Fibonacci numbers), graphs (not just representations thereof), functions (Type->Type), procedures (Agent->Context->!(Return Type, if any)), and even process-states (which include sets of continuations and open communications ports). The problem, as might become apparent by examining exactly the sorts of types I just listed, is that handling these things is extremely complex. It's doable, but it isn't easy. Unfortunately, even considering carrying complex-typed values requires serious review of things you've taken for granted all along, even as part of Relational. Among the things you need to carefully consider: equality. Checking whether functions, infinite streams and coinductives, procedures, and processes are equal is undecidable, so simply by allowing these types you've made it impossible to perform tests for equality (at least on columns that carry these). Besides breaking the RelationalModel (since 'sets' cannot be computed if it is undecidable whether two values are different) this neatly shatters almost all forms of indexing that currently exist. Even if you ignore undecidable cases, determining whether two graphs are equal (aka 'isomorphic') is NP. In general, you'll need to toss equality except for certain value-types, and you'll need to figure out how to index on something else entirely. E.g. for graphs, you can index on their vertex count, edge-count, completeness ratio, subgraph-patterns, Hamiltonian nature, etc. etc. etc. For infinite streams, you can index on their first N values... and, perhaps (if you can examine their structure) on their overall patterns. In general, the point of this paragraph is that: you need to be careful that you don't break things simply by adding types; every part of any type-system affects every other part. If you do break things, you also need to concern yourself with the how to fix them.
- I'm aware of the problems of computing equality on certain recursive types, graphs, etc. They can't easily be indexed, stored in a tree, or a hash value computed, either. However, this does not preclude storing values of such types in a relation, any more than it precludes storing them in a list. There are still numerous cases where it is reasonable to house values of such types in a relation, and I have done so. For example, a DBMS I developed stores operators (aka stored procedures) in a relation-valued variable. Obviously, it is not feasible to index on the operator itself. (Quick! Write a method to compare two functions for equality!) Fortunately, there is no need to do so, as the operators are functionally dependent on a trivially indexable attribute. This does not break the relational model because the primary key is a related attribute, not the operator itself. Of course, certain caution must be observed (or constrained, or caused to throw an exception) to not project out the operator on its own, as the inability to compute equality would potentially result in a non-relation. Similarly, we can't RESTRICT on the operator, or JOIN on it. But the same problems would hold true if we tried to shove the operator into any garden-variety Set. Thus, as I noted before, the problems of storing certain types in a relational collection are no better or worse than a variety of other collection types, but it may be appropriate to use relations for any number of reasons. -- DV
- The use of variant-typed attributes in Relational, in particular the recursive but finite sort (type is fixpoint, not value), is more readily doable. So long as all other types have equality defined, these have equality defined, so you don't accidentally break a ton of things. The problems with these arise more in practice than in theory largely because there is no clear mechanism by which they can or should be indexed or processed. Indeed, the only obvious means of indexing is by equality, and by doing so you've certainly gained nothing over the standard 'set'. In practice, you've gained nothing at all by using the table over the list or set... not without specializing it by some means. Unfortunately, there is also (in practice) no obvious way to specialize the tables... not using the languages available to us through the DBMS. But, in practice, there are obvious (albeit often horridly tedious and error-prone) ways to index lists or arrays of the variants (the language providing the list has such facilities, e.g. wrapping a list such that all commands to it filter through an object that tracks all its iterator-indices). Language addendums to process the variants are needed, too... e.g., one variant value might carry foreign-keys to a huge number of tables (including itself), where -which- other table is referenced varies based on the variant; practical use in a database will require rapid association between these sub-components and the associated tables without getting caught up on the specifics of computation over variants. I don't believe this impossible; I think that if someone was willing to push hard enough, it could happen in practice for the common DBMS, too. However, consider what top and others seem to consider 'state-of-the-art' when representing trees and common variants: interning their nodes into the table, creating a million artificial names (one for each node), etc. The approach described here hasn't happened. It won't happen unless implementors choose to recognize the problems with the current approach... the unnecessary hassle it creates, and the fundamental wrongness of value-representations that depend upon naming external pieces. It won't happen unless DBMS implementors both learn what types are and choose conceptual correctness and ease-of-use over benchmarks. I'd certainly hope AdaptiveCollection takes the theoretical high-ground here, rather than forcing users to represent trees in some totally obtuse, unnatural, inverted manner as is common for tables.
- With this, I largely agree. -- DV
- You said that: "To a slightly lesser degree I can appreciate the difficulty with trees and graphs, though appropriate use of transitive closure may address that." Transitive closure (aka recursive-self-join) provides a convenient means to navigate through all the nodes until you've constructed complete trees (which any sane representation would have as complete in the first place). Unfortunately, transitive closure doesn't fix the real problem, which is access and indexing for the domain. It's more correct to say that the need for transitive closure indicates something else is broken. DatabaseIsRepresenterOfFacts; a row should be a fact independently of how one might navigate to it. Joins are mechanisms of associating one fact with another... and shouldn't need to be used as means of getting to the fact in the first place.
- Though this is not my area, I believe some work on transitive closure is intended not to construct a complete tree, but to provide a means of navigating the structure. Despite being somewhat of a RelationalWeenie, I am the first to admit that trees and other graph structures are poorly handled by the relational model. However, at least for data management purposes (which is quite distinct from programming purposes, but that's another topic), I'm not aware of an alternative model that doesn't devolve into navigational, i.e., network or hierarchical, structures. These (at least for data management purposes) have been soundly rejected, and with good reason. -- DV
I am not sure what you are calling a "variant". Do you have a formal definition we can use?
- Yes. Variant in this context is short for variant-typed. For values, which are immutable, the type is not temporally varying, but, rather, varying in structure and legal interpretation. You have seen variant-typed structures before... some examples are the C/C++ union, the boost variant (from which I draw the name), Haskell Datatypes using the '|' vertical bar (Node N T T | Leaf L), EssExpressions for which different symbols prepend different structures, and XML schemas for which there are one-of-many component switches. More conceptually, they also exist wherever you'd perform a 'switch' or a bunch of 'if-elseif-else' statements always testing the exact same pattern (dependently typed variants). CategoryTheory-wise, variants are the dual of the mathematical tuple: a tuple is the product of at least two categories, whereas a variant is the coproduct of at least two categories. A tuple is one-of-each, while a variant is one-of-many. A<-P->B (from product you can produce A or B) vs A->C<-B (from A or B you can produce the coproduct). CategoryTheory possesses the strongest formalization via use of the coproduct, but most people have an easier time understanding it as the C/C++ union.
- Object-inheritance-graphs also form variants because you can produce the parent-type (the variant) with any one-of-many derived types, but these are only variant in practice if you actually try to figure out which derived type something comes from via explicit or implicit use of a downcast operator. (Implicit use would include virtual functions.) The difficulty of representing variant-typed objects in the form of highly virtualized object trees is part (but not the whole) of the cause for what many people call the ObjectRelationalImpedenceMismatch?.
- Let me reword the question: How does one tell if "variants" are being used in a relational structure? Your definition seems based on types, but multiple types are not necessary for structures or programming. How about this: if there is one column that describes or scopes the meaning of another column(s).
- Your overall summary: "one column describes or scopes the meaning of other columns" is essentially correct, though the "one column" part might be a little too specific. ("At least one part scopes meaning of other parts" might be the generalization you're looking for.) The relational experience of representing a variant will almost always be that different columns are interpreted (and programatically handled) as having -different meaning- (or relevance) based on values found in other columns (flags, strings, etc.) Your insistence that multiple types are not necessary is still misplaced (the types necessarily exist the moment it becomes necessary that you handle the values differently... but you are no type theorist, I assume, so I'll let it go).
- Nobody has yet to present a compact and unambiguious definition of "type". (Some argue such is not needed, but I find that a copout.) It is an overloaded term that can be bent to mean just about anything such that it means almost nothing. It has similar "name space" problems of "object" in that regard.
- There are many compact and unambiguous definitions of 'type'... just no one definition upon which everyone agrees ^_^. Similarly, there is no compact, unambiguous, and unassailable definition of 'object' or 'home' or 'family' or 'identity' or 'self' or 'value' or 'meaning'. Primitive concepts, in general, lack definition except through their use. I can offer you a mechanism for recognizing variation in types that is consistent with -every- type theory: the types of two things (values, objects, whatever) are different when it is wrong to use one thing the same as you use the other. In computation, this translates directly to (and from) TypeSafety being a guarantee that a program performs no 'wrong' (aka 'undefined') operations. While most often one refers to computation types in terms of structural error (e.g., attempting to access the '.whatever' property of a value that lacks any such property), it can also refer to semantic error (e.g., adding a distance to a volume), or even emotive wrongness (it is just 'wrong' to wear that shirt with those shoes! - if so, then there must be two types of clothing, such as 'dance' and 'grunge', such that it's wrong for one type to go with the other). In computation, of course, emotive error isn't recognized as such. Anyhow, despite the lack of crisp definition, 'type' certainly means something, and it's definitely something different than 'value', 'object', 'home', 'identity', and 'self'.
- Informally, a "type" is a description of a collection of values and the operations that may be performed on them. For example, in a typical programming language, the 32bit "integer" type is the set of all 32 bit values, plus the arithmetic operators, conversion operations, and so forth that may be applied to that set of values. Very informally, a "type system" describes how values and operations may be organized to form types. -- DV
- Well everyone has a "notion" of types, but those don't help much for deciding borderline or controversial situations.
- I think it's more accurate to say that there are varying views on which type system are good, bad or indifferent. The "notion" itself is clear. -- DV
- I have to disagree. We've had squabbles about this before. Nobody could clearly define "types" in just a few paragraphs. They were either long or fuzzy.
- Lol. A 'type' is whatever an actor distinguishes as requiring different action. That's short and clear enough to someone who knows what an actor is and the significance (and utter simplicity) of "distinguish". It's in explaining it to non-theorists like yourself, who don't understand such things (or just wish to 'squabble', as you say) that the explanation becomes long. A long, fuzzy explanation, though, doesn't change the definition... and even a -wrong- definition can be compact and unambiguous. (e.g. a type is a rabid squirrel - compact, unambiguous, and wrong).
- Not understanding a concept and not being familiar with academic terminology are possibly two different things. Anyhow, I bet at least one of the words in the definition leaves a lot of wiggle-room WRT interpretation. I'd bet money on it if my wife let me. One problem here is "action". Declarative data sets may have no inherent action; but clearly they can have types. That is, some indicator that tells what "type" something is.
- "I bet at least one of the words in the definition leaves a lot of wiggle-room WRT interpretation"... and with that, you're right where I said you'd be: just wishing to 'squabble' ^_^. Playing devil's advocate is fun, but it does get old fast when you latch onto petty things. Anyhow, what you call "declarative data sets" are just transistor-charges on a block of silicon or chicken scratches on a page until an agent starts interpreting them, are they not? You don't even get to "characters" or "bits" until you start distinguishing one charge or chicken-scratch as relevantly different from another. You seem to think that data can be represented meaningfully when separated from the agents that interpret it, but you've never given the concept any serious thought. While types in programs are measured by what the interpreter or compiler must meaningfully distinguish, types in declarative data sets result from how the programs that process the data choose to meaningfully distinguish. Data that is just a length of 'raw bytes' (single-typed) to one agent might be a sequence of unicode characters (in utf-8) to another agent and might be a rich, recursive XML structure to yet another agent. Returning to the 'variant' typed data, agents processing the meaning represented by variant-typed data values must (to avoid being in error) interpret different variants differently - in particular, to be correct, they must interpret the values as defined by yet another agent: the one that attached semantic meaning to certain representations.
- I find the "will execute" criteria problematic and arbitrary. (Funny. I never listed a "will execute" criterion.) A given data set may never even get around to being run. Or, one may not know if it will. (Technically, a piece of code that never runs exhibits TypeSafety by definition. Proving that it never runs is the problem.) Consider a temp or contractor programmer preparing data sets for other companies. During the preparation, he/she may not know who or how stuff will be processed. Our consideration is to describe the data with sufficient detail. We are describers, not executors. (You're speaking from the standpoint of the language designer - the agent that decides how to give form to meaning. The form you choose determines what an agent must distinguish to properly comprehend your meaning - to correctly interpret your language. Nonetheless, some agents can process the data you represented as raw bytes without ever looking at your meaning... and, to them, the type of relevant chunk is still 'byte', and your intent to the contrary doesn't stop it.) And when you decide how to describe the data, how to give form to meaning, you're [end quote? text error] And there are gray areas around one variable possibly affecting another. For example, a car license plate number and the state (CA, TX, etc.) of the plate. And, maybe the year issued. Is "state" a type? Is the year issued a type? (If the agent must take different action based on different state, then yes.) Both of these may affect the encoding of license plate numbers. Whether the "user" of the data will parse plates to that level of detail, say for forensic reasons, we may not know. Will it be used for basic reports, or FBI forensic experts? Is it only a "type" if the interpretation varies based on other items? Your definition gets us stuck in Heisengburg/quantum-like conundrums where the observer/user/process can change the nature of a given data set from non-type to type by merely looking at it. (That is correct. Whether a license-plate is just a bunch of pixels in a JPG picture or a bunch of letters vs characters arranged at positional (x,y) coordinates that must be dropped into a database vs a collection of particular strings described by meaning rather than position with a date that must be processed from a couple letters and numbers (e.g. record {state:String,number:String,expiration:Date}) is really up to the agent. If the agent must act differently after stopping cars with plates from local-state vs out-of-state, the plate itself indicates a variant. The line between type and value can be shifted as needed.) Rarely is anything 100% independent. Dependency is continuous and relative. Data-mining experts may find relationships that you had no fricken idea even existed. (Which isn't an issue. Pattern-recognizer systems, as are used for Data Mining, essentially have just one type: pattern. They identify and categorize patterns. There is only one type of pattern (for them) until such a time as they choose to act differently for different patterns... e.g. if they decide to give extra processing to what they've decided (or been told) to classify as 'number pattern', then there are at least two types of patterns.)
- You made it sound almost formulaic to determine "type". Now you seem to be retreating into dissecting human neurons and intentions instead of studying things. (One can formulaically describe 'types' expressed in any given language, for it is the same set of types that an interpreter of that language must distinguish in order to properly interpret expressions of that language. However, language-types are just one type of types, with unique qualities in the form of externally determined meaning or specification and the fact that all language-expressions are values. Another type of types, for example, is object classification, where no external authority exists that can tell you the 'one true way' to view sensory input as an object. Any definition for 'type' must be general to all types of types or you risk confusion when applying the word across domains. One definition that meets this requirement is: A 'type' is whatever an actor distinguishes as requiring different action.) (Oh, and never mistake 'actor' or 'agent' for 'human', though humans are actors and can be agents.)
- "A 'type' is whatever an actor distinguishes as requiring different action." That's an unusual definition, which smacks of actor theory, but from what little I've read of it (and that's very little) I recall no mention of types. Do you have a reference? I'm not saying it's wrong, and intuitively it seems correct, but I'd like to know the origin of that definition. -- DV
- The wording is mine, and is a result of my own study on the subject. Of course, I swallowed a bunch of type-theory books whole before I came up with it.
Tree operations
might help; I'm not familiar with them, and would need to know what conceptual and computational access they provide. You
lie about providing "no clear structures": several use-cases (tree, digraph, belief) - (
Now you are going into your rude mode again. I honestly find your descriptions vague and poorly structured and I swear to every known deity and supernatural force that this is how I perceive them. I am not making that up. I think you need a lesson in technical documentation, but this is not the place. Above you admitted you were not familiar with Oracle's tree operations, yet claim that your tree example is slam-dunk evidence. This appears to be a contradiction. (If I were rude like you, I might be tempted to call it a "lie".) See also: RelationalAndTrees. (I'll check it out.)) - ...provide structures as clear as you'd see listed as Datatypes in Haskell! I can't respect your statement that these aren't "clear" enough. Admittedly, scene-graphs aren't so clear, but they aren't in practice either... all you really ever know is that there are rather arbitrary geometric shapes (which is why I made a challenge of it, since lists or arrays of virtual objects, plus a bp-index, works fairly well). As far as sample-data and queries that fail: failures due to variants are failures on algorithmic complexity. It's difficult to show with just a little sample-data; rather, I need to ask you pretend that there are a few million records, that some of the variants are
thousands deep (recursively), and
then demand you find certain properties. You and I probably think on different levels, anyway... sample data is how I perform tests, but not how I think about or understand structure. If you wish to run the use-cases, create your own sample data.
I cannot evaluate the practicality of your test scenarios if I am not familiar with the domain. I could not say usage pattern X of your belief gizmo is more likely than pattern Y, etc. Of course, you could probably find scenarios where a navigational structure is faster. That is not in dispute. "General-purpose" does NOT mean "always best for all operations".
- To me, "general-purpose" means: usable across a wide-variety of programming domains. I agree that tables are general-purpose in that sense. However, among the domains it is not general to are those involving variants, which is also a wide variety of programming domains. You react as though I've been saying: "Tables and relational are bad! Look at this one thing they can't do!". I'm not. I'm saying: "Tables and relational are bad at this one thing they can't do!" That might sound similar, but the meaning is not at all the same. Now, I do have a few more 'one things' (e.g. ordinal subrange pattern matching and, thus, parsing-support is another 'one thing'), but that can be handled later. Sometimes, relational and tables (or whole databases, really, since all too often one table needs to become two or three) are the right tool for the job... but I'd never agree that they are as general a structure as the humble list.
I'm sure you can construct a generator of trees or beliefs or even some simple scenes (boxes and spheres and cones, perhaps). If you feel utterly confident in your ivory tower, then don't bother testing your precious tables against any of the gauntlets I provided. My understanding is from experience, use-cases provided from sandboxes I've played in before. It's been a LONG time (years) since I actually did all that, but I still trust the experience and my recollection of the representations I used, queries I built, and calculations I performed at the time. As far as 'objective performance problems', I can only offer you algorithmic ones under optimal conditions based on any particular choice of representation. After all, even if I did generate sample-data and run benchmarks, you'd stand like a righteous bastard and shout out: "Lo, and is that particular relational engine optimized and tweaked and twiddled to your domain? For if not, it is invalid!
Again, how come only YOU are allowed to tweak for a domain? And again, not always matching navigational performance does not disqualify something. General-purpose ANYTHING is rarely optimum for everything. - Hmmm? What gave you the impression that only I am allowed to tweak for the domain. I have the feeling you've written this based on a false impression. I say only that if you -specialize- an implementation to the domain you no longer have -general-purpose-, period. Enabling new features that apply across many domains is quite legal... which would include such things as indexing on "less-than" and "greater-than" as well as "equal-to".
And Lo, is that sample-data realistic? For if not, it is invalid!
Is this not an important issue? Like I said, of course there will be situations that favor one approach over the other. - It IS an important issue. However, my overall point is that you'll just be looking for any excuse to entirely dismiss my efforts. Indeed, the fact that you just said: "Is this not an important issue" implies that you were inclined to do so.
To me, this is proven.
If it's only proven in your head, it is not usable to those on the outside. Maybe you are just not a good relational designer. It is hard to tell without specifics. (In my head, eh? Other people have experienced this and reported it as part of ObjectRelationalImpedenceMismatch?. It seems Oracle users experienced it to the point that they've added special operators to deal with the problem. However, I don't know how to better prove it to you except by allowing you to experience the same problems I encountered.)
I'd love to see a solid disproof... something I could use in the future when I encounter these problems again. But I've never seen it. Tables are good for many things (e.g. anything one could put in a CSV), but horrible for collections of recursive variants. Databases can do collections of recursive variants, but even they can't do it well. To you, my statements are simply a challenge... or, as you call it, "a bunch of imaginary strawmen" defending a castle in the sky. Bah. For better or worse, you're biased, mister TableOrientedProgramming. With a name like that, who wouldn't be? Come if you wish, build yourself a gauntlet in my sandbox and test whether your faith in your beloved tables is well placed. I'll gladly help if you're confused, where what is clear-as-day to me (who knows the listed domains) seems fuzzy and obtuse to you (if only because you lack experience in them).
I will agree there may be classes of problems where tables don't shine so well, and perhaps your Cyc-like scenarios are examples of such. But that does not necessarily make tables "non-general-purpose". (Just means they aren't general to domains using fixpoint variants, does it not?) Like I said above, there may be times where a knife with a flashlight built-in is better than a Swiss army knife without a flashlight (they usually don't have one). Yet, most would still call it a "general-purpose" tool. As far as the classes of problems where tables and relational may not do so well, perhaps. Maybe ways can be devised such that extra relational operators can be added to deal with them. But maybe nobody does it because it is not a common enough niche. We may be able to tweak/customize relational rather than toss it altogether for some navigational contraption for such problems. You make it sound all-or-nothing, like "I tried Oracle and it didn't do well, so I am gonna invent the Adaptive-Diaper-Doo-Dad and ignore relational 100%."
Eh? My Adaptive-Diaper-Doo-Dad will have relational built right into it. Relational is still a valuable tool, and it's good for at least half (primitives, semantic-typed, product-typed) of the whole set of commonly useful basic value-type categories (primitive, semantic-typed product-typed, coproduct (or variant), structure-abstracted collections and sets). It needs to be paired with something to make the whole good for the rest... just trying to figure out what that something is. I think the three big things tables don't handle well are variants (esp. the fixedpoint recursive sort), cross-component ordinal subrange pattern matching (e.g. if you were using a whole table to represent a list or string or tree, and needed to find a pattern that is spread across many related components), and cross-component group pattern-matching on the whole collection (e.g. matching against a set).
....
Hmmm... RelationalAndTrees seems to attempt to handle one special case of tree structures that exists solely within one table. It solves only a small part of the problem: the fight with relational to get access to the whole tree within a collection. It will handle some variant types, but it is rather easy to simply build a slightly more complex variant type just to trip it up. Fortunately, most variant types used in practice -are- simple enough to be handled by this. Unfortunately, all it can do is the easiest part: locating one tree for examination. It is telling that you fight with relational just to accomplish this part of the problem that is easiest when working with a more traditional list of trees.
I was rather hoping for something more impressive.
Perhaps it was not clear from the parameter description, but joins can be done first to bring in multiple tables
One can be 'evil' and make variants that would reference four alternative tables based on column, each of which have variants that reference one of four tables... in one huge recursive mess. That'd make transitive closure rather difficult, I'd imagine. Which joins do you do first? (of four possible tables... do you do all four?) And would it help?
Partial Agreement with-regards-to Pointer-Hopping
Perhaps we don't really disagree that much after all. I will agree that there may be certain classes of problems where to use tables; one has to do a large amount of procedural iteration and "pointer hopping" from node-to-node (record-to-record), where declarative queries are of less use. In such cases table-centricity may indeed suffer from practical and theoretical (big-O) performance problems compared to more direct address-pointer solutions. This we can both agree with for now. We don't need to argue these any more.
Now, it may be possible to custom-tune a relational engine such that the common operations of the domain can be done declaratively instead, yet still be "close to the metal". But, this perhaps disqualifies it from being "general-purpose". But the alternatives need a lot of custom fiddling also. If you are building custom pointer handlers then you are essentially building a kind of database from scratch. Thus, neither qualifies as "general-purpose" under this view. (Your "adaptive" thingy only exists on paper, so it is hard to say how much setup work it requires. And perhaps a pointer-centric relational kit can also be built.)
But I claim the need for such pointer-hopping is either not that common, or only a fraction of the total data usage pattern for most applications. Even if some of the app requires lots of pointer hopping, that does not mean that all does. Declarative queries may still be needed and useful for other parts of the app, and perhaps a majority. And one may sacrifice pointer-hopping performance in order to get the other RDBMS benefits (ad-hoc declarative querying, out-of-the-box concurrency, integrity management, import-export utils, off-the-shelf report writers, etc.)
Indeed there may be niches where the pointer-hopping-centric performance is the critical path (bottleneck) and the other things mentioned above are not important enough to give up these critical path performance needs. But like the flashlight-knife versus Swiss-army-knife analogy shows, having everything plus the kitchen sink is not necessary to qualify as "general-purpose".
Perhaps you just work more on the pointer-centric problems, and thus judge "general-purpose" from that perspective. You need a flashlight-knife and thus the Swiss-relational-knife lacks a key item for those kinds of projects. A career military professional is going to view "general-purpose pickup truck" differently than a commercial construction manager. Our key difference is perhaps in the perceived frequency of such problems (or the importance of the pointer-hopping portion of the app). Without formal surveys and studies, it is hard to say whose experiences are more representative of the typical or most frequent apps. If we agree that this is the only real point of our differences, then we can AgreeToDisagree on this issue and move on.
In the long-run I would be inclined to suggest the industry try to create relational tools or kits that help address the needs of pointer-hopping-centric apps both by tuning the performance for such activities, and by adding more graph and tree traversal-friendly functions/operations so that larger chunks can be done declaratively. I've yet to see any evidence that such is not possible in the absolute sense. The lack of a push for such to me is evidence for the small size of that niche.
--top
I'd think it's more that there are already other specialized tools and data-management for these (rather broad) 'niches'. I.e. rendering engines and inference engines and functional programming languages, Cyc, OWL, RDF, Haskell, etc. Consider that RDBMS is viewed, at large, as a storage utility (as opposed to a volatile data store or runtime processor) for business domains. It's awesome at that (again, rather broad) 'niche'. People who want tools for other 'niches', however, rarely go first to the RDBMS.... and thus the appearance of demand for such facilities is minimal.
But are there any such tools that you'd consider "general-purpose"? Or do they mostly roll-their-own? Perhaps they avoid RDBMS because they don't have much exposure to them in school or work, not because they won't do the job. Many degree programs skip RDBMS unless you are an IS major instead of CS. CADD may be an exception. Performance tests have shown commercial RDBMS noticeably slower than custom-built CADD thingies. But then again, custom-built will probably always be faster for known usage patterns. But it still may be possible to put a relational interface on top of such a contraption.
- No, except for the GeneralPurposeProgrammingLanguages and perhaps the overall OperatingSystem (which I don't think should be partitioned anyway... LanguageIsAnOs), there are no tools I listed that I'd consider "general-purpose", including the RDBMS. I'd consider, however, such things as relational operators on collections, transparent persistence, cross-cell transactions, and cells with history and rollback to all be general-purpose utilities, and would love to find them in a general-purpose language. Those have value that cross many domains, and their presence would make a language more general (and bigger... but, hey, can't have everything). However, for better or worse, the RDBMS is an engine specialized in its purpose.
- Anyhow, you ought to keep in mind that these people who "don't have much exposure to the RDBMS in school or work" are not getting it -because- it isn't much relevant to their work. Certainly relational operators and tables might be more popular if they were found in a general-purpose language alongside lists and sets, but when people just need collections (quick!) they don't go to the RDBMS. It isn't their general tool. Ever. Unless they are RelationalWeenies, of course, and can write ODBC code with their eyes closed. Often, the RDBMS is simply where the dead objects live, and it's more the "DBMS" than the "R" that they're looking for in that case - the transparent persistence and backup support unavailable from the OS or language.
- I'll consider belief-databases "general-purpose" the moment they can write my essays for me. ^_^
Trees as represented in RDBMS are useful when representing real structure of real entities as facts within a database. DV mentions (well, more like 'alludes to') this above in his description of the real purpose for transitive closure. Where trees within tables becomes utterly evil is when used to represent structure of values that are, themselves, just part of a fact (and are not facts in and of themselves). E.g. for a table of propositions, the table doesn't stand for "this IS a proposition" (because a proposition, being a value, is a proposition whether or not it is represented in a table); it instead stands for "and here's where I store the proposition that is referenced from elsewhere". I hope you can see where I'm coming from on this. Violations of DatabaseIsRepresenterOfFacts might be okay when merely leveraging a table as a structure to store data... but not when moving from table to RDBMS.
Re: "Tables are easier to scale, both in size and complexity, without having to overhaul access code, because they are close to a one-size-fits-all structure."
And they are, proportionally, more difficult to design... especially when comes time to provide, maintain, and utilize indices (which are the only means tables can compete with the more traditional heaps, arrays, maps, hashes, etc.). If they were easy to add, all general-purpose languages would provide implementations.
I found tables so easy to use and change in ExBase that one was tempted to use them for everything. It shows that if you design a language around tables instead of the other way around, they can be quite nimble. (And even ExBase had areas it could improve in.) Tools like MicrosoftAccess also can be nimble, but I find it makes it tougher to port existing work from drag-and-click to scripting when you want finer control. It has a mouse-to-scripting DiscontinuitySpike that traditional ExBase didn't.
I'd love to have good tables sitting in my STL toolkit and made 'nimble' via some sort of template-driven query and manipulation language. Hasn't happened yet. Getting tables INTO the language is the first DiscontinuitySpike you need to deal with... which is, essentially, what the above italicized statements say (or were intended to say... I suppose you can interpret it as saying that tables are difficult to design for the domain, whereas the intent was to say that tables are difficult to design into the language and compiler).
Things like maps and garbage-collection were once expensive and exotic, but now are commonplace. The implementation age for nimble tables may yet come.
I'm hoping we'll skip straight to AdaptiveCollection, which generalizes the advantages of tables.
AdaptiveCollection is a goofy mess.
By which metric?
I consider EssExpressions the only comparable competitor. I'm a "conceptual" admirer of LISP, but not a practical admirer. See also: RelationalLispWeenie. -top
See also: GeneralPurposeProgrammingLanguage, MinimalTable
JulyZeroSeven
CategoryDiscussion CategoryCodingIssues, CategoryDataStructure