Cross Tool Type And Object Sharing

One of my complaints about types and objects is that they are difficult to share across different tools and languages, at least without some kind of pre-defined conventions or standards. The more type-ish or more OOP-ish it is, the more difficult and involved is the sharing.

I'd welcome suggestions or getting around these limits, or perhaps an agreement if they are an inherent part of types and objects. Markup languages and delimited tables are usually just easier to share (and partially aided by standards such as ODBC and HTTP).

[Eh, "sharing across different tools and languages" is difficult for ANY system without using pre-defined conventions or standards - you can't even share "plain text" across tools without conventions or standards. Markup languages and delimited tables ARE standards for serialization of structured information. If you like 'standards' for serializaton of object-structured information, consider YamlAintMarkupLanguage or JSON as a serialization medium. If your goal is service sharing, such that the object must remain at its home, then you would need something like CORBA just as you currently use ODBC for DBMSs... though HTTP is also well enough suited for the service sharing between object systems (each object = URI; one can get/update/delete/etc. the object).]

I don't know that OOP makes this more difficult than anything else. My LISP code won't work in Javascript, COBOL won't share with PHP, Pascal and VB don't see eye-to-eye, and so on. CORBA (particularly for objects) and standardised WebServices are perhaps a step in the right direction, and XML was briefly (and naively) touted as the universal solution. Currently, DBMSes (and their standard interfaces like ODBC) using canonical primitive types do tend to provide a common point of contact -- at least for many enterprise business applications.

In terms of advanced type support in DBMSes, DateAndDarwensTypeSystem implies the potential for being shared across languages -- particularly because of the explicit elimination of any notion of "object identity" (there are only values) -- but this doesn't diminish the inherent complexity of implementing the system in each host language or environment. This is also an issue for CORBA and objects. Of course, once it has been implemented on a given platform, the issue essentially goes away.

The success of CSV, HTML/XML, and HTTP are largely because they are text-based and type-light.

[I think this claim unjustified. You certainly lack evidence to attribute their success to their being 'type-light', and the fact that there are many text-based and type-light structured data formats that haven't achieved 'success' suggests that being text-based and type-light isn't 'largely' a cause for success. As far as what 'type-light' means when working with XML Schema and such (which are pretty heavy on formalized structure and validation), I've no real clue.]

It's an opinion. But let's approach it the other way and look at the top successes of type-heavy and object-oriented sharing techniques and standards. I don't see much success in this area. CORBA is probably the best try, but has been a yawner.

[I dunno... in my opinion mime-types, codecs, XML with their schema, Yaml resolved type-tags, WSDL, etc. all seem to indicate the type-heavy approaches are alive are part of present and future state-of-the-art. Data types are by nature assertions that data has certain structure and properties. Validation of these structures and properties at runtime is essentially TypeChecking. I believe your implied assertion that 'XML' is 'type-light' was already in error. Indeed, I suspect the 'success' of CSV/HTML/XML is in part because they were MORE type-heavy than both unstructured opaque and ad-hoc structured plain text alternatives.]

[Today there is some awkwardness because there is no 'intervening' TypeSystem between tools. Support for well-typed FileSystems is not at all impossible. Today's requirement is that you work with rather opaque strings, forcing every application to share or build a library for serializing to and from structured files, using unreliable dot-extensions (.ext) as type-tags in the filenames; the end result is awkward, inefficient, ad-hoc, and painful... but apparently still better than unstructured, unvalidated, insecure input. The desire for typed filesystems with structured data clearly exists; it is why dot-extensions and XML and such are so successful and pervasive. I fully expect that, at some point in the future, the FileSystem will store structured data - not opaque binary, but rather be reflective and fully apprised of this structure - allowing rapid access and manipulation to well-typed attributes that can be coerced to strings and back for viewing and hand-editing (as well as supporting transactions and versioning and similar features). A typed FileSystem would in turn pave the way towards supporting typed communications (typed FIFOs, for example) typed process IO, and well-typed workflows (so, when you issue a command-line piped workflow, it can be checked for safety at that time) allowing for far more optimal workflows implemented with far fewer serialization and parsing pains between processes.]

[Over time, cross-tool types will become ever more pervasive and less awkward. Programmers have made clear their desire for types since they started using dot-extensions. Dealing with opaque strings is painful and ad-hoc, as is repeating the IO effort between each application, so programmers want to do much less of it and automate as much as possible, creating standards like XML and YAML to carry their structured data, and dedicated libraries to work with it. They'd prefer to even be lazier yet and not need to include libraries just to perform basic typed input and output. They'd prefer some greater optimization for space and speed, especially in pipelined workflows where the output is serialized to string just to be parsed again by the next process. Besides, type-safe command lines would just be darn cool, doubly so if the use of types was integrated into the command predictions and tab-completion (so 'mplayer <TAB>' lists only the files mplayer can accept). The reason it hasn't happened yet has a lot to do with inertia: almost every FileSystem tool would need to be rewritten; language libraries would need updating if they are to treat files as more than BLOBs; programs would need to specify the sort of IO they expect. Also, a TypeSystem would need to be chosen - which is likely to create something of a battle between the FunctionalProgramming guys (including me) and the ObjectOrientedProgramming guys (including the majority of everyone else). Nonetheless, while the effort to make it really work is staggering, it will happen. Someday. But, until then, we'll just limp along with structured text, schemas, mime-types, codecs, and unreliable '.ext' type tags embedded in filenames.]

There's also the type-light/free camp that fight the bloat and over-engineering they see/perceive from the OO/type-heavy camps. We will fight for the PowerOfPlainText and dynamism. (Perhaps such techniques are industry-specific or app-specific such each industry will settle on the best match. I won't necessarily dispute that, I just don't want another zealot to stomp out my favorites without 100% proof.) -- top

A "zealot to stomp out my favorites ..." Huh? Have you ever had a favourite stamped out?

[Hmmm? Sounds like a fool's battle to me. You may believe you're somehow resisting bloat by fighting for PowerOfPlainText and dynamism. In truth, you are actually fighting to maintain waste and bloat by resisting OnceAndOnlyOnce factoring of structured data management into the space between utilities. The actual result of your favored camp's approach is kludgy, awkward, inefficient, and chaotic. The need for structure is part of the EssentialComplexity of computation, and forcing tools to use SimplySimplistic types (like strings or BLOBs) just forces programmers to 'hide' the structure inside the simpler type. This has the following consequences, all of which can readily be observed in our history and in present day:

[In any case, you're free to 'fight' this if you wish... all you need to do is find solutions to the problems that types solve without using equivalent types. If you don't, your entire camp's resistance is hardly going to be felt relative to the massive inertia from such systems as Linux and Windows today and the GoodEnough patchwork solutions (e.g. XML). We - and I can only guess, but I'd bet that most systems engineers and OS designers ARE in the camp opposite yours - will just keep on pushing in the direction we see the greater advantage... albeitly at a sedate pace.]

[Anyhow, you needn't fear; your "favorites" will be there in that brave new world of typed FileSystems. If you desire to zealously stick your structure inside 'plain text' and parse it back out, we won't coerce you. Instead, we'll coerce the type. Personally, I can't imagine why you'd want to (except for those circumstances where you're serializing across the network or to a tool not designed for the new FileSystem), but I wouldn't stop you from doing so.]

You seem to suggest I am for parsing. I am not. Ideally the "atoms" would be clearly delineated such that no parsing would be required. In practice there is no such standard yet. ODBC is probably something closer to what I have in mind, but it lacks some dynamism. Text is merely the low-hanging solution right now. --top

[Databases only help you because they already support the more complex structure. E.g. every relational database supports (at least) a map of sets of maps of strings... and even that 'minimal' relational database rarely avoids need to perform parsing (e.g. the moment it comes time to add integers together or check to see if one date is larger than another) and occasionally introduces a bunch of extra complexity when you attempt to avoid parsing (example: if you want tree-values or set-values in a column domain, you could represent by identifier reference to a table... but then equality testing and comparisons of tree-values becomes a royal pain, as does cleanup.]

['Parsing' is gathering structure from a simpler type. If you are to avoid parsing, then the data must be presented to you (when you receive it) as the more complex type. I don't think you can have it both ways: you can't avoid both types and parsing unless your domain is full of very simple problems. Attempting to avoid both parsing and types is a PipeDream. In many domains such SimplySimplistic notions will bite you in the ass and tear a few chunks out. Anyhow, going back to the FileSystem: many people think the FileSystem should be replaced with a DataBase. I don't really disagree. But either way, you'll be: (a) doing parsing (integrating parsers/serializers with every tool) OR (b) doing types (inside those atoms & across communications) OR (c) both OR (d) finding yourself a job that doesn't involve programming. Your choice.]


Sounds like we may be on the verge of LaynesLaw-ing on "types" again.

[Not likely. We aren't engaged in a philosophical battle over definitions. Neither of us believes that being the one to define 'WhatAreTypes' would be significant to the argument. The notion of supporting values of declared structure in filesytems and databases is well distinguished from the notion of supporting nothing more complicated than strings.]

Strawman: "just strings". That is misleading. Validation techniques and other attributes-about-values are perfectly possible, and without adding complicated syntax/requirements, but rather ReUse? of the existing schema system for consistency and concept conservation.

[I didn't say "just strings". And I agree, validation techniques can be applied... but there is a critical point to make: you need to validate against something that tells the validator what a 'valid' string looks like. That 'something' has all the necessary properties to be a type-descriptor. The ability to perform validation requires types. And you won't magically escape the need for the normal requirements of types in order to use such techniques... though I don't believe "complicated syntax" is a prerequisite for using types - that's just you being a type pessimist.]

Re: "The ability to perform validation requires types." - That's a rather bold claim. Care to justify it in another topic? --top


PageAnchor: Compound_Element_Types

Regarding the complex-number issue, I will agree there is somewhat of a need for "compound element" types/values; and that using ID's to reference and represent such compound values as tables can be a bit obnoxious. Let's see if we can work something out. For the sake of representation, let's represent them with curly braces. Ex: {123, "foo"}. The individual elements (let's call them sub-elements) can then be specified and constrained just like "regular" types/columns. (Later I may work up an example of how these could be specified in a standard way using FlirtDataTextFormat.)

Where the exchange format/system ends it obligation, however, is using expressions and/or a TuringComplete "engine" to define operators or validations. (One can add it as an extension or add-on, but should not be a core standard.) There should only be basic optional validation such as range, size, required status, and perhaps basic character sets.

--top

[You've recognized that rejecting structure leads to much kludge and OnceAndOnlyOnce violations, but aiming your solution at such a trivial example as complex numbers has produced a simplistic half-measure. Composite 'flat' types as you propose do extend the set of domains in which a Relational Database can conveniently be applied, but said solution would still end up repeatedly punishing everyone who needs deep-structured values.]

Example?

My Complex number was intended as a mere starting point, a trivial example that simply illustrated the kind of problems inherent in exposing the internal representation of a type via tables in a schema. What about types represented as a tree, graph, or lattice? What about types that require an internal representation involving varying numbers of elements, varying types of elements, or varying relationships between the elements on a value-by-value basis? What about complex (possibly procedural) constraints? What about distinct types that share a common internal representation, but that are distinguished by their operators and/or constraints?

Data exchange formats are not a problem -- we already have those, e.g., XML, YAML, CSV, etc. and assorted ad-hoc representations. The problem lies in accurately reconstructing type definitions at the various communication end-points of a system. This is not solved by simplistic data exchange formats. As noted above, a type may be more than just the data structure used to represent a given value. This is only solved by creating type definition standards that (a) do not lose any information about the type definition when they are shared; and (b) do not unecessarily expose users to internal representations that may (quite appropriately) vary depending on their location in a system.

[Good points, as usual. But Top, I'm discovering, is incapable of abstract thought or he'd already know all of this from prior discussion. After all, you had even named types represented as trees in the very same paragraph he was responding to, and yet he can't even think of it; he needs an 'Example'.]

[Top, consider ordered tree-values used as primary keys to a column in that same situation where complex numbers were used. What I want to be able to do is say: "myTable WHERE tree1 = tree2" or "myTable WHERE tree1 = Tree(Node(1, Node(Node(2, 4), 3))" or "INSERT INTO myTable(tree1,tree2) VALUES (Tree(Node(7,22,Node(11))), Tree(1,Node(2,4),3))". I also want access to component operators: "myTable WHERE contains_pattern(tree1,Node(*,Node(2,*),3))". What I want to be able to do is define trees and their operators OnceAndOnlyOnce so that I don't need to build them into each query. However, in a DBMS not supporting types, my options are limited. One way to obtain what I want is to put every tree-value into a string & use parsing for every operator; this is a solution with plenty of its own problems (esp. when it comes to operators over structure), but it is not the solution that you'd promote. What you promote (based on discussion here and how you've responded to past challenges) is more analogous to breaking down 'Complex' into two columns is essentially creating a separate 'Nodes' table... perhaps:

  TABLE Nodes                                           TABLE Nodes
  --------------------                                  -------------------
  ID      Integer Autonum                               ID        Integer Autonum
  Value1  Integer                                       Type      Char     // 'N' for Node; 'L' for leaf
  Value2  Integer                                OR     IDParent  Integer  // parent node
  Value3  Integer                                       Position  Integer  // ordered position in parent (for ordered trees)
  Value1Type  Char    // e.g. 'N' for node              Value     Integer  // for leaf
  Value2Type  Char    //      'I' for integer           PrimaryKey (ID)
  Value3Type  Char    //      'x' for unused            Unique(IDParent,Position)
  PrimaryKey(ID)

[Each of these solutions has disadvantages. The one on the left requires building the tree procedurally from the bottom up (so you have the Node IDs for Value1, Value2, Value3), while the version on the right requires a top-down approach (so you have the IDParent). The version on the left has severely limited count of nodes (just three), but potentially allows for sharing structure (e.g. Node(1,2,3) could be shared among many trees). The version on the right will generally require a unique structure for each tree (since the structure must be unique to IDParent). But enough about the trees. In this case, you'd use:

  TABLE myTable
  ---------------
  tree1  Integer
  tree2  Integer
  ForeignKey(tree1 into Nodes)
  ForeignKey(tree2 into Nodes)
  PrimaryKey( ... actual tree value of tree1 ... ?)

[Hmmm... I've already encountered a problem. I can't find a reasonable way to state my PrimaryKey as being unique based on the tree structure rather than on the tree identifier. That's somewhat upsetting, but since I know that Top doesn't give a damn about 'protection' like that, I'll let it slide for now. What matters to me is OnceAndOnlyOnce, SeparateIoFromCalculation, and otherwise avoiding kludge. 'Pet Views' aren't the only thing that matters to me; just as important are difficulty of forming queries to insert, update, delete, and request data based on the tree-value, and the ability to include the 'Tree' type as a foreign key into other tables (should the need arise).]

[Top, take any one of the tasks I wish to perform - e.g. "myTable WHERE tree1 = Tree(Node(1, Node(Node(2, 4), 3))" - and find a solution that is just as elegant and reusable as this typeful one, except where your solution uses a Nodes table instead of 'Tree' typed cells. You can make your own Nodes table if you wish. Can you do this?]

Are we talking about exchange systems or RDBMS? I have no problems with the idea of adding traversal and graph-node operations to a relational query engine. I've even proposed some SMEQL-like operations of my own.

This section is specifically about representing types in table/relation/relvar-based systems, which certainly applies to DBMSes but could equally apply to other systems as well. The underlying concept -- the undue complexity forced by lack of appropriate user-defined type support -- is applicable to any language. While adding traversal operators and graph-node operations to a relational query engine would certainly improve the query engine, it does not completely address the issues that are introduced by a lack of true type-definition facilities. Merely having (tree?) traversal and graph-node operators does not address the issue of, for example, a type that requires a large number of varying elements that happen not to be a tree or graph. Only true type-definition facilities can eliminate (for example) the complexity of duplicating type operations (such as testing for equality) in every query against a collection of table attributes (or even whole tables) that represent a single type.

[Apparently we wrote responses at the same time again. Yours is as thorough and accurate as any I could offer, and I agree with your statements. Mine, below, is also to Top.]

[Both RDBMS and Exchange, Top. Look carefully at the example: sending 'Tree(Node(1, Node(Node(2,4), 3))' as part of an 'insert' is exchange. Using it for the query is more relevant to the RDBMS. I don't believe the two can be so easily divided as you seem believe they can.]

[I don't mind if you add traversal and graph-node operations to a relational query engine (after all, I'm asking for solutions)... BUT I feel like you're just assuming (cue visual of Top waving hands) that such a utility would somehow save the day and make everything better. I don't believe it. My intuition is that actually trying to use these graph-node operations to perform a simple equality operation will still be kludgy, will still violate OnceAndOnlyOnce across queries (repeated syntax), and will probably be difficult to share or optimize, too. Please show me how it would help out, since you believe it would. You like concrete examples, and I offered you concrete example problems. Can you show me how this solution you suggest solves them?]

This appears to be a case where the domain needs a tree-oriented query language to communicate tree updates between them. It also smells like a "lab example", i.e. somewhat artificial. Yes, I do want an example, meaning something realistic from the real world. If that makes me bad somehow, so be it. I am bad person who wants an example to test the practicality of this.

[Tree and graph values are useful for representing composite identifiers in almost any system (especially if you want to follow the ZeroOneInfinity rule), feature associations for a fuzzy memory engine, problem-solution pairs for memoizations, pattern-transforms for data-driven optimizers, component-features and functions in an RDBMS-based GUI engine, and so on. If you were to consider whichever forces introduce a 'complex number' problem, you'd quickly see that vectors and matrices and sets are all useful types for comparisons. Trees in particular are just one instance of the more general problem you're seeking to ignore by calling it a "lab example".]

[Why don't you tell me more about how this "tree-oriented query language" of yours that will magically solve all the problems? I've given the subject some thought myself, and while I believe traversal operators are a fine idea, I'm still convinced they leave unsolved the problem presented above. Why don't you show me your version of "myTable WHERE tree1 = Tree(Node(1, Node(Node(2, 4), 3))" using this "tree-oriented query language" of yours?]

In practice, there may be some tree-ness to a given app-to-app transfer, but it may be specific enough that we don't need a general-purpose tree-query-language and can make a little domain-specific sub-language (or sub-convention) to do a somewhat specific update/transfer. For example, updating some branches in file folder directories usually does not need a general-purpose tree query language. A dummy file name generator plus a DOS script generator could do the trick. Tree-oriented query languages have been re-invented multiple times, but have never really caught on because the need is not that common. I've seen a lot of biz apps over the years; for I've been in the industry since the mid 80's, before PC's took over as the domonent biz platform. [This topic is TooBigToEdit, my spailchecker stopped working on it.] --top

[Yes, I know all about how you'd prefer to repeatedly create a complex system of scripts and applications to solve problems that could be solved OnceAndOnlyOnce in the DBMS. Hey, I've got an idea in the same vein: we don't really need a DBMS... let's just use flat files to store the data and DOS batchfiles to update them! Never mind that we'll need to re-invent this solution multiple times. And people like DosMind will be there to fight tooth and nail to prevent more general solutions from entering the field saying things like: "show me an application that needs a generic solution and can't get by with batchfiles and flat files!" and "it is simpler to transport flat files around; we really need to keep exchange systems simple!"]

If you have a general solution to a common problem, then show about 5 to 10 realistic examples that demonstrate it is common, and then present the solution. You are only claiming it, not showing it. Claims are the easy part. As far as showing up DOS-only advocates, I'd have them cross-reference (join) a million records, and their solution would either be too slow or take up lots more code than SQL. I wouldn't need indirect round-about mumbo-jumbo justification, but rather speed or code volume would be there for them to actually witness with their own eyes. (If they question the need to join a million, I'd give them actual scenarios from my time as a marketing research query writer for a cable company with a million+ customers.) If by chance they find a way to get DOS to do such easy and fast, I'll pat them on the back and say, "Well done. If you personally like it, go with it." --top

[And what I'd do is ask them how good their transaction support happens to be. In any case, why don't you handle just the problem in front of you before asking that I cook up 5 to 10 more.]

You haven't presented a realistic non-sys-soft domain context. It's a text-book puzzle at this point.

[I do not need to do so. We have goals that communications and storage media utilities like FileSystems, OperatingSystems, and DatabaseManagementSystem?s be domain generic while avoiding LanguageIdiomClutter. In this context, we also wish to provide as much CrossToolTypeAndObjectSharing as possible for both manipulation and IO. Given these three goals, a general-purpose feature like a standardized TypeSystem that can be used to obtain the desirable effects in different domains is inherently better than producing a ton of domain-specific solutions that cannot interact (or be shared) because they are modular and unaware of one another. Because of this, I do not need to prove the system better in every domain; I only need to have reason to believe that it is no worse than the existing solutions in most domains while providing useful features in at least one domain. Proving no worse is trivial by simply ensuring the types in use by existing solutions (e.g. Blobs, strings, dates, integers, etc.) are available. Proving the useful features in at least one domain is the reason that the tree-example is provided.]

[In any case, if you're like most people you should be able to learn a lot by actually working through a "text-book puzzle". I'd appreciate it if you spent at least half as much effort mentally applying yourself to the example as you do in seeking excuses to avoid it.]

I'm a practical guy and I don't think this is a fault. Often times "lab-toy" examples exaggerate the need or usefulness of certain techniques by making a series of unrealistic assumptions. They can be fun to play with, but I tend to focus on the practicality of things more as I get older. There is plenty of work to do to discover better practical tools such that I don't need to seek out artificial problems to keep my curiosity satisfied. I've seen volumes of dusty IT research journals at my local university, and was appalled at the money and time being wasted on silly lab toys. (Perhaps 1 in 5,000 will result in the next big breakthru, but I'm not here to play lottery.) Thus, I'd like a realistic industry scenario before I consider this worthy of a practitioner's time. The academic puzzle-lovers can thus take over at this spot. -t

As someone who works in academia, I'll be the first to admit that many published papers are rubbish, and that a significant proportion of "research" (I use the term loosely) is little more than a way to retain employment and/or gain promotion and/or avoid teaching and/or obtain funding for office & lab toys. Yet, without the 5000 efforts, there wouldn't even be one big breakthrough.

By the way, realistic industry scenarios were given above -- e.g., "feature associations for a fuzzy memory engine, problem-solution pairs for memoizations, pattern-transforms for data-driven optimizers, component-features and functions in an RDBMS-based GUI engine [...]" Unfortunately, I think the problem is that these aren't realistic scenarios for the industry you work in (or you don't recognise that they are), so the impact is entirely lost on you. Fortunately, it isn't lost on those of us for whom such scenarios are relevant and cogent.

That being said, let's digress from the topic of "sharing standards" and continue at RelationalTreesAndGraphsDiscussion.


I'll provide the requested five (actually, six) examples, as commonly-used business-oriented types. Imagine two equivalent DBMS systems. System A provides user-defined type definition support. System B does not. Assume that for each of the commonly-used, business-oriented types that I will list below, I've already used System A to define the types for you. Assume that values of each type can be selected via a typical operator-invocation or function-invocation syntax. E.g., to instantiate a value of type Money as $12.34USD, I would use Money(12.34, "USD"). Assume equality and ordinality test operators are provided, so Money(12.12, "USD") = Money(12.12, "USD") returns true, Money(14.13, "USD") = Money(12.12, "USD") returns false, Money(15.01, "USD") > Money(15.00, "USD") returns true, and so on. Issues of currency conversion and the like may be ignored for the sake of this illustration. Here are the types:

Now let's assume the existence of the following RelVar (think "table", if you like) in System A:

 VAR myvar REAL RELATION {attr1 Money, attr2 Date, attr3 Time, attr4 GeographicLocation, attr5 ComplexNumber};
Let's assume we wish to perform the following queries, in TutorialDee syntax:

 INSERT myvar RELATION {TUPLE {attr1 Money(15.23, "CDN"), attr2 Date("12 Jan 2003"), attr3 Time("23:13"), attr4 GeographicLocation(55, 130), attr5 ComplexNumber(3, 2)}};

myvar WHERE attr1 = Money(12.34, "USD") AND attr2 >= Date("Last Tuesday") AND attr3 = Time("12:22PM") AND Miles(Distance(attr4, GeographicLocation(57, 63))) < 4 AND ComplexNumber(5, 6) = attr5 * ComplexNumber(3, 4)
Assume an operator Distance(x, y) that returns a value of type Distance that represents the distance between two GeographicLocationS x and y, and an operator Miles(d) that converts a Distance d to an integer number of miles. Assume the '*' operator has been appropriately overridden for ComplexNumber.

Now, create equivalent or simpler queries using System B that are as expressive and intuitive as the above. You may presume any representation you like. Pay particular attention to attr2 and attr5...

Once you've done that, we'll proceed with the discussion.

Are we talking about a data exchange format/standard or a query language? If your needs require a query language instead of mere data exchange, then I am not against user-defined types in a query or DB system (see DoesRelationalRequireTypes). Data exchange and query service are two different issues as far as I can see. You seem to be mixing up the two. If the implication is that query languages can replace data exchange formats; well, that's a different issue that we can addrress separately. --top

The two are not as separable as you appear to claim. How does the query get to the DBMS in the first place? How do the results get back to the client?

FlirtDataTextFormat with Compound_Element_Types. Done. Now we can go home.

Huh? You wrote, "if you have a general solution to a common problem, then show about 5 to 10 realistic examples..." I have provided five specific examples illustrative of the problem of type sharing (particularly in the context of DBMSes) that has formed the bulk of discussion on this page, and that will illustrate the distinction between user-defined type support in general -- i.e., regardless of context, whether communicated or in a DBMS or language -- and its lack. Do you have a problem with that? Furthermore, take a closer look at my examples -- are mere Compound_Element_Types sufficient?

Where do you see a potential problem spot? We can focus on that first.

Please, please work through the example, otherwise this will most likely turn into another lengthy exercise in futile rhetoric. I think my point will be much more clearly understood by discovering it, rather than having me explain it.

Fine, but I'll take my sweet time.

Good! I hope you'll find it an interesting (and maybe even enlightening) process. I know it was for me -- skeptical as I was at the time, coming from a mid-80's BigIron business system background -- when the necessities of certain projects drove me in the direction of typeful programming.


page anchor: can_of_worms

RE: Ideally, such tables can store arbitrary data types from the language, or at least a significant fraction of them, in order to avoid all the AccidentalDifficulty associated with your choice of either parsing/serialization or value composition/decomposition/collection when interacting with the tables.

If a DB typing system must be able to match the language, then it would have to have a super-set of type system abilities and thus risk turning into a monstrosity.

There are always development risks. It is true that you can err on the side of overgeneralization of the TypeSystem and introduce extra complexities into the language library. But if one errs in the other direction, one risks turning the applications themselves into monstrosities as they perform complex workarounds, and one risks later requiring a major revision to the TypeSystem to support the type. Fortunately, one can design the original type and value system with the possibility of upgrade firmly in mind, and thus design to reduce the cost of upgrade. Also fortunately, the closer one can get to representing the 'ideal' type, the less translation effort is required. So, like I said, the ideal is to get support every type in the language, but supporting a significant fraction of them is still better than supporting only a few.

For example, although many languages seem to use trees or DAG's for type references, its possible that a language may allow cyclical definitions. So what happens when you go to import a cyclical language type into a DB type system that only accepts DAG's?

I'm assuming you're really asking what happens when you suddenly need to support cylic values when you previously supported only DAG values. There is a difference. A tree is a cyclic type (Tree X = leaf:(X) | node:(Tree X, Tree X). But a tree doesn't necessarily support a cyclic value. An example of a cyclic value would be a set that contains itself as an element.

In this particular case, you'd need to workaround until the TypeSystem and transport system could be upgraded to your requirements. As noted above, the closer you can get the better. For a workaround, what you'd essentially do is use a few tags to represent the semantics for interpreting the value (e.g. 'fixpoint:(name,value)' to say that name=value in value' or 'recursive_bind((name1=value1,name2=value2),value)' for a more letrec-style.) Then, for every application that uses this type you'd need to integrate an extra library layer that can translate between cyclic-values at your language layer and these 'semantic' values for the transport and storage layer. This translation is AccidentalDifficulty, as is the need to distribute your library and integrate it as an intermediate in your applications.

These sorts of workarounds are bad for CrossToolTypeAndObjectSharing for a number of reasons:

To avoid these problems a little bit of BigDesignUpFront is appropriate here, just as with language design of other sorts. One should at least support the values one can be seen being passed around today. But one can still err on the side of not being general enough, so long as one designs for upgrades in the future. When you notice people reinventing things, it is time to seriously consider refactoring it into the official standard. Perhaps ThreeStrikesAndYouRefactor should be applied more globally.

One can design the type system and transport layer for extension. As a simple example for an extensible transport and representation layer, you could reserve all 'semantic value' tags starting with the letter '%' for purposes of this upgrade path. This would allow you to add such things as '%set' and '%lambda' and '%recursive_bind' and so on as the need arises without worries about colliding with someone's homebrew semantic value types. When comes time to add a new semantic type, one simply finds a way to marshall and unmarshall to this type for each language library. Since it was invented before (as above) you've probably got a darn good idea how to do it for the popular languages. The TypeSystem itself doesn't need values annotated with these special tags (plain old tags work for type-descriptors), but it will need to have any new type-descriptors integrated with the validators.

It is worth noting that YAML already has such an upgrade path. YAML is a fine example of engineering that could be applied well to this purpose.


page_anchor: 90%

But why invent 90% of an app language when a few extra features will make it 100%, and then dump the prior app language (for new projects at least). It doesn't make sense.

As you aren't a language designer, I'll accept that "it doesn't make sense" to you. But that extra '10%' (a statistic that is more hyperbole than fact) can make enormous differences on how 'sharable' something is between systems.

Sharability in a computation system might be loosely measured in terms of such things as:

Application languages often contain values, types, and objects that involve a great deal of interaction with their environment - tings like closures that access mutable state, lambdas with free variables, and pointers. Other values have potential to mutate as you use them, such as stateful objects; these are especially difficult to store or share, as doing so essentially requires either continuous cache maintenance or time-sharing... either of which is difficult to achieve if one wishes to persist the values. Other values imply special dependencies for their use - e.g. scripts require embedded interpreters, filenames require one be on the right filesystem, BLOBs require codecs, anything that requires explicit post-processing (e.g. parsing or translation). These dependencies and interactions can make these values and types far less easy to usefully share.

A LanguageDesigner aiming for CrossToolTypeAndObjectSharing, even if it is for just between tools written in one language, will make a trade: giving up features dependent on the source context or environment and obtaining features that reduce dependencies on the recipient's environment. Sharing lambdas is much better than sharing scripts. Sharing structured values is much better than sharing BLOBs. Etc. A LanguageDesigner aiming for an application language, however, can make some different tradeoffs in order to buy performance or take advantage of known features of the environment. This really isn't a problem for sharing so long as one can automate the translation.

Obtaining sharability may require rejecting that 'extra 10%'. So be it. If one can take a powerful sharing language, intelligently add 10% to it without touching the value or type-descriptor semantics, layer in some convenient programmer syntax, add some standard libraries, and end up with a powerful and high-performance application language, I'd say that's a good thing - a proof, I would argue, of the long-term viability of the sharing language. But I doubt even in that case you'd be able to readily share all objects in this new language, much less the actively running applications. "Why invent 90% of an app language when a few extra features will make it 100%"? Because JustIsaDangerousWord, even when you're talking about 'just' adding 10% more.

I'm not sure how your list quite relates to this topic. Let's approach it from this perspective: given a relatively full-featured app language, which features would you want *removed* to make it a satisfactory for sharing?

That is a fair question. Here are some of the top things I would target for removal:

I'd also add the any type, for where you don't care. A lot of languages don't have one.


Here's a draft of increasing "levels" of complexity of information transfered. Note that some are not necessarily pre-requisites of those lower on the list, for its only an approximate hierarchy.

I'd suggest ending at either compound types or operator definitions. Beyond that requires expression and code execution/evaluation.

--top

[You honestly believe that 'reg-ex' doesn't require some fairly arbitrary evaluation? Based on this list, my suspicion is that you think 'simple' that with which you're familiar and 'complex' that with which you are not. There is no clear technical reasoning behind your hierarchy.]

You can make your own hierarchy or feature list if you want. Implementing type operators is a huge ramp-up in complexity to most rational people.

Really? I find that surprising, to the point that I suspect we're thinking of different things when you mention "type operators". What do you mean?


Here is a schema that can represent types up to "compound types" in the above list. (RunTimeEngineSchema may suggest ways to represent operators and parameters). Text representation (serialization) is not considered here.

 types
 -------
 typeName
 sequence    // integer
 parentType  // either another "typeName" or base type
 notes
 // primary key: typeName + sequence

typeAttributes -------- typeRef // foriegn key to "types" table sequenceRef // foriegn key to "types" table attribName // examples: maxLength, lowRange attribBaseType // text, number, integer, dateTime attribValue

The "typeAttributes" table allows somewhat open-ended attributes. Some may argue that creating tables as needed is the "proper" way to do it, but many shop arrangements don't make that task very easy. Thus, an AttributeTable is assumed instead.

Base types are: text, number, integer, dateTime. The parentType must be either another "typeName" or base type. Circular references are not allowed. If attributes appear multiple times in tree/DAG path, then the lowest level one takes precidence. Here's an example of a compound type:

 typeName...sequence...parentType
 --------------------------------
 coordinate.....1......number
 coordinate.....2......number

(Dots to prevent TabMunging)

If you are manually assigning sequences, then perhaps increments of 100 makes it easier to insert new ones later. (Remember the BASIC line-number days?)

I'm curious how and where the semantics of the type attributes, e.g., maxLength, lowRange, etc., will be defined. Are these intended to be canonical, agreed by the endpoints of the system, or something else? Also, is the above model (and that on RunTimeEngineSchema) intended purely to be illustrative, or to be implemented?

A standard set could be defined, similar to reserved words, but would not prevent custom ones. As far as illustrated versus implemented, I would consider it illustrative at this point. But if you see something that would prevent implementation, please point it out.

For the custom type attributes, how and where would you define their semantics?

What is an example? I stopped short of defining type operators. A "note" column is available for a longer description. I suppose we can add a note column for attributes also.

Let's imagine you've defined an attribute called oddNumbersOnly, intended to limit values to odd integers. If it's strictly an attribute name (or described via human language in a 'note' column), it might be meaningful to a human but not to a machine. Therefore, where would the semantics of this constraint be defined, thus permitting values of this type to automatically be treated differently from ordinary integers? Or, would this not be automatic, the presumption being that communication endpoints must have pre-agreed, already-in-place machinery to recognise and appropriately handle "oddNumbersOnly"?

At a glance, I don't see anything that would prevent implementation of a system that provides a relational reflection of the run-time core of a programming language, with a relational repository for its source code. Indeed, at work I've internally published a similar idea as a possible applied thesis topic for students (no takers, so far) -- with a particular focus on exploring its value in terms of refactoring, maintenance, and run-time manipulation/observation (such as for debugging purposes) of the executing environment. However, I intuitively suspect ExtendedSetTheory may prove more flexible than the RelationalModel while providing the same benefits (if any), but at this point, it's pure (though rather enjoyable) speculation.


page_anchor: top's book

Generally strings are treated as a unit. If and when we want to do "listy" things with then, we can convert them into a more formal data structure. I agree the distinction may be somewhat arbitrary or usage-specific, but trees cross the line in my book. Lists can go either way, but trees are into the structure camp.

also: Relational is against nesting in my book.

[It seems your "book" makes several assumptions with which I disagree:

[Unless you're practicing a religion, you should seriously test your assumptions. We've tested them. We have expressed disagreement and reasoning with many of your stated and unstated assumptions, often with examples, but it seems you just aren't open-minded enough to seriously entertain the possibility your assumptions could be in error.]

[It isn't as though promoting support for rich structure means I am "against relational". There are both technical and semantic points that can rigorously be applied in determining where relations are warranted (regarding such things as mutability, measurements, predicate disjunction, normal forms, nature of data (inferred model vs. sensory), etc.). But your naive "book" doesn't consider these points, and instead favors an a one-size-fits-all approach with you declaring "trees are into the structure camp" and that they shouldn't be nested within a cell. Worse, your approach isn't even justified on any technical basis, which leaves it, essentially, unjustified.]

If you want a stated rule, then: "Lists can either be embedded in the atom or broken out as a structure (table) based on usage patterns, but anything more complicated should be a formal structure (table)."

What is your rational basis for defining this rule? Do you really prefer the complexity of schema definitions and obtuse, complicated queries to the inherent simplicity of using complex values in exactly the same manner as a simple, primitive value?

The definition for a structure has to be *somewhere*. I'm just using conservation of conventions/rules to avoid reinventing one from scratch for each "kind" of data structure.

[I think you've got it backwards. Embedding structure back into the schema is what requires users reinvent types (trees, sets, lists, etc.) from scratch for each "kind" of data structure. How many times do you think 'trees in SQL' have been reinvented?]

As far as queries, we are not defining a query language, at least not something I will participate in.

[I agree: we are not defining a query language. However, ignoring query languages as a representative tool with with which types and objects are utilized and shared would be an act of rather monumental stupidity. What good is CrossToolTypeAndObjectSharing if it can't work with one of the more common tools for sharing data? So we are using queries as a framework for proof-of-concept for CrossToolTypeAndObjectSharing. If a solution fails for integration with RDBMS and queries, it fails: end of investigation... time to look for a better solution.]


[As to your latter problem: I might suggest Firefox. My spellchecker works just fine. I also suggest forcing yourself to learn to spell correctly every word that you mistype. E.g. domonent -> dominant; just type 'dominant' as fast as possible ten or so times each time you mess it up (including for each mess-up while typing it ten times). Burn correct spelling into muscle memory... it, on average, shouldn't take more than a minute of your lifetime per word. Funny. I'd have thought a person who rejects support from the compiler for catching errors would have a bit more self-discipline and less reliance on a spellchecker.]

I am using Firefox, and it reaches a limit. And I have more interesting skills to work on than spelling. Long topics should be divided for other reasons anyhow. The spellchecker croaking is just a reminder of the size.

[You're probably not working on any skills while arguing with me... or doing anything else productive or useful (like sleeping), even if you should be, so don't try to rationalize your laziness with that excuse.]

And what is your excuse against producing a realistic biz domain example? I suspect its not that you are lazy, but that you are afraid of open scrutiny.

[Doing so is not logically required of me. That's a very good excuse. If you feel the existence of types inhibit you from doing the biz-domain work you're doing right now, then please provide an example that proves the rule. Otherwise you're just whining because your personal HobbyHorse isn't in the limelight.]

Your "logic" tends to use faulty givens. If you don't want to sell your pet techniques to possibly the largest niche on the planet and cater just to narrow niches, be my guest.

[Of course I'd sell my techniques to the largest 'niche' on the planet. Businesses would (presumably) benefit from the sharing described by the title of this page (along with the reflection and the greater simplicity of configuration management that are implied by it). Businesses gain the potential to access data that was, before, tied up inside system software or unavailable for queries or data-mining due to need for complex joins between non-integrated or specialized DBMS products. There's a lot of icing on the domain-generic cake. But it simply isn't logically relevant to my arguments. Whether such things are proven or not isn't critical to claims of 'better'-ness, which are based on RDBMS being domain generic software and the ability to practically service more domains. While it may be fun to speculate, I consider doing so a distraction from the relevant arguments.]


You say my logic uses "faulty givens". I'll write my givens here. With which of them do you disagree, if any? Please give reasoning.

Your main objection to types seems to be that you want to keep the RDBMS, query language, and FLIRT as trim as possible. I.e. you're focused on the implementation, not on the users of the system. If there is a "faulty given" here, I think it is your belief that the job of the implementor should be to fob off as much 'stink' onto the users as is necessary to make their own jobs easier unless, of course, the domain is CBA, where you make a big, entirely self-centered exception (I didn't say "just strings", I want dates and other CBA types, too!).

I think you are doing the opposite: trying to sneak your pet paradigm or Grand Unified Type Machine into different shops under the guise of a data exchange system.

Indeed, instead of selfishly starting a HolyWar over every RDBMS or data exchange feature that might not directly pertain to my needs, I'm doing the opposite. I believe there is great value in a framework that simultaneously meets domain needs for types and allows these shops to readily automate sharing of data. The ability to share is a straight up feature for data storage and exchange. I'm not being particularly 'sneaky' about it.

I am not sure what you mean by CBA exception. We need *some* kind of base types. Most tools already support strings, numbers, and dates; and that's why I used them as base types. If you are envisioning another set, let's see it. Put your cards on the table if you don't like mine. --top

You could get by with just strings. Or even just finite-width bit arrays. Or even just bits if you fold bit sequences into the schema with (col_x_bit_0, col_x_bit_1, ...). If you're going for type-light, why do you feel you should make CBA types 'special'? because they were there first?

I'd suggest the following as a starting point: unit, bit, codepoint (aka unicode 'character'), point (point = opaque surrogate identifier), semantically tagged value, tuple, record, unions, collection, set, recursive inductive types, lambdas, recursive coinductive types. And then I'd ensure the language can be extended just in case I missed a few.

It looks like we'll forever disagree about the scope of this "project". You want the whole kitchen sink in it such that it is an application programming language or mega-query-language in itself with a few "sandbox" restrictions. We seem to be going in circles. Hopefully we raised some good questions.

You don't comprehend the actual "scope of this project" because you've been too busy using circular logic to defend your delusion that issues of usage are "far beyond the goal of 'sharing', per title". Here is the circular logic: You believe that use and exchange are orthogonal. You are wrong. We present examples that demonstrate you are wrong. You refuse to seriously examine these examples. Why? Because: You believe that use and exchange are orthogonal. Circle complete. We've run that circle at least a few times already. If you'd like to break out of the circle, you must be open-minded enought to allow a contest of that belief of yours and you must be willing to go to the effort to present evidence that we are incorrect. But either approach requires that you seriously confront the examples we have provided. After all, it is those sort of examples that convinced us that you are incorrect, and our belief on that issue will not change by you merely waving your hands and insisting otherwise.

As far as your belief about my own goals: I have not suggested a full query language, but I am aware enough to recognize that any solution to this "project" must integrate effectively with filesystems and query languages. After all, CrossToolTypeAndObjectSharing really ain't worth much if databases and filesystems are outside the set of tools with which the types and objects may be shared. And so it is worth considering the "types and objects" in the context of filesystems and databases in addition to operating systems and other tools, many of which have, indeed, been explored above.

If you can pull it off without inventing a programming or query language to do it, I welcome your attempt. If not, and the only known way to achieve such is to invent a programming or query language, then that's that. I was hoping/trying for a declarative solution that didn't require expressions and algorithms; but if that's not do-able or won't satisfy enough domains, then that's that. --top

I do agree that keeping types as declarative as possible is a fine idea.

So far it seems not possible to get types as rich as you want without using non-declarative techniques. Perhaps "declarative" needs to be defined better for our context. For example, a functional language can be TuringComplete but still considered "declarative" by some.

Declarative doesn't mean "not TuringComplete", but I would consider it reasonable to avoid TuringComplete definitions - but we can have much richer types than we currently have without resorting to TuringComplete type definitions. ML and Haskell don't have TuringComplete types (but they are TuringComplete languages).

Because you seem confused, I'll note that we can also have a much richer value system without having TuringComplete values, and you could look at Charity language (http://en.wikipedia.org/wiki/Charity_(programming_language)) for an example. However, I am not so certain that the engineering tradeoff of doing so is worthwhile for CrossToolTypeAndObjectSharing.

In any case, independent of the TuringComplete issue, I think that procedural and mechanical (including function-evaluation) definitions of types and constraints (e.g. trigger-based constraints) are a bad idea. I also object to building values piece-by-piece in this mechanical manner, though I don't object to values containing a function.

It is worth noting that my objection to procedural description of values is one of the (great many) reasons I find your 'break structures into schema' approach to be abhorrent: the need for surrogate identifiers to build up a tree (or set, or list, or any 'nested' construct) node-by-node requires a procedural or mechanical approach to value descriptions. As usual, I think this is a case where you say one thing (you want declarative values) but, because you're operating on wrong assumptions, you favor approaches that accomplish the exact opposite of what you believe they'll accomplish.

I suspect your example is not realistic enough to illustrate that trees are not really trees.

I'm not attempting to illustrate that "trees are not really trees". I suspect that you are still assuming that values should be broken down into structures if they're used for more than equality. I stated above which assumptions of yours I reject, and that is among them. The example, in fact, is meant to challenge your assumptions. Therefore you are missing the point (and, I'll repeat, resorting to circular logic) when you reject the example because of one of your assumptions.

But anyhow, it seems a semi-custom system may be more appropriate to make every domain and personal prefrence happy. Rather than provide the One Grand Language, perhaps what is needed is kits that allow one to roll-their-own database without having to start from square one.

Ah, everyone will roll-their-own database without using a standard language for rolling it... what a fantastic idea for CrossToolTypeAndObjectSharing. Why, with that idea even databases can't share types or objects. Even better, it will be nearly impossible to write generic tools that can view, manage, and update databases. Oh! Oh! And you DBA guys will get to squash bugs in both the database and the schema! (And you'll get to learn a new, non-standard language each time! How exciting!) Rather than a small group of experts getting the database implemented mostly right with a few learnable quirks, we can let thousands of small groups each get it wrong in their own way! You'll never have a shortage of new and interesting stories for jobs you wish you could automate.

Are you trying to sabotage CrossToolTypeAndObjectSharing? and reliability? and maintainability? and overall simplicity?

Why do you object to a standard language for type and value descriptions? You obviously don't object to standards for communication in general (like SQL, ODBC, Sockets, filesystems, XML, HTTP, TCP/IP, etc.). Do you have any valid technical objections (i.e. excluding any based on unvalidated assumptions), or do you just fear the idea of learning something new in order to repair the messes people make of new features? (It isn't as though they won't make messes of the existing features, but perhaps that is the devil-you-know.)

I'm skeptical that a generic cross-domain query language can be made that doesn't become giant ball of crud that only a committee could love.

I asked about "a standard language for type and value descriptions" which is not the same thing as a "a generic cross-domain query language". Producing a system of types and values is a smaller task than producing a query language. Every query language must have a system of types and values (even if it is 'EverythingIsa bit') in addition to a query semantics, and possibly a data-manipulation semantics (update, insert, delete, etc.). For "generic cross-domain query language", I find your skepticism reasonable - it is difficult to figure out who needs what queries (clustering?) and which optimizations. But for "standard language for type and value descriptions" your skepticism is unwarranted: we already know a great deal about what works for types and values across domains. The set of common GeneralPurposeProgrammingLanguage types (records, sets, ordered collections, graphs, etc.) is common for a reason. We even have examples of working structured value systems in the form of JSON and YAML, and for simple type-descriptors with XML Schema.

It's tough enough to standardized *within* a domain. SQL, ODBC, FTP, etc. are the lucky few that "clicked" among many dozens if not hundreds of dead or lingering attempts, and they assume rather basic base "types". My suggestion is walk before we run by focusing on a non-expression-based declarative approach to type sharing. You gotta learn to get to orbit before you land on the moon. You don't try for the moon on the first attempt. When that is perfected and road-tested, then add operator implimentations etc. You are risking another never-ending XanaduProject by trying for all-or-nothing. HTML and the Web succeeded where Xanadu failed because it only bit out what it could chew and swallow. --top

I do agree that an intelligently designed evolutionary approach (i.e. one designed to accomplish backwards compatibility while resisting TIMTOWTDI and LanguageIdiomClutter) is a good idea. One doesn't wish to bite off more than one can chew, but one also doesn't want to deprecate features and receive internal competition between product versions, or to clutter up the system with features or multi-version maintenance as tends to happen if you don't deprecate features while adding new ones. When it comes to shared interface standards such as APIs, languages, and CrossToolTypeAndObjectSharing, taking 'big bites', or BigDesignUpFront, is the only practical way to end up with a simple and complete solution. And the less you can get to up front, the more you need to 'big design' support for future versions.

Anyhow, I don't believe the "walk before we run" analogy is particularly apt... not unless, on this evolutionary scale, you rate modern SQL as "sliming along under a heavy burden". One of the major reasons that many "dead or lingering attempts [that] assume rather basic 'types'" exist is because they don't sufficiently advance state-of-the-art. To succeed, any project must offer enough to offset the perceived pain of change, which is subjectively measured in the perceived (guesstimated) costs of paying for the new technology, learning to use the new technology, and deprecating the existing technology (for an in-depth study, read The Change Function by Pip Coburn). Incremental improvements to non-disposable technologies will usually fail unless they provide backwards compatibility. Contrary to your apparent expectations, those "rather basic 'types'" you mention are an indicator that the system will die or fail... because 'basic types' isn't sufficiently above and beyond the existing systems (XML/YAML/JSON/etc.) to motivate change. In order to succeed, a new technology will need to enter the world-at-large "running" better than the other technologies and will probably need an evolutionary path leading towards locomotion and unmanned flight.

In your tree example, you had a query that asked for tree equality. Do you expect this operation to be implemented in the sharing system? Or, merely defined as existing?

Answered below.

You asked, "What I want to be able to do is define trees and their operators OnceAndOnlyOnce so that I don't need to build them into each query." If the goal is not related to building a query system, why are you asking me this? --top

I'll try to simplify it: what you seem to be hearing is that I expect you to implement a query system, whereas what I've been actually saying is that I believe (for reasons given) that your destructure-the-values approach sucks for CrossToolTypeAndObjectSharing and almost every other purpose. Further, I'm saying that I am utterly skeptical of your claim that a few new query operators will fix the problems. If you believe a few new operators will help, I want you to show me. And if you can't show me, I want you to EatYourOwnDogFood and see what it tastes like. You keep giving advice on how tree values should be destructured and how doing so provides alleged benefits. Now you try it.

Seriously. Follow your own advice. Destructure a domain-value that happens to be a tree, then actually use it as a domain-value - as an identifier with equality-queries, access queries, joins, inserts, deletes. We're allowing you the latitude to choose your own query operators so long as they are clearly implementable and aren't schema-specific. We just, honestly, don't believe they'll help as much as you seem to believe they will.

And to avoid the impression that I'm claiming the tree-structured domain-values are used for nothing but equality and joins, go ahead and handle a few tree operations (merge, contains-subtree, partial pattern match, views, etc.) that you believe your approach handles more effectively. Demonstrate those benefits you have regularly asserted are there to be found relative to structured domain-values. If your examples correspond to uses of domain values (i.e. excluding mutation because domain-values cannot be mutated) we'll give you our versions of those operations should you challenge us on them.


Values and Operators and Tools and Sharing

In your tree example, you had a query that asked for tree equality. Do you expect this operation to be implemented in the sharing system? Or, merely defined as existing? --top

Value equality would be well defined as part of the standard. If an implementation provides an equality operator to programmers, it would be expected to adhere to the same semantics that everyone else has been told to use as standard for equality.

Whether a given tool implements or provides access to the equality operation (e.g. via a scripting language or query language or library API) would be up to the tool. Some tools, such as socket transport for values, might be unconcerned with equality comparisons. Of course, one would expect most tools to end up using a common shared library that implements these things, and I imagine such a library would be very likely to provide the equality comparison operator even to tools that don't need it (at least prior to dead-code-elimination optimizations).

Similarly, you could expect other primitive operators (e.g. projection on a record or tuple, set unions and intersections, cartesian joins, etc.) to be well defined. It is these sort of operators that define the value, providing its intrinsic semantics and differentiating it from other values... these operators are part of the EssentialDifficulty of any value system, in the same sense that 'integers' are defined in terms of successor and predecessor and that 'strings' are constructed of ordered sequences of characters. Also, if type-descriptors are part of the standard, one would expect a standard definition for whether a given value is a member of a given type-descriptor (but not all tools would be concerned with types).

Tools would be able to provide arbitrary higher operators, such as pattern-matching, definitions for joins, functions, etc. Tools would provide the extrinsic semantics for values - e.g. interpreting a given record as a command or query or identifier or physical address or statement of truth. Not all tools (e.g. transport and storage) are interested in such extrinsic semantics, and only need to concern themselves with intrinsic semantics. If standards are to be placed on these extrinsic semantics, that can be performed with a different set of standards, likely standardized specific to the domain (e.g. physics) or the class of tools (e.g. DBMSs).

Unfortunately for your goal of externalizing as much as possible, having multiple such standards is directly counterproductive to CrossToolTypeAndObjectSharing, because it prevents sharing between tools of different classes and domains, and prevents sharing with cross-domain utilities (like the DBMS or object-editor). Essentially one is 'parsing' or 'interpreting' the value for the given domain. In an extreme case, one could make it so 'EverythingIsa BinaryLargeObject?' with each domain providing its own implementation libraries and semantics, and we'd be even worse off than we are today. Sharing between these tools would require extremely complicated translation efforts. Note these problems of extrinsic semantics is inherent; it exists whether you are putting everything into BLOBs, into strings (plain text), or destructured into relational schema. Other tools won't usually know how to interpret a given 'node table'. The greater the degree to which semantics are externalized, the more duplication of effort and semantic divergence will occur, and the more expensive sharing becomes.

Thus one must, by nature, strike an engineering balance between simple cross-tool sharing and what you call LanguageIdiomClutter. Engineering tradeoffs are for anticipated costs and benefits justified by use of analogy, model, example, and prototype. And before you raise the 'DisciplineEnvy' flag, you'd do well to realize that even professional engineers rarely understand the full ramifications of any given design decision, especially in RDT&E (Research, Development, Test, and Evaluation). In this sense, SoftwareEngineering and language design are no worse off than other RDT&E programs. There are always risks when trying something new. Real engineers recognize this and attempt to control and compensate for risk.

I'm certainly with you on the goal for language minimalism (or 'concept preservation' as you call it), but I do see considerably more need for structured value support than you see. I would not be surprised if this is due to the different domains in which we work. You write business reports, CRUD screens, help companies organize their data, and other CBA apps. I do robotics, operator control units (soft realtime GUI with commands, planning, video media, remote camera manipulation, overview maps, world models integrating sensor payloads from different platforms, etc.), configuration support (languages, reactive programming, scripting languages, domain object models), mission planning (AI and heuristics support, DeltaIsolation, knowledge management), command and control protocols, and distributed systems - and that's just for my job. In my minimal spare time I study languages and I'm implementing one. I'd like to think I know a bit about the real demands for CrossToolTypeAndObjectSharing.

There are some rules of thumb one might follow in order to help achieve this balance between sharability and clutter. One example is ThreeStrikesAndYouRefactor. Essentially, if you see the same basic 'concept' being reinvented in three different domains, it is time to look at finding a way to fold it into the standard. Not that I recommend starting from nothing and refactoring from there: it is easy to paint oneself into a corner when it comes to concerns for backwards compatibility and language minimalism. Languages, APIs, protocols, and other interface standards are notoriously difficult to extend and especially to shrink or refactor unless great care is taken, so some BDUF is warranted even if it is minimally to prepare the extension mechanisms. Comparatively, fixed approaches like "lists are okay, trees are not, and I'm going to completely ignore sets" will not be able to meet real engineering demands.

I know for certain that we systems, math, and language programmers would benefit from easy CrossToolTypeAndObjectSharing of: unit, strings, sequences (lists), sets, unlabeled graphs (with isomorphic equality), structured commands and messages (can be represented as records of these things), and unions or tagged unions.

I'll admit the apparent need for sharing 'lambda' substitution functions and 'unfold' values (which require lambdas) is less than need for the value structures named above. These evaluative forms are useful for the transport of predicates and guards, transforms, triggers, accessors, infinite concepts like the fibonacci series or the sequence of twin prime numbers, etc.. But they aren't used so often that JSON or YAML bothered supporting them. These are a place where I could just ensure there is a way to add them later without breaking existing tools, then leave it at that. On the other hand, these also aren't complex or difficult to implement. When unburdened by syntax and optimization concerns, implementing lambdas, lambda application, coinductive unfolds, and structural equality can easily be done in fewer than one-hundred lines of C code. And transport can look a lot like other tagged records, simply using reserved names.

I found my reply repeating things I already said, so I junked it. I suggest we re-focus on a smaller goal here: transferring data between different apps in a "static" fashion: that is, packaged in a file. Let's put aside a query-oriented solutions for now. How do we transfer object and type-heavy data?

If you paid attention, you'd note that these questions have already been answered AND that nothing in the above 10 paragraphs involves "query-oriented solutions" making your entire comment here a non-sequitur, and making it quite clear that you didn't bother reading before replying. Anyhow, I went into a thorough answer to your specific question in the section entitled Type Heavy FileSystem. This time, try to avoid ListeningWithYourAnswerRunning.

Stop projecting your poor documentation skills as the sin of ListeningWithYourAnswerRunning on my part. If it smells, tastes, looks, and feels like a query language, then for all intents and purposes its a query language even if you call it something else.

I'm not projecting. You really are an awful listener, and you are extremely rude to ask questions, not read the answers, then reply to what you assumed the answer would be. And you just did it again (twice, here and below), so don't pretend this is my fault. Nothing written in the section entitled Values and Operators and Tools and Sharing "smells, tastes, looks, and feels like a query language"; indeed, stuff written in that section applies to query languages, but it also applies to interpreters, filesystems, sockets, fifos, and so on.

"Extremely rude"? You are being a drama queeen again. Extremely rude would be calling you a meandering befuddled run-on idealistic exaggerating detached patronizing blowhard. See the difference?

I do see a difference. Actions speak much louder than words. I consider name-calling to be childish, and a much lesser form of rudeness than such things as intellectual dishonesty, sophistry, hypocrisy, ShiftingTheBurdenOfProof, or asking questions then not listening to the answers. Indeed, the fact that you'd waste forum time and space exclusively to diverging on the off-topic and subjective semantics surrounding "extremely rude" speaks much worse of your behavior than does calling me a 'drama queen'. If you're going to waste a little time in meta-discussion, you could at least spend an equal amount of time on content relevant to CrossToolTypeAndObjectSharing in order to ensure forum progress.

Your sense of "justice" is as convoluted and twisted as your writing style. Good riddance.

I prefer to call it 'developing and maturing', but I can see how it might look 'convoluted and twisted' to someone whose understanding of justice seems to have reached its full development in grade school. Hmm... I guess this presents something of a challenge to you: can you stand by your "good riddance" statement and resist attempting to achieve the last insult? Or will you belie your assertion and reply? Time will tell.


Type Heavy FileSystem [perhaps move to TypeHeavyFileSystem??]

Q: How should we package type-heavy or structured data in a file for both persistence and sharing between processes?

A: We should take the straightforward approach: we should make it so files can package type-heavy or structured data for persistence and sharing between processes. Doing so requires upgrading the FileSystem (one might call the result a Type Heavy FileSystem. We need the FileSystem to support structured values as structured values - both in the sense of persisting structured values and of providing structured-value FIFOs and such.

FileSystems today support structured values as octet streams, which is remarkably inefficient and awful for CrossToolTypeAndObjectSharing. Modern approaches to sharing structured data is to serialize YAML or JSON or XML or CSV or whatever into an octet stream using interpretive layer of structure (objects, attributes, sequences, etc.) atop an interpretive layer of characters (typically ASCII, UTF-8, or UTF-16) atop the octet stream. Alternatively, we serialize structure directly atop the octet stream (and call it a 'binary' format, like MP3 or H264).

But this modern approach has a significant set of engineering disadvantages that prevents it from being how we should do it:

Instead of supporting structured data inside octet streams, even using some sort of 'standard' like YAML or JSON, we would be much better off to have the FileSystem directly support structured data.

By doing so, all of the above problems diminish. The FileSystem can index for efficient access to parts of a structured value. The OperatingSystem and language standard libraries can provide streaming access into this structure such that if you access a really large file only the parts you're looking at are cached into RAM and the whole file needn't be loaded just to access the few pieces you need. (With polymorphism, this can be done while maintaining the same interface as values produced inside the language runtime.) The OperatingSystem can support views and translations via 'open_as_type(filename,type-descriptor-or-name)'. The need for parsers and serializers and gatekeeping and 'views' is centralized OnceAndOnlyOnce into some mix of the OperatingSystem and its libraries. FileSystem FIFOs, Stacks, Pipes, etc. can be made massively more efficient by avoiding the need for marshalling and unmarshalling when communicating between tools on a single machine. New tools and processes become easy, cheap, and safe to write and integrate via pipelines due to the savings in memory and CPU, the savings in programmer effort, the flexibility of views, and the ability to detect workflow errors (maybe even statically). And it becomes easier to create tools that can usefully access, view, and manipulate any file used by any other tool.

And the line between OperatingSystem and Language blurs even further. LanguagesAreOperatingSystems. I consider this a good and natural thing, but I'll note that some people like the hard division between languages and operating systems... probably because that's the devil they know. Or perhaps they don't like it, but simply don't have the imagination or education to envision anything else.

The above benefits aren't guaranteed, of course. One could trivially thumb their nose at the typesystem and start passing data around as type:[(bool,bool,bool,bool,bool,bool,bool,bool)] - a sequence of octets - and be no better or worse off than they were with the existing FileSystem. The idea is to obtain as much optimization and safety and useful features like automated translations and as possible... balanced against the desire for language minimalism and the need for simple implementation. Striking that balance is an engineering problem.

You may be wondering about transport: how do files get from one implementation of the filesystem to another? But transport of structured data is essentially a solved problem. The only real issue is picking just one or a few of the vast array of solutions (XML, YAML, JSON, EssExpressions, erlang marshalled values, mozart marshalled values, or even a new format just for this). It's a solved problem and uninteresting to people who are properly educated in ComputerScience. I don't dwell upon it, but if I were forced to pick just one I'd favor YAML for the following reasons:

I'd disfavor building my own solution because this is a non-critical issue. If we need to later support more than YAML supports, well, YAML itself is extensible or we could upgrade to '%ANOTHER_OPTION' and dispatch to the correct OperatingSystem plugin.

The FileSystem itself wouldn't store structured data as YAML (unless it wanted to). YAML is just a network serialization format. It could also be used to provide 'plain text' access to values and objects, but doing so would be less efficient than keeping the value in value form (e.g. for streaming). Because YAML is 'just a serialization format' instead of 'the official language', one needn't embrace all of YAML's features or types; one could get by with using and supporting only a subset, so long as one supports every feature one uses.

The choice of YAML is reasonable... but, ultimately, boring. People who focus too much on the syntax of transporting and saving values might make some small, incremental improvements, but they'll not be making any significant improvements to CrossToolTypeAndObjectSharing.

There are major one-time costs regarding a TypeHeavyFileSystem?: implementation, optimization, provision of other features (versioning, distribution, caching, sharing, etc.), and especially its integration with the existing panoply of existing files and tools. Choosing such a FileSystem would, for an OperatingSystem, be a lot like starting from scratch. Many existing efforts would need to have significant rewrites to take advantage of the system.

Example APIs:

  Value myFileState = open_file_as( myFileName, aTypeDescriptor ).
  save_file( myFileName, expression producing Value )

MutableFile myFile = open_file_for_write( myFileName ) myFile.artist = ... expression producing Value ... myFile.element[20] = ... expression producing Value ...

The OperatingSystem would include plugins (selected heuristically) to perform that type conversions automatically (and possibly lazily), and could raise an error if no conversion was available. The 'myFile' object itself would then be accessible in the language... and, again, might be accessed lazily and with indexing (to support very large files) such that 'myFile.artist' can immediately return the artist without loading or searching the file and without concern for the representation of the file. Usefully, the language or libs could also integrate filesystem transactions, versioning, mutable views, language-integrated queries, and such behind the scenes.

No, I guess I was not clear. File systems are not static. They are a kind of special-purpose hierarchical database. I meant more like the kind of stuff that XML and CSV would be used for. --top

File states are immutable values. That makes them static. The fundamental answer to "how do we transfer object and type-heavy or structured data between different apps in a 'static' fashion" is to support type-heavy or structured immutable values as file states. Which are static. To support structured values as file states requires modifying the FileSystem. I labeled the result 'Type Heavy FileSystem'. And you'd know all of this if you weren't too busy jumping to conclusions to read what was written.

Or, perhaps, you believe that values and structures can somehow be magically separated from the operations you use to access them? In reality, attempting to divide values from their operators is a logical impossibility, a bit like trying to divide wetness from water. This particular detail was covered above, in the section entitled "Values and Operators and Tools and Sharing" at the part discussing intrinsic and extrinsic value semantics of values.

In modern FileSystems, the intrinsic semantics of every file is octet stream (or binary large object), and so the FileSystem API (the operators) only support access to files as though they were BLOBs. In the proposed Type Heavy FileSystem, the intrinsic semantics of every value can allow a variety of types, such as records, unions, sequences, sets, graphs, matrices, numbers, dates and times, geographic locations, other measurements, and so on (subject to engineering tradeoffs). And so, in such a FileSystem, the API (the operators) would provide a richer set of accessors and manipulators, likely via integration with language-objects.

You can't change the intrinsic nature of files without also changing the FileSystem. That's fundamental.

Basically, with a Type Heavy FileSystem, you obtain the advantages of XML and CSV (and YAML and JSON and other structured data representations)... but you also get a whopping massive bundle of additional advantages for optimization, code complexity, OnceAndOnlyOnce, safety, and so on. These additional advantages are described above (mostly as relative disadvantages of the modern EverythingIsa octet stream approach).

And as a side note, you say that FileSystems are a "hierarchical" database. While common, that is completely optional. There is no problem with flat filesystems without folders or locality of reference... or, at least there's no problem if you have some other means of organization (e.g. a system of tags).


Relating FileSystem to DBMS

The issues in a DBMS are similar... just broader. If this confuses you, think in terms of:

 TABLE FileSystem {
   FILENAME, 
   FILEVALUE, 
   AND MANY COLUMNS, 
   OF SECURITY,
   TIME,
   VERSIONING,
   BRANCHING,
   TRANSACTION,
   MIRRORING AND CACHING,
   AND OTHER FILESYSTEM METADATA,
   MAYBE INCLUDING 'TYPE',
   MAYBE INCLUDING 'NAMESPACE' (or e.g. folder)
 }

Implementing and optimizing this table (and optionally a namespace/folders table) to support block devices and streaming and internal indexing and distribution and transactions and other features is what gets you a filesystem... just as supporting arbitrary schema (plus indexing, streaming/cursors, distribution, transactions, and other features) is what gets you a database management system.

In a regular FileSystem, FILEVALUE is an octet sequence - a BLOB. In a Type Heavy FileSystem, a FILEVALUE is a complex structure with nesting. If you wanted a statically typed filesystem, you could have 'TYPE' be a metadata element or part of its filename. This would not do you much good unless compiled processes provide static info on what types they need for input/output. If you prefer dynamism, you'd simply use the any type. What makes it a 'TypeHeavyFileSystem?' is the formally standardized complex structure found in a 'FILEVALUE', not the use of a manifest 'TYPE'.


Do you at least agree that introducing direct behavior implementation into such a tool/language greatly increases its complexity? --top

It unquestionably increases the amount of developer effort for any given new implementation of such a language, but this is a trivial observation -- having to implement n + 1 features requires more effort than implementing n features. However, this is a one-time amortisable cost which is increasingly negligible over time.

For a user of such a (presumably well-designed and correctly-implemented) language:

Therefore, using a language that provides "direct behaviour implementation" will, in fact, demonstrate less complexity than using a language that does not possess it.

It's perhaps worth noting that we can replace "direct behaviour implementation" with any useful language feature <x>, and the above remains true. There is a cost to learning to use feature <x>, of course, but this is again a one-time amortisable cost which is increasingly negligible over time. The cost of not having <x>, on the other hand, will be incurred every time <x> is needed.

[In the context of whole language design and implementation, by which I mean to include the language standard libraries, introducing direct behavior implementation of certain features (in particular, KeyLanguageFeatures and features that overcome a MissingFeatureSmell) can actually reduce the net complexity of a language. It can do so because the complexity cost of efficiently and correctly implementing the standard library (i.e. paying attention to NonFunctionalRequirements such as optimization and security and robustness and simplicity of interface and the ability to avoid leaking implementation details) may be greater without this feature than is the total cost of integrating the feature with the language then using it to help implement the standard library.]

[So, no, I do not agree that, in general, "introducing direct behavior implementation into such a tool/language greatly increases its complexity" - not even when just considering the implementation of the language itself. This is because, when implementing the standard library, the language implementor is also a language user. And, in general, feature support in the language or its standard library simplifies things for users.]


AugustZeroEight and MarchZeroNine

See Also: PowerOfPlainText, CrossAppLanguageOopIsRough, NaturalEventSyntaxDiscussion


CategoryInfoPackaging, CategoryReuse, CategoryText, CategoryIdealism


EditText of this page (last edited June 6, 2013) or FindPage with title or text search