Tree Uber Alles

Ok, this is a bit of a flame, please excuse the emotion.

XML is a manifestation of this evil meme. The idea that if you can just put everything into a tree, you can have markup nirvana. Well, it's just plain wrong. Too many things get lost in translation:

When you take a document, table, etc. and translate its structure into a tree, you are forced to pick one, and only one, dimension on which it will be indexed. You are forced to pick a hierarchy, and not permitted to deviate. I think this is a classic case of PrematureOptimization, writ large.

Consider what happens when you take a database, which has many tuples of values, and try to force it into a tree. You then have to define each and every data value, each and every time. This is a rather obvious case of waste.

Layers are a powerful tool for adding MetaData to a document. The ability to add information without disturbing the source, is a powerful tool. The absence of layering in HTML, for example, leads to a whole set of tools to try (and fail) to make up for it.

Consider a traditional book. You have characters, words, paragraphs, pages and chapters. You could mark each paragraph in XML. However, if you also decide to number the pages to match the print edition, you then are forced to use artificial means to keep together paragraphs that span pages. It gets even worse if you have an outline and section numbers to add.

Thanks for listening, and cleaning it up, if you chose. -- MikeWarot

Aren't a book's index, table of contents, and list of figures what you call layers? They add MetaData information without disturbing the source.

[I think that's his point. XML (and HTML) don't generally add layers without disturbing the source. They mix one and only one set metadata into the data.]

EXACTLY the point I'm trying to make.... you can't add any new information without disturbing the source. Markup (a.k.a. Annotation) isn't really an option, because the tree necessarily subverts all attempts at layering. -- MikeWarot


Stepping away from issues of XML and documents, and looking at it in the abstract, it's important to notice two things: 1. Trees are limited, which sometimes is bad (many things, e.g. graphs, are more flexible/powerful). 2. Trees are manageable, which sometimes is very, very good.

Many algorithms that are hopeless on graphs are efficient on trees. So sometimes the limitations on trees are a good thing. Other times not.

Applying this lightly to XML, you could say that it's obvious from first principles that using trees is going to suffer from inappropriate limitations in some cases -- but they can still be useful. Always the best way to go? Of course not.

But even more importantly, XML is a meta-notation, so that although it, itself, is hierarchical, it can encode absolutely any finite discrete data structure you like. This isn't usually the way it is used, usually the encoded data structure mirrors the surface structure of the XML, but more of that can and should be done.

Wouldn't a better solution be to separate the metadata from the data so that multiple trees (or graphs) could be imposed on the same underlying information?

That's what XSL is about, imposing multiple trees or expressions on the same underlying information.

XSL was created to express how XML data should be presented or styled. It assumes there is one tree structure imposed on the data and applies styles to that structure. Imagine a 2 dimensional array:

  r1c1 r1c2 r1c3
  r2c1 r2c2 r2c3

To represent this in XML we have to assign an order to the dimensions. We can represent it as:

<data>

  <row id=1>
    <column id=1>
      r1c1
    </column>
    <column id=2>
      r1c2
    </column>
    <column id=3>
      r1c3
    </column>
  </row>
  <row id=2>
    <column id=1>
      r2c1
    </column>
    <column id=2>
      r2c2
    </column>
    <column id=3>
      r2c3
    </column>
  </row>
</data>

or as:

<data>

  <column id=1>
    <row id=1>
      r1c1
    </row>
    <row id=2>
      r2c1
    </row>
  </column>
  <column id=2>
    <row id=1>
      r1c2
    </row>
    <row id=2>
      r2c2
    </row>
  </column>
  <column id=3>
    <row id=1>
      r1c3
    </row>
    <row id=2>
      r2c3
    </row>
  </column>
</data>

In either case we've imposed one tree structure on the data and ignored the other. XSL doesn't get around that problem.

Irrelevant. XML is an interchange standard, just as is Unicode; neither are intended to be formats for in-RAM processing of data, although they can be used that way when appropriate (i.e. rarely).

Instead, its serialized representation of the 2D table can (and should) be used by the receiving server to rebuild a literal 2D table in RAM, which then suffers none of the limitations of the XML representation itself.

So although XML is tree structured, it can represent tables with no problem. Similarly, you can encode an order-free set in XML, even though one must choose an arbitrary order for the elements in the XML encoding.

This is a fundamental principle of the proper use of XML. Any use of XML that insists that the XML's structure itself is what is fundamental is simply incorrect use of XML, and a misunderstanding of XML.

But you are quite correct in the sense that this is a very widespread misuse/misunderstanding.

XML is not an interchange standard. It's a markup language.

Why are you assuming those are mutually exclusive? But in any case, let's say it's using incorrect terminology to call it "an interchange standard". That doesn't change the rest of the arguments above.

The arguments above don't change the rant at the top of the page. XML was invented to insert tree structured metadata into text. It forces the metadata author to pick one and only one tree. All other possible trees (or any other structure) are excluded because the "shape" of one set of metadata is imposed on the data itself.

When XML is used to serialize data, you're correct. The receiver can parse it and re-order the data as it wishes. The same is true for comma separated values. The big dream behind XML wasn't to replace CSV, though. It was to add "meaning" to plain text that a computer could "understand". Here's an example of some of this hype: http://members.ozemail.com.au/~visible/papers/xml-metadata.htm


Isn't this what XPointer is for?


See LimitsOfHierarchies

Perhaps this should be split and moved into LimitsOfHierarchies and/or XmlSucks.


CategoryHierarchy


EditText of this page (last edited January 4, 2008) or FindPage with title or text search