Moved from XmlSucks
What is the purpose of this page? To find a better notation (than XML) for encoding the contents and schemata of relational tables?
In theory, any data can be represented as relational data, not just information from RDBMS. Whether its "good" in practice probably depends on the nature of the data.
This topic has since branched to:
Re: what are the alternatives to XML for data transfer?
CommaSeparatedValues are nice for relational data. It takes up about 1/2 or less space than XML. Mostly missing from standardization is a table and column specification header format (but see headlines on TabDelimitedTables). Tentatively, you could use something like:
. @@Table: Customers . @@Columns: custID, title, locationID, balance, notes . 512,"IBM",36,2406.34,"" . 883,"Johson Movers",572,0,"47A compliant" . Etc.... . @@Table: Invoices . @@Columns: custRef, invDate, Total, Tax . 423,"12/14/2003",3500,854.17 . 822,"03/01/2002",1476.34,322.05 . Etc....(Dots inserted to prevent TabMunging)
Fancier versions are explored below.
As a RelationalWeenie, I find XML is a NavigationalDatabase or a NavigationalDatabase ASCII dump. In other words, the kind of crap that DrCodd sought to rid the world of in the late 60's.
So, you don't think it's worth the effort to express all of your database entries as entities that can be validated through Schema? Okay, then, how else do you validate your data storage and transport?
See comma-delimited example above. You list the column names, and then the data underneath. You generally don't have the repeating groups like you do in XML if you normalize, so validation is simpler. I just find relational structures cleaner. Maybe it is subjective, but it just works for me. DTD's are far more complex than they have to be, and to me that is a sign that a better abstraction is needed.
For column types, perhaps something like this can be used (borrowed from above example):
. @@Table: Invoices . @@Columns: custRef(int), invDate(date), total(decimal:10.2), tax(decimal:8.2) . 423,"12/14/2003",3500,854.17 . 822,"03/01/2002",1476.34,322.05To indicate that empty (null) values are not allowed, then perhaps use a plus sign or something: "custRef(int)+".
Another way to handle the validation is to send a DataDictionary table instead of a header for each table. The DataDictionary table would describe the schema for the data. The DataDictionary itself would not need validation since its format would be standardized. In other words, use a table to describe tables. It could be the first table (data set) in the text stream.
DataDictionary-based Example
. @@Dictionary . People, Name, varchar, True, 30 . People, ContactRef?, Integer, False . Bills, Amount, Decimal, True, , 2 . Bills, VendorID, Integer, True . Bills, DueDate?, DateYMD, True ....etc.... . @@Table: People . "Fred", 234 . "Bob", 343 . @@Table: Bills . 28.42, 5736, 2003-10-23 . 1242.56, 3746, 2003-10-29 ....etc....(left dot added to avoid TabMunging)
The DataDictionary portion would describe the fields that are to be in the table data to follow. Its columns are: table, fieldID, field-type (ANSI standard?), isRequired (Boolean), max length (if applic), and decimal count (if applic). Other optional possibilities could be value ranges/sizes or decimal positions.
One possible drawback with this approach is that multi-line text could be a bear to represent, especially if intermediate processes break long lines into smaller ones. Otherwise, perhaps BR tags could be inserted, along with """ tags. And/or, perhaps an "end of line" character can be introduced, such as the dreaded semicolon. That way it would not matter if somebody split a line on a whitespace boundary.
One could also have the table header meta data in a separate file with some form of version ID that could be referenced by the actual data as one line at the top. Typically ASCII files are a separate file for each table anyway. The meta data does not change as often as the actual data so does not have to be sent every time. It would be analogous to a DTD or XmlSchema.
Requiring separate files as part of the standard could be limiting in some cases, such as TelNet? transfers. I personally would not want that to be a requirement of a standard, but merely an option.
I guess I'm an XmlWeenie?, but this page seems to point out a common mistake among people who don't like XML because it's verbose or "complicated"... they start to create a replacement that's going to make more sense, then they run into all the problems that XML is meant to solve. To wit... you need a "magic number" at the top of the file. You need to indicate what character encoding the file uses. You need to account for extensibility. You need escaping rules. You need to accommodate multi-line data. It would be nice if the file were more self-describing instead of looking like line noise. And so on, until you realize you should have just used XML. -- MattChaput
What is an in-born flaw of the suggested approach such that those things allegedly cannot be added? What is an example of "extensibility"? that this cannot handle or must handle poorly? It is not clear whether you are claiming that XML is more mature, or if it is inherently better regardless of ripening times. And XML *is* verbose compared to this because it violates OnceAndOnlyOnce by repeating attribute names over and over and the block ender's over and over.
I meant that XML is more mature and featureful. You could add all those features to this hypothetical file format, but is it worth it? When you were done, would it be the "much clearer than XML" ideal you set out for?
So you are saying we should stick with unnormalized technologies due to QwertySyndrome? How about COBOL in there too.
(And the redundancy in XML isn't wholly worthless... it adds some readability, and increases the ease of error checking recovery. It looks like this file would become unusable if the lines describing the column headings was somehow lost.)
If you want to use redundancy as error correction (for the sake of argument), then why not send the file twice? With the space saved, there is room.
Using SQL Statements
This whole page stinks of NotInventedHere. Why not use SqlLanguage? A message can be though of as a small, self-contained database, that is a set of true statements. So data can be represented by SQL INSERT statements. If you need more power, you can also include UPDATE statements, or even a program that creates a database (think of a PL/SQL script creating and populating a DB). Of course, INSERT-statements-only solution is semantically identical to (comma) delimited text. In such cases, I would prefer delimited text as it is shorter and also easier on the eyes.
As for "specifying table and column format" - that becomes database schema. SQL DLL (CREATE TABLE customer ...) to the rescue. I see not advantage of proposing some home-made table schema.
Of course SqlLanguage is not a long term solution due to SqlFlaws, so some other language should be used. In fact, RelationalAlternativeToXml might be one of the few areas where an alternative to SqlLanguage could be proposed. But the basic idea that message is a database remains the same.
One may not know ahead of time whether an insert or update is needed until a comparison with the target database is done. Therefore, making all items "insert" statements may be misleading.
In my proposal, the message is the database - an independent, self-contained database. It's not an 'import script' (although it could be, for instance when schema is designed so that users only record facts/rows and do not change them). You then run SQL queries against this database to same effect as running XQuery or walking the DOM tree in XML.
Besides, non-SQL systems might want to use the same data, and making them all parse SQL may be asking too much. SQL as it stands is not easy to parse unless you stick to a narrow subset.
Besides, non-XML systems might want to use the same data, and making them all parse XML may be asking too much. XML as it stands is not easy to parse unless you stick to a narrow subset.
Unfortunately, the two statements above are also true. If only they weren't, especially the second one... Besides, only a subset of SQL is usable for messages, especially so if we use comma-delimited text or only INSERTs for data part. SQL DDL (for schema definition aka metadata) is IMHO not so complicated.
Minus the extra line-noise of the word "INSERT", using SQL is essentially identical to the comma-delimited format above. There's a comma delimited list of fields, then a comma delimited list of values, then a semi-colon line terminator. The extra SQL keywords aren't important unless you actually are using it as an import script, which I would recommend against ferociously. Moreover, there's no provision made for any sort of meta-data, like version information, data types, or anything else. This would rapidly fall down in complex situations, although it'd be sufficient in a YAGNI sort of way for just tossing data between two otherwise identical databases. XML is crappy for all the reasons stated, but on the other hand libraries for parsing it are readily available and it does address the futureproofing concerns. XmlSucksButYouShouldUseItAnyway?. -- ChrisMellon?
Possible problems with direct SQL:
For those of us who have had to do substantial ETL (Extract/Tranform/Load) processing between databases of unlike structure, there is the old tried-and-true "tagged pipe-delimited" format. It's an enhanced version of the raw "unload" output from some SQL engines.
Is this standard documented anywhere public?
It is, in many ways, like the CommaSeparatedValues examples above. It is exquisitely compact and internally checked. Not terribly "readable" by the human modules, but browsers are real easy to write, being simply a bit of formatting hung from the parser needed to load the stuff.
Little niceties like number of rows, type of records, source, target, and so on are dealt with in the header line(s).
Transmission overheads are minimal. Storage requirements are small. Flexibility is broad.
We looked at using XML for this job, and found that, aside from making us rewrite all the ETL code (this format predates the popularity of XML) and aside from costing us more than twice the bandwidth, we would gain little in the adoption. We stayed with "tagged pipes" for our transport format.
The next data warehouse project was destined for a "more modernized" customer, who like the idea of XML as the generic transport layer. Oh Gawd. We spent months just implementing the parser/DOM/object manager for the XML handlers - which became my part of the effort along with the low-level data acquisition drivers. Three times the manpower, 20% more time to deliver. But it was StateOfTheArt and we SparedNoExpense.
Hey. I'm really interested in implementing this format. Now I'm using lisp as a prototype (The file format parser is implemented as lisp macro so the file is parsed using EVAL), But the intention is to hide as much lisp as possible that it could be easily parse and used in other language, and user will not scare off by realizing that they are using lisp!! Now the problem I'm seeing with it is how to handle nested structure. In XML, for example, the <birth-date> tag can be nested in <person> tag as <birth-date><day/><month/><year/></birth-date>. But in my lisp prototype format, I used "string" and have parser to parsed it. Now think about other user defined structure, Should I be forcing user to make a new table to hold that data structure? Since this kind of structure always has one-to-one relationship with its ancestor. I knew in real database I would have to do that. But separating it into different table will make it harder for human edit, me think. -- PisinBootvong
Do you really need 3 separate columns? Why not just have a date column? Every RDBMS I have ever seen supports a "date" type. In the example, I sort of assumed 3 date "types": dateYMD, dateMDY, and dateDMY to indicate the order of the 3 parts. For month, the month name or number was acceptable. Specific vendor DB's may have more limits, but a protocol that caters to every vendor is not realistic. We would probably have to make such a language TuringComplete for such. Maybe we can define a standard way to report problems with imports rather than strive to include general-purpose validation.
I used date type as an example of user defined data type. See, in XML I could define in schema a node that contains others nodes. For example I could defined the <address> node as <address><street/><city/><zip-code/><country/></address> and I could define in XML schema that <person> type contains <home-address> and <work-address> node, both of type <address>. It makes parsing easier and each element are grouped. How can I do that (specify user defined type) in Table? The easiest way I can think of is that I'll have separate table for each kind of user defined type, e.g. address table with street, city, zip-code and country columns. And it must have another columns for primary key, used to linked it with its parent table. Now here's a new problem, Not all data we store have primary key. Take XML as example, in XML the child node is always linked to the parent node because it is nested in the parent node. normalize it into table would require that human edit be manually keeping track of the parent primary key. Now what is the primary key of a person? His Name? Imagine sub-attribute table of person must be writing out duplication of person as foreign key. Then it will violate once and only once. Any Idea? or am I thinking of the wrong way to solve problem here?
The problem with nesting is that in the long run LifeIsaBigMessyGraph. When composition (aggregation?) is no longer sufficient, you need to use references of some kind anyhow. The world is not a tree. Databases such as IMS from the 60's used to assume that aggregation was sufficient, but it turned out to be a wrong assumption. Aggregation may work in the short term for small stuff, but it is not change-friendly. It is yet another example of trees looking simple on paper but flunking in the real world. As far as keys, are you suggesting using text as the "database" itself? Where does the data come from? Regarding address management, see ContactAndAddressModels. What happens to your nesting when two people share the same address, for example? Also, it looks like we may be getting into the messy subject of WhatAreTypes. Relational is better defined than "types" in my opinion.
Ignoring the issues of trees-vs-tables for internal storage (and most of us agree that for the storage of arbitrary business data, relations are a better solution than hierarchical models), and focusing only on the issue of textual representations of datasets of arbitrary shape (and tables more specifically)...
Why can't a table be encoded in XML thusly?
<table> <name>Employee list</name> <heading> <attr>Name</attr> <attr>JobTitle?</attr><attr>Emp#</attr><attr>Boss</attr><attr>Salary</attr> <primarykey>Emp. #</primarykey> </heading> <data> <row><attr>Blow, Joe</attr><attr>Staff Programmer</attr> <attr>452-99-1123</attr><attr>373-73-7373</attr><attr>50,000</attr></row> <row><attr>Hornblower, Horatio</attr><attr>Lead Architect</attr> <attr>123-45-6789</attr><attr>373-73-7373</attr><attr>60,000</attr></row> <row><attr>Overpaid, Oliver</attr><attr>Chief Executive Officer</attr> <attr>001-01-0001</attr><attr>NULL</attr><attr>10,000,000</attr></row> <row><attr>Pointyhair, William</attr><attr>IT manager</attr> <attr>373-73-7373</attr><attr>001-01-0001</attr><attr>120,000</attr></row> <row><attr>Public, John. Q.</attr><attr>Tester</attr> <attr>999-99-9999</attr><attr>373-73-7373</attr><attr>40,000</attr></row> </data> </table>While it, like all XML, is highly compressible :), it doesn't repeat the attribute names for each record. Think of it as an XML-ish version of CSV.
But why? If we are going to use XML, why not something like:
[data definition stuff] <row empID=1234 name="Blow, Joe" ssn="452-99-1123" salary=50000/> <row empID=1235 name="Hornblower, Horatio" ssn="123-45-6789" salary=60000/> ....Because you don't gain anything by it, except possibly readability. Any error that can be picked up through the duplication of names simply isn't possible if they aren't repeated. Those that can't cannot be detected in the XML either.
<row empID=1234 name="Blow, Joe" ssn="452-99-1123" salary=50000/> <row empID=1235 name="Hornblower, Horatio" ssn="123-45-6789" salary=60000/> @@empID,name,ssn,salary 1234,"Blow, Joe",452-99-1123,50000 1235,"Hornblower, Horatio",123-45-6789,60000I am not sure what you mean by "repeated names". Do you mean matching open-close pairs of tags? Duplication of tags for the sake of error correcting/detecting is smelly in my opinion. There are other ways to get error correcting that don't bloat up the content.
A relational alternative must support arbitrary types, up to and including relations. In other words, in already supports nesting.
(Forgive the static typing slant)
@@Invoices (custRef(int), invDate(@@Date(month(int),day(int),year(int)))) 423, (12, 14, 2003) 822, (03, 01, 2002)Undercooked, but hopefully getting the idea across...
Why must it support "arbitrary types"? I suppose I should point out that I think type theory is weak. ThereAreNoTypes. Besides, most RDBMS don't support arbitrary types. -- top
Because your strings are an implementation of arbitrary types. In other words, the support that most RDBMS's have is limited at best, but anything which supports either (effectively) unbounded nesting or length supports arbitrary types by way of turing equivalence.
A table, even if the contents are limited to strings, is a type. It is constrained to exactly the number of columns you specify, with exactly the names you specify. Relational operations are the means in which you map a type to another type, in the context of a third type.
-- cwillu
I am not sure if you are talking about definitions or needs. Note that our purpose here is to supply an alternative to XML, not invent a whole new TuringComplete language. As a side note on "constrained to exactly the number of columns you specify", see MultiParadigmDatabase. -- top
Turing completeness has nothing to do with this. Expressiveness has everything to do with it. And the 'constrained' bit comes directly from relational theory, which is the pillar that any table-oriented approach relies on; otherwise you simply move duplicated code to duplicated tables, and do not solve any of the underlying issues.
A table can indeed have an arbitrary composition of columns, but the fact remains that your code will use a describable subset of them. Even if you're using all of a dynamically generated set of columns, that set is still (or can be) described in a table.
This is all trivial anyway, as the real issues with types aren't that they don't exist, but that a perfect type system cannot exist. Either you err on the side of having types which cannot be demonstrated as consistent in your system but can in another, or you err on the side of having contradictory information. What is interesting is which side one approaches from. You come from the dynamic side, and therefore deal with increasingly slight contradictions when you push the limits of complexity. I often find myself coming from the static side, and deal with increasingly slight restrictions on what will compile. Either approach can be made to work, and easily gets around issues that the other stumbles on.
What I hope is that there can be a somewhat systematic way of embedding one approach inside the other, so that you can make use of whichever technique is most appropriate. I would say that JavaBeans and LispMacros are two examples (of admittedly varying quality) of this.
-- cwillu
Well, how about we look at specific cases instead of get into definition squabbles. I lean toward validation rules instead of "types" to provide checking, if made available. However, "perfect" validation probably requires a Turing Complete language, and neither XML nor this attempt wishes to go that far. -- top
Well, many people do think that XML wishes to go that far, but this is beside the point; we're in ViolentAgreement. (i.e., ValidationRules? sounding quite reminiscent of PredicateTypes) -- cwillu
One possible stickler is "dynamic relational" as described in MultiParadigmDatabase. attribute/value pairs may be needed in such case, similar to the inside of an XML statement.
I came up with grave-separated format. The grave (pronounced as in French) character is nowhere used in any lexical system by itself as a meaningful character (check out Unicode, if you'd like). Commas do have lexical value, which renders them less good for delimiting text
Combinations of quotes and commas working together are generally sufficient to prevent such problems. The problem with single "special" characters is either they may occur in the data itself, or they are not commonly supported across systems. The "grave" character may work in Unicode transfers, but not with ASCII. Plus, it may not print/display properly when debugging parsing issues. I like to use the pipe "|" when I need simple parsing and it is part of ASCII. However, every now and then the darned thing occurs in the body text. There is no perfect solution. The very thing that makes the grave unique enough to avoid body text conflicts also makes it non-universal.
When you have a free-form text file, where certain characters have special meaning, you cannot... um... escape the need for the EscapePattern.
Moving some of the discussion to FusdxDiscussion. Fusdx will be an alternative to CSV/XML.
I disagree with putting non-Fusdx stuff under a Fusdx topic. Also FusdxDiscussion needs some refactoring and cleaning up. It is hard to tell what belongs to what now. I would suggest putting *only* fusdx stuff under the fusdex topic. Use PageAnchors if you need to reference sub-sections. But lets agree on how to refactor it first before moving anything back here. Perhaps we can have a FusDxFormat? page and a FirstColumnDataFormat? page to keep them separate. -top
I disagree with it being "non-Fusdx stuff". You simply forked Fusdx into your own format and called it your own, which I am glad you did because Fusdx has slightly different goals (overlapping many though). From the way its headed on your page so far, it seems to be version 9.5 being discussed instead of version 1.1. I also wonder if the "Flirt" in "FlirtDataTextFormat" came about from you reading too many of my Magick articles elsewhere, and wonder if you are flirting with my Fusdx too much.. but this is good, borrow and take - it is all open. Remember, when FlirtDataTextFormat becomes popular, Top will no longer remain anonymous though ;-). As for your PageAnchor idea.. this is not Standard Pascal where we shove stuff at the bottom of the program. I prefer modularizing when things get large. Except for databases which are all one huge big global page. I can see where your mind is at, TopMind.
I have no idea what this "Magick" thing is; it does not ring a bell. I simply took the general idea of cell-per-row and did my take on it based on my experience with similar stuff from the past. I don't see that as "stealing" from Fusdx, but I don't want to get into a credit fight over it. You can take credit for sparking the idea of FLIRT if you wish, although you should know that cell-per-row is not a new idea. Lets just work together to find the best way to go about it. Or we can fork the idea if we can't agree, and let the reader decide. No hard feelings? As far as grouping with fusdx, if we go that route, I'd elect "cell-per-row-formats".
In fact, I can tell you the evolution of my thinking steps for coming to the approach I did. If you notice, my early attempts at the top of this topic use "@@" symbols as first-character markers for table headers. Then I saw your initial Fusdx example and started thinking about one-cell-per-row variations of my earlier attempts. At first I thought about following fusdx and put the data like this:
@@record fred 12/29/2007 6.95 ....But then I thought, "what happens if the data has "@@" in it? Escape characters?" The multi-cells-per-row formats (near the top) based on CSV didn't have this problem because the first data character of a line was either a quote or a digit, and could not be punctuation.
After a while I decided that I could avoid any kind of escape characters (except possibly new-lines) if the first character was *always* an indicator regardless of line type. This is what I consider the juice of FLIRT. I don't even know if its original, it just works. Perhaps it comes from my early days of FORTRAN where the first 6 or so character columns were reserved for non-code stuff. Thus, maybe FORTRAN gets the credit? Or perhaps the old impact print formats where the first column was reserved for a control character such as "new page" character code. (I'm really dating myself). It's more a fusion of existing ideas than original invention. But then again, so is the iphone :-)
--top
See also: MultiParadigmDatabase, UniversalStatement, EtlIntegrator, FlirtDataTextFormat
CategoryInformation, CategoryText, CategoryXml, CategoryLanguageDesign