Fusdx Discussion

This will be a place to discuss FUSDX. Pages such as XmlSucks, RelationalAlternativeToXml, and similar had some fusdx content that was moved to here.


Line-Per-Cell Variations

FUSDX format is an alternative to XML and CSV for data dumps.

http://z505.com/cgi-bin/qkcont/qkcont.cgi?p=FUSDX-Standard

Sample:

 [column names]
 date
 price
 shortdescription
 longdescription

[fields] 12-DEC-2002 27.93 red t-shirt This red t-shirt is a perfect gift \n\ It also comes with a free tie.

01-JAN-2003 15.79 blue pants These pants are for the winter \n\ They also can be cleaned with bleach.

Etc..

(Note that Mozilla-based browsers may display extra blank lines due to a known wiki bug.)

I wonder if being line-per-cell increases the chances of things getting out of sync. With CSV, if one cell is messed-up (such as escaped wrong), it usually only throws off the remainder of the line, not the whole file from then on. This is because CSV has a cell delimiter and a record delimiter (line-feed), but FUSDX appears to only have one for both purposes, meaning that a problem spreads further. A fix may be a standard record header, but this invites new escape problems. I've even considered indenting data but not markers for line-per-cell approaches. The indentation is removed when read in. But this has its own problems. --top

Top, as this FUSDX is something I recently invented I would like any input and feedback before making it a more official standard. The CSV files can easily become misaligned when a quotation/comma encloser is misaligned. It is hard to debug a CSV file with your eyes since there are commas, quotes, and nesting everywhere. A FUSDX file could be debugged with the human eye by scrolling through the data top to bottom.. whereas a CSV file requires the human eye jump all around in any direction.

FUSDX is not ideal, however. If there was a problem in a FUSDX file it would be easier to fix, in my opinion, since you can literally glimpse all the data from top to bottom. I think escaping carriage returns/line feeds is extremely simple (KISS) and less error prone than escaping 3 or 5 different characters that nest into complex obfuscations.

The problem I am considering is where there is damage or corruption or bad escaping. If somebody added one line to the middle of a FUSDX file (say something forgot to escape a line-feed), half the data would be lost or bad. This is usually not the case with CSV. FUSDX seems to rely on positional counting and if anything throws the counting off, then from that point on, everything is off to the end of the file. CSV will get back on track upon encountering the next line, losing/currupting one record at the most. FUSDX needs some kind of "beginning of record" marker rather than rely on counting in my opinion. It seems that multiple kinds of delimiters (field and record) are necessary to have such a feature. I haven't found a way around it yet. Note that I've seen these kinds of problems many times in CSV where the parsers or generators made escaping problems/differences. --top

I shall start a FUSDX mailing list and official website on Z505. It is an Open Standard available for anyone to make suggestions, comments, and contributions.

Very often Tab delimited files get misaligned easily since tabs are invisible.. a big no-no is to make something invisible. Example: lots of source code snippets contain TABS for indentation. If source code is stored in a database (say a wiki with source snippets or a developer documentation with source snippets) this becomes a nightmare in a database. Tabs in FUSDX files are just tabs. Less error points.

To escape a line feed would be a matter of parsing all line feeds.. whereas parsing commas/tabs/colons is much more complex due to nesting and escaping. A carriage return doesn't have the nesting problems or complicated nesting/escaping problems. Even a simple 'Replace All Line Feeds with \n\' algorithm would work and would not be error prone unless someone started editing the FUSDX file manual by hand and forgot to insert \n\.

I suppose a text editor could lock a person from inserting carriage returns.. and each time someone hit the enter key it would prompt them (do you really want to insert a carriage return, this will create a new cell). This could get annoying, just a silly idea. No format is perfect, but formats that are readable always win.. and CSV in my opinion is not human readable. It is close, but no cigar.

One issue with FUSDX would be taking care of unix line feeds and windows line feeds (and old Macintosh line feeds).. but this can be solved the same way CGI/HTTP does it (choosing one, and only one. CGI chooses CRLF).

One variation that may eliminate the "lasting counting error" problem is something like this:

 @columns
 $mydate
 $price
 $shortdescription
 $longdescription
 @record
 $12-DEC-2002
 $27.93
 $red t-shirt
 $This red t-shirt is a perfect gift \n\ It also comes with a free tie.  
 @record
 $01-JAN-2003
 $15.79
 $blue pants
 $These pants are for the winter \n\ They also can be cleaned with bleach.
 @record
 $Etc..

The character in the first character column tells whether the line is data or a section marker. The marker character also acts as check on the data. If say somebody forgets to escape a line-feed, then the next line will likely be missing a marker ("$" or "@") and the parser can display a warning. --top

[Spot on, Top. Your scheme provides facilities for error-detection on a record-by-record basis, as well as clearly distinguishing an empty value ('$' followed by a newline) from a record break, which FUSDX does not. I consider the chance of FUSDX happily absorbing 100 gigabytes or so of records -- assuming a missing column value followed by a series of record breaks mistaken for empty column values and vice versa -- before deciding the column count is erroneous, to be unacceptable. I wouldn't use it. I would use Top's approach.]

(NOTE: Ideas branching from Top's idea here have been expanded within FlirtDataTextFormat.)


Comment from Lars: not necessarily, FUSDX could use \EMPTY\ directive. Error checking is done by counting the fields.

 12-DEC-2002
 27.93
 /EMPTY/
 This red t-shirt is a perfect gift \n\ It also comes with a free tie.  

01-JAN-2003 15.79 blue pants These pants are for the winter \n\ They also can be cleaned with bleach.

06-MAR-2003 35.90 blue pants /EMPTY/

Please let me know what problems you see that exist with this. Remember, we are trying to KISS (keep it simple) here and I do not want to reinvent XML or a binary database format.

In my opinion, the dollar sign is simpler. One character is usually easier to parse than many. Plus, your empty marker is an escape sequence, while "$" and "@" are not since the first character column is *always* a control marker and never ever content regardless of the "line type". The less we need to escape, the better, in order to avoid multi levels of de-escaping risk. Anyhow, that makes it more similar to my suggested approach such that they are converging somewhat. But, I would rather use the record separator for other possible purposes, like table separation. And, blank lines as separators just kind of makes me nervous. I feel more comfortable with explicit (visible) markers. What if a tab or non-printable character ends up there? One could become a mad-hatter trying to debug such, especially newbies who don't understand character sequences (ASCII and Unicode). --top


Now we just gotta agree on how to specify (optional) types and lengths. Let's see. One approach is to extend the "@columns" group:

 @columns
 $mydate,date
 $price,number,12,2
 $shortdescription,char,20
 $longdescription,char,100

Or give each their own group:

 @types
 $date
 $number
 $char
 $char
 @maxlength
 $
 $12
 $20
 $100
 @decimals
 $
 $2
 $
 $
 @record
 $...etc...

Both bother me in different ways. Maybe we could just define a DataDictionary table rather than add new element kinds to our setup. Should we reserve a table name or add an optional "@dictionary_table" or "@dictionary" group?

By the way, multiple tables (including dictionary tables) could be separated as follows:

 @table
 $foo_table
 @columns
 ...
 @record
 ...
 @table
 $bar_table
 ...etc...

So now we have:

 @dictionary
 @columns
 $table
 $name
 $type
 $maxlength
 $decimals
 @record
 $my_products
 $mydate
 $date
 $   // note that we could eliminate this line without harm
 $   // note that we could eliminate this line without harm
 @record 
 $my_products
 $price
 $number
 $12
 $2
 ...etc...
 @table
 $my_products
 @record
 $12-DEC-2002
 $27.93
 $red t-shirt
 $This red t-shirt is a perfect gift \n\ It also comes with a free tie.  
 @record
 $01-JAN-2003
 ...etc...


Comments from Lars:

 $12-DEC-2002
 $27.93
 $red t-shirt
 $don't know if this is part of the above line
 $don't know if this is part of the above line either
 $we have no idea these lines map to, just like in FUSDX
 $adding a dollar sign is about as useful as adding three carriage returns instead of one
 $what row/cell are we in now? I still have no idea. The parser stops and says "too many fields in row".
 $which is exactly what would happen in FUSDX

Why does a dollar sign magically solve the problem? Nothing stops a human from mucking up the file with this format either.

I was thinking if the count is ever off then STOP. Adding corrupted data (as you suggest with a broken CSV file) is dangerous because the CSV file could still throw data in the wrong places. If there is any corrupted data, then STOP immediately and tell the person to fix the FUSDX file. It can pre count the number of records before inserting them.. i.e. it doesn't insert them first then check errors after.

Are you suggesting that FUSDX should be designed so that corruption is okay to pass by? I suppose one could have the ability to turn errors off, but I do not like the idea of continuing on and inserting more FUSDX data if the file is corrupted! Once the file is corrupted, this needs to be repaired immediately by the person who owns the file.. it is extremely risky to continue on and assume that a CSV file or a FUSDX file is correct based on guessing if just one row was corrupted.

On the other hand, it could be nice in non mission critical situation to pass by corrupted data and make assumptions that the rest is okay, but the thought really scares me still...


[Content is mis-contexted now. Needs cleanup]

If this looks confusing, an outline of this would resemble:


Strengths of FUSDX simple carriage return to \n\ simplicity:

Carriage returns are all converted to \n\. This is fool proof when reading from a database. How could a carriage return ever not get turned into a \n\? The algorithm is so simple it cannot be simpler.. replace all Windows line feeds first, then replace all Unix line feeds, then replace all old macintosh linefeeds. This ensures that all CRLF and CR and LF are replaced with \n\. The Windows linefeed is done first since it is two CR and LF rammed together and we don't want to replace it with a Mac and a Unix line feed. Regardless, the algorithm is so simple and trivial.

With other formats being converted from a database to a text format, such as CSV, you have to escape pipes or escape quotes or escape commas, etc. Then the carriage returns get inserted into the CSV file as real ones.. and the file appears to have several more rows than the database has which humans simply cannot read and the file becomes obfuscated with data bounding all over the place.

If someone inserts a carriage return

 12-DEC-2002
 27.93
 red t-shirt

some text here

[Note that spacing does not display correctly in Mozilla-based browsers. Its a wiki bug.]

This is an error because the specification says there are 7 fields. The parser informs the user that the data is corrupted, and the file must be fixed. Making assumptions about the file and continuing on wouldn't be so wise.. for data integrity. Once a file is corrupted, IMO this is a run time error and should not make "assumptions" and continue on parsing like with maybe some CSV parsers. Possibly, as I said, in a non mission critical app this could be an option but I seriously cannot see why we'd want to make assumptions and "forgive" a corrupted file.

I am not following this. How do we know they are not blank values instead of extra carriage returns? Are you assuming the new "/empty/" convention? My comments on that are above.

Further, in tools such as spreadsheets, usually the user wants to attempt to complete the loading even if there is a problem. I've seen it many times where a dialog box alerts the user to a problem, but still gives them the option of trying to continue. (Being the nearest local office "techie", I'm often the one they come to when there are data conversion problems.) I cannot scientifically prove it at this point, but I feel that the "dollar" approach is more robust under such circumstances, or at least equal. Yes, we could add extra error correcting features into a format; but like you said above, we also want K.I.S.S. I believe the dollar approach is a good compromise. --top


Anyone's comments or suggestions for FUSDX format will of course be credited.


[The "dollar" approach description is being moved to FlirtDataTextFormat.]



EditText of this page (last edited July 9, 2010) or FindPage with title or text search