Tab Delimited Tables

Software such as SuperCalc?, dBase II, Excel, and many other RegisteredTrademarks? have had the ability to emit TabDelimitedTables for years. Often, these tables have a header which names the field in the column below it. This facilitates data interchange between various pieces of software.

Rod Manis, Evan Schaffer, and Robert Jørgensen have written a book titled Unix Relational Database Management. ISBN 013938622x They describe a suite of tools which they collectively call /rdb

If it uses something other than tab for the field separator, it's more generally a DelimiterSeparatedValues file.

Their tools operate on TabDelimitedTables, in a Unix shell environment. They interoperate with other Unix tools. I found it just as I was learning the Bourne Shell. Their suite operates on tab delimited tables with "headlines" (see below).

The table object has been quite useful to me, and to others I work with - including the computers I interact with. With tables of named information, both the computer and I can do operations on named columns, such as sort or subtotal.

With the promise of XML, humans and machines will be able to refer to the same sorts of information. If a computer were to join StateAbbreviations and AreaCode? on the ST field, it could name the capital of the state with area code 513. Can you?


headlines

Headlines are bits of meta data included within the result of the query.

In /rdb, from RevolutionarySoftware?, a head line consists of a tab delimited list of fields followed by a tab delimited string of hyphens underlining each field.

In /RDB, Hobbes replaces the hyphens with an N for a numeric field, and with M for a field with a month.

The name of the fields in the headlines simplify pipelines, by allowing a field to be referred to by name. Imagine a table called checkbook with a headline of checknumber date payee amount withholding memo

Dataflow for making W2 forms might be about as simple as

  cat checkbook | sorttable payee | subtotal -l payee amount withholding

or simpler, sorttable payee < checkbook | subtotal -l payee amount withholding

The beauty of this suite of tools is that the beginning and end of almost every pipeline are TabDelimitedTables.


Tabs tend to be problematic characters for cross-system data sharing. Sometimes they end up getting converted into spaces for various reasons. Thus, CommaSeparatedValues tends to be more popular because they rely purely on WYSIWYG characters.

Note: If a user puts a comma in a single field, put quotes around that field. If the user wants to put quotes in a single field, escape them with a backslash, or double the quotes character. That's a weird mashup of both DelimiterSeparatedValues and CommaSeparatedValues quoting rules and I haven't seen the program that handles both conventions in the same file.

Precisely because tab is *not* a WYSIWYG character, it's useful as an out-of-band character to indicate "next field", while allowing *all* WYSIWYG characters to be used in any field without any weird escape codes.

That is true, but its downsides override that in my opinion.

The AsciiCode allocates four control characters (28-31) for record structuring:

It is a shame that these aren't used more widely for their intended purpose.

Probably because it defeats the purpose of being "ASCII", which usually implies it is human-viewable in a generic editor. Using those would make it not much different than a "binary format".

Only because the ASCII separator codes never caught on for general use, unlike other standard ASCII control codes used in text files, such as TAB, CR, and LF.


In some of the seminal texts about UnixProgramming? and the MakeProgram, the concept that the tab is difficult to spot among spaces has been mentioned by some as a design flaw [SyntacticallySignificantWhitespaceConsideredHarmful].

Perhaps it is okay for same-program usage, but not for cross-system.


See also: RelationalAlternativeToXml


CategoryBook


EditText of this page (last edited October 10, 2012) or FindPage with title or text search