Power Of Plain Text

Original assertion: storing resources as plain text makes them easier to view, maintain, increases their longevity, and/or makes it easier to share information with multiple languages and systems. Popularized by the UnixWay.

Notice the best sites on the web are pure text/links: Google, Wikis, CraigsList. No intrusive ads, flashy media, slow/clunky formats (PDFs), instantly searchable, etc.
- Disagree, Kijiji is better than CraigsList for many people.
Keep in mind it's not necessarily about user interface. Proposing a character-centric user interface versus a text-centric configuration and inter-process communication method under the hood are perhaps two different topics.

Tentative pattern form

(Except the WikiName of this page isn't a good pattern name. Anyone care to suggest something better?) (I like it very much.)

As this title is also from ThePragmaticProgrammer, I submit the name of the tip, KeepKnowledgeInPlainText?.

Context: a program needs to store resources in persistent form. Common choices for this are plain-text formats and binary formats in the filesystem, or some sort of database.

Problem: the choice of a storage medium and format for persistent resources usually has far-reaching consequences on both the design and the ergonomy of an application. If we have formally expressed requirements that favor one medium or format over all others, we will let those dictate our choice; however there are situations in which the functionality we are implementing demands some form of persistent storage, but no particular choice of medium or format seems to be particularly appropriate.

Forces:

plain-text formats imply some parsing overhead
plain-text formats encourage user experimentation
binary formats are difficult to understand, test, and debug
binary formats usually imply proprietary formats
object or relational databases hinder portability

Consider also:

the volume of information to be stored
whether the resources may be accessed by other applications
whether it is appropriate for the information to be user-accessible
how resistant to user-originated corruption the resource should be

Solution: unless you have definite indications that a different format is more appropriate, consider storing resources as plain text. This is particularly useful when storing configuration information, but will also apply to application "data" in some instances.

Resulting context: during development, it is much easier to see what is happening to your stored resources. Plain-text format can be parsed or deserialized "by eye"; spotting data corruption due to your application is easy. Plain-text resources can be placed under version control, where a quick "diff" will usually narrow down why something broke.

Resulting problems: it might be harder to provide a user interface for modifying resources than if they were stored in some sort of binary format (e.g. serialized object graphs). If you have large volumes of plain-text data, and parsing the entire resource for each use is not appropriate, you will need some scheme for random access. (However, plain-text makes it easy to provide modularity in the form of include/import directives.)

If the application is used internationally, formats for numbers and character encodings may vary. Likewise the different end-of-line conventions used for text files stored on different platforms.

A plain-text format which is trying to do too much will show a tendency to degrade into a complex vocabulary, soon becoming impossible to understand for an average user or developer. Consider AlternateHardAndSoftLayers.

Examples:

Most of what's in /etc under Unix/Linux
Many internet protocols
Netscape preferences file (which looks like JavaScript)
Lots of databases. (See TabDelimitedTables)
Interchange formats, such as CommaSeparatedValues or ExtensibleMarkupLanguage
The .mdl files created by RationalRose
Some wiki-like systems
AlmostFreeText
StructuredText
YamlAintMarkupLanguage

Advocates of plain text are divided into the "DoItRightTheFirstTime" school, which tends to produce verbose, generic, complicated, extensible plaintext formats (e.g. ExtensibleMarkupLanguage), and the "WorseIsBetter" school, which tends to produce custom, neat, concise, simple plaintext formats (e.g. WikiMarkupLanguage).

Discussion

A more comprehensive listing of pros and cons:

Benefits

No special editor is required
Accessible to your standard PowerTools
The formats used can often be inferred by careful inspection.
Relatively robust in the face of data corruption.
Integrates well with VersionControl systems
There are standard libraries to parse XML, JavaScript, and so on.
Different languages/applications can read/write each others files
Every person, pipe, and computer sees the same stream of bytes
Byte order issues are not a problem with AsciiCode or UtfEight encoded Unicode. (UTF-16 encoded Unicode text still suffers from a byte-order problem, though the ByteOrderMark helps)

Disadvantages

Parsing overhead
Random access and update is hard -- normally have to read and rewrite the whole thing
Plain ASCII text is difficult to HyperLink
Formats can become unreadably complex. See the sendmail manual ;-)
Makes it easy for the stupid users to go in an mess up your program's configuration file. (Or is this a benefit? I've known users to edit binary executables with WordStar, and save the results; it wasn't pretty.)
Size (though text files compress well) You still have to uncompress it in order to read it, compression is "binary by nature" after all, so the size issue will come back to haunt you. << Editors can transparently de/compress archived files for editing. And we're talking about config files here. If your config files are large enough to require compression, you don't need a binary file format, you need to Greenspun a programming language to do your config in.
Character set issues (Unicode? {Dos|Mac|Unix} text files? This is not really a character set issue but an issue with choice of line separator. Using a non-Western language?)
Internationalization issues (Does this locale use "." or "," as the decimal point? What if we want to exchange files between locales?)
Security (because security has to be placed on the file instead of on the file's editor, because it's not possible for text editors to grant access to parts of a text file while denying access to other parts, because encryption is binary by nature and harder to use in text files, and because it takes less brains for a user to change something he's not supposed to in a text file)

Practical tools:

ExtensibleMarkupLanguage
LotsofIrritatingSillyParentheses (Lisp's EssExpressions provide a versatile representation for all kinds of data, a bit like a lighter-weight XML. Also note that EssExpressions are perfectly usable outside of Lisp.) I'm no Lisp programmer, can you give an example? -JaimeWong
CurlyBraces?
LittleLanguage (The host of little languages written at AT&T as Unix tools for processing text-based languages. The AwkLanguage being an obvious example.)
Packages in CPAN that read CommaSeparatedValues files.
CSV files can also be edited in any decent spreadsheet program, which is more convenient than a text editor because spreadsheets provide functions for processing table-structured data (cut/paste of rows/columns, fill a row or column, cut/paste blocks of cells etc.)

This is a principle advocated in ThePragmaticProgrammer.

If you like your pretty document formats, and you don't understand the dinosaurs who seem stuck in dusty old plain-ASCII, please read T. V. Raman's article "Welcome To The Universe Of Fancy Colored Paper!" [http://www.cs.cornell.edu/Info/People/raman/publications/colored-paper.html], "an attempt to encourage the use of interchangeable document encodings."

(Or is this more about TheRealStrengthOfXml?)

Good reasons to use Plain Text

May read, debug, administer, and edit using common available tools.
Especially suitable for configurations.
More readily survives across application lifetimes, and more readily to port across systems, so suitable for archival and interchange of human-developed work between humans. (It is useful even for a database-file to have the ability to export to plain-text for archival and interchange.)

Good reasons to use binary codes

Database-files and the like - high-performance indexing, memory management, memory mapping
efficient load times; keeping memory footprints small: read just the part of the file needed, reduced requirement for lexer/parser and associated overhead

Bad reasons to use Plain Text

Then users can edit the files with vi, and then we don't need to provide a structured preference editor. YAGNI.

Bad reasons to use binary codes

Most users overwhelmingly prefer a structured preference editor over having to edit text configuration files with vi. True, but that has nothing to do with whether the preferences are stored in a binary code or a plain text file - many preference editors read and write from text files.

: I don't agree with this. As a consultant in a more technical field of work I find the amount of work I have to perform to do certain text editing things in functional or technical specifications is a lot more in MicrosoftWord than vi. The power of the VimTextEditor is the ability to do certain repetitive tasks with the speed of lightning. Word's mouse approach takes a lot more time.

binary codes are always smaller than plain text Not really. See below.
binary codes are always more efficient than plain text.
binary codes are more secure than plain text (Do you mean "secure" as in "more difficult to accidentally corrupt", or "secure" as in "it keeps our trade secrets out of the wrong hands"? Either way, not really.)

binary codes are always smaller than plain text?

Not really. Far too many people design special, custom binary formats that are larger than compressed text files.

Yes, a .DWG ("efficient binary") file is shorter than a .DXF (more text oriented) file representing the same image. But DXF is best because .DXF.ZIP files are shorter than either one (even shorter than .DWG.ZIP).

Yes, a .pdf ("efficient binary") file is shorter than a .ps (plain text) file representing the same document. But .ps is best because a .ps.gz file is shorter than either one (even shorter than .pdf.gz). See http://www.complang.tuwien.ac.at/~anton/why-not-pdf.html and PdfSucks.

{The above is an unfair characterization. Not all forms of "efficiency" are aimed at spatial optimizations, and not all spatial-optimizations are aimed at on-disk space. Encode time, Decode time, Memory Overhead, Indexing, Partial Edit Performance, etc. can all be forms of 'efficiency' that may reasonably apply. If you look at .ps.gz just in terms of space-on-disk, it might be better. How does it compare regarding other performance features?}

I think this illustrates PrematureOptimization -- the misplaced focus on making individual parts of the binary file small interferes with the real goal of making the entire file small. If you're only allowed to look at one part at a time, the best you can do is a Huffman-style compression. But if you're able to look at even 2 or 3 parts at a time, you can often get much better compression -- LZ, LZW, and similar kinds of compression.

This is worth highlighting; if space is your concern, compression algorithms are going to do a better job of "designing" your binary format then you could ever hope to. My experience is that "compressed text with XML-type formatting" is typically still much smaller then "the plain text alone", and at best, your binary format can only aspire to such 0-byte overhead. Letting a compression algorithm do the work can result in your shiny new "binary" format having negative overhead.

Compressed text is almost never as fast or space efficient as the compressed form of a good binary format. XML is particularly inefficient. Consider the test results using a binary form of XML, versus a compressed text form: http://www.w3.org/TR/2009/WD-exi-evaluation-20090407/ . No one knows how many coal fired power plants have been built to supply the power for the extra processing required by inefficient formats such as XML.

{Are you, by chance, defining "good binary format" as one that is designed for high-performance interchange (as is EXI)? Most "good" binary formats I've written or used tend to be aimed at high-performance linear indexing, memory-mapping, partial writes (i.e. modifying specific values), and sometimes even garbage-collection, allocation, and durable transactions (e.g. binary database files). Many of them will not compress well because they are not designed for that purpose.}

Random access doesn't necessarily require rereading the entire text file. Consider the Adobe Structured PostScript conventions; the preamble contains the fonts and operator definitions, while individual pages are set off by special comments. It's easy to view a particular page in GhostScript.

Consider a well-written LaTex file; commands are defined in the preamble, and it's easy to render a particular chapter or section.

You still have to serially read through the data, parsing it as you go, to find the special comments. On the other hand, random access in a file with fixed-size records can be implemented by seeking by byte index.

Not entirely true, you can also lseek() over the sequential file and read until you see a page-marking comment, then do next lseek() based on that information. This way, you can implement a binary search.

Then use a subdirectory instead of a single file. You can randomly access every file in the subdirectory.

That depends on the file system. Many Unix file systems do a linear search through a directory for each directory look-up. This becomes a problem when you have thousands of files in a directory. This is why databases are implemented as files containing fixed-size records, or bypass the file system and access the disk directly.

-- Then don't have thousands of files in a single directory. Have ten subdirectories with a hundred files in each, or ten subdirectories with ten sub-subdirectories in each and ten files in each sub-subdirectory. Write your program's file search system so that the subdirectories are transparent (no text file when naming another needs to be aware of the subdirectory system in use). On second thought, maybe this is too complicated.

Ummm... It would seem to me that the power of binary files to allow random access comes entirely from forcing fixed length records. The same can be achieved using plain text as well, it's merely a matter of providing padding. Further, for the case of variant-size records, databases don't handle those too well, while plain text can. Above and beyond those arguments though is the fact that if random access is an important requirement, it can be designed into plain text file formats as easily as binary file formats. I could choose to divide up my file into fixed length blocks and create a plain text b-tree representation. In fact, if speed of access was really an issue, I'd do the same thing people do with databases, i.e. have index files optimized for speed of search and access. Indexes, by virtue of being a controlled subset with a specific purpose, can be designed to be plain text with fixed length fields. The only difference would be that current binary index formats probably encode file offsets into the database as a 4-byte binary integer representation, and the proposed alternative would use a 10-character text integer representation. -- AnonymousDonor

random access of text files: email

Some email software keeps data in 2 completely different kinds of files, gaining the benefits of both Plain Text and random-access data.

"mailbox" files are plain text, holding the full contents of many pieces of email.
"index" files corresponding to each "mailbox" file. These contain some summary information about each piece of email -- the From, To, Subject strings, some other bits of information, and the *exact* byte offset of the corresponding email in the "mailbox" file. This makes some common types of searches *much* faster than sequentially scanning through the text mailbox files. I assume these are binary files.

All the real data is in the "mailbox" files. If the program ever suspects that something smells fishy (say, the file modification date of the "mailbox" file is later than the file modification date of the corresponding index file), then the program deletes the index file. If I'm moving/archiving the mailboxes, I can delete the index files.

Whenever the program needs the index file I just deleted, (for example, when I ask it to search for something), the email program regenerates it from the corresponding mailbox file (this takes a tiny bit longer than just searching the mailbox file directly, but then the *next* time I do a search, the index file will make that search run much faster).

EditHint: Surely there's a less verbose way to say this. What was my point again? -- DavidCary

For more details on handling megabytes of email blazingly fast, and the importance of keeping *all* the data in plain-text mbox files (although we might *cache* some of that data in binary summary files), see "mail summary files" by Jamie Zawinski http://www.jwz.org/doc/mailsum.html

If you apply the AlternateHardAndSoftLayers methodology (and you know you should), then a suitable PlainText format might simply be your soft language. Think of Emacs, where the configuration file simply consists of elisp code. This has several additional advantages:

You get procedural abstraction ForFree.
You don't have to hack Lex/Yacc to read in your own format.
If your language supports byte compilation, then you can eliminate the parsing overhead associated with PlainText by compiling to bytecode.

LuaLanguage was also designed with this in mind - the compiler has been optimized for reading large files of textual data. -- ScottVokes

LessIsMore -- MatthewTheobalds

However,

We should also be open to the possibility that the traditional 7-bit AsciiCode might be preventing us from moving forward. For example, I used to use Nisus (a Macintosh word processor that stored text in one fork, formatting commands in the other) as a source code editor, and found the ability to annotate source code with bold, italic, even colors, did wonders for source code. Which got me to imagining other, more automated, uses for the extra bits. For example, imagine a programming language in which recent changes 'automatically' changed color to glow red hot, and 'automatically' "cooled" to black over time.

Also, experience with getting X-windows, Apache, and JServ configuration/properties files to do the right thing make me wonder if such files aren't the problem instead of the solution. I feel far more confident changing real code in a programming language like Java or Perl than I do changing properties files, if only because languages tend to do a better job with error messages than property file readers do. Maybe wider adoption of XML will help, but I tend to doubt it.... Tomcat's XML configuration process is about as fragile as JServ's property files in my experience.

Finally, I find the command completion stuff in the VisualAge for Java editor extremely powerful compared to programming in an ASCII editor like vi or emacs. -- BradCox

I was about to comment that the NeXT/OPENSTEP toolchain compiled .rtf files, then I saw the signature and figured you already knew that :)

I believe Tomcat's configuration fragility is due to poor error detection and handling.

Emacs' command and text completion facilities are powerful, fairly consistent and pervade the UI. Unfortunately they're not obvious when you're accustomed to typing things in full. -- MatthewAstley

Brad, Emacs can already do something a little like your cooling suggestion, but it's a general text feature, called highlight-tail, which colorizes the most recent text, and as time passes, slowly shifts the color towards the regular text color. I got my copy of it from http://groups.google.com/group/gnu.emacs.sources/msg/b50d53424e7ca0cd?output=gplain and it's worked well for me. --maru dubshinki

[Regarding confidence in "getting X-windows, Apache, and JServ files to do the right thing": the solution here (unless it's dead simple of course, maybe even then) is to write config files in a RealProgramingLanguage?. EmacsEditor and AwesomeTheWindowManager do this, for example (with Lisp (EmacsLisp) and LuaLanguage, respectively).]

discussion of changing preferences using "preference editor" vs. "text editor" moved to PreferenceEditorPattern

I'd say that PlainText is even more important for ProtocolDesign? than for storing data.

Mistake #1: Newbies at protocol design frequently jump straight into binary representations thinking that bandwidth must be conserved at all cost, whereas if bandwidth is a concern the protocol will be far better off if it is designed to use plain text but compressed before going on the wire.

Mistake #2: Those who have designed programming language APIs frequently design a protocol the same way, and then send '0' or '1' instead of 'true' and 'false'. Now, a protocol and an API are very similar constructs, but the differences are enough to lead designers easily astray. Protocols must usually be supportable by many platforms.

Reasons for designing protocols in PlainText and using expressive language:

Debugging is easier if I can sniff the wire and read off it.
The protocol is more likely to be interoperable.
Error code space (e.g. HTTP error codes) is usually limited. Error message space is not by nature, yet can be by specification.
Specification is easier to read, criticize and improve.

-- LisaDusseault

This depends on the protocol, and how its used. If you're dealing with something like IP, UDP, TCP, ICMP, DNS, ARP, ... that are frequently manipulated in hardware, then its important to use a good representation from which hardware can easily extract information. When dealing with fixed size frames other details are important. For example, protocols that are framed by IP will place important identification information in the first 64 bits (e.g. port numbers, sequence ID) because ICMP error notifications include these first 64 bits in their packet.

However, in any protocol design, the abstract sequence of events and state machines can, and should, be verified independently of the representation. -- DaveWhipp

You better believe that bandwidth is an extremely important part of communication protocol design and you will never find any ASCII text compression technique that will come within an order of magnitude of the level of efficiency in using binary codes. Aside from dedicated office LANs, most communications media suffer the problem that as the bandwidth needed for a channel increases, the number of channels available must decrease. -- WayneMack

Can we please differentiate between WireProtocols? and ApplicationProtocols?? All of the protocols mentioned by Dave are WireProtocols?. The PowerOfPlainText for application protocols is amply demonstrated by the RequestForComments. Having written a couple of application protocols for toy applications, I can testify to this. To appeal to authority, I recommend people have a look at BXXP, the new meta-protocol(application) designed by MarshallRose?. -- AndraeMuys

you will never find any ASCII text compression technique that will come within an order of magnitude of the level of efficiency in using binary codes.

I found one over at ItsTheLatencyStupid.

In the example given, sending the ASCII text "YesNoYesNo?" takes 102.4ms. Sending a single binary bit (the fastest possible binary code) takes 100 ms.

Yes, bandwidth is important, but latency is even more important.

Consider the WindowsRegistry? for an instance of moving away from KeepKnowledgeInPlainText?, in a misguided attempt to solve a problem with the latter pattern. Worked great, eh?

There is nothing wrong with the idea of the registry - it does what it should (provide fast access to configuration variables). Where Microsoft got it wrong was in keeping the native form of the registry in something other than plain text. Microsoft could have kept the on-disk representation of the registry as plain text, and kept the in-memory representation in whatever form facilitated fast access. This would then require rewriting the registry when it changed.

[NOTE: this would be pretty slow, too. But that's just Our Good Buddy Windows for you, eh?]

It should be noted that Microsoft does provide a means to export and import the registry in a textual form. And there are third-party tools that do this as well. Microsoft also supports updates to the registry that are encoded in a textual form. Here's an example of a ".reg" file's format:

 REGEDIT4

 [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Java VM\Security]
 "EditCustomPermissions"=hex:01,00,00,00

Yup, that's perfectly BassAckwards.

[It would be kind of bad to have the registry floating around as something readable on NT or 2000. Not very secure. Not like the registry was the best idea re: security to begin with.]

But the registry is already readable through either tools like regedit or a simple program using documented API calls. The binary format is only trivially more secure than a text version would be.

Eh? Who said anything about security? Binary provides absolutely no security improvements over PlainText; in fact, the reverse is often the case, as revision control and versioning are much harder problems for binary formats then text. Security for configuration files is a function of a) FileLevelSecurity? in the file system, and b) cryptographic protection.

I think the distinction here is that in NT, the registry file is maintained by the operating system, and security can be set on different parts of it given that non-disaster-recovery access is exclusively through the registry API. Therefore it really doesn't matter what it is behind the scenes... -- MattBehrens

[The problem with the registry is that it's a big black hole just like the windows/System directory. One of the main problems is readability and accessibility. Half of the things in the registry we don't know what are for unless you sit there and study it for long periods of time. Sometimes the registry is used to hide things like security keys because of this fact of it being a black hole. People take advantage of the fact that it is a black hole. It's not a debate of whether or not it is a black hole since people have proven it useful to hide things in. Unfortunately, it's disadvantages are used as an advantage by commercial developers.]

The problems with the Windows Registry are not because of it's in-system representation. If it were all text, it would still be a huge pain. The problems are multiple:

no docstrings or comments. If the name isn't informative, you're SOL.
Guids-as-keys not being represented in a useful way by the registry editor. Nobody needs to see that slooge of hex unless they ask for it.
No useful scripting system.

PowerOfText? wouldn't be necessary if there was a solid universal scripting system. The argument for text is always "I can script it" - well, if I could pop open a console and say "help(myapp)", review the api, and then say "foreach doctype in myapp.settings.doctype doctype.showicon := false" I would be a damn sight happier than I would be with a text editor. Windows tried to do this with the Windows Scripting Host, but we all know what a fiasco that was. I guess I've been spoiled by Python, where the lines between a module, data, program, and documentation are pleasantly fuzzy.

I strongly oppose the use of binary configuration files, even ones (suggested above) created by serializing/picking objects in Python or Perl, or via some gory C++ solution.

Things I love about text files that were mentioned above, such as editing in an editor, I will not rehash, but I didn't see anyone point out that verification of the correctness of the contents of a text file can be done more easily than the verification (visually, in an editor or some other tool) that the contents of the file have not GoneToHellInAHandbasket.

Who wants to tell us campfire stories about RegistryCorruption?, HackingDbfFilesManually? or any other binary recovery nightmares?

I use CommaSeparatedValues files for flat tables which are small enough to fit in memory all at once.

I keep SQL syntax formatted dumps of all my SQL database tables, that can be executed as scripts, to repopulate my databases if they are corrupted.

When using Python, I use pprint (pretty printer) to dump out my data dictionaries, which can be saved to text files, and parsed (with a small overhead), using the eval() builtin, except where security/trust issues prevent me.

When using results of queries on databases locally, I keep the query results locally in CSV or XML format, in memory inside components constructed for the purpose, or on disk. If I am ever worried about system state, I can dump everything to a a temp directory, zip it up, and I have a record of my entire application state, including lookup tables, etc.

Text is wonderful. Long live text. WPostma.

Proposition:

: PlainText : Power Unix User :: MicrosoftExcel : GUI-minded office worker

A plain-text file (CommaSeparatedValues, Space-Separated Variable, or TabDelimitedTables) can be edited using emacs, Notepad, or Excel.

(comments on preference editors moved to PreferenceEditorPattern)

See WorseIsBetter, YamlAintMarkupLanguage, PragmaticProgrammer, UnixWay, TheRealStrengthOfXml

Good points, but with regard to the portability issue of RDBMS, it is often quoted but never in my experience actually an issue. Have any of you ever heard of a non-trivial database being switched from one RDBMS to another? Once a company has invested in an RDBMS, they are exceedingly unlikely to change. -- JamesRadvan?

I've noticed that the iPod stores contact, calendar and note files as plain text. Because the iPod mounts as a drive, I can just "build" the set of text files I want to take with me. I used OutPod? to convert my outlook contacts folder into a set of (plaintext) vCard files, and my calendars into a set of iCal files. (Obviously I deleted the contacts and calendar folders in outlook because of OnceAndOnlyOnce.)

I like this because:

I am checking my iCalendar and vCard files straight into cvs, so I get full history and diff features.
- It's also one less thing to back up, because cvs gets backed up anyway.
- It's accessible everywhere that my cvs account is available.
I can search my appointments and contacts using grep and find. (No need to load a big bloated, monster application.)
I can write simple shell scripts to selectively archive, copy and index the data.
This storage format means I can write simple GUI tools using file associations, or servlets whose model is part of the file system. This is easy to do incrementally, because I don't have to unpick a binary format.

I got tired of all the intricate, proprietary addressbook and calendaring systems out there for small gadgets, so it's really nice that the iPod uses iCal, vCard and plain text notes.

I've been considering switching from CVS to Subversion, and the fact that Subversion does not use plain text is the one thing which I really weigh against the benefits.

I've been using Subversion for about a year now, and haven't lost any data. And you can always get a plaintext dump of your repository.

Notice how almost all program source code is PlainText and usually edited with an unrestricted text editor. (Exception: MicroSoft VisualBasicForApplications presents a modal error box when syntax is violated.) I could imagine a structured editor for C++. Would it be useful? pleasant to work with?

The primary problem with plain text is that there really is no such thing anymore (if there ever was). For starters, there are encoding issues, but let's assume that you use one encoding (e.g. UTF-8) everywhere.

Problems start at a very fundamental level if you want to keep things simple enough for most people reading/editing your files to accurately mentally model the format and not get bitten by stupid things (such as invisible characters being significant). Invisible characters are e.g. white space at the end of a line or white space not recognized as white space by the software in the middle of white space that is.

Unicode 4.0 has 26 white space characters, some of which are produced simply by pressing spacebar in some input systems. (CJK, their characters are of fixed size and take more screen space than Latin characters, so they have their own space character) ALT+space produces \xa0 (non-breakable space) on many systems, and it looks just like a space. Most software doesn't consider it as white space, which results in confusion. When some software does and some doesn't, there's even more confusion (similar to broken HTML "just working" until you try to look at the page using your mobile phone's browser). Not surprisingly, many popular formats don't get white space quite right.

There are many end of line conventions, and XML 1.1's got patched because 1.0 apparently didn't get them right enough. Python has significant white space, and it's commonly accepted that mixing spaces and tabs when indenting is bad due to different tab widths in different editors. Mixing them is allowed by Python and is considered to be a misfeature. Makefiles allow tabs only, which is considered brain-damaged (and annoying) by many. (see TabsVersusSpaces).

When people see white space, they often expect it not to be significant, as it's sometimes impossible to visually distinguish from lack of characters (at end of line or end of file). They also often expect different white space characters to have the same meaning because they are hard to distinguish from each other. Programs, however, have problems in determining what exactly is white space, and of what kind, and it can be expected that many programs' idea of white space is " \t\r\n", and assuming anything else will cause compatibility problems. Even this set of white space characters isn't safe, though. (carriage return and line feed handling is broken in programs on various platforms, make barfs on spaces, YAML parsers barf on tabs)

When people see plain text, they expect different things depending on their experiences. Programs that process plain text expect wildly different things, and as many of them are being patched to work with multiple encodings their behavior becomes even more unpredictable (and dependent on the locale) due to white space, sorting order, etc. sometimes depending on the locale. In contrast, when you have a binary format, nobody usually expects anything, so you get to make its specification clean and concise.

No wonder many projects just use XML. All the more or less arbitrary decisions are already made for you by a more or less arbitrary decision process, but at least the format is in common use and well defined. Yet white space haunts us even here: non-validating parsers can't tell the difference between all significant and insignificant white space, so they just give them all to you. The problem is much simplified, though, and you can always do a validating parse. But is all the complexity of the XML really necessary for most things? And how is e.g. YAML any simpler, the spec is rather long? How do you know any random YAML parser is correct, given that YAML is used much less than XML and the parser implementations are moderately complex stuff?

OK, so that was kinda long :-) But this page depicted too rosy a view of plain text formats so I had to complain a bit. I myself like plain text, but would like people to standardize on something simple, unambiguous and not obviously broken (backwards compatibility be damned, having a zillion different end of line markers is just drain-bamaged (BrainDamage)).

Type-free and to a lesser extent dynamic typing sort of borrows many of the ideas of PowerOfPlainText. It is easier to transfer data between systems if the source and destination are not as picky about types. Fiddling with type conversions between databases often consumes a lot of tedious busy-work time. For example, validation must happen before the load process if you use required types. However, type--free allows one to load stuff first, and then validate it after loading. This can simplify a lot of validation algorithms. (Related: DynamicRelational)

(Somebody has been screwing with my handle, perhaps out of anger for use of the word "anal", so I removed it and the handle to avoid further edit-wars. Next time, please ask.)

What The World Needs, Next:

It seems to me, that Computer Kings need to mark their Carriage Returns on all Computer Keyboards, better, in order to reduce the confusion to Printers, who seem to guess if the typist wants a Single Spaced Carriage Return, or a Double Spaced Carriage Return, which often results in a Random Spaced Carriage Return in most my perfect e-mailed letters.

Actually, some computers THINK that my Carriage Returns are Paragraph Markers, which causes a lot of headaches to many, like me, who type with only two fingers, and who don’t need “Default’s” help, and find it extremely difficult to type an Upper Case Carriage Return in order to get any Carriage to Return, on most Computers.

So, I hope someone creates Dictation Software, soon, which types as fast as me, with no more Carriage Return errors!

James H. Armistead (age 478), or Shakespirit@gmail.com 1200 W. 4th Street, Apt. 203 Centralia, IL. 62801 (618) 292-0129

FacePalm