Comma-delimited text, or comma separated variables/values (CSV), is a popular format for storing tabular record data in TextFile?s. Most spreadsheets and databases support import and export to this format.
Compared to ExtensibleMarkupLanguage, CSV is more compact and simpler to parse.
However, XML can do a few things CSV can't handle:
- There is no way to specify encoding.
- There is no easy (visual) way to specify hierarchy. (Although hierarchy is often not needed -- LimitsOfHierarchies)
- No standard for specifying the schema (but one could include "headlines").
Compared to
TabDelimitedTables, CSV is slightly easier to hand-edit, having a visible field separator, but is harder to mechanically parse and write than a format which uses backslash escapes, and is harder to use with Unix command-line tools.
- I sometimes use pipes or carrots as the delimiter to make parsing easier, but you have to make sure the content won't have those characters. No character is perfect. Even tabs can be valid content in some contexts.
- As discussed below, the problem is less the universal need for escaping when an embedded delimiter occurs, and more the lack of agreement on what the quote rules actually are for CSV files; also the most common double-quoting rules are harder to implement on both the reader and writer ends than backslash-escapes.
CSV has the
PowerOfPlainText compared to binary formats.
Does it really? As evidenced below, the format is so convoluted that editing one in a TextEditor or other non-dedicated tool is inviting disaster. SqLite is far more useful IME.
DelimiterSeparatedValues are equally powerful but allegedly closer to DoTheSimplestThingThatCouldPossiblyWork.
Can't CSV files really be something other than comma separated? From a microsoft web page:
If a user selects English (United States), the decimal symbol is a period (for example, 3.14). If a user selects German (Germany), the decimal symbol is a comma (for example, 3,14). Similarly, the list separator character used in .csv files is a comma (,) in the United States but a semicolon (;) in Germany.
''Someone once wrote:"
- Commas themselves must be escaped when used in field data.
Huh? Yes, if I need to put a string into a CSV file, if a string has one or more commas anywhere in it, the string must be quoted -- but the quotes go once around the entire string. There's no need to escape each individual comma. (In XML, each and every less-than sign in the data must be escaped with <).
- So then any field containing both a comma and a quotation mark must be surrounded by quotation marks, and the quotation marks in the field must be escaped. You can't escape some form of escaping ...
- It's rather easier to construct a correct writer and reader using backslash escapes than with the double quote system. WIth the backslash system every input is defined; with the double quote system how are you supposed to react to When writing you can write one character at a time, rather than having to scan the whole field for anomalies; when reading the state machine for escaping has many fewer states)
From what I've seen, typically CSV strings are encoded using C-style escape characters.
That would actually be a DelimiterSeparatedValues file that uses comma as the delimiter.
CSV strings are also frequently encoded using two quotation marks to represent one inside the string.
Also newlines in a quoted string MAY be sometimes converted to C escapes or left unconverted.
It's also common for quotation marks within quotation marks to be unquoted, this is normally considered a bug but tends not to get fixed.
RFC4180 recommends quoting fields containing double-quote, comma and/or newline, and doubling double-quotes within the fields. According to RFC4180, there's nothing special about backslash.
See also: TabDelimitedTables, DelimiterSeparatedValues, ExtensibleMarkupLanguage, RelationalAlternativeToXml, FlirtDataTextFormat, SqLite
OctoberTwelve