String Without Length

AntiPattern Name: StringWithoutLength

Type: Data structure design

Problem: You want to store strings of characters, of arbitrary length. This is an early problem in compiler design, and in the design of programs in assembly language.

Context: Ubiquitous. This antipattern, most unfortunately, is embedded in C. Many other languages got it right.

Supposed Solution: Terminate the strings with a magic 'terminator character', such as '\0'. When you need to read the string, just iterate until you hit the terminator. Then you can forget about the length!

Resulting Context: You can read strings of any length smoothly! This appears to avoid the need for keeping track of the length (though it turns out that it doesn't).

However, when you try to copy or write strings at runtime, you overflow buffers, unless you separately keep track of the string length. You find that you need to iterate through strings to find their length very often once you start doing any serious string manipulation (starting with word wrapping on the screen). Soon, you would have been better off carrying around a length with your string everywhere.

(Also, you can't have any strings containing the terminator character, generating occasional issues related to ZeroMeansNull.)

Originally, this was popular because it apparently avoided the problem of storing the length - since one wants arbitrary length, one seems to need a rather complex system for storing it (since the amount of data needed for the length may be of any length); it also simplifies some assembly language code. It is also useful to bust longer strings into a bunch of smaller strings, such as strtok, which has no efficient implementation with length-delimited strings.

Originally, all input strings were 80 characters or less. Trailing blanks on a HollerithPunchCard coded as nul. For keyboard or paper tape, you set up an 81-character buffer, preset as all-nul and read a max of eighty characters to EOL. Keyboarded devices like the ASR35 ran out of line at 80 (sometimes 79) characters, and PaperTape was always punched to behave itself. A nul signalled the end of input. Since this was also the usual way for character-mode devices to signal end-of-input, the buffer and the device were compatible. This doesn't justify making ASCIZ a data type in a higher-level language, but it does explain it--I hope. Just about every low-level program I've written included functions to convert back and forth between ASCIZ and TEXTC. Fortunately, they take the same number of bytes, so it can be done in-place. --mt

Nowadays, it's popular almost entirely because it's been embedded in C - "quoted strings" automatically become these lengthless things.

Applicable Positive Patterns: Instead, strings should be stored with a length automatically attached to them. In Pascal, this was only one byte, which limits strings to a length of 256, not that that seemed to cause any real practical problems (see LeasedString). Long strings could be, and were, represented using ropes or other higher level data structures, provided you had a version of Pascal with the right extensions.

If this doesn't trouble you, you could just use C strings always allocated to 256 bytes - it might be a little inefficient, but that doesn't seem to cause any real practical problems. I've needed more on occasion - 4 gigs would certainly do the trick, but this makes me wonder how interested the author is in solving rather than avoiding the problem. After all, people who are content to use 256 byte strings aren't trying to use strings of arbitrary length.

In properly written C nowadays, 32 bits are generally used for such things, which limits strings to a length of 4 Gigabytes, which is almost certainly enough. But even more could be used if necessary; for instance, there could be a byte specifying how many bytes specify the length, which would allow for lengths of up to 2^(256*8) bytes. Linked lists of string fragments, or "ropes", mentioned below, are another method. Of course, the '\0' terminator is still being attached to strings, even though it's not really needed when string lengths are known...

If you're designing a programming language, store the string length as an integral part of the string. If you're stuck using a language which doesn't, create an encapsulation which does, and use only that. There are dozens or hundreds of different ways to do this, good for different purposes, but they all have in common attention to length. In C++, the string class doesn't have this problem, though there are other StringClassProblems

Don't even begin to think that there is one "right" way to implement strings. There are many data structures that might be reasonable depending on the goal. A string is fairly abstract: it's a sequence of characters.

Yes. And every reasonable data structure demands recording the length of the sequence of characters with the sequence.

False. Linked lists are reasonable, depending on goal, and typically do not record length. I'm just saying it depends on application.

(Whether you want the concrete storage length in bytes or the length in characters is another matter; there are cases to be made for both depending on the application.) You need it for almost every operation beyond the simplest read-and-print; it's easy to lose it and a big time-waster to recover it. Terminators are a nice backstop, but no substitute.

First think about how many different ways of implementing sequences there are (linked list versus array, for starters),

Linked lists of characters are rarely reasonable for strings given the size of characters relative to the size of pointers, and this has been true forever.

then think about what "character" means (ascii, utf-8, utf-16, for starters). Then how about which operations you want to be really fast. Doing a lot of editing? Emacs uses a buffer-gap algorithm. Etc etc.

Which keeps track of lengths. There's one rare case where the desired fast operation doesn't demand any knowledge of lengths, and that's a particularly basic form of output which is barely used any more - as mentioned, it makes for slightly smaller and simpler assembly code for that particular problem.

Even if you stick with array representation, you might have multiple kinds of strings needed: one with 64 bit length, in some applications, but then someone will point out that it's only used once in a while, and meanwhile ten million other strings have 8 bytes worth of overhead, and 80megabytes is too much for some uses of the library. So then you begin to have mixtures of lengths. Then someone points out that there's a clever algorithm that stores the length at the end of the allocated string buffer rather than the beginning (and this is indeed pretty handy for some things).

Which all require storing lengths.

Of course, all of this is hidden behind some abstract facade. But this whole page is about concrete representation. Or perhaps that is the problem.

Yes; if it's hidden behind an abstract facade, no need to worry about it. If, however, you're constructing such a facade, you should store lengths with your strings.

BTW the major problem with C is that many of the standard library functions have badly designed behavior. If they all did proper null termination and were guaranteed not to run off the end of a buffer, then it wouldn't be such a big deal in C.

People would still hand-manipulate them as arrays - which is OK in and of itself - and then get it wrong by not keeping track of the lengths. Furthermore, people would still pass incorrect lengths into the string manipulation functions, and get, if not buffer overflows, garbage results. Or, if the string manipulation functions computed the lengths themselves as needed, they'd do a ridiculous amount of recomputation. (Which is done in some programs which don't bother to cache the lengths.)

I.e. null-terminated strings aren't the inherent evil, it was the bad practice surrounding them that turned them to the Dark Side of the Force.

No, it's null-terminated strings without lengths which are the evil, as in the title of this page: StringWithoutLength.

[Why are they evil, assuming a suitable abstraction that prevents the null-termination invariant from being violated? Because NUL might be excluded? Because strlen() might take O(n) instead of O(1)?]

Given how often you need to do it in most string applications, that could actually be a good reason.

[The whole reason the way strings in C are implemented can be considered "evil" is that C lets you clobber 'em so easily; and in many cases encourages string-clobbering.]

That's a fair statement. Part of this encouragement is because of C's dissociation of strings (and arrays) from their lengths.

[You might consider a not-uncommon data structure known as a "rope". What is it? It's a doubly-linked list of string fragments, essentially. Not According to this paper on ropes [http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf]. While traversals from the beginning can take O(n) time; ropes are generally efficient as small "jumps" are done quickly, as are insertions/deletions in the middle.]

Yeah, that's a nice (and natural) data structure. Of course, within each string fragment, you almost certainly want to keep track of the length!

If you used strings with lengths, and then built a library on top of them that had the same mis-design as the standard C lib string functions (e.g. ignoring the length sometimes!), you'd have just as many troubles.

But if you built a library based on strings with lengths, you wouldn't ignore the lengths, because they'd be in your face. Strings without lengths just invite trouble. In practice, all competent C programmers seem to carry length variables with their null-terminated strings whenever they're doing anything serious with them. (Arrays without lengths have some of the same problems, but strings are particularly a problem domain which demands the use of lengths constantly.)

What we mostly do with strings is byte-wise processing. How can you beat

    while(*cptr){process *cptr++}

for efficiency? --mt

With the advent of UniCode and other multibyte encodings where characters do not all have the same width, you can no longer use such simplistic loops.

Also, a lot of the string processing I see is not char-by-char, but tokenizing, splitting, and merging. Every null-terminated strcat() starts with a strlen(), unnecessary if the string knows its length.

The entire argument for strings with lengths seems to be that most applications need to calculate length more than once. This is the weak point in your argument. If you could provide some solid, non-anecdotal evidence I might be convinced. On the contrary I can think of many situations in which recording the length of a string is completely unnecessary. For example imagine your strings are used only as identifiers. In other words they are never modified only compared. The null terminator is an example of a sentinel. Using a sentinel instead of relying on a prerecorded length is a convenient way of track of string length. If your strings do need to be measured often, if you are appending or inserting often you will need a different data structure such as [http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf]

[What sort of programming do you do that you only ever use strings as identifiers? I'd say that's a pretty niche market. Look around you - MOST applications do at least moderate string processing. Lots of applications (almost anything user-facing, for example) do LOTS of string processing. And recording length stills gets you a bit of a benefit when comparing strings, assuming that you're just testing for equality rather than checking to see how different they are. Which is another mistake (imo) in the C string functions - strcmp() runs relative to the length of the longest string, rather than to the first character where the 2 strings differ. Almost every string operation becomes more performant when you know the string length up front. And that's totally aside from the issue of buffer overflows.

Look at some of the recent spoofing exploits in IE and Mozilla for other problems with terminated strings - inserting a null into a URL will cause some functions (that expect a C string) to terminate it, while others (that operate on counted, usually Unicode) strings to operate on the whole string. Basically, uncounted strings have exactly one advantage, which is a small memory advantage. Counted strings have a whole slew of them - there's a reason every modern language provides counted strings, and why C programmers have been rolling their own for decades.

Which is another mistake (imo) in the C string functions - strcmp() runs relative to the length of the longest string, rather than to the first character where the 2 strings differ.

Maybe I'm mistaken here, but I think that this is not true. For plain old C character strings it is not necessary to look at the strings after the first difference to judge which is lexicographically smaller.

Isn't this just the rationale behind Object Oriented Programming?

Yes, C is a procedural language and programmers had to remember the correct library methods and sets of methods needed to manipulate a data structure (or pseudo-structure, depending upon how one prefers to classify C-style strings). Using an Object Oriented approach removes this level of thought. By using a proper string class or template, one does not need to worry about checking string sizes and creating the appropriate buffer.

Concerning performance, however, the null terminated string is usually the more efficient format. Copying strings is not common (remember, us C programmers like to pass those pointers around!). Comparing strings is much more common and not having to maintain and update a length count while doing so is less efficient than checking for a null in one string and comparing contents.

Actually, you're wrong about the efficiency of null-terminated strings; they're notoriously inefficient for every purpose except raw output (yes, if you never do anything with your string except print it, it's more efficient). The LeasedString is more efficient for nearly all purposes, including comparison for equality testing. If you have a length recorded with the strings, most string compares will give a doesn't-match result with a single assembly language test, because the lengths will be different; the algorithm for the rest of the test is slightly more efficient than the similar algorithm for null-terminated strings. Comparison for alphabetical sorting in the modern days of Unicode requires keeping track of lengths of both strings. Comparison for numeric sorting ditto. Comparison for codepoint sorting may possibly be benefitted by having a terminator, but you'll find you still need the lengths when the time comes to output your sorted list!

I also quickly wish to note that if you are worrying about performance at this level, you are probably looking at the wrong area for optimization. Look at your memory allocation scheme first!

Use a proper object oriented solution and don't worry about the data format. It doesn't really matter!

Absolutely. This entire problem is one of low-level programming. If you're at a higher level of abstraction, your object oriented solution is certainly keeping track of string lengths for you. If it's not, get a new one. :-)

There are advantages and disadvantages of null-terminated strings.

For example, with null terminated strings, this C code could drop a string from the stack:

 while(*stackptr++){}

You can pass the top string of the stack to a function using:

 functionname(stackptr)

It is also easy to calculate the length of a null terminated string:

 int strlen(char* strptr) {
  char *start = strptr;
  while(*strptr) strptr++;
  return strptr - start;
 }

But, you can't have embedded nulls in strings. Sometimes you would want that, but not always.

One way I used is to store the length at the beginning, and then the characters of the string, and a terminating null at the end (the string might contain embedded nulls as well, or it might not). This way, dropping a string from the stack is a bit more complicated but not much, and it can still be passed to a C function, but you have to add the part used for the length. For example:

 functionname(stackptr+sizeof(int))

Maybe you also have 1 byte for the type of value on the stack (also in my program). So, you would do:

 functionname(stackptr+sizeof(int)+sizeof(char))

And, in Forth (in case the each data on stack is as long as each character of the string):

 : STRDROP BEGIN WHILE REPEAT ;

Making strlen fast is a lot harder than you think, you can get 400+% better performance by using unsigned ints instead of chars to implement the search for the null. In any case, the real issue with null terminated strings is the painter problem: http://www.joelonsoftware.com/articles/fog0000000319.html Having a length attached to string avoids the issue. Just consider a typical string manipulation problem:

 char *setFileExtensionInPath(const char *path) {
  static const char s_theFileExt[] = ".png";
  const char *oldFileExt, *temp;
  char *newString;
  size_t length;

  /* Step 1: Drop the existing path if any */
  oldFileExt = strrchr(path, '.'); /* Runtime performance: O(n) */
  if (oldFileExt) {
     /* Check that it isn't a directory with a dot in the name */
     temp = strrchr(oldFileExt, '/'); /* Runtime performance: O(n) */
  }
  if (oldFileExt && !temp) {
     length = oldFileExt - path;
  } else {
     length = strlen(path); /* Runtime performance: O(n) */
  }

  /* Step 2: Allocate a new buffer and transfer old string */
  newString = malloc(length + sizeof(s_theFileExt));
  if (!newString)
     return NULL;
  memcpy(newString, path, length); /* Runtime performance: O(n) */
  newString[length] = '\0';

  /* Step 3: Append extension */
  return strcat(newString, s_theFileExt); /* Runtime performance: O(n+m) */
 }

Total performance: O(n+m) or O(4n + m + C). Using known length strings, the performance is improved:

 std::string& setFileExtensionInPath(std::string& output, std::string const& path) {
  static const char s_theFileExt[] = ".png";
  size_t oldExt, dirSep;

  /* Step 1: Drop the existing path if any */
  oldExt = path.find_last_of('.'); /* Runtime performance: O(n) */
  if (oldExt != std::string::npos) {
     dirSep = path.find_first_of('/', oldExt); /* Runtime performance: O(n) */
  }
  if (oldExt == std::string::npos || dirSep != std::string::npos) {
     oldExt = path.length(); /* Runtime performance: O(1) */
  }

  /* Step 2: Allocate new buffer and transfer */
  output.clear();
  output.reserve(oldExt + sizeof(s_theFileExt) - 1);
  output.append(path, 0, oldExt); /* Runtime performance: O(n) */

  /* Step 3: Append new string to old */
  return output.append(s_theFileExt); /* Runtime performance: O(m) */
 }

Total perfomance: O(max(n, m)) or O(3n + m + C). Obviously you can NitPick this example as not needing fast performance but it doesn't take much imagination to apply this to a more complex case; you can also tweak the first example to use memcpy twice instead of strcat which would work to even the performance in this example but the problem reappears when you have several string manipulator functions like this, you'll eventually end up using either strlen or some of the painter problem library functions [strcpy, strcat, etc]. Null terminated strings are a classic case of Performance/Memory trade-off, Known-Length strings use more memory but are faster because of it; at the time C was designed, 640KiB of RAM usable by mere mortals had barely been dreamed of so it made more sense. The substring without copying is also somewhat overplayed, you can do that with length strings; in fact it's easier in some implementations since you don't need to insert and remove null terminators if you want to pass a pointer to 5 chars in the middle of your string. The Windows NT Kernel uses a UNICODE_STRING object for all its strings that supports this behavior:

 struct UNICODE_STRING {
   unsigned short length, maximum_length;
   wchar_t *buffer;
 };

You can just point it to somewhere in the middle of your string and set length=maximum_length={substring_length}.

In response to everyone saying "null-terminated is only more efficient for printing", or "the only advantage is a slight memory saving": This is extremely false. Here's an example where null-terminated is vastly, qualitatively more efficient. The example itself is contrived but if you're smart enough to grok the idea behind it, you'll eventually start wondering how anyone could ever live without it.

The example: Given a string X containing exactly one comma, calculate two strings: the part of X before the comma, and the part of X after the comma. You're allowed to alter X itself.

Null-terminated solution:

void split_at_comma( char *input, char **output1, char **output2 ) {

  char *ptr;

  // Find the comma
  for ( ptr = input; *ptr != ','; ptr++ )
    ;

  //Change the comma to a NUL
  *ptr = '\0';

  //The original input pointer now points to the stuff before the comma
  *output1 = input;

  //As for the stuff after the comma, well, it starts after the comma!
  *output2 = ptr + 1;

}

If your strings are required to store length, you'll need to allocate at least one whole new object (more likely two if you're working in a "developers are babies" language like Java). And then copy the buffer to the new object(s).