Data Format Versioning

This is an old, old problem with a lot of varied solutions we should outline and weigh.

Here's the problem. You build a data structure for communication between two points (either two programs in a messaging system or the same program at two points in time -- e.g. a "save" file format).

Now how do you ensure that as those programs change over time that the messages can still be read? Let's take a worst-case scenario. Say you have a binary format and that in version 1.0 your first field is 32 bits. In version 2.0 that field is now 64 bits. The problem is that version 2.0 programs will try to read not only the first field, but part of the second field...chaos ensues.

So, here are some solutions to this problem that I've seen. They should be patternized by some kind soul with more time to work on this than I -- also many more should be added.

1. Version Numbers -- include a "version number" as the first field of the structure. Programs know immediately what they're dealing with and whether or not they can handle it.

2. "Self-describing" data. If your data is "key-value" or XML then your programs can more easily tolerate certain types of changes to the data structure since position does not matter (as much...)

3. "For expansion" or "unused" fields. In positional (binary) forms you keep some space for future expansion which older programs ignore and newer programs can use.

4. Almost a version number, but not quite. Newer Windows APIs use structures like this...

 struct Foosenblat {
  DWORD cbStructSize;
  int nInteresting;
  char szInteresting[20];
 };

The cbStructSize is supposed to be filled in with sizeof(Foosenblat), thus enabling the callee to differentiate between calls using the old struct Foosenblat, from calls using the new, improved struct FoosenblatEx.

5. ASF uses something like this:

 struct object {
  INT64 size;
  GUID guid;
  BYTE data[size];
 };

The format of the data section depends on the GUID. If you change the format of the data, use a different GUID. The size is included so that clients who don't recognise the GUID can skip past the data.

Contributors: KyleBrown, RogerLipscombe (last edit May 6, 2002)

Having run into this problem (like who hasn't) myself, I found that the first version change is the most fragile. We had to determine what set of conditions was true only in the first version and parse for those conditions first, then parse for the next condition set. By version three we were much smarter about telling the application what version of the data it was dealing with, but we were forced to drag legacy parsing around for another seven versions. Even after we officially obsoleted the first two versions, our paranoia led us to keep the legacy parser as a manual option.

Subsequent products, of course, included versioning data out of the gate.