Locality Of Reference Documentation

I was reading TheSourceCodeIsTheDesign and LightweightDocumentation and thought it was strongly related to my LoRD principle. I wrote it up on the latter page, but decided to move it to its own Wiki page. I've written about LoRD many times on Usenet and mailing lists over the past 3 years or so, but I think this may be one of the few times I've written about it publicly in a more persistent forum.

This is my second attempt at writing this up here on the Wiki. The first attempt solicited much discussion and comments, particularly from MichaelFeathers, but also from KielHodges, BetsyHanesPerry, BillTrost, and RonJeffries. This rewrite is an attempt to incorporate that discussion, as well as a much earlier one over a year or so ago with TimOttinger on the Usenet newsgroup news:comp.lang.c++.moderated.

I am greatly indebted to all of these individuals for helping flush out and clarify important points that were lacking in the previous discussion, or else not clearly communicated. I don't expect this rendition to be perfect, but hopefully it is a vast improvement. Unfortunately, its also much longer (about 3-4 times longer), though still shorter than the original if you include all the clarifying comments that were made.

-- BradAppleton


The LoRD Principle -- Locality breeds Maintainability

initial draft: January 1997

A few years back, I coined a principle that I call LoRD: LocalityOfReferenceDocumentation. It tries to address the problem of keeping code and documentation consistent with one another and up-to-date. The principle may be stated formally as if it were a Newtonian Law of sorts:

The likelihood of keeping all or part of a software artifact consistent with any corresponding text that describes it, is inversely proportional to the square of the cognitive distance between them.

A less verbose, less pompous description would be simply: Out of sight; out of mind!

Therefore, it is desirable for us to try and minimize the cognitive distance between artifacts and their descriptions! This is not without its presumptions and caveats (which are described later).

Ostensibly, LoRD refers to software documentation (all types of documentation, not simply design), but its obviously more general than that. It's not strictly limited to the relationship between docs and code. Locality of reference (without the 'D' :-) can apply equally to the relationship of code with comments (though comments are easily viewed as a form of documentation), use-cases with requirements, requirements with design, use-cases with test-cases, test-cases with test docs, etc. (the list goes on and on). For this discussion however, I'm focusing more on the relationship of source code artifacts to any kind of documentation (formal or informal, end-user or design, etc.).

It should be noted that LoRD does not suggest anything regarding how much documentation should be created; This is still left to your discretion. LoRD suggests only that it is desirable to keep the documentation as close as possible to the code (whatever that means, we'll delve further into that a bit later).

It is also important to notice that LoRD does not refer exclusively, nor even primarily, to design documentation. It refers to all documentation, where possible, including end-user documentation (manuals or user's guides, even if they are only available on-line). Although it is often applied for design documentation, it is easier to keep such design documentation minimal, mostly within the code itself (see below for how this might be done). Many of the more noticeable applications of LoRD are concerned with documentation of things like programmatic interfaces and end-user documentation.

Cognitive Distance

The phrase "out of sight, out of mind" gives a vague indication of what is meant by "cognitive distance" (there is a similar concept in the world of user interface design named "visual distance"). It doesn't refer to a Euclidean notion of distance, but to the amount of conceptual effort required to recognize that the corresponding "item" exists and needs to be changed to stay "in sync." It also spans the effort required to navigate from one to the other in the developer's workspace. The navigation effort may seem more "manual" than conceptual, but it is significant because it relates to the interruption of "flow" of the developers' thoughts between the time they first thought of what they needed to do, and the time and effort expended before they were actually able to begin doing it.

As an example, suppose I edit a source file and I now need to update a corresponding document (or perhaps Ive edited a ".c" source file and need to update the corresponding ".h" header file). If the other file isn't in the same directory as the one I just modified, I'm less likely to remember that I need to modify it. Once I do realize this, suppose it takes 2-3 commands to leave the editor (or current editor buffer) and navigate to the file in the other directory, and then open it up for editing. The effort to enter the commands and perform the navigation may well interrupt my train of thought, and impact my productivity.

Applying LoRD in Theory - Locality Fantasies

How can we keep documentation "cognitively close" to its corresponding source code? How do I keep source declarations and interface specs "in sync" with their corresponding definitions and implementations?

Well - if you have a really spiffy development environment, capable of visually conveying impact or dependency analysis of related items and information, then it would make it very clear to the user that the docs needed to be updated after artifact was modified, or the ".h" file needed to be updated when the ".c" file was modified in a certain way.

What I'd love to see is a true visual programming environment. Something that links the physical design (of files and directories) with the logical design (e.g. methods, classes, and class categories). I envision maybe a tool like Rational Rose where I can double click on an object to see its subparts in more detail. Double click again on a method name and browse/edit the code for that method (or a data-member). And I could do the same for the method docs (or object docs) too with textual notations.

Take it a step further and I can simply "change views" from development to project management and see a diagram of tasks and people and resources and milestones. I might double-click on a completed milestone and perhaps zoom in on the version-label and all the file-revisions that make it up, and from there browse the code; or maybe use a mouse-click to navigate from a revision to the task that changes was made in, and maybe continue on to the corresponding bug-report that prompted the development task. Switch "views" again and I can go from a piece of code to its requirement or test-cases by clicking on some hyperlink. Right now we are being told that XML may well make this all a reality before very long.

At least for the "development view," Smalltalk VisualWorks comes very close to attaining this in many respects, so does Genitor, which is billed as an "Object Construction Environment" for C++, and someday Java (see http://www.genitor.com). I've also seen some amazing things programmed into Emacs to come astoundingly close to such an environment (complete with color-coded pretty-printing, automatic formatting, and syntax directed editing). Of course they still leave some things to be desired. ;-)

If you are using such a development environment, not only does it require less conscious effort to apply LoRD (because the environment supports it in many respects), but it can even require less documentation. By this, I mean not just less "formal" documentation, but less informal docs and comments. This can happen when the code itself is clearly written with useful and readable naming conventions (some of which are mentioned in SimplyUnderstoodCode, TheSourceCodeIsTheDesign, and various ExtremeProgramming practices).

Applying LoRD in Practice - Locality Realities

When lacking such a development environment, however, we typically have to rely upon the proximity of the related artifacts in the organization of the source tree or project folder. There are several approaches that are commonly used:

1. The document is in the code

This is the approach taken by javadoc, and by Perl5 with it's POD mechanism (POD stands for "plain ol documentation"). The document (be it user-docs, interface-docs, design-docs, or any combination thereof) is embedded directly in the source code in the form of structured comments. Then some extraction program or script parses the comments from the code, and assembles them into a document that is auto-generated for formal viewing/reading.

Issues of formatting, indexing, collating, presentation order, and including diagrams and pictures typically need to be addressed here. Some amount of formatting and indexing information may need to be included in the source comments (typically in some kind of markup language). Some kind of template with section ordering and table-of-contents control is often employed to generate the document with the right sections from the right files and directory in the right ordering (note that this presumes there is a known "right" ordering).

2. The document is the code

With respect to design documentation, this is the approach taken by TheSourceCodeIsTheDesign (and probably LightweightDocumentation to some extent). It is also the approach Bertrand Meyer was aiming for in his design of the Eiffel language and the ISE Eiffel programming environment, which contains pretty-printing tools, as well as "short" and 'flat' forms for viewing the code, and its comments (color-coded no less). Eiffel was intended to look like a readable but formal software specification language, so the result may often resemble a formal software specification/design document.

3. The code is in the document

This is DonKnuth's idea of LiterateProgramming. The code is richly interspersed in the document, which appears in presentation order. And a code extraction tool separates the code out into appropriate modules for building (compiling and linking). Much of the same formatting issues arise here as with case #1 above. But now the extraction tools have to process and spit-out the code segments, as well as the documentation. Again, it is presumed there is a single, known presentation order for all the embedded information. Issues of collating the document for presentation are now replaced by (or combined with) issues of collating the code for compilation.

4. The code is the document

This is the same thing as case #2 -- The document is the code.

5. User Guide as Requirements Spec

This doesn't refer explicitly to the code per se, but it still refers to related software artifacts that would otherwise need to stay in sync. If the two related artifacts are treated as a single artifact that meets both needs (not just a single artifact with both artifacts embedded, but separate, but the very same artifact comprises all parts of both and serves as the lone presentation for both), then the distance between the artifact and itself is zero; problem solved (or is it?).

This means information that some might feel inappropriate for a requirements doc needs to be in the user docs, and vice-versa. Personally, I have no problem with this, and have seen it work quite well on numerous occasions. But I recognize that other projects with other constraints (or certain contractual obligations) cant always get away with this. If you can, however, it can work quite nicely!

Leveraging Locality at Multiple Levels

The LoRD principle can be applied at several different levels of scale and abstraction throughout a project:

1. Inter-artifact: the directory-tree level

This applies for portions of documentation or code that, for various reasons, must go in separate files. Here, you can try to place the code and its "docs" in the same directory (or at least in the same subtree of the source tree). Some people put their C/C++ header files (.h files) in the same directory as their .c or .cc files for this very reason: because it makes it easier for them to remember to update one file when the other has changed. It's right there in the same directory when they list its contents or view it as a folder containing file icons.

It also applies for documentation that neither rightfully nor logically belongs in any one file; this is usually the case with things that describe higher-level concepts or strategies that encompass the interactions or relationships between multiple files or components. Ive seen project that put the design docs for a subsystem (the overview part at least) in the same subdirectory with the code for the subsystem itself. The subsystem's code is spread across multiple files, so it may not make sense to document the collaborations of the subsystem's parts across the various subcomponent source files (because the document is describing the workings of the whole - not of its parts).

In a similar manner, I've seen people put a README file in each subdirectory of the source tree, to serve as a kind of documentation for the directory itself, explaining its purpose and organizational principles/criteria.

2. Intra-artifact: the file level

This is the LiterateProgramming and JavaDoc-type stuff that Tim just mentioned. Docs that are applicable to a single artifact (a file) can be embedded in the file itself. Often tools like javadoc, or Perl5 POD introduce the need for some undesirable redundancy or additional references. This is because the particular "extraction tool" has limitations regarding whether or not, and how often, it is capable of treating certain portions of a file as both document and code.

If the tool can't recognize cases where portions of the code should be made part of the documentation, the programmer is forced to repeat certain redundant information in the docs that is readily apparent from the code (e.g. parameter names and types). This is an unfortunate drawback of such approaches when using tools with these limitations. It needs to be carefully balanced against the amount, and formality of documentation that the project requires.

Still, despite this drawback, many projects feel they have great success using these approaches. They may not necessarily use all of the facilities or "document sections" that something like javadoc or POD provides (partly in response to this aforementioned problem), but they manage to find a suitable subset that meets their needs quite nicely.

3. Micro-artifact: the code-block/statement level

This goes a step further than the previously mentioned level. If a comment corresponding to a snippet of code is not very close to the code itself, it is likely to become obsolete (which isn't very helpful, and can even be harmful). The most common occurrence of this is I've seen is function-header comments. The function (or method) header comment is a good place to say certain things about the protocol and/or logic of the function in general, but some folks often include somewhat detailed pseudo-code in the function header as well. It almost always becomes obsolete shortly thereafter because the pseudo-code is all the way at the top of the function, even when code near the end of the function changes. Whether or not the pseudo-code should even be present is a separate matter; for now, lets presume we are talking about comments that arent redundant with the code.

Comments that correspond to specific parts of the function or method (instead of the function as a whole) should immediately precede the corresponding logical block of statements. Similarly, comments at the beginning of a class, or an interface, or even a source file can suffer from the some problem. A certain amount of "high-level" or generally descriptive text that applies to the class or file as a whole is fine. But anything that could easily be separated out and placed closer to the specific declaration, definition, or code-block, should be.

Some of this is what is described in SimplyUnderstoodCode. But other aspects of it are more like the "inter-artifact" or "directory" level discussed above. We have a single file, class, or function/method that encompasses several smaller logical entities: a file or package contains classes and methods or functions; a class or interface contains definitions or declarations of several methods and data members; and a function or method contains multiple logical sections or blocks of code. So in this sense, the larger-grained "entity" can be viewed somewhat like a "directory" of the smaller-grained entities, but at the micro-level (all in the same artifact) rather than at the mini or macro-level (across multiple artifacts).

Maintaining LoRD - The Facts of Refactoring

As the project evolves, code needs to be refactored, for any of a number of reasons: Changes may be made which make it more convenient to commonize code in other files; Things that seemed to belong in the same file may later need to be split apart (sometimes files with high checkout-contention are a good candidate for this); People or roles may change during the lifetime of the project and artifacts are rearranged to make a more "navigable living space" for their maintainer; or things may need to be fused-together or split-apart simply to help decrease the time required to build a large system.

In short, refactoring happens! When it does, the corresponding sections of documentation will have to be appropriately refactored as well. Code and docs that belong together will be easier to keep together during the move if they are already close together to begin with (especially if they are within the same set of lines). But some docs that went into a single artifact will now refer to things in several artifacts, and may itself need to be split out into a separate, "overview-level" artifact of its own. Or some things in an overview-level document now will need to be moved into a single artifact (or to a new directory) because the commonality or layer of abstraction has been factored into its own logical class or subsystem of the code.

Not all these efforts are necessarily in addition to what would need to be done anyway when refactoring, but many of them are. And even though they are done in a way that may still be faithful to the LoRD principle, the extra code-doc-locality maintenance efforts can be non-trivial.

Competing Forces Combatting LoRD

Now - having said all of that, there are certain counterforces which may drive you to put more "distance" between such related artifacts.

1. Separate responsibility for maintaining docs and code

It may be that separate groups of people are responsible for the docs and the code. It may not always be prudent, for example, to have tech writers and coders checking modules in and out in the same directory at the same time. This goes double if the docs and code are in the same file!!! Not only does it create more checkout contention and/or merging, but the writers have an unprecedented opportunity to mess with the code, and the coders with the docs. (or maybe its coders and SQA, testers, configuration managers, ...).

2. Increased parallel changes to the documentation

The likelihood of parallel changes to the docs is increased because it is distributed across many files which are constantly evolving. This not only makes merges of documentation versions more likely, it may be harder to perform your tools for merging and comparing text files arent as nice as those offered by your word-processor for merging and comparing document versions. (OTOH, if your word-processor has no such merging facilities, it may be easier to merge the docs in text format.)

3. Increased Build Times

Checking out a file just to change the docs may cause a lot of unnecessary recompilations when rebuilding the system. And in file-based projects using compiled languages like C or C++, keeping .c and .h files in the same subtrees may cause very long directory search paths for include files and libraries. This may also result in a dramatic slowdown when building the system. Some may try to have it both ways by keeping headers in a single directory with (symbolic) links in the source directories, but (sym)links take up some amount of diskspace and use a slot in the filesystem registry/tables. This can be non-trivial for systems with tens of thousand of files (or more). The headers may also make the directory look more cluttered (though that might be solved with subdirectories).

4. Document Generation Complexity

If the docs are distributed throughout the source-tree, just like the source code, then your Makefiles have the added burden and complexity of having to construct the documentation as well as the programs and libraries. Extracting and collating and ordering all the sections of documentation and generating the correct documents can be just as complex as building the system, if not more so.

How do you know what order to put things in (the order may depend on several variables)? Often, the docs in a file are extracted in order of appearance and the order in which to compile documentation from the files is manually recorded or programmed (into a script, or perhaps a makefile). Sometimes sections of docs in the same file have known identifier tags associated with them (or supply their own tags) and the logic of how to extract them from one or more files in a tag-defined order is non-trivial. Often this can require some kind of precompiled database of extracted docs with known "locations" for each "tag." This doc-database can take up significant space in the source-tree (just like object files and executables do) and will require additional effort to make sure it is updated when the "docs" change, but is recompiled with minimal effort (just like we try to do when using makefiles to build the executable from the sources).

I worked on such a documentation extraction and generation tool once upon a time. It was a third-party tool we were enhancing and bundling into our suite of programming language development tools. It was pretty spiffy indeed, but believe me when I say these issues can be an extremely complex to resolve - especially when the project needs to meet documentation standards like DoD-2167a or Mil-Std-489 [sigh]. Still, I have to admit that, as painful as it was, it was far less painful then keeping such profuse, formal documentation up to date by hand [yikes].

5. Multiple Perspectives and Dimensions of Information

There is not necessarily a single best presentation order for both the code and its related documents. Documents and source-code may be more suitable for viewing for different audiences and people. Furthermore, there are multiple dimensions of information for each of these multiple audiences that needs to be coherently distributed throughout the codebase.

LoRD presumes the most convenient organization of artifacts to use is according to the usage patterns of those who spend the most time living in the codebase: the developers. So the organization of the code comes first; and the organization of the various docs has to be dispersed across it. But the most logical organization for navigating through the code wont necessarily be the best organization for navigating the portions of a particular document.

Concluding Remarks (praise the LoRD ;-)

I would have to say that minimizing the amount of documentation that must be created and maintained in the first place is probably one of the most important aspects of applying LoRD (perhaps even a precursor). This is what concepts like "self-documenting code" and TheSourceCodeIsTheDesign are trying to achieve (though admittedly they do this only for design and interface docs, and not for things like end-user documentation). Note that "minimal documentation" is not the same as "zero documentation". Here, "zero documentation" means not only no design docs, but no user-docs, nor any other kind of docs (not even a single comment anywhere in the code -- since comments are a form of documentation). If there is zero documentation, the distance between artifact and its description is infinity; But if the artifact is also the documentation, then this distance approaches zero.

However, even when you do minimize the amount of documentation you will provide, some modicum of docs must still exist. Certainly end-user documentation is always necessary, even if its all part of the online help system. And some amount of design and interface documentation will always be necessary, even if its predominantly in the form of clearly crafted, readily readable code with a light sprinkling of clarifying comments.

Its certainly desirable to keep code and docs as close as possible. But its also possible to keep them too close, or to try too hard to keep them close when faced with many of the obstacles mentioned above. The strategies for applying LoRD, and the LoRD principle in general, are not a hard and fast rule. As a general rule, the strongest mandate that we can infer from LoRD is:

Code and documentation should be kept as close as feasible, but no closer!

I would say that LoRD is a force, but not a pattern. Certainly some of the ways of applying LoRD mentioned above may be patterns, but they are not without their consequences and tradeoffs. There are other forces and tradeoffs that may work against LoRD, and which have greater or lesser importance depending upon the particular project, and its associated scale, diversity, environment, and users.

Still, its something to keep in mind, and to strive for whenever feasible.

-- BradAppleton


There is truth here! Er, I mean "I HaveThisPattern". If I gave advice (I don't, I sell it), I (who drones on at infinite length) would advise: shorten this to 1/10 this size and lose nothing. It'll be hard, but the result will be pretty fine! Keep at it, Brad! -- RonJeffries


Somewhat related to LocalityOfError?


This sounds like an idea we're playing with at TeamInaBox called SubText. Have a look at http://www.teaminabox.co.uk/downloads/subtext/ but the essence of it is to provide dynamically built documentation to developers as they work. It uses a similar concept to AspectJ to help discover what documentation is relevant to the code being worked on. It looks like you've done a lot more thinking on this :-) (we had this idea a few days ago and threw a prototype together). We would value you ideas and comments. -- ChanningWalton


I would add items 6 and 7 to your list of approaches...

6. The CustomerTests are the documentation.

7. The documentation is interwoven with the CustomerTests.

See AgileRequirementsDocumentation

-- Steve Jorgensen


[Discussion of user docs moved to AssociateUserDocsWithSource]


CategoryDocumentation


EditText of this page (last edited November 4, 2005) or FindPage with title or text search