Code Normalization

There are several similarities when you compare normalization of relational schemas with refactoring of object-oriented code:

Refactoring is the moving of units of functionality from one place to another in your program (quoted from RonJeffries). Normalization is the moving of units of data from one place to another in your relational schema.
Refactoring has as a primary objective, getting each piece of functionality to exist in exactly one place in the software (again, quoted from RonJeffries) (cf. OnceAndOnlyOnce). Normalization has a primary objective, getting each piece of information to exist in exactly one place in the database, i.e. removing redundancy.
Another part of it would seem to be, "each routine does only one thing." (quoted from DaveHarris). (Is this a goal of the process or a "side effect"? See RefactoringAndRewriting for a discussion of this point.) In some higher normal form, each table contains information about only one thing.

I'm curious now - what "normal forms" can be discerned in program code? (I guess spaghetti code (i.e. just GOTOs) is the starting point, isn't it?) Which rules have to be applied to get from one normal form to the next?

These are not for me to answer, but maybe somebody with a little more experience than myself (which is everybody else) has some ideas?!

--FalkBruegmann

It looks a valid analogy to me. In both cases replication would cause problems when the code/data needs to be changed. The hard part is knowing when two things are the same. --DaveHarris and when one thing is different

See also RedundancyIsInertia.

Normalization in a database does reduce duplication (a very fine thing), but it is really focused on increasing the cohesion in entities by ensuring that they are not multiple entities munged together.

In OO, we are trying to maximize the cohesion of methods, and minimize the coupling between classes. This tends to be so even when it's not an overt goal. The rules for normalization seem to almost translate directly to OO interfaces.

1NF: attributes depend on the key
2NF: attributes depend on the whole key
3NF: attributes depend on nothing but the key

Yeah, I used the paraphrase. But anyway...

In OO, a class's statement of responsibility (a 25-word or less statement) is the key to the class. It shouldn't have many 'and's and almost no 'or's (see OneResponsibilityRule). The idea for 25 words or less comes from PeterCoad in his patterns and strategies book.

The attributes and operations of a class should depend on the responsibility, the whole responsibility, and nothing but the responsibility. If this can't be made to work, the class has lower cohesion, and might be a candidate for refactoring.

I've found that having more, smaller, simpler, more tightly-focused classes is a big win. When you have too many, you can still create facades (views in databases) to hide the complexity while maintaining a strong focus in the interface.

I think that code normalization is a very useful and powerful technique, for its being so simple to work with. --TimOttinger

Attributes depend on the key, the whole key and nothing but the key, so help me Codd. (In an American trial, a witness is "sworn in" when they say "I promise to tell the truth, the whole truth and nothing but the truth, so help me God."_ -- TomRossen

Okay, that leads me to:

1NCF: clients are provided with a service
2NCF: clients are provided with the whole service
3NCF: clients are provided with nothing but the service

These rules would apply to interfaces, I think, not to classes -- classes quite often have to provide more than one service. Higher normal forms for code might limit the types of the clients:

4NCF: clients derived from a single type are provided with nothing but the service
5NCF: clients of a single type are provided with nothing but the service

Note that there isn't much (if any) language support for this kind of partitioning so it seems kind of awkward. The Visitor pattern highlights the value of these latter forms, and the need for language support for them I think. --PhilGoodwin

Certainly it applies to interfaces, though in COM, for example, it is violated in order to avoid round-trips to QueryInterface. COM interfaces are big globs of stuff, even the officially blessed ones like OPC.

In recent (24 Aug 99) discussions, we've decided that COM interfaces are best used as Facades, anyway, rather than as proper base classes. You have special considerations, and those are probably best managed outside of the 'proper' classes in your design. -- TimOttinger

I think that it applies to classes, also. A class utilizes its inherited and aggregated features to achieve some specific goal or specific variety of a goal. I think that the OneResponsibilityRule still applies here. You should still be able to write a sentence of 25 words or less, very sparing of conjunctions, which describes the intent of the class. You should also be able to map all of the functions of the class to that statement of responsibility, and prove that all of the data of the class is needed to support its functions.

In this way, I think that there is much the same kind of strength in OO as you find in relational models. --TimOttinger

Another part of it would seem to be, "each routine does only one thing."

Ive been referring to this on Usenet and mailing lists for the past year as:

: One purpose for every method; and one method for every purpose!

As for what the normal forms correspond to for code, I think TimOttinger and PhilGoodwin have it exactly right! Mathematically if one wants to view code refactoring as applying one or more "behavior preserving transformations" (to quote RalphJohnson), then each such transformation can be represented by a matrix. Imagine a matrix or a "table" (for those who prefer relational terminology) in which every method in the system corresponds to the rows on the left side, and every class/interface in the system corresponds to a column across the top:

          | Class1 | Class2 | Class3 | ...
  --------+--------+--------+--------+-------
  method1 |        |        |        | ...
  --------+--------+--------+--------+-------
  method2 |        |        |        | ...
  --------+--------+--------+--------+-------
  method3 |        |        |        | ...
  --------+--------+--------+--------+-------
  ...     |        |        |        | ...

For a given (method,class) pair, the corresponding value in the table can indicate whether that class implements its own version of the method, whether it uses (by inheritance or delegation) some other object's implementation of the method (in which case it somehow indicates the class whose implementation it is using), or whether it is altogether incapable of understanding or handling the method/message request. You can use distinct values for each such case if you like; and depending on which programming language you use, you may or may not need to care about the specifics of the method signatures and possible overloading.

Now then, refactoring consists of rearranging this matrix in "behavior preserving ways." Let's assume that a value of "does not understand" corresponds to zero and that all other values are non-zero. Preserving the behavior can mean that, for a given class (column) all the non-zero values remain non-zero even though their values may change after having been refactored.

Doing such refactorings to eliminate duplication and ensure singleness of purpose ends up looking extremely similar to performing Gaussian reduction on such a matrix, or from treating it as a table in a RDBMS and applying the rules of data normalization to the matrix in order to minimize redundancy. One could probably write a program to facilitate this (maybe even a spreadsheet).

If we could figure out how to accommodate splitting/adding classes and methods into this scheme, I honestly think that someone so inclined could probably pin down the actual mathematical constraints for inheritance and behavior-preserving operations on such a matrix, and formally express refactorings and their results in this manner. In fact if one could take a given matrix and formally express the goal the wish to achieve for a given refactoring, that something very much like Gaussian reduction could be used to determine the precise set of transformations (refactorings) to apply to achieve the result (much the same way it is done in linear algebra to find the inverse of a matrix).

I can manage to do this in my head for extremely small sets of classes and methods (or slightly larger ones if I'm using the DemeterTools? - which basically decide at compile-time where to distribute methods among classes that aren't at the endpoints of an access/query). But for anything of respectable size it becomes too much all too quickly for my tiny brain to keep it all in my head at once. ;-) --BradAppleton

DependencyInversion seems like a good criterion for normalization of code. Most projects are split into two kinds of code: 1) main, or code immediately invoked by main, that depends upon concrete classes. 2) The rest of the system which depends only on the interfaces (abstract classes in C++, interfaces in Java, method names in Smalltalk) that those concrete classes conform to. -- RobertCecilMartin 9/7/99

Here goes the definition of normalization found in the American Heritage Dictionary with some added comments

-To make normal, especially to cause to conform to a standard or norm: normalize a patient's temperature; normalizing relations with a former enemy nation. (normalize some programmers source code to make it conform with good code standards)

-To make (a text or language) regular and consistent, especially with respect to spelling or style. (this is a very importance one, I think it really suits the term CodeNormalization)

-To remove strains and reduce coarse crystalline structures in (metal), especially by heating and cooling. (well any analogy with code reviews ?)

AlexeiGuevara, 01/22/2001

I was drawn to this topic after someone concerned with database design commented that UML class diagrams looked like their data dependency diagrams. They asked whether we apply the rules of normalization to the data in class instance diagrams. After some thought I decided that I probably had been doing that, though without consciously thinking about it.

I am starting to think that as well as in method calls, normalization of data should be part of the design process. This is obviously directly applicable in entity EJBs, where the class data design maps directly to the database design, but it seems to be a general rule.

AnswerMe I have one doubt though. Is this applying relational design in the object world, whereas we should be moving to object databases and use object design in the database world?

Alternatively is normalization equally applicable to object databases, and the difference between an object store and a relational database smaller than sometimes implied?

Somewhere in here, I can apply the principle of subsidiarity to coding. Paraphrased from it's original socio-economic context, it would read - Never do at a higher level routine that which can be done by a lower level routine.

Limiting this issue on object-oriented code seems awkward. You can write denormalized code in every language. This is called CopyAndPasteProgramming. It suffers from update anomalies. Therefore you try the put similarities in separate modules.