Code Generation Isa Design Smell

"It's such a great product," says the PointyHairedBoss, "just look at all the AutomatedCodeGeneration it does for you!"

"Dude!" I say, "you just don't understand: The need to do lots of AutomatedCodeGeneration is a flaw, not a feature."

The input to the code generator is the higher abstraction that is being converted to a lower abstraction: the output code. Why is the input to the code generator not sufficient in itself? Plus, the input and the output are a form of OnceAndOnlyOnce violation, since they have to be in synch or duplicate the same info, just in different formats.

It's not true. May be we need to clarify OnceAndOnlyOnce definition - it's OnceAndOnlyOnce of human input. A program can have as much as necessary duplication as long as all copies can be regenerated without human. Those automated duplications do not violate OnceAndOnlyOnce. Consequently code generation does not violate OnceAndOnlyOnce.

If the generated code will never be touched by humans, then why not "run" things off the original input instead of the outputted code?
- For all the same reasons that would apply for compiling any code. Performance is the number one reason. Interpreting code can be done, but the overhead (latency, throughput, framerate, CPU, memory costs, etc.) can (depending on the abstraction of the source language and the nature of the application) be prohibitively high. In addition to performance, there is possibility for a wide variety of code analysis - often to get even more performance, via optimizations, but also for safety. Also, the relative costs of creating a code generator and creating a framework for interpretation of code aren't significantly distinct (notwithstanding the differences that come of having to maintain and ship the interpreter as part of the final product as opposed to keeping it and its input in-house).
- It is an ugly trade-off that kicks OnceAndOnlyOnce in the nuts, if you ask me. The same bit of info should be in only one place. It is simply cleaner factoring of info. I am not saying there is never justification for duplication, but rather that such duplication is a yellow-alert.
- Despite your apparent desire to believe otherwise, ActiveCodeGeneration doesn't hurt OnceAndOnlyOnce, which is defined in terms of programmer declarations. It's only a problem if you need to muck around with the generated code after generating it (which would violate OnceAndOnlyOnce because it would require mucking around after every time you generate it from source).
- If you want an example, LaTeX is a good one -- granted, it's not code generation per-say, but it can (and does) generate HTML (latex2html), PDF (latex2pdf), PostScript and whatnot... Besides, HTML is code in a way, and PostScript is a language (thus, what is generated is code), even if not one meant to be written by humans. Say if you made a typo and typed in "foa" instead of "foo" - do you go edit the HTML *and* the PDF *and* the PostScript files or do you edit the single .tex file and regenerate? Code generation here doesn't hurt OnceAndOnlyOnce, it actually helps it (you avoid having to do the same thing in HTML, PDF and/or PostScript)... That said, there are downsides to code generation - namely, you introduce dependencies and similar. For the yacc/lex example that I've seen floating around -- what if the tool is no longer available anywhere (or if it's commercial and the company goes bye-bye)? Sure, lex/yacc do have replacements (flex/bison), but nevertheless, the problem does exist for some other such packages. Another example is the Gold Parser Builder, which only works in Windows - this means that the cycle is boot(windows)->compile(grammar, target)->boot(linux/mac/whatever)->compile(target)->run. Myself, I don't like Yacc/Lex much mostly because they introduce dependencies and increase complexity (and the cycle to build-compile[lex]-compile[yacc]-compile[gcc/whatever]-run -- makefiles help, but still); also, they're nowhere as portable as a simple C compiler. Also, have a look at Haxe and C--, one being a language meant to be generated, another a language meant to be the TARGET of the generation... They both seem to work (though granted, I've yet to see C-- used, ever). So yeah, in my opinion there are two ends to this.
  - That's arguably a "converter", not a generator.

More than that, the primary purpose of code generation is to follow OnceAndOnlyOnce, because most of modern systems/languages/libraries do not allow to follow and force programmers to violate OnceAndOnlyOnce if code generation is not used.

I am not sure what you mean here.
- If you choose to not generate code, there are at least two alternative options. One is what you suggested above, is to "run" things off the original input instead of the outputted code via use of a framework or dedicated library. This wouldn't violate OnceAndOnlyOnce. But another commonly selected option is to hand-implement the code, treating the 'input' as an external design document. This second option is favored by YagNi, performance, and simplicity: writing up a more generic framework is almost invariably more complex. But this second option also violates OnceAndOnlyOnce because the code duplicates the original input. AutomatedCodeGeneration offers benefits of hand-coded solutions plus supports OnceAndOnlyOnce.

Legitimate Exceptions: Although code generation is often a sign of poor software engineering (human-to-machine and human-to-human interface), there are some limited reasons to use code generation:

Performance. One must hard-wire in certain information to the executable in order to get needed temporal performance, or must avoid the bloat and space cost for the extra interpreter, integration, and loaded data. (OTOH, a staged compiler (CompileTimeResolution, RealMacros) could support the desired performance and hard-wiring, so this also relates to LanguageSmell.)
Using a DesignSmell to mask a LanguageSmell:
- Too difficult to refactor the external specification into a generic framework without leaking implementation details or an obtuse syntax (MissingFeatureSmell)
- You are stuck with a language that lacks sufficient "meta ability" to do its own code-generation without 3rd party utilities (e.g. CompileTimeResolution + FirstClassTypes + PartialEvaluation; RealMacros, Fexprs)
- You are stuck with a verbose language that cannot be made very compact, and so need code generation to help generate the needed bulk.
Legal reasons. There are rarely restrictions on generated code, but there are often restrictions on integrating a framework to "run" the code.
You are not allocated time to design a decent framework to handle lots of nitty-gritty exceptions (EightyTwentyRule variations from simple abstractions) and code generation gets you a quick starting point in code.
- Even if you do have time to design and implement a decent framework, it might be better to apply that time elsewhere. If a code generator is available, YagNi that 'decent framework'. This sort of reasoning tends to apply towards use of Lex and Yacc, among other things. And history proves that creating a 'decent framework' for parsing XML and such is often a pretty significant maintenance undertaking... one that you might not wish to bear.
You desire strong compile-time type checking which data-driven or meta techniques cannot readily provide. (But less applicable to languages with DynamicTyping.)
You need too much custom tweaking that a direct framework alone cannot provide. In other words, the "input" to the code generator is too high-level or simple an abstraction for our needs. Code generation allows us to alter or reduce the abstraction to a form that the original input may not provide. For example, we need similar code that is say 80% similar, but the similar parts are not consistent enough between each variation to justify further factoring. Thus, we generate the code, in which case the generated segments are initially 100% similar (excluding parameters, etc.), but we customize it such that it's 80% similar when done.

Subsumed Reasons:

Using a DesignSmell to serialize object graphs between languages.
- Assume an external specification describes the serialization. From this specification, you could perform CodeGeneration or you could write a library that can 'interpret' the specification. The latter is more flexible and inherently has a smaller tool-set, maintenance, and ConfigurationManagement profile - and is thus more desirable (if feasible). Reasons this may be infeasible are already listed above: performance issues, MissingFeatureSmell.

Code generation smells because ...

"Anything you can do by generating code, I can do by calling data driven subroutines."
- That's not true at all. Code generation for example, can generate an object model from a database schema, which can be compiled and used, data driven subroutines can't give me strongly typed objects.
  - Some consider "mirroring the DB schema via objects" to be another kind of smell. See OopNotForDomainModeling. Thus, you may be justifying one smell in order to facilitate another smell.
- It is true that with multi-language or language-neutral systems, static typing is tougher. That just may be the cost of sharing tools. However, I would like to see a specific example before I summarily agree with you.
The most substantial cost of any project (from inception to final retirement of the system) is maintenance. Now why on earth would any rational person want to generate large quantities of largely similar code that would have to be maintained for the life of the system?
- One doesn't. One generates large quantities of similar code that aren't touched by maintainers. Compilers do it all the time. However, one would need to ensure the tool to generate the code remains available throughout the life of the program.
  - Compilers are to make machines happy (fast), not human maintainers. Thus, it's a different goal. But anyhow, how does one know the code will not need to be touched? Everything is subject to potential change in my observations. I can guess where it's more likely to happen, but rarely can I say with 99%+ assurances that any give line of code will never need to be touched. Change requests can be very unpredictable. "You want it to do what????".
  - If you need to change the code produced by the compiler, you instead (a) change the source-code and re-compile, or (b) modify the compiler source-code/options/heuristics, then re-compile, or (c) both of the above. Since you have options other than modifying the code produced by the compiler, you can avoid ever touching the large quantities of similar code.
  - I'm not sure how this relates to the original point. It was originally an analogy, not a direct problem.
  - Clearly, you didn't grok the original point, which is that generating large quantities of largely similar code DOES NOT imply the need for maintainers to touch the generated code. You asked how one "knows" one doesn't need to touch the generated code, and I answered.

Who says the code is that similar? Generated code shouldn't be touched by hand, ever. If you need to specialize generated classes, inherit from them. This separates hand written code from generated code.

: I wrote a system to generate code from a schema (back before there were tools to do this). Methods on generated classes always returned references to the manually edited subclass. The generator would generate "stub" subclasses if there was no manually edited subclass. We used separate source directories for manual and generated code. At any time, you could blow away the generated code and regenerate it - which is what the ant full build always did. It did not need to be maintained and was not kept in source control. -- PaulMurray

Generating significant blocks of code to be customized for each use is BAD.
AutomatedCodeGeneration is always based on the generating engine customizing templates with parameters. In other words, it's a fancy form of CopyAndPasteProgramming.

Not if those templates are highly customizable via the parameters. Code generation is mandatory in a language without macros.

Code generation is good because ...

Generating an empty skeleton class with empty "place holder" methods that you fill in with your application code can be a good thing.
And AutomatedCodeGeneration can be an effective PerformanceOptimization?. That's why we compile programs, generating machine code.
Takes advantage of existing compilers/interpreters, including any optimization or portability features they offer.
Promotes standardization.
Eliminates/reduces human error.
"Anything you can do by calling data driven subroutines, I can do by generating code." [disputed below]

Is this code generation like Lex/Yacc/JavaCC etc? Or like MFC, UML frameworks? For the former, I would not like to call a subroutine to process a grammar every time a program is invoked. -- RobertField

So is using the C++ or Lisp language. You are copying and pasting the code into your computer. It is not your code. Using C++ or Lisp to make programs, is using code that is automated for you. It's not yours. This analogy is not accurate. Using C++ or Lisp, you use and invoke C++/Lisp code, not the generated machine code. When using code generation, like some ORMs, you use the generated code directly, and that in my opinion is where the smell is coming from.

I normally think that CodeGeneration is a sign of an insufficiently expressive language. However, I find myself on a project that has reams of hand-written identical code. Right now, ActiveCodeGeneration is smelling like lilacs. -- AndrewMccormick

Exactly, it's something we have to do to get around the fact that we don't have runtime macros ala Lisp.

If the above statement was true, "The need to do lots of AutomatedCodeGeneration is a flaw, not a feature." we'd all be writing our own operating system (think snob or irrational) instead of stealing other people's TEMPLATE. Linux, Windows, X, Unix are all templates that you took, or bought, or got some how. It was automated for you. Template is operating system. All who use Lisp, C++, those were all created by someone else. They are automating your work. You were born because of automation. Everything is based on automation.

No automation? Why not just stop using the Internet all together, and send people typewriter notes in the mail? Email is automation. It automates delivery. You do not have to code every time you send email. That is automated for you. Automation is good. Automation can be bad, because of poor design or general lack in need for automation - but saying that automation is bad, in all instances (generalization) is dumbfounded. If you use your typewriter, you are still using automation. If you use a pen, you are still using automation.

Yes, there are programs that generate code that are not good. But since everything is based on automation (even text editors and your keyboard hardware), arguing about anything to do with automation, and calling everything to do with automation bad, is dumbfounded.

When automation is bad: For example, in order to post on this message board, it would take less time to just type out some plain text, hit save, than to get a program that automated the task for you - since your brain is doing the automated work of thinking up a post. But since you are still using a keyboard, to post to the message board you are still using automation.

Even commandline is a form of automation. You do not go inside your computer and directly send electricity in it with your hands from you brain. It is automated through several devices - keyboard, processor, etc. (that you did not build, yourself). People that bash automation need to figure out, that everything, including life, is based on automation. You wouldn't be using an oven would you? How about a fire? they are all forms of automation. You don't start a fire, without a spark or a match. The spark and the match are a form of automation, since you cannot with your human bare hands, start a fire.

The future involves automation. Otherwise, you might as well live in a cave, and then start right over with the rocks, fire, tools, and hunting automation. Oh, but you'd still be using automation. So, your only solution is to do nothing.

Great job tearing into that strawman. And just think: I was in danger of not realizing that an argument against code generation is an argument against all automation, ever. Thanks for clearing that up.

{Perhaps it's a matter of what is being automated. Code generation may simply be an unnecessary step. In other words: yes, it's automation, but maybe automation of an unnecessary step.}

[Comments from WhiteBoxFramework, which originally appeared on RefactoringWithaFramework.]

By "WhiteBoxFramework," do you mean that application programmers can see the framework code, but are not allowed to change it? (...or are you talking about sample or wizard-generated code?) -- JeffGrigg

[...] I definitely do not mean generated code! -- MartinLippert

Any system based on code generation is not a framework. A framework allows one to write their own code using well-defined interfaces in any manner they want. Required code generation is a smell. It means they could create something good so worked around it.

Yes, I've often said that code generation is used by people who don't know how to call subroutines. I've met many managers and programmers who think code generation is a good thing; the more the merrier. I say that maintenance is the largest long-term cost of any project, making code generation generally a bad thing. -- jtg

MicroSoft's IDE for MFC and ATL "generates code" for you. And most CORBA ORB implementations "generate code" from IDL. Yet I would still consider these frameworks.

But, in these cases, the good sense of "generate code" is that they create a minimal "shell" or "skeleton" within which you put your code. Or, they generate standard proxy code, as an implementation technique, and you'd never change that code; if it needed change, you'd just delete and regenerate it.

-- JeffGrigg

It all depends on what type of code generation is being performed. With ActiveCodeGeneration, this is definitely false, but with PassiveCodeGeneration this is probably true. -- MikeRettig

I strongly agree that the generation of lots of boilerplate code that then needs to be customized is bad, especially when such code needs to be manually re-customized as changes are made to whatever was used as the basis for the original generation. But automated generation of an intermediate form which is then directly processed into the final form is often helpful, especially for tools such as yacc, lex, and RPC/IDL compilers. I'm not sure the terms "active" and "passive" are very descriptive of the distinction; maybe BoilerplateCodeGeneration? (or WizardCodeGeneration?) and IntermediateCodeGeneration? would be better?. -- KrisJohnson

I stuck with the only definitions that I've seen in a published work. I stole ActiveCodeGeneration and PassiveCodeGeneration from ThePragmaticProgrammer. I don't think the terms are succinct, but I can't think of anything better. Although, BoilerplateCodeGeneration? (or WizardCodeGeneration?) and IntermediateCodeGeneration? provide specific examples, I don't think the phrases provide a generalized distinction. -- MikeRettig

If you delete all the generated code and regenerate it all from sources every time, then CodeGeneration is just a performance optimization or implementation technique (possibly for portability). In that case, CodeGeneration is not a design smell. (So, the C code generated by lex and yacc are not a design smell.) -- JeffGrigg

Unavoidable code generation is a language smell. One is forced to generate code when the abstraction mechanism of the language (if any) isn't powerful enough to remove duplication. (Compilers do this for assembly language. The calling sequence conventions and idiomatic instruction usage is captured in the compiler's code generator.)

Unavoidable code generation is a system smell too. One is forced to generate configuration scripts when lots of packages need the same information but can't read the same files. (This smell is so pervasive that we just assume computers are suppose to work this way. Can you imagine a world where you just refactor the package to read one file? I can, but not in my lifetime.)

Comments wanted: I have a representation of the project's database schema. Using a home-grown tool, from this representation I generate

Oracle scripts for creating/dropping/backing-up the database tables
VBA scripts for creating equivalent tables in Microsoft Access
ATL-based C++ classes for reading and updating the tables (including COM collection interfaces)
HTML documentation for the tables

Whenever we need to make changes to the schema, I update the source and then re-run the tool to crank out all the new stuff, and then build the software and run our tests.

Now, is this a good thing or a bad thing? (I think it's a good thing, but I'd be glad to hear better ways of addressing this problem.)

-- KrisJohnson

Oracle script sample: http://www.uaex.edu/srea/autoora.sql
More Oracle Scripts: http://www.orafaq.com/scripts/
Vba script: http://www.programmingmsaccess.com/Samples/VBAProcs/Related%20to%20SQL%20Server/VBAProcToEnumerateDatabasesOnASQLDMOSQLServer.htm
ATL-based C++ Class sample : http://www.codeguru.com/atl/PropBrowser.shtml

Are these examples of what you are talking about?

I wouldn't worry about it too much, if it works for you.

But...

Why are Oracle and VBA DDL scripts needed? If you have the data needed to generate the script, why not just run directly against the database instead of generating a script that will do the work later? Is this a ProcessSmell??
The need to generate classes indicates a weakness in the implementation language and/or library. Or it may be a PerformanceOptimization?.
HTML documentation for tables could be active content, generated by ASP or JSP pages.

Good ideas. But it's easier to generate scripts than it would be to write programs that directly manipulate the database. It's also easy for Oracle DBAs to examine and use the scripts, rather than learning how to use the tool (and deciding whether to trust it). And it's easier to create and deploy a static HTML page than it is to install ASP/JSP/CGI. The class-generation thing is a sore point with me - I think it is unnecessary and even detrimental, but others in the company want it. -- KrisJohnson

One could argue that today's DBA departments are a ProceduralSmell? - an indication of a dysfunctional organization. And why does an organization make it so difficult to deploy a web application? -- DevilsAdvocate

And there you have human attention being split between two separate points in the process: the programmers are looking at the Oracle code generation, but the DBAs are reading the generated scripts. So if the DBAs are in a hurry, and they find an error with the script, are they more likely to a) walk over to the programmers and tell them, and wait for them to fix it, or b) fix the Oracle script by hand, forget to tell the programmers, and then trip over the same problem a week later? Better, perhaps, to focus the thinking in one place, and since code is more versatile than Oracle scripting, well ... (obviously, I'm not a DBA.)

And HTML isn't so relevant to this discussion; it's not really "code" in the sense of "code generation". It's more properly viewed as formatted output, like CSV, PDF, or XML. (See also JustAnHtmlCoder.)

Oracle PL/SQL is not a very powerful language it seems. It does not have a lot of meta ability, partly because Oracle often seems more concerned about machine performance than programmer or DBA productivity so that they can win the benchmark contests. Then other vendors are forced to copy their practices in order to keep up on speed. I would like to see a vendor push a "dynamic RDBMS" as a productivity and adaptivity tool. It could have dynamic columns, as described in MultiParadigmDatabase, and a dynamic DB scripting language as its native language for stored procedures, etc. Code generation would rarely be needed. It may not do very well on performance, but it may change the way people think about RDBMS, and some niches may benefit greatly from its dynamicy. The current "static" RDMBS are like only seeing Java and Eiffle, but never knowing anything like SmallTalk or Python existed.

I'll have to take issue with "Anything you can do by calling data driven subroutines, I can do by generating code." from the lists above.

CodeGeneration fixes the data into executable code at CodeGeneration time, while data driven code can change its behavior at run time, with just a change of the input data. Yes, you can do CodeGeneration of anything that can be done in a data driven way. But many languages' compile and link requirements, and many organizations' release control requirements will prevent you from making this approach as flexible as a data driven approach.

(Caveats: It's not unheard of to compile code at runtime. Java virtual machines have popularized this concept. And while the data that drives a data driven application is not subject to the same release controls as the source code, a good argument can be made that it should be.)

The point of that comment was that the "data-driven subroutines" and "generated code" are not really very different from one another in terms of capabilities, and there is no reason to say that one is always preferable over the other. Obviously, one cannot hard-code data that won't be known until run-time, but when the data is known at compile-time, then hard-coded solutions may be better. I agree that data-driven programs are usually more flexible than hard-coded programs, but there is often a cost in terms of performance, development time, and other factors, and those other factors sometimes make code the better choice. Use of generated code often passes the DoTheSimplestThingThatCouldPossiblyWork test, whereas data-driven programs are often unnecessarily generic and more difficult to test thoroughly. -- KrisJohnson

Data driven isn't nearly as flexible as code generation, at least in a compiled language. Code generation can give you objects, data driven can't.

It all depends on what sort of code is being generated.

On the one hand, you may have a code generator to produce skeletons. Perhaps your team has a standard for writing classes (maybe you must explicitly create constructors and destructors), and your generator produces empty classes in normal form. In this case, you don't re-generate the class; you generate it once, and work on it manually. This doesn't add to maintenance costs.

This could be viewed as not really code generation at all, but rather as just an unusually large editor template.

On the other hand, you may have a code generator that takes in source from another language and produces what you need, without requiring you to touch the converted code. Consider LexAndYacc. These two tools take in their own special language, produce correct and inscrutable C code, and make parsing simpler than straight C. This works well, but you have to be careful in your configuration management. In this case, the input code to the Lex and Yacc programs are source code, but the C source that they produce should not be stored as source; it is an intermediate, like an object file. Because the Lex and Yacc code is so robust, it doesn't matter that you can't hand-hack it; if there's a problem, you fix it at the Lex or Yacc input level, where it belongs.

On the gripping hand, there are two particularly nasty types of generated code. The first kind is generated by a program that takes interactive (point and grunt) input, rather than an input file in a higher language (like Lex). The second is generated code that is almost, but not quite, what you want. The former often leaves you with nasty code that you can't wrap your head around, and thus cannot safely change. The latter often requires you to regenerate, then hand-hack, every time you want to make a change. For maintenance reasons, consider avoiding either of these types of generated code.

-- RobMandeville

Generating code that is almost what you want may not be quite so bad if you can bridge the gap to what you need by inheriting from the generated classes. This only works well when you can regenerate without needing to touch your inherited classes, of course. -- FalkBruegmann

The LexAndYacc example should be on your Gripping Hand. As suggested above, it is evidence of a Language Smell. C is inadequate for parsing and LexAndYacc generate code for it instead of providing a library for C functions to call. Why do LexAndYacc generate code instead of compiling parsers into a library that you can link directly with? Because they aren't expressive enough to allow you to define a competent interface -since it's written in C. It's an Ouroboros of language smelliness.

["compiling parsers into a library you can link directly with" is in fact almost exactly what LexAndYacc do. Producing source rather than a binary is a feature - they compile to C code instead of binary, that's all. You don't run Lex once and then hand-tweak and maintain the code (which is when CodeGeneration smells), you maintain the grammar as the source to the output.]

It also depends on the nature of the code generation process.

If you treat the code generator, and the files that it reads, as part of your source code then you'll soon know if its good or not. You DoTheSimplestThingThatCouldPossiblyWork, and then apply OnceAndOnlyOnce. If you've refactored the source code, and the code generator still exists, then there's no problem with it. I've been working on RefactoringCodeGenerators? recently, so I'll create a page to summarize my results. -- DaveWhipp

[See ReflectionVsCodeGenerationArticle]

In part, the article provides strong support for the idea that CodeGeneration is a PerformanceOptimization?: about six times faster, in his example.

I would say CodeGeneration can be a PerformanceOptimization? or CodeGeneration can often facilitate optimizations. The problem I solve in the article has nothing to do with performance. The statistics were one of the last additions to the article. -- MikeRettig

One could do type checking and better error handling on the runtime reflection side. System stack traces normally don't tell you much about the data you were working on at the time.

In the example stack trace, it appears that 'SimpleFileLoadManager?$1.load' is failing while reading an integer from the stream. One assumes (from the following stack trace) that it was trying to load an array of integer LineItems? to build an instance of the PurchaseOrderLoader? class. Instead of letting the 'java.lang.Integer.parseInt' NumberFormatException? bubble up, the SimpleFileLoadManager? object could catch it and reformat the exception into an application defined exception that includes the class and property it was working on at the time. The more informative application defined exception could be logged or displayed, as an aid to tracking down the error.

The point is that you can't rely on stack traces to provide all debugging information you could possibly want. Like, when I receive a database error while loading records from a file, I want to know more than where in the program the error occurred: I want to know where in the input file the bad data occurs. -- JeffGrigg

Very true. Increased error handling could be added to either the GeneratedCode? or the RuntimeReflection? solution. IME, functionality is easier and safer to add in a CodeGenerator. My point in the article is that the GeneratedCode? gives you an informative StackTrace? for free. Sure you can use reflection to trap and handle errors, but this raises the complexity of the solution. -- MikeRettig

See RuntimeReflectionIsaDesignSmell.

I once worked on a project where all code was created using CodeGeneration. It was a complete mess - jumps everywhere, different calling conventions for different functions and all sorts of crazy things. I told the team they were crazy - how could anyone work with this? They replied that I shouldn't be programming in assembly and I should let the compiler take care of it.

CodeGeneration as "just" a shifting of the boundaries between SourceCode and MachineCode?

That anecdote demonstrates a common failure is code generators: for some reason we seem to throw the basic principles of software design out of the window when doing code generation. Projects based on code generation are not spaghetti of necessity. ExtremeProgramming appears to be an ideal methodology for code generation because it focuses on an OnsiteCustomer. The customer of the code generator is the project itself, so it is necessary to work with a customer who is intimately coupled with the engineering tasks. -- DaveWhipp

I think the anecdote is intended to show that compilation is code generation. Yes, it can produce messy (assembly) code, but it does raise the level of abstraction and can make the job easier and quicker. Remember, the compiler is just another program!

I agree with you regarding the intent of the anecdote's author. I have been known to make that argument frequently myself. More recently, however, I have discovered the benefits of applying a refactoring strategy to code generation. If the generated code is a mess, then the code-generator will be hard to maintain. There are some principles you can safely abandon in the generated code, but you must be careful (redundancy is OK; but spaghetti is generally not). It is too easy to generate code that is more complex that required, and then justify the complexity on the basis that "it's just like assembler". Unless you have a debugger that allows you to debug at the level of your source code (i.e. not the generated code), then you should err on the side of readability. -- DaveWhipp

Yes. In my view, CodeGeneration is a useful technique if we remember that whatever we generate code from has become source code. From language users we have become language maintainers. That's a more difficult job.

See HowToDoCodeGenerationWell for continuation of this discussion.

CodeGeneration may be TooDeepIntoTheBagOfTricks - and will earn you a PropellerBeanie.

What about generating code that is Hard to write, like with the ANTLR parser generator? It would seem to make that case that writing a grammar and generating the code to parse it is much preferable to writing the grammar AND the code to parse it. -- StevenNewton

yacc, bison and other compiler-compilers could, instead of generating code, generate and interpret their state-transition tables at run time, and call application processing routines using function pointers that you provide. But in a number of applications where these tools are used, performance is fairly important, so doing code generation as a performance optimization makes a lot of sense. -- JeffGrigg

Code generation is an effective bridge between visual design tools & algorithms written in programming language. These are different levels of the software process, and code generation is the hand-over of skeleton structure so the code functionality can be attached. -- ThomasWhitmore

Why couldn't the graphical models be executable? Or, could the code be attached as attributes to the graphical model? -- DevilsAdvocate

My graphical models are executable. The mechanism that I use to execute them is CodeGeneration :-). -- DaveWhipp

My project is using Rational Rose to design a framework and generate C++ skeletons. Once our designs settle down, we add functionality in protected regions of the code. We can do round trip (both forward- and reverse-) engineering; if we change the model/design, we regenerate the code, but Rose preserves the stuff we added. This works just fine for our needs; it just takes a little getting used to. I don't see how anyone can make a blanket statement like "code generate is a design smell"...

Code generation is wonderful within (at least) the following framework:

We design a new language more appropriate to a particular problem than the target language of the code generator. The input to the code generator is now the source code. The output of the code generator is merely a convenient intermediate form, and despite being humanly readable it is not source code and should never be hand-modified.

The generated output is chosen to be, for example, C-language code primarily to leverage the existing optimizer and machine-code emitter of our C-language compiler. Again, that generated C-language code should not be considered source code.

(Hint: we never check generated code into our revision tracking system.)

Yes, to the author above, rather than generate code in another language we can usually merely interpret the source language. We could also run the rest of our system in an interpreter, but we usually don't. The reason why intermediate language code generation is often preferable for our special new language is the same reason why machine code generation is preferable for our general purpose language. Efficiency.

There's no reason to be frightened of code generation, and it is emphatically not a smell. There are, however, stupid uses of code generation, just as there are stupid uses of an "if" statement. Nobody calls conditionals a smell.

I'd like to put some of the above in list form and add to it a bit. Proper CodeGeneration requires that:

generated code is never ever checked into the repository (let alone modified by hand)
the code generation step is fully integrated with the build step
there are significant performance or simplicity gains from generation as opposed to interpretation

Anything that should be added?

The system, comprising code-generator(s), its meta-data and all non-generated code, is in ExtremeNormalForm

I don't think listing attributes of 'Proper CodeGeneration' can ever be right, because it's trying to either bundle at least three things into one, or to argue that only on of the things can be right. Over on AutomatedCodeGeneration I proposed that there were three kinds of code generation - one shot, round trip, and 'compiled'. I totally agree with your assertions when applied to what I called 'compiled' (aka ActiveCodeGeneration here). I would add, as others have pointed out, that best practice in this case may also be to report error positions with respect to the source, not the generated code.

However JeffGrigg argues above that (one-shot) generation of minimal 'shells' of code is also a good thing in some circumstances. And different rules must apply in this case - the developer edits this shell so it must be checked in and the code generation step happens way before the build step. This is clearly a different beast, and different criteria for using it must apply. I would accept the 'cut and paste' criticism for this kind of code generation, if it generated lots of repeated code. -- BrianEwins

In my opinion, code generation or not isn't the design smell issue. Whether you can or can't repeat the generation step, and how much it costs you if you need to repeat it, and how likely you'll be needing to repeat the code generation step - those are the design smell issues I'd worry about.

Pardon me if I state the obvious. -- AndreasKrueger

(PageAnchor: attribute_repeat)

Most generated code looks to me like a "database dump". Why not leave it in the database? It is easier to manage a bunch of parameters as a database than as code, at least for me. I can use database browsers and query languages to customize my view of that info. Would you rather edit a spreadsheet as a linear text dump? I surely wouldn't.

Performance is the only reason I can think of, but I have seen some pretty fast desk-top database engines, although they went belly-up IIRC. Most current DB engines seem optimized for million-record tables instead of the hundreds range, which is understandable because that is where the money is. Most current tools seem to assume that one is using code as a lite-duty database instead of where it should be. I think the pendulum will swing back the other way one of these days such that it will be "put as much as possible in the database" instead of "put as much as possible in code". Sure, some people may not like that, but I do. It fits my thinking patterns me better. No trend/fad favors everybody. (DataStructureCentricViewDiscussion, DataDictionary)

Codifying such also is a violation of OnceAndOnlyOnce it appears to me. Example:

 As a table:

 Field    Type    Size  onValidate
 -----    ----    ----  ----------
 Name     String  30    nameValid()
 Rank     String  10    rankValid()
 Serial   Number  10    foo()
 Location Number  5     bar()

 As Code:

 unit Name {
 Type = String
 Size = 30
 onValidate = nameValid()
 }
 unit Rank {
 Type = String
 Size = 10
 onValidate = rankValid()
 }
 unit Serial{
 Type = Number
 Size = 10
 onValidate = foo()
 }
 unit Location {
 Type = Number
 Size = 5
 onValidate = bar()
 }

(This example is subject to frequent TabMunging. In fact the whole goddam page is. Tabs be damned!)

The tabled representation does not repeat the attribute names over and over again (a OnceAndOnlyOnce sin). For example, "Size" appears only in one place in the table representation, while it appears 4 times in the coded version. Plus, I find it much easier to see row-wise and column-wise patterns in the tabled version.

The reason I keep mentioning "to me" is because some have admitted to preferring the code version for some odd reason. It may be subjective, I don't know. Tablizing it just works for me.

However, you can achieve minimal duplication using OO:

 class unit:
 def __init__(self, Type, Size, onValidate):
  self.Type = Type
  self.Size = Size
  self.onValidate = onValidate

 Name = unit(String, 30, nameValid())
 Rank = unit(String, 10, rankValid())
 Serial = unit(Number, 10, foo())
 Location = unit(Number, 5, bar())

Your "nodes" are not connected, I would note. And, it would be easier to type it into a table browser in my opinion. For one, the columns line up automatically. Note that a variation of the above could look like:

  function unit(Type, Size, onValidate) {
 r = createArray();
 r['type'] = Type;
 r['size'] = Size;
 r['onvalidate'] = onValidate
 return(r)
  }
  Name = unit(String, 30, nameValid());
  Rank = unit(String, 10, rankValid());
  Serial = unit(Number, 10, foo());
  Location = unit(Number, 5, bar());

- - -

What about the headache of putting the stuff into "the" (not! multiple instances) database, and the configuration management of doing so, and the "DBA may I have one or two tables" of doing so? I've worked on maintaining several projects that wanted to put everything possible into the database, no matter how non-volatile the information was. My thought after one or two updates was "Welcome to Hell".

This may be a case of DbasGoneBad. I have used "nimble table" tools, such as FoxPro, that made making such tables a sinch. Unfortunately the OOP fad appearently killed off the acceptance of such wonderful tools and techniques. People are afraid of tables these days because of the human overhead they now have. It's a crying shame. --top

There was a really good comment (or two) above about "generator's input is THE SOURCE" / don't check in generated output, and that it's faster.

OTOH, about the 7th time you change something, maybe you should put it into a database, AND give the user a proglet to tune whatever it is themselves.

The one instance that I have run into lately where code generation can be a GoodThing: Creating EJB interfaces with XDoclet. (http://xdoclet.sourceforge.net). This has been good, because generating interfaces is not truly important work, and I don't even version control the interfaces. Rather, ANT generates the interface code every time I build the application. While I consider this to be good, most J2EE tool vendors and specification leads could also realize that CodeGenerationIsaDesignSmell.... -- ChadThompson

Code generation can be a very good thing when dealing with language boundaries. For example, generating a Java proxy for a database stored procedure or SOAP interface. -- JeffDrost

A mirroring wrapper is perhaps a violation of OnceAndOnlyOnce or YagNi. If you look at such code, most of it repeats a theme over and over. The duplication itches like insufficient abstraction. The duplication should be compressed out somehow. I think dynamic languages are better at this, but this risks starting a HolyWar about static versus dynamic typing. Perhaps we should study some generated code together.

This page touches on many of the issues found in EffectiveCodeGeneration.

I spend a lot of time in some projects in building code generators to take away the grunt work of programming. Examples would be generating stored procedure templates from tables, generating calling code for stored procedures, etc. The amount of time thus spent is always worth the savings the team gets from the generated code.

In maintenance projects I also write a lot of scripts to generate scripts. I need to do this when I do not have direct access to the system I have to maintain and I need to run scripts based on some configuration.

A perfect example: In one project I needed to log all inserts/updates/deletes for some tables in an audit table. I wrote a script that will accept the table name as a parameter and generate the trigger based on the table metadata. I spent about 3 hours writing and testing it. The three hours was worth it as there were 80 tables where this trigger was required and furthermore I was able to use it in subsequent projects when required.

In projects I make a concerted effort to identify areas that will benefit from code generation. -- HemantSahgal?

Ideally, there would be an "event level" that would be all tables. It is analogous to an "on_key" GUI event which can listen to key/mouse strokes for any widget on a form, or even have it at the application level, not just the form level. A status structure of some sort often tells which widget/form/table/etc. is the target of the event.

But lacking that, can't you call a stored procedure or some central routine with only parameter values that are different per table? It sounds like you might be dealing with a limited language if you have to truly replicate the entire logging function for each target table. I agree that limited languages or tools may force one to replicate to some extent, but I don't know your particular environment.

The code that you use to code your program, is automated. In order to get away from automation, you'd have to build your own computer from scratch. Otherwise you are automating. And there's no way you'll be able to just go and buy a keyboard at the store, to get this computer of yours running. You have to build it yourself. Get real, automation is the reason everything exists.

There is a difference between repeating a concept and repeating code. If you have 20 sections of code that are all 80% identical, that is a strong hint that more abstraction is needed in your design, or perhaps your languages is not "powerful" enough. Why can't you create a subroutine or method with parameters instead of repeat the code 20 times? The point at which duplication of similar patterns is no longer tolerated varies per individual.

[The language not being powerful enough is probably the main reason for using code generation as a technique. Repeating similar code is generally necessary only when the differences can't be factored out in the language you're using - for instance, type differences in early C++. Since C++ compilers have started really implementing templates, I've had much less use for code generation in C++. In a language like Lisp, I'd think that you would never need a code generator, since the macro feature is so flexible.

Of course, in a very real sense, both C++ templates and Lisp macros are code generators - written using powerful built-in features of the respective languages. Hmmm. -- DanMuller]

Code generation of interfaces that other code depends on especially smells.

Let's say I automate generation of interfaces and default implementations for classes representing types I define in an XML Schema. I just give it the schema, and the tool takes care of the rest. Dependent code talks just to the interfaces.

Now let's say I use ActiveCodeGeneration for the interfaces. They are wiped and recreated for each build.

Here's the kicker: what about IDEs that passively validate and compile code as you type, providing UI artifacts displaying lists of methods, parameters, return types, fields, etc.? As soon as start writing the code depending on the generated interfaces, the IDE will complain that type does not exist. No auto-completion will be possible.

Sure, you can generate the interfaces only, but every time you do a clean, you have to remember to regenerate these interfaces. Such a task is not doing a build, per se. It's an intermediate step.

If the generated interfaces go into an archive, it's even worse. Some IDEs on some platforms will not let you overwrite archives you specified as a dependency for the project. The auto-completion process has some sort of file lock on the archive. Thus, when you complete the dependent code and want to do a build, the build will ultimately fail due to an I/O error.

Surely this indicates a bad smell coming from the IDE?

Code generation is a way of writing a compiler for a MiniLanguage? without most of the work that goes into writing a compiler. One problem which can arise with it lies in poorly defined interfaces between the code in the MiniLanguage? and the code in the main language. If the generated code gets used as if it was code in the main language, it creates hopelessly hairy interactions. It's like compiling chunks of C where the resulting assembly gets macro-substituted into assembly files. Defining the possible interactions between the generated code and the main code strictly makes a big difference.

Consider an interpreter instead of a compiler then.

I just sat through a 2-hour presentation on a 'software delivery process methodology' that shall remain nameless. Needless to say, it was huge (4+ Mb of HTML or four big ring-binders) and had all the usual process diagrams, roles, list of deliverables (one that made the button on the scroll bar shrink in a frightening way ... and one that didn't seem to include executable code as far as I could tell) and templates for those deliverables.

One of the project managers in the room noticed me rolling my eyes. He knows I'm a bit of an 'extremist' and asked me afterwards if I thought it was all bullshit. I said I did and he agreed with me. "But", he said, "managers of companies we consult to expect to see this sort of stuff. They don't read it but they do want to see it. It may be bullshit, but if we adopt it, we don't have to write that bullshit ourselves. Even better, these templates allow us to automatically generate the resultant bullshit the method requires. So the customer's managers are happy because they have the requisite weight of bullshit and we're happy because we don't have to do much to produce it."

An example of DoTheSimplestThingThatCouldPossiblyWork, I wonder? ;-)

-- PaulDyson

I had a manager once that told me there are two types of managers: one is concerned with progress of the checkmarks, the other is concerned with what the quality of the work that was produced to satisfy a checkmark. Most are the first type. (see CargoCult)

Code generation is being used to satisfy the red tape b.s. that many companies require, no-one reads it, but clients want it, so code generation comes to the rescue. You can generate b.s. documentation just as well as anything else, the design smell isn't code generation, it's the bureaucracy that requires all that unnecessary paperwork. Code generation isn't a design smell, bad code generation is a design smell, and there's a world of difference between the two. This is a completely different topic, though I'm not it's author, I agree with it's point, you simply can't do business with many big companies without producing lots of paper, because it's the paper they're paying for, it gives all their mid-level managers a way to justify their jobs, and keeps them busy with meetings and revisions; they have to have something to do. -- RamonLeon

Isn't that called "boilerplating"?

EditHint: I moved the main points together, but the discussion could be moved to a separate page.

Alternatives

Code generators get you working code fairly quick and early. Maybe a good framework would alleviate the need for such, but good frameworks are usually difficult to get right the first time and require a lot of domain knowledge. Alternatives include:

HelpersInsteadOfWrappers - Make micro-frameworks that can be ignored when needed
?

To do functional style programming in .NET (usually in static framework methods that operate on families of classes) involves heavy use of reflection. Reflection is a horrendous performance bottleneck and code-genning up the exact 'lightweight' object with all the properties you need is far faster and is done at compile-time (instead of run-time). I view code-gen as a substitute for reflection. Why slowly reflect on a big object and pick out what you need at runtime when you can compile exactly what you need? --BrianG

Perhaps there are other ways to do something similar. May we request a UseCase?

I used to try to use DataDictionary techniques to map DB columns to forms or variables or maps for in-app processing. However, the languages, team familiarity, and tools just don't support that technique very well, especially with regard to handling those fields that don't map one-to-one. Thus, I've moved to using existing schemas to generate conversion function calls, similar to those under HelpersInsteadOfWrappers. Suppose something like this is generated:

  ...
  makeColumn(handle, obj.sourceX, "destinationX", myType ....);
  makeColumn(handle, obj.sourceY, "destinationY", myType ....);
  makeColumn(handle, obj.sourceZ, "destinationZ", myType ....);

If I need to custom-diddle something, I can just do this:

  ...
  makeColumn(handle, obj.sourceX, "destinationX", myType ....);
  temp = obj.sourceY . fiddleWith(obj.sourceB);   // append stuff
  makeColumn(handle, temp, "destinationY", myType ....);
  makeColumn(handle, obj.sourceZ, "destinationZ", myType ....);

It is possible to do such fiddling with DataDictionary frameworks, but requires pretty fancy frameworks that confuse newbies. The above is generally newbie-friendly and doesn't require a new hook be put into the framework. Thus, in an imperfect world, sometimes code generation is not a bad thing, as long as the code it generates is not unnecessarily verbose. One row per field is usually sufficient.

My company uses code generation, and I feel they made the right decision. The problem is that we made an engine which runs a program in a new language, a very domain specific language. We sell the engine and allow customers to write programs in this language. We have the engine written in C++ and we have most of the tools written in Java. We want the language to be expressable as XML, to be expressable as an object graph in C++ and an object graph in Java.

It's basically the serialization problem except that we want the objects to be in Java and C++, and code generation from like a Rose model to Java code and to C++ code is the only sane way to do it. (The key IMHO is keeping the model as small as possible, something which isn't being done, but that's a discussion for another time.)

Generating code that has to subsequently edited by hand is bad.

E.g. Microsoft Wizards are bad. The generated code is often hard to understand. Worse, if they code generator is changed, and the automatically generated code cannot be regenerated because of the hand changes made... that's bad.

Hand editing or customization of generated code is not so bad if it does not interfere with regenerating code. E.g. if the hand edits or cuistomizations are not blown away by the regeneration.

So many of y'all are ragging on code generation. Why use code generation when a denser, more abstract description would do the job?

Because your boss doesn't want to hire people with specialized skillsets. They want to stick with JavaProgrammingLanguage? or CeePlusPlusLanguage?, because there are lots of programmers out there who can read and maintain it. Unfortunately, neither of those languages would be described as "dense" or "abstract" without having to build a few layers of classes, first. So, either sit down and start banging out boilerplate code (copied and pasted from other projects, tweaked as appropriate for this project) or write something dense and compact and let a code generator write it. Ideally, you'd revision-control the dense, compact version, but you may not want your boss knowing that your code generator created that pile of code.

I have been ordered to stop factoring at one org I did contracting for because it was claimed factoring made it too difficult for PlugCompatibleInterchangeableEngineers to come and go, and they cited past examples. It's the way of the industry, for good or bad. Rather than fight it, perhaps we should embrace it and find the best way to leverage code generation. If you can't beat 'em, join em, and join 'em using the best techniques possible relating to it.

CategoryCodeSmell, CategoryDiscussion, CategoryAbstraction