Refactoring Html

On BoredomIsaSmell, I mentioned the canonical example of a boring thing that programmers hate doing: HTML. I claimed that subjecting it to all the XpSimplicityRules can bring the some same benefits as subjecting real code to the XpSimplicityRules.

A reply was this:

Kinda hard since HTML supports neither variables nor subroutines. You can't factor it. Now, if you could write a 10-line Perl script which used s/// a few times to convert your own abbreviations to HTML, you'd be cooking.

Exactly. Write code to write that code.

Replace this:

 print "
  <td>Foo</td>
  <td>Bar</td>
  <td>Baz</td>
  ...
 ";

with this OnceAndOnlyOnce version:

  print map { "<td>$_</td>" } qw(Foo Bar Baz);

And, the very next time you have to output a <td>, refactor that to one place. When you have a conceptual grouping of these things, put those into one place (for example, the header/navigation bits are probably deserving of their own object).

Possibly you'll end up having something like this, to save yourself from seeing (or duplicating) any HTML directly:

 print Html::tableize($someTwoDimensionalArray);

Eventually, you hardly realize you're doing HTML at all.

Don't forget that you have column headings, and that each data column may have different alignments, such as right-justified for numbers, left justified for descriptions, and center-justified for small codes.

The tricky part comes when the output needs conditions or different cells need different formatting. One may have to decide whether to store formatting information with the intermediate structure, or just put it at the HTML. Because of such issues, an easy shortcut is sometimes to make simple, short wrappers:

  td(foo)
  td(bar)
  td(baz)

It ain't fancy, but it gets the job done. Add named parameters if you need them:

  td(foo)
  td(bar, bgcolor="red")
  td(baz)

However, you may end up reinventing HTML, which have resulted in ExpressionApiComplaints. A compromize may be to have an optional second parameter that just passes through to the HTML tag contents:

  td(bar, 'bgcolor="red" zarg=3')

Result:

  <td bgcolor="red" zarg=3>[bar's value]</td>

Note that dot-path-crap is anti-factoring:

  html.system.td(foo)
  html.system.td(bar)
  html.system.td(baz)

I don't know why some people tolerate such bloat. I find it ugly, stupid, and just plain poor factoring of syntax. If your language must have it, then run away.

Eventually, I made an entire TableWriter class to refactor data into tables (with formatting and such included). Now I love making HTML tables! -- SteveBushman?

I think m4 (or even cpp) is a superior way for this kind of stuff.

Can you give us some examples of this approach? My main goal is to treat HTML with the same requirements as code (to DeduplicateAndElucidate?). The above approach seems to fit well, because if the business logic is in a language (Perl, above), then keeping the formatting in that same language avoids some need to manually keep things in sync.

I HaveThisPattern; I was starting to write a tool to help a friend who wanted an online journal, but didn't want to run it on a blogger because of ads and control issues. After a few rounds of programming, I developed a few shell scripts that would do all the HTML formatting, and all I'd need to do is provide a text file every day (or whenever). -- PeteHardie

It's amusing to be having this discussion in a forum that automagically converts WikiCase to an appropriate hyperlink.

CascadingStyleSheets are an attempt to refactor HTML. I hate writing HTML with style attributes, the duplicative stench is overwhelming. Writing a style sheet is boring too, but at least it's OnceAndOnlyOnce. -- SeanOleary

Considering that much extant HTML has all manner of syntax errors and other crud (goes to show that browsers are smarter than compilers ;-), and that some HTML pages are tangled messes of nested tables, and you can see lots of opportunity for refactoring HTML. Just grab a page and clean up the code, fixing broken or unclosed tags, then tidy the tags so they becomes W3C compliant for starters. Then simplify the tables. Strip out any inline markup that forces a particular look and replace with CSS, and this includes any extraneous <BR> tags to force extra spaces in various places - you can do this by assigning class="foo" attributes to create special variations of basic tags, and in this process you'll also be adding semantic markup (e.g., all your inline book titles will be surrounded by <span class="booktitle">). Finally, start replacing table-based layouts with DIVs. At the end of it all, you will have cleaner and simpler HTML code which is very easy to modify and restyle as needed. -- EricScheid

One of the forces that makes this not always a good idea, however, is if you're working for a site that requires a lot of design changes. It's a waste of a programmer's time to be constantly tweaking HTML, especially if a production person who makes half the salary can do it instead. This is why people still do code in PHP or .rhtml or JaveServerPages?; it's fairly messy in practice, though.

Oddly, sites that require a lot of design changes are when I think this is best. The bare minimum is that the header and navigation are deduplicated. Any other kind of task that would require a "Web Monkey" to go do monotonous work is a prime candidate for automation (AutomateBoredom). As you mention, it's messy in practice.

The model I've gone back to time and time again is: Have the artsy people create a sketch, be it HTML produced from Dreamweaver, a printed picture from Photoshop, or a scribbling on the back of their hand. When they give it to the programmers, the programmers figure out the correct abstractions to use. If they decide a template is appropriate, great. More often, they'll say, "Oh, this is exactly like <some other page>. I'll make a handy function to generate both.

Notice one important point: Both sides are using their brains. NoMonkeysNeeded.

Headers and navigations are easily taken care of with server-side includes, which have the added benefit of requiring almost no programming ability to use. Other HTML is unbelievably monotonous at times, but that doesn't mean it can be easily automated.

In particular, my experience with turning HTML generation into code has involved a lot of assigning to names to particularly arbitrary elements that are meant to make the page look good in MSIE, not to actually satisfy any conceptual need. Some kinds of commercial HTML work are nothing like the SemanticWeb that TimBernersLee imagined, and henceforth lend themselves very poorly to programmatic abstraction. For some reason it feels wrong for me to write code where I create intermediate objects like BlackTopRow and SideSpacerColumn, especially if I'm simply going to have to rewrite those abstractions in a week.

Some sorts of problems make this abstraction a good thing, of course. If the client's design change is "The text in each column header should have non-breaking spaces instead of normal spaces", code lets you do that. But it's not always like that. Sometimes - too often, in my experience - the client fixates on more superficial things that can't really be refactored out no matter how clever your code is.

Replace this:

  print map { "<td>$_</td>" } qw(Foo Bar Baz);

With the non-obsfucated, and easy to change version of this:

 print "
  <td>Foo</td>
  <td>Bar</td>
  <td>Baz</td>
  ...
 ";

HTML breaks the idea of OnceAndOnlyOnce. It is one of those things you are going to have to sigh and cope with. Unless you are the single programmer/designer who will only ever be working on the project that is sent to one browser and one browser only, you want to make sure that your HTML is easy to work with. New versions of browsers are going to break it. Browsers on different platforms are going to break it. Designers are going to break it. Sites will get redesigned.

It is entirely too easy to get caught up in trying to wigitify and templatize every little piece of HTML to "Make things easier for the Web designer" that end up making things harder. Generally, the Web designers can cope with ASP, PHP and JSP. What they can't cope with so well is learning some obscure template language (at best) or even an obscure proprietary template language (at worst). (MetaProgramming)

Use PHP, or JSP as your application glue. Most designers can handle code like:

 <TD> <? echo $someval ?> </TD>

HTML belongs in One Place, and Only One Place. Not in three different places (Your templates directory, your content directory, and in an obscure location of your function library).

On the other hand, your designers might have trouble with the "extract method" concept. Those fancy, but repetitious, layout elements that require several different tags each to produce are perfect fodder for the programmer to want to write a function. The last designers I worked with had to be taught this concept, and then it didn't work with their toolchain, which was roughly a WYSIWYG HTML editor. You see, at the time (2003) the wysiwyg tools couldn't deal with repeated semantic units of layout. They tended to produce HTML intended for final consumption, when the ideal approach is to produce templates usable with some large collection of other tools.

Generally however, trying to treat HTML as just another language where you can apply Java/Smalltalk like refactorings at will seems counter-productive. Refactor HTML Mercilessly, but you must refactor it for different reasons.

HTML Refactorings:

ConvertTableLayoutToCss? - To move to a more CSS friendly layout, and to do better separation of design from content.
ConvertCssLayoutToTable? - Sometimes your layout/design requires funkiness that only tables can realistically provide.
RemoveIslandText? - Text with the parent tag being the <BODY> should be given a <DIV> or <SPAN> tag at the very least.
KillOverusedWhiteSpace? - Whitespace occasionally breaks things. The first instance of whitespace (tabs, spaces, cr/lfs) has meaning, but subsequent whitespace does not. For instance <img src="f00"> <img src="bar"> is different than <img src="foo"><img src="bar">. Whitespace at the end of tables can be particularly nasty. CR/LFs tend to really hose you. This isn't a refactoring, since it deals with correctness, rather than making the HTML more easy to maintain and modify.

The goal of refactoring HTML is to describe content in terms of a vocabulary and the exclusion of layout. Only tabular content should be described with <table> tags. The role of CSS is to apply the layout to the content vocabulary. Content will invariably contain repetitive elements, therefore the repetition of class selectors should not be considered malodorous. As programmers, we should refactor our code to generate X/HTML with the least number of class descriptors necessary to describe the content whilst providing layout artistes with a document structure that can support presentational requirements. A simple pattern:

GenerateMarkedupContentWithoutStyle?

and AntiPattern:

AbuseMarkup?

See HowImportantIsLeanCode, HyperTextMarkupLanguage