Commenting Challenge

(I've put six single quotes in lots of places to avoid functions and variables mindlessly being made into inappropriate links. However, I can't type tabs, so now I'm unable to create both bulleted lists and properly indented code. There appears to be no solution, so I'm leaving it for someone else to put the [tab][splat] where necessary.)

Here is DaveHarris' original program, which I will attempt to refactor until I think it doesn't need comments. (Not that it has many now.) Then folks can point out where the program is unclear without comments, and we can see how best to fix it.

See ToNeedComments. -- RonJeffries

See my CommentingChallengeResponse. For the challenge to be fair, you should look at my code first and see if you can understand it without the comments. Then look at Dave's original code (below) to see if his comments add anything to your understanding. -- JimLittle

I don't use Smalltalk nowadays. Here is a Java sample. Please be gentle with me. -- DaveHarris

    /** Turn a stream of "name=value&" pairs into a Dictionary. (The version
      in javax.servlet.http.UttpUtils throws exceptions if it sees "name"
      rather than "name=". */
    private static void parseArgs( Dictionary args, InputStream in ) throws IOException {
        StringBuffer key = new StringBuffer( 50 );
        StringBuffer value = new StringBuffer( 50 );
        StringBuffer curBuf = key;

        while (true) {
            int b = in.read();
            switch (b) {
            case '=':   // Second half of name=value pair.
                curBuf = value;
                curBuf.setLength( 0 );
                break;
            case '&':  // End of name=value pair.
                args.put( key.toString(), value.toString() );
                curBuf = key;
                key.setLength(0);
                value.setLength(0);
                break;
             case '%':  // Encoded special character.
                int num = getHexDigit( in.read() );
                num = num * 16 + getHexDigit( in.read() );
                curBuf.append( (char) num );
                break;
             case '+':  // Encoded space.
                 curBuf.append( ' ' );
                 break;
             default:
                 if (b < 0) {
                     if (key.length() > 0)
                         args.put( key.toString(), value.toString() );
                     return;
                }
                curBuf.append( (char) b );
                break;
             }
        }
   }

(I just tried to fix the indentation, which was lost during the copying. It's probably best not to use tabs as Wiki thinks the case labels are dictionary lists.)

This example is Java when the original request was for Smalltalk. I suspect it makes a difference. Java has manifest types, and function definitions generally have a lot more syntactic noise than in Smalltalk. As it stands, the whole routine fits within a single screen (on my machine at least). This makes it easier for me to understand. If it were refactored so that, say, every "case:" had its own method, I suspect it would - in Java - become twice as long, you could no longer see it all at once, and it'd be harder to understand. (This is for sound psychological reasons involving the limits of short-term memory.)

You say it doesn't have many comments now. I chose it partly because it had so many! Most of my routines are shorter and need fewer. Perhaps having a concrete example, albeit in Java, has made my position clearer. I'm not into vast amount of redundant commenting; few people are. Here's a brief commentary on the comments.

"Turn a stream of "name=value&" pairs into a Dictionary."

This is a method header comment which tells you what the method does. Arguably it would be unnecessary if the method had a better name. In this case, I also wanted to suggest the syntax of the (name, value) pairs. Ideally this sentence acts as a specification for the whole routine - you could write the routine from it. It should be at a higher level of abstraction. It's redundant, in that you could figure out what the comment would be from reading the method's implementation, but that would usually take me many seconds. With a comment I get the whole gestalt instantly.

"(The version in javax.servlet.http.UttpUtils..."

This is a cross-reference explaining stuff that isn't obvious from the method body. It mentions a problem with the standard library routine that took me little time to discover in the first place. I don't see how you could get rid of it.

"Second half of name=value pair."

Each branch of the switch has a comment which shows how that piece of code fits into the wider context. Again I'd claim its a specification for the code that follows, and at a higher level of abstraction. This is probably the weakest point - the comment is a hint that the branch could be turned into a separate function. I usually do that, but didn't here because of the way the branches use the state on the stack. I suspect the "perfect" factorization would turn the routine into a separate class, and turn the stack variables into instance variables.

I'm not convinced that would truly be better. See above for some of the reasons - generally it would mean more code and overhead and add unnecessary complexity. This is probably a judgement call - your opinion may differ. Because I don't have such a negative view of comments, I can see them as an acceptable solution, and in this case they are a lightweight, economical answer.

-- DaveHarris

Faced with this sort of thing, I'd tend to do something like this:

        final class NameValuePairToken {
                static final char
                        START_OF_VALUE         = '=' ,
                        END_OF_NAME_VALUE_PAIR = '&' ,
                        SPECIAL_CHARACTER      = '%' ,
                        SPACE                  = '+' ; 
        }
        /*..*/
        private static void parseArgs( Dictionary args, InputStream in ) throws IOException {
        /*...*/ 
        while (true) {
            switch (b) {
            case NameValuePairToken.START_OF_VALUE : 
                /*...*/ 
                break;
            case NameValuePairToken.END_OF_NAME_VALUE_PAIR :
                /*...*/ 
                break;
            case NameValuePairToken.SPECIAL_CHARACTER : 
                /*...*/ 
                break;
            case NameValuePairToken.SPACE : 
                /*...*/ 
                 break;
            default: //read name or value
                /*...*/ 
                break;
             }// switch
        }//while
   }//method ParseArgs

Added comments :

switch, while, and method closing '}' these stand in for an "end <<blockname>>" construct, so are justified as filling a hole in the language syntax. I've been bitten by this so many time in C/C++ that it's second nature to put these in. Your mileage may vary.
intent on the 'default' branch of the switch. Maybe I'm just thick, but when I read this code I thought: 'hang on, where does the name get read? Oh, it must be in the default block, because there's no token.' I then was thick and didn't consider where the value was read. Hence the amended comment.

Removed comments :

intent on the non-default branches of the switch. Since the constants are clearly named for their semantics these are now redundant. A small increase in code volume for a big increase in clarity, I submit.

Weakened comments :

It just got a lot easier to deduce that this code will "Turn a stream of "name=value&" pairs into a Dictionary" by reading it.

By providing the class NameValuePairToken? the specification for the "language" this method parses is show separately from the implementation.

Each block in the switch now has a label saying which part of the language it deals with, not what syntax it looks for, hence communicating intent.

-- KeithBraithwaite

Interesting. I think I prefer the original, though. I find

case '&': // End of name=value pair.

easier to read than:

case NameValuePairToken.END_OF_NAME_VALUE_PAIR :

because for me the indirection gets in the way. The phrase END_OF_NAME_VALUE_PAIR now occurs in two places (definition and use) where before it was only one (comment). It feels like an unnecessary shield. "NAME_VALUE" is less clear than "name=value" because it loses the equals sign. Indeed, I don't agree that it's easier to tell what the whole is doing, because the equals sign is hidden. The idea of using verbs instead of nouns to name symbols feels funny but that may be because I'm not used to it.

I don't agree with "}//switch" type comments; I can't remember being bitten by not having them. A matter of using correct indentation and small blocks, and knowing the language.

Overall I find the original clearer and simpler. I am an unreformed sinner.

-- DaveHarris

If anyone who can type tabs could fix the bullets without wrecking the code indentation, that would be nice.

I do this by tabbing in Notepad, then copying it to the clipboard, then pasting the tab into the edit box with ctrl-V.

I think I would be unhappy until I had something similar to (untested):

 private static void InputStreamReadStringBufferUntilChr_Decode(InputStream in, StringBuffer sb, char c) throws IOException{
     sb.setLength(0);
     while(true) {
         int b=in.read();
         if(b==c) {
             return;
         }
         switch (b) {
             case '%':  // Encoded special character.
                 int num = getHexDigit( in.read() );
                 num = num * 16 + getHexDigit( in.read() );
                 sb.append( (char) num );
                 break;
             case '+':  // Encoded space.
                 sb.append( ' ' );
                 break;
             default:
                 sb.append(b);
                 break;
         }        
     }
 }

 private static void InputStreamReadDictionary(InputStream in, Dictionary args) throws IOException {
     StringBuffer key = new StringBuffer( 50 );
     StringBuffer val = new StringBuffer( 50 );
     while(true) {
         InputStreamReadStringBufferUntilChr_Decode(in,key,'=');
         InputStreamReadStringBufferUntilChr_Decode(in,val,'&');
         if(key.length()==0) {
             return;
         } 
         args.put( key.toString(), val.toString() );               
     } 
 }

See LanguageOrientedProgramming, ThelopLanguage. -- HelmutLeitner

Helmut: Can you please explain to me how "parseArgs" or better yet "parseArguments" is somehow less clear, less intention-revealing, less obvious than InputStreamReadStringBufferUntilChr_Decode? I just don't see it.

You should compare the names parseArgs and InputStreamReadDictionary. What do they tell you? The function InputStreamReadStringBufferUntilChr_Decode - that you dislike - is something potentially reusable, that didn't exist in the original. And it does what it says.

parseArguments is a better name than InputStreamReadStringBufferUntilChr_Decode because parseArguments tells me the semantics of the method. It tells me what the method does. Just from the name (and the class), I know that this method will parse the arguments to an http request.

InputStreamReadStringBufferUntilChr_Decode only provides me syntax. It leaves me guessing as to what the method does. It tells me that there is an input stream that is being read and decoded. What input stream? What are its contents? The whole HTTP request? Just the URL? Why would I want to use this method? I can't tell from the name alone.

Phrased another way: I'm just some guy coding my first HTTP request server and I want to figure out how to get the arguments out of a request. I start scanning the code. Which name is more like to jump out at me and say, "This is the method you're looking for!": parseArguments() or InputStreamReadStringBufferUntilChr_Decode?

I'm also very confused about why "parseArgs", which is clear, gets turned into something long and unreadable, but "sb", which means nothing, doesn't get turned into stringBuffer or aBuffer or something? Where's the logic?

Within a simple function (3 parameters, 2 local variables) there is no need for long names, stringBuffer will not give you more than sb. But change it, if you like. What counts (in LOP) are the function/method names and the consistency of name and content.

Actually, stringBuffer gives me a lot more than sb. When I see stringBuffer, I say to myself, "Ah! This is a string buffer!" and I know what it does. When I see sb, I say to myself, "Now what the hell is an sb?" and I have to go look to the top of the method where sb is defined...and hope to hell that the fellow who wrote the class didn't call it SB too.

Here's two versions of a random chunk of Smalltalk code. Which one gives a clearer indication of what the method does?

NavigatorBrowser>>orderedCollectionCompiledMethodCopyToClass: cl

  | oc mo |
  mo := CodeFiler organizerFor: self selectedClassOrMetaclass.
  oc := OrderedCollection new.
  self selectedProtocols do: [:sp| oc addAll: (mo elementsOfCategory: sp)].
  self
    copyMethods: oc
    from: self selectedSmalltalkClassOrMetaclass
    to: cl.

...or...

NavigatorBrowser>>copySelectedProtocolsTo: newClass

  | methods methodOrganizer |
  methodOrganizer := CodeFiler organizerFor: self selectedClassOrMetaclass.
  methods := OrderedCollection new.
  self selectedProtocols do: [:protocol |
    methods addAll: (methodOrganizer elementsOfCategory: protocol)].

  self 
    copyMethods: methods 
    from: self selectedSmalltalkClassOrMetaclass 
    to: newClass

Dave: Rename "parseArgs" to "parseArguments", and rename "c" to "nextCharacter". It's clear as a bell, otherwise. -- a.l.

Firstly I am assuming this function is intended to parse the query string from a HTTP GET request?

Regardless, I am wondering what the deal is with the while(true) loop? Surely you want to loop until EOF, or some other condition occurs?

Secondly why are we doing the parsing manually?

(Note this code is uncompiled let alone untested, but an indication of my approach to this sort of problem)

 private static Map parseQueryString(InputStream in) throws IOException {
   Map dict = new HashMap();
   String queryString = new BufferedReader(new InputStreamReader(in)).readLine();
   org.apache.oro.text.perl.Perl5Util re = new Perl5Util();

   re.match("([^&]+&)*([^&]+$)", queryString);
   org.apache.oro.text.regex.MatchResult match = re.getMatch();
   for (int i = 1; i < match.groups(); i++) {
     dict.put(extractNameFromPair(match.group(i)), extractValueFromPair(match.group(i));
   }

   return dict;
 }

Where extractNameFromPair, and extractValueFromPair perform the appropriate unescaping. I'm pretty sure I've got the regex wrong, but it's close enough and we've eliminated the switch. We've expressed the syntax in the form of a regex and so that can be tweeked. Most importantly (although maybe not for this specific case), formats have a tendency to want to change. If even your simple parsing is done by regexes this change becomes trivial and so this is one more impedance to ReFactoring removed. -- AndraeMuys

I claim that the whole thing is an example of why obfuscated code is just WRONG for this sort of thing. Here is the same in RubyLanguage: -- IanKjos

 def parseUrlEncodedArguments(str)
   out = {}
   str.split('&').each do |part|
     name, value = part.split('=', 2)
     out[urlDecode(name)] = urlDecode(value)
   end
   out
 end

 def urlDecode(str)
   str.to_s.gsub('\+', ' ').gsub(/%([0-9A-Fa-f]{2})/) {'%c' % $1.hex}
 end

As you can see, the thing packs way more meaning into each line of code. Sure, it might require a few more CPU cycles, but why is that a problem? Has anyone found profiling results stating that we spend most of our time decoding URL parameters?

Ruby is on the right track, now, in APL:

 DictionaryOfURLArgList {<-} { defineWordinDictonary/{each}UrlDecode? {each}{each} '=' chop {each} '&' chop }{omega}

 chop {<-} '({omega}={zilde}{rho}{omega}){quad}pencloase {omega} }

 URLDecode {<-} PercentHHtoChar {jot}{'/+/ ' TEXTREPL {omega}}

 defineWordInDictionary {<-} { dictionary{<-}dictionary add {alpha}{omega} }

-- RandyMacDonald

Your program also changes the semantics. Given the args "a=b=c", the original code will bind the key "a" to the value "c" in the result dictionary. In other words, the original code is buggy.

The tests don't exist. Who is to say it's buggy? -- RandyMacDonald

 - Um, is it a problem when refactoring or porting points out bugs that you can fix?

Sarcasm, man! Humor! Cue the laugh track! In this case, the answer is to check the standard, but in the more general case, we should be able to ask the client about preferred behavior. Or, perhaps the general case is that if you feel the need for comments, you probably have bugs.

It would be straightforward to use Java to read the entire string in at once, then use String.indexOf('&', lastIndex + 1) to extract each "name=value" tuple, then use String.indexOf('=') to find the key & value, and finally go through and demangle the percent signs. That wouldn't be any fun, though, because then it would be easy to understand what the Java code is doing. This is precicely the point. If your code is easy to understand, then bugs in it come to the surface quickly or else (better yet) don't get written in the first place. -- IanKjos

Yes, good point. It's not Java that's wrong. It's trying to implement parsing state machines by hand that's wrong. Either perform simple and obvious transformations on the input text that match the definition of the problem, or else use a parser generator. Anyone care to write a URL parser in lex/yacc? It seems a trivial exercise, but I don't personally know flex/bison well enough to write it.

I don't know that I'd go for a full-bore parser generator. A simple recursive-descent parser should suffice -- there are only a couple of non-terminals.