Dup Tective

DupTective is a Python program that helps you find duplication in source code at the method level.

It currently works on PythonLanguage source, but can be easily extended to other languages.

DupTective is at version 0.1 now, so there is room for improvement, especially new features, and, um, "portability enhancement." It's known to work with Python 2.2 under LinuxOs, and will probably work on all other UnixOs systems (including MacOsx) and MicrosoftWindows systems with Python and gzip (but see below). Please post any successes and failures here.

This is also a good place to request new features (e.g., being able to specify boundaries by RegExp).

How to get it:

 http://gpaci.home.tiac.net/duptective/

If you just want to run it:

 http://gpaci.home.tiac.net/duptective/duptective.py

If you want to run the tests, this tarball has them, plus some test data files:

http://gpaci.home.tiac.net/duptective/duptective.tgz

How to use it:

You can check a single source file in just a few seconds:

 python duptective.py myFile.py

This gives you the ten regions (usually methods or functions) most likely to be duplicated elsewhere or contain duplication themselves.

To find more regions, use a numeric flag:

 python duptective.py -20 myFile.py

This gives you the top twenty.

To find all regions, use the -all flag:

 python duptective.py -all myFile.py

You may want to redirect this output, since it can be large:

python duptective.py -all myFile.py > /tmp/myFileDups

To find all regions with lower signal than 20%, use a percent flag:

python duptective.py -20% myFile.py

Running times barely depend on how much you report.

Multiple files work the same as single files:

 python duptective.py myFile.py myOtherFile.py

This gives you the ten most duplication-prone regions of the two input files together. Methods duplicated (or nearly duplicated) between the two files should show up close to each other in the report.

On UnixOs (and possibly Cygwin), you can glob to specify files:

 python duptective.py /usr/local/lib/python2.2/email/*py

This gives you the top ten for all Python source files in the given directory. On an older machine (Pentium II at 266 MHz), this takes about 20 seconds (for 13 files).

Interpreting the Output:

The previous command yields the following output:

Total info: 14759 # len info signal share location 46: 346 18 5.20% 0.12% /usr/local/lib/python2.2/email/Message.py(181:189) 47: 343 20 5.83% 0.14% /usr/local/lib/python2.2/email/Message.py(190:198) 3: 283 19 6.71% 0.13% /usr/local/lib/python2.2/email/Encoders.py(30:41) 54: 343 26 7.58% 0.18% /usr/local/lib/python2.2/email/Message.py(289:299) 69: 1384 107 7.73% 0.72% /usr/local/lib/python2.2/email/MIMEImage.py(19:46) 53: 347 28 8.07% 0.19% /usr/local/lib/python2.2/email/Message.py(278:288) 48: 327 28 8.56% 0.19% /usr/local/lib/python2.2/email/Message.py(199:207) 4: 303 27 8.91% 0.18% /usr/local/lib/python2.2/email/Encoders.py(42:53) 65: 1464 132 9.02% 0.89% /usr/local/lib/python2.2/email/MIMEAudio.py(43:71) 70: 271 26 9.59% 0.18% /usr/local/lib/python2.2/email/MIMEMessage.py(1:14)

The columns are as follows:

# is the number of the chunk as encountered by DupTective; this can be useful for spotting regions that are near each other
len is the length of the region in bytes
info is the number of bytes of information the region contributes to the whole
signal is the ratio of info/len, expressed as a percentage
share is the ratio of info/total, expressed as a percentage
location is the filename and range of lines (inclusive, starting at 1) where the region is located

The most important column is signal, and the report is sorted from lowest signal to highest. Low signal corresponds to duplication, either within the region itself, or between the region and some other region somewhere in the source.

Very low signal levels (<5%) usually indicate blatantly obvious duplication. Look nearby in the report for prime suspects, or in the region itself if it's internally repetitive. Signal levels above 20% usually indicate non-duplicate code, at least as far as I can see.

These numbers may depend on the language, the coding standard, the amount of commenting (comments are usually English text, which is around 30% signal), and the average method size. More experience using DupTective will yield better rules of thumb.

Could you post the offending Message.py excerpt and explain what was bad about it? The explanation was a little bit too abstract for me. An example would be great.

Excellent point. Just about any explanation can be enriched by examples. Here is the text of the first two (hence worst two) chunks found above (they happen to be next to each other):

First chunk:

def keys(self):
"""Return a list of all the message's header field names.

These will be sorted in the order they appeared in the original
message, and may contain duplicates.  Any fields deleted and
re-inserted are always appended to the header list.
"""
return [k for k, v in self._headers]

Second chunk:

def values(self):
"""Return a list of all the message's header values.

These will be sorted in the order they appeared in the original
message, and may contain duplicates.  Any fields deleted and
re-inserted are always appended to the header list.
"""
return [v for k, v in self._headers]

(I've used bold for the parts that vary, to make them stand out.)

These two methods clearly duplicate each other quite heavily. As a result, they show up (a) at the top of the list, and (b) near each other in the list (right next to each other, in fact). Now, with something as simple as this, you may not choose to remove the duplication. But you may -- probably by making one method, and passing keys/values as a parameter:

def headers(self, getValues = 0):
"""Return a list of all the message's header names
(or values, if specified).

These will be sorted in the order they appeared in the original
message, and may contain duplicates.  Any fields deleted and
re-inserted are always appended to the header list.
"""
return [v[getValues] for v in self._headers]

In fact, there's a third, similar method, right after keys() and values() in the source, but somewhat lower down on DupTective's list:

Seventh chunk:

def items(self):
"""Get all the message's header fields and values.

These will be sorted in the order they appeared in the original
message, and may contain duplicates.  Any fields deleted and
re-inserted are always appended to the header list.
"""
return self._headers[:]

The body of this could be rewritten as:

return [v for v in self._headers]

Then you could rewrite all three methods using a function parameter:

def mapHeaders(self, f):
"""Return a new list containing the results of applying f()
to each of the message's headers.

These will be sorted in the order they appeared in the original
message, and may contain duplicates.  Any fields deleted and
re-inserted are always appended to the header list.
"""
return [f(v) for v in self._headers]

# Outside the class:
def get_key(x): return x[0]
def get_value(x): return x[1]
def get_item(x): return x

# so foo.keys()=> foo.mapHeaders(get_key)
#foo.values() => foo.mapHeaders(get_value)
#foo.items()  => foo.mapHeaders(get_item)

And this might lead you to some other refactorings, such as introducing a Header class, and putting the get_* methods on it, or introducing a Headers class to serve as a container for the headers.

Another Example:

One-line methods are, of course, somewhat special cases. So let's look at another case of cross-method duplication turned up in our example run -- the sixth and fourth chunks, which are also right next to each other in Message.py:

def get_main_type(self, failobj=None):
"""Return the message's main content type if present."""
missing = []
ctype = self.get_type(missing)
if ctype is missing:
return failobj
parts = ctype.split('/')
if len(parts) > 0:
return ctype.split('/')[0]
return failobj

def get_subtype(self, failobj=None):
"""Return the message's content subtype if present."""
missing = []
ctype = self.get_type(missing)
if ctype is missing:
return failobj
parts = ctype.split('/')
if len(parts) > 1:
return ctype.split('/')[1]
return failobj

These differ only in that the first looks for the first element, and the second looks for the second. Eliminating this duplication is trivial, and I think just about anybody would do it:

def get_mime_type_part(self, part=0, failobj=None):
"""Return the message's specified content-type part if present."""
missing = []
ctype = self.get_type(missing)
if ctype is missing:
return failobj
parts = ctype.split('/')
if len(parts) > part:
return ctype.split('/')[part]
return failobj

def get_main_type(self, failobj=None): return self.get_mime_type_part(0, failobj)
def get_subtype  (self, failobj=None): return self.get_mime_type_part(1, failobj)

There are other things you might want to improve (such as either using the parts local temp variable instead of the second split, or inlining it into the if expression), but now you only have to improve them in one place.

I hope these examples give a better idea of what using DupTective is like. I intentionally used code from the Python libraries since you should have them available already on your system (they may be in /usr/lib, instead of /usr/local/lib); I'd encourage you to explore further with your own code. -- GeorgePaci

[Should the order of the above examples be reversed?]

Emacs Integration:

If you use Emacs, I recommend using a keyboard macro similar to the following to jump to the region indicated:

 (defalias 'goto-chunk (read-kbd-macro
  "C-e M-b C-b C-SPC M-b M-w C-b C-SPC C-r SPC C-f C-x C-x M-w C-x C-f C-y RET M-g C-y M-y RET"))
 (local-set-key 'f4 'goto-chunk)

Feature Requests:

specify region beginnings by regular expression (to support other languages)
ignore whitespace
ignore comments (including docstrings)
report on a single region
Eclipse plugin. :)

Platform Experiences:

Works fine with Python 2.2 and gzip 1.3 under RedHatLinux 7.1.

Works fine on Python 2.3 on Windows so long as gzip and wc are in your path (for example, CygWin's).

However, if you replace the implementation of compressedSize() with this:

 def compressedSize(s):
     return len(s.encode('zlib'))

it should work everywhere. timeit() says the zlib version is significantly faster, too (especially on Windows, which has a high process creation cost), and it should give you a better view of the "real" amount of information, since with gzip, you're adding in the size of .gz file headers. --TimLesher

CategoryDuplicationFindingTool CategorySoftwareTool CategoryPython