How does a diff algorithm work?
There isn't one true diff algorithm, but several with different characteristics. The basic idea is to find a 'modification script' that will turn Text A into Text B. They use modification operations such as insertion and deletion. Usually, you'd want the minimal number of changes required, but some application also have to weigh operations depending on how much text each one affects, or other factors.
With a bit of further modern literature search (such as anyone at the level of a university student and beyond (which of course not everyone is) should be capable), these comprehensive papers are easy to find:
Hunt-McIlroy is considerably faster for the line-to-line comparisons than the "folklore" dynamic programming algorithm that has been rediscovered too many times to trace the original discoverer, even though it's often called the WagnerFisherAlgorithm. Further discussion over at http://wiki.tcl.tk/3108.
Applications for diff algorithms include comparison of amino acid sequences (and the answer to the question of 'how far apart are they?'), Wikis, storage reduction in a version control system, historical analysis of different copies of old texts (say the Magna Carta), and by the "patch" command to update source files without needing to replace the whole file.
From one example of typical diff implementation source used by the cvs program at: http://cvs.sourceforge.net/viewcvs.py/*checkout*/cvsgui/cvsgui/cvs-1.10/diff/analyze.c?&rev=1.1.1.3
The basic algorithm is described in: "An O(ND) Difference Algorithm and its Variations", Eugene Myers, Algorithmica Vol. 1 No. 2, 1986, pp. 251-266; see especially section 4.2, which describes the variation used below. Unless the --minimal option is specified, this code uses the TOO_EXPENSIVE heuristic, by Paul Eggert, to limit the cost to O(N**1.5 log N) at the price of producing suboptimal output for large inputs with many differences.
The basic algorithm was independently discovered as described in: "Algorithms for Approximate String Matching", E. Ukkonen, Information and Control Vol. 64, 1985, pp. 100-118.
Note: This Wiki just calls the unix diff command, which implements such an algorithm. For a discussion of how this wiki uses one implementation of this algorithm, the unix diff utility, see QuickDiff
The core of diff algorithms seeks to compare two sequences and to discover how the first can be transformed into the second by a sequence of operations using the primitives delete-subsequence, and insert-subseqence. If a delete and an insert coincide on the same range then it can be labeled as a change-subsequence. The user doesn't always want to be informed as to which subsequences remain identical, but this information is nonetheless typically computed as a by-product of the basic algorithm.
The sophisticated way to find those editing operations is to sort the two sequences in order of the values of the elements of the sequences; for when the basic unit is a text line, usually a hashcode of each line is used to represent the line, and the sequences are sorted by hashcode in order to discover groups of identical elements/lines. Once identical elements are found, they are clumped together into runs which were adjacent in the pre-sorted sequences. One common variant looks for the longest possible such runs, which is an O(N^2) approach, but for N equal to the number of possibilities in each group, not the number of total elements/lines, so this is usually acceptably fast.
At this stage all matching regions between the two sequences/files have been found. Running sequentially through the elements/lines that have not been matched now trivially indicates which regions have been inserted, deleted, or changed. Usually moved regions are indicated as an insert and a delete, but with some extra effort (not typically done by most diff programs) they can be reported as moved.
Actually source code in C can of course be found in e.g. the GNU version of diff or the one from Version 7 Unix, also available online.
The smallest versions of this algorithm can be implemented in less space than the above text of this article; it's less complicated than it may sound. -- DougMerritt
One aspect of 'diff' is that many times the users of the diff output, is not actually interested in the differences between the two strings/files/text/binaries. What they are more interested in is how similar the two things are.
A example of this is attempts to find plagiarism in written text, or code that was re-used in another program.
At this time 'diff' is typically used for this, it probably isn't the right type of algorithm to use.
This especially becomes apparent when the two strings/files/text/binaries are typically mostly different, but share a small common, or 'nearly' common part or parts. The individual parts are usually in block or groups, with only small segments (such as 'identifiers', or people names) changed.
To add to the the larger blocks of similar text, may not be in the same order, but shuffled in sequence! And it is when that happens that the normal 'diff' algorithm breaks down.
'diff' is designed to compare things that are mostly the same, with small amounts of 'modifications'.
[i]What can you use to deal with strings that has been 'shuffled', or which are 'mostly different'?[/i]
I'd love to see someone implement diff in a modern language like Haskell.
http://hackage.haskell.org/package/Diff
The book that I learned Haskell from uses diff as an example of DynamicProgramming.
My favorite mode of the diff program I'm currently uses puts these 2 characters in front of each line in the output:
If I don't see any (non-blank) lines that start "<!", then I can be confident that nothing was deleted (shuffled around a bit, perhaps).
Here's a link to the popular GNU diff algorithm written for java - http://www.bmsi.com/java/#diff. -- RobertDiFalco
There is certainly value in sharing that link, but I'm a little disappointed that, by implication, you're going with the notion that diff algorithms should just be a canned black box, like abs() or sqrt().
Do you think every programmer should RewriteCodeFromScratch rather than use this canned diff implementation?
Certainly. :-) No, that's not the point. Part of the point is, almost no-one looks inside the black box, but almost everyone is at times puzzled by the diff output. Understanding the algorithm allows understanding that quite a few aspects of the output are chosen arbitrarily, relative to the purpose it is used for, and that customization is possible to give output better suited to the task at hand.
One of the problems of code reuse and copy/paste programming is propagation and proliferation of bugs and defects. Consider a very popular unix tool that got branched and forked so much that every similar tool is based or uses code from the same ancestry. Now consider what happens when the original code had some hidden bug. Finding another tool might be crucial in some cases. Consider data recovery and or data-centric tools. Having multiple different implementation gives the user choice. And choice counts.
It's also true that most people are afraid to open up the black box, and so they suffer from a complete lack of an appropriate diff algorithm, when actually it is quite easy to write a custom one that works on the problem at hand.
Another reason would be to learn by example or hands-on experience. Yes you need to write a linked list to know and learn how it works. Simply using one doesn't give you that experience and there is a lot to be learned from (re)implementing and (re)solving problems and already-invented tools/solutions/programs/utilities/etc.
This comes up all the time; the problem is that the 30 year old diff only operates on units of lines. If you want an XML diff or a binary executable diff, which people do, all the time, then you're out of luck. But it's a scary black box, so that's where it stops. Or if they're lucky, they find some vendor that sells a special diff for that purpose. For Windows. Too bad if you're on a different OS.
People are so cavalier about acting like OnceAndOnlyOnce means it's bad to reinvent the wheel, that once something has been done, it should never be done again. Some people elsewhere on this wiki were speaking contemptuously of anyone who writes a linked list routine. Give me a break; there are over 200 different styles of linked lists. Show me a canned implementation of every flavor.
I mentioned sqrt() - actually, I do in fact need to reimplement it once in a while, too, e.g. because I need a fast integer square root, not part of most standard language libraries. Or because I've implemented a new numeric type. Or because, although I'm using floating point, I only want about 3 bits of precision and I need it to be as fast as possible. The people who sneer most contemptuously at reinvention are just lucky that their own careers have been so focused that these needs haven't come up for them, but that doesn't mean such needs don't exist (InAllMyYearsIveNever).
What I'd really like is a language that assists in reinventing the wheel, so that I waste as little time on it as possible.
The point is that you can't pre-can every conceivable kind of algorithm. It's a fine line between avoiding unnecessary wheel reinvention versus being afraid to create a new wheel.
If someone comes to this page because they're interested in diffing two binary executables, then the algorithm is potentially useful to them; the GNU diff is not.
Notice that I posted code not a library. That is so people could see how a diff algorithm can be coded in Java. -- RobertDiFalco
Diff algorithms are interesting. A lot of current programs even still use the line-by-line variation (go back and forth through each file, finding a line that matches, "hook" there, and try to "hook" again). The old Hunt Mcillroy algorithm is a lot better and actually finds the LCS of the difference, rather than the line-by-line variation, which approximates it (but is still "correct" in the sense that it finds the differences, just not as efficiently).
Basically, the old Hunt algorithm is better for files which have a lot of differences in them (and is in current OpenBSD diff and old UNIX ones), while the Myers one with its "snakes" is better for files which don't have a lot of differences in them, thus being better for general cases like CVS diff (and is in GNU diff also).
BTW those papers, despite being only about 11 pages, are extremely dense reading and will probably take you around a month of study to fully understand the stuff.
There are several implementations out there:
For the Myers-GNU Diff variation, there's http://www.ioplex.com/~miallen/libmba/dl/src/diff.c (BSD licensed) and of course GNU diff itself.
For the Hunt version, there's the public domain diff from the comp.sources.misc archive (which I use in WikiServer).
For the dynamic programming variations (line-by-line), well, there are countless implementations of this, several at the comp.sources.misc archive.
P.S. I've read the Myers paper "An O(ND) Algorithm and its variations" (you can see it at http://www.xmailserver.org/diff2.pdf) which has the same algorithm, but does anyone know where I can see a copy of "A file comparison program" on the net?
-- RyanNorton
Here's a quick-and-dirty way to do a quick-and-dirty compare. Load text lines or paragraphs of two documents into two database tables with a "Text" column with one row per line or paragraph. Then run a query something like this:
SELECT Text, count(text) FROM ( ....SELECT Text, 1 as Src FROM tableOne ....GROUP BY text ..UNION ALL ....SELECT Text, 2 as Src FROM tableTwo ....GROUP BY Text ) GROUP BY Text HAVING count(text) <> 2(Dots to prevent TabMunging)
You may have to adjust it a bit for different dialects, such as adding explicit aliases.
It does not give you sequence info and does not distinguish between duplicate lines in the same text. But, it tells you the text of differences.
Reader exercise: Use nested queries and SUM(...) to also factor in line quantity occurrences.
--top
See also: