Joe White’s Blog

Life, .NET, and Cats


Diff-idence

Ah… long weekend. (Sigh… cold basement.)

So there’ve been a couple of times at work lately when it would’ve been nice to have some code that would do diffs, on arbitrary lists of things. I’ve been writing tools to semi-automate some complicated refactorings, and it’d be nice to see what the tool changed before I tell it to save and check in. There’s also an app my company sells that might benefit from some proper diff support (although it remains to be seen whether the customers actually want that).

I’ve been curious about the diff algorithm for years, and on several occasions I’ve searched for Delphi code that will calculate diffs. No luck yet. So a year or two ago, I went hunting for a diff algorithm that I could translate into Delphi.

I looked at GNU diff, but I wasn’t too excited about reading C source code. But I had a hunch that somebody had probably written this for Perl, and I wasn’t disappointed; a search of CPAN quickly turned up Algorithm::Diff.

It’s an interesting algorithm. A naïve approach to calculating diffs would take exponential time; but Algorithm::Diff is guaranteed to return the best possible set of diffs, and it runs in something like quadratic time, or possibly a little better. Its implementation is really pretty simple, once you get past the optimizations (including an if statement marked “optimization: most of the time this will be true” that never gets hit once by the test suite).

Once I did get past the optimizations, it wasn’t all that hard to understand how it works. Trying to figure out why it works took a little longer. In the end, the algorithm is pretty amazing — there are four or five seemingly-unrelated design decisions that all fit together, and the thing wouldn’t work unless every one of those things was just so. Hard to imagine how somebody figured this all out to begin with, but the end result certainly is slick.

Now I just need to sit down and code the algorithm. Sigh. (Although I’m thinking I can take a shortcut that the Perl code didn’t take, and make the code both a little simpler in a couple of spots, and a little more memory-efficient.)

Yes, I know there already two articles on CodeProject about how to write a diff algorithm in C#. But it doesn’t look like either one uses the same algorithm as the Perl code did, and I suspect Algorithm::Diff might end up being a better algorithm. (And, of course, the CodeProject articles don’t have Delphi code… although I’ll probably write a C# implementation too, mainly because Delphi is godawful expensive these days, so I don’t have my own copy of it at home.)

So here’s the question for you guys. I know I’m supposed to be writing a killer search tool, but what would y’all think about me writing a few articles picking apart the diff algorithm? Presumably I’d write some code along the way, and end up with a working diff algorithm in C# and/or Delphi (ideally both).

If I wrote about diff, would anybody read it? Let me know what you think.

13 Responses to “Diff-idence”

  1. Marcel Popescu Says:

    At least I would :-) I was always somewhat curious on the inner workings of a diff program (but never curious enough to actually start researching it, as you did).

  2. Dean Hill Says:

    Look no further. I have attached a URL with some Delphi code by the brilliant Angus Johnson with a full Delphi example application that is very cool.

    http://users.on.net/johnson/delphi/textdiff.html

  3. Dean Hill Says:

    He also includes an algorithm and paper in PDF form.

  4. Joe White Says:

    Hmm, interesting. TextDiff looks like it uses yet a different algorithm. I’ll have to play around with it a bit (and clean up the code — I don’t like that Application.ProcessMessages call).

  5. Dean Hill Says:

    Something that may also be of interest is the KOL Updater and Update Maker. Although they use an algorithm that is primarily for binary files it may be of use as this is something that Angus’s algorithm is very slow at doing.

    http://bonanzas.rinet.ru/e_tools.htm

    or the KOL download page

    http://bonanzas.rinet.ru/e_downloads.htm

  6. Danny Thorpe Says:

    It’s not what you’re looking for, but since you phrased your wish so broadly, I’ll point out that the TList class has an Assign method that takes an operator parameter to control how the two lists are combined. You can perform boolean set logic using this little routine - unions, intersections, unique items, etc.

    -Danny

  7. Joe White Says:

    Hmm, thanks for reminding me about that. I don’t have a copy of Delphi nearby at the moment, so I’ll have to look at it later. I’m not sure how useful it would be for things like comparing two text files line-by-line - you need more than just "what’s in one but not the other", since relative order matters too. But those operators are something I don’t think about often enough; I’m sure they’d be handy if I studied them a bit more.

  8. Brad White Says:

    Yeah, I’m sure that will be useful.

  9. Brad White Says:

    I use CSDiff from ComponentSoftware. They also have HTMLDiff, which is also useful for XML, and ExcelDiff.

    http://www.componentsoftware.com/products/CSDiff/index.htm

  10. Blake Hollingsworth Says:

    I would love to read about it.

  11. Blake Hollingsworth Says:

    Actually I just found a C# implementation

    http://taubz.for.net/code/diff/

  12. Angus Johnson Says:

    <<clean up the code — I don’t like that Application.ProcessMessages call

    >>

    re - http://users.on.net/~johnson/delphi/textdiff.html

    Joe, I’m very open to suggestions for improvements to any of my components.

    angusj[at]angusj[.]com.

  13. Paul Chung Says:

    Is the diff algorithm using "Dynamic Programming" paradaigm?

    sorry I have never tried digging into the code and check how GNU diff works. I just guess it would be a recursive algorithm computing from group up instead of ordinary top down recursion. (i.e. Dynamic Programming in Computer Science)

    Am I right?

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Joe White's Blog copyright © 2004-2008. Portions of the site layout use Yahoo! YUI Reset, Fonts, and Grids.
Proudly powered by WordPress. Entries (RSS) and Comments (RSS).