Diff, part 1: A definition, and a stupid implementation #.NET #Delphi #diff

I've once again run into a case where it'd be nice to have a diff library for Delphi and/or C#, so I figure it's time I get around to writing one. As I go, I'll be blogging about the underlying theory, partly to remind myself how it works, and partly just in case anyone might want to read about it. (Blogging seems, after all, to be one big exercise in YAGNI — it's the epitome of writing stuff without having a clue whether anyone's ever going to need it.)

If you don't understand anything I babble about here, please, post a comment. Don't be shy. This stuff starts simple, but gets hairy before long. If you're interested in this stuff, then I don't want to lose you in the details. If anything isn't clear, or if you have a question, or a better example, speak up. I don't bite. (Not strangers, anyway.)

Most of what I know about the diff algorithm comes from reverse-engineering the Perl Algorithm::Diff module from CPAN. It's a very Perlesque implementation with a lot of optimizations, but by reading the code, running the unit tests, finding out that the tests didn't exercise major portions of the code, and eventually adding a whole lot of print statements, I eventually managed to puzzle out how the silly thing worked.

So, let's start at the beginning. What is diff? Well, it's an algorithm that takes two lists (often representing the lines in a text file), and tells you how those lists are different. To be more specific, it results in a list of instructions on how to convert list A into list B. These instructions take the form of "insert" and "delete" instructions. Delete line 42, insert the line "if not Enabled then" at line 87, etc.

Well, given that definition, I could write a diff algorithm in five minutes. If list A is 40 lines, and list B is the same file with 2 extra lines added in the middle (for a total of 42 lines), I could write a diff algorithm that spits out 40 delete instructions and 42 insert instructions.

But that would be stupid.

The main idea is not just to write a set of diffs, but to write the best set of diffs. Specifically, the shortest set of diffs — the fewest number of insertions and deletions that will get you there. If list B really is just list A with two extra lines stuck in the middle, then we really want a diff algorithm that gives us two inserts, and that's it.

Next time, we'll get a little closer, but we'll get there by cheating.