Diff, part 2: Head and tail matches

Last time, I defined the diff algorithm, and gave a stupid implementation of it. Here, I’ll add improvement #1: better results by cheating (only I’ll call it “optimizing”).

We’re still very much in the realm of the mundane, here. But we can improve our diffs, a little, by first looking for head and tail matches, and then doing a stupid diff on what’s left. If the first twenty elements in list A are exactly the same as the first twenty elements in list B, and the last twenty elements in list A are exactly the same as the last twenty elements in list B, then you don’t need to bother comparing anything except what’s in between.

If we start by looking for head and tail matches, and then apply our stupid diff algorithm to what’s left, then (in our simple example, where list B simply added two consecutive lines in the middle) we get the ideal diff results: two inserts.

Of course, this breaks down pretty quickly. Suppose list A is the lines from a source file. List B is the same file, but with a comment added at the beginning, a blank line deleted at the end, and no other changes. Now head and tail matches are no help; our stupid algorithm will still give stupid results.

That doesn’t mean that head and tail matches are worthless. In fact, they’re still very valuable. Looking for head and tail matches is an O(N) operation, which is substantially more efficient than any good diff algorithm is going to be — and reduces the number of lines you need to run through the more expensive (something like O(N²), I think, or O(ND), or something like that) operation, which is a huge win. Head and tail matches aren’t very valuable on their own, but they’re the first step in any reasonable diff implementation, even a stupid one.

Next time, we’ll make things a little more interesting, by turning the lists inside-out.

Leave a Reply

Your email address will not be published. Required fields are marked *