Diff, part 3: Introducing the Longest Common Subsequence

In the last two articles, I’ve begun outlining what diff is all about. This time, I’ll start to go into how it’s actually done.

It turns out that it’s hard to find out what’s different between two lists, but much easier to find out what’s the same. So, we start by finding the Longest Common Subsequence (LCS), a much easier problem than diff. Once we’ve got the LCS, the diffs are easy. So we can throw away our stupid diff implementation, and use LCS instead… once we figure out how to do an LCS, anyway.

But first, what is a Longest Common Subsequence? I’ll illustrate with these two lists, A and B:

A is (1 2 3)
B is (1 4 5 3)

A subsequence of A is just a subset of A. (1 2) is a subsequence of A; so are (1 2 3), (3), and () (the empty list). However, (3 2) is not a subsequence of A. To be a subsequence, the elements have to stay in their original order. (2 2) isn’t a subsequence of A either; no fair repeating stuff (though it would’ve been fine if there were two 2s in the original).

A common subsequence of A and B would be anything that’s a subsequence of A, and is also a subsequence of B. There are four in this case: (), (1), (3), and (1 3).

The longest common subsequence is just the longest of all possible common subsequences. Here, it’s (1 3).

The LCS tells us what’s the same between A and B. And once we know what’s the same, the diff is pretty straightforward. Actually, any common subsequence would do (our stupid implementation is actually the degenerate case, using () as its common subsequence). But if we use the longest possible common subsequence, we’ll end up with the shortest possible set of diffs, which is exactly what we want.

So, how do we implement LCS? Well, we could build all the possible subsequences of A, check to see which ones are also subsequences of B, and keep track of the longest one we’ve found so far. When we’re done, we’ve got our LCS.

It would work, but it would be stupid. A list of length 3 has 8 possible subsequences. A list of length 10 has 1,024. Length 20? 1,048,576. Length 100? 1,267,650,600,228,229,401,496,703,205,376 possible subsequences. And if you’re comparing files with a few thousand lines? You’ll be there a while.

Hmm. Maybe we need a not-so-stupid LCS algorithm.