Diff, part 3: Introducing the Longest Common Subsequence
In the last two articles, I’ve begun outlining what diff is all about. This time, I’ll start to go into how it’s actually done.
It turns out that it’s hard to find out what’s different between two lists, but much easier to find out what’s the same. So, we start by finding the Longest Common Subsequence (LCS), a much easier problem than diff. Once we’ve got the LCS, the diffs are easy. So we can throw away our stupid diff implementation, and use LCS instead… once we figure out how to do an LCS, anyway.
But first, what is a Longest Common Subsequence? I’ll illustrate with these two lists, A and B:
A is (1 2 3)
B is (1 4 5 3)
A subsequence of A is just a subset of A. (1 2) is a subsequence of A; so are (1 2 3), (3), and () (the empty list). However, (3 2) is not a subsequence of A. To be a subsequence, the elements have to stay in their original order. (2 2) isn’t a subsequence of A either; no fair repeating stuff (though it would’ve been fine if there were two 2s in the original).
A common subsequence of A and B would be anything that’s a subsequence of A, and is also a subsequence of B. There are four in this case: (), (1), (3), and (1 3).
The longest common subsequence is just the longest of all possible common subsequences. Here, it’s (1 3).
The LCS tells us what’s the same between A and B. And once we know what’s the same, the diff is pretty straightforward. Actually, any common subsequence would do (our stupid implementation is actually the degenerate case, using () as its common subsequence). But if we use the longest possible common subsequence, we’ll end up with the shortest possible set of diffs, which is exactly what we want.
So, how do we implement LCS? Well, we could build all the possible subsequences of A, check to see which ones are also subsequences of B, and keep track of the longest one we’ve found so far. When we’re done, we’ve got our LCS.
It would work, but it would be stupid. A list of length 3 has 8 possible subsequences. A list of length 10 has 1,024. Length 20? 1,048,576. Length 100? 1,267,650,600,228,229,401,496,703,205,376 possible subsequences. And if you’re comparing files with a few thousand lines? You’ll be there a while.
Hmm. Maybe we need a not-so-stupid LCS algorithm.
March 31st, 2005 at 1:51 am
Hey, you’re on vacation or something? I need my fix!
March 31st, 2005 at 1:52 am
Ok, I now realize that that comment was too terse. I really appreciate this series. Please go on. (Now I sound like Anya in "Buffy the Vampire Slayer" - "thank you for your money; please go away now".)
March 31st, 2005 at 4:33 am
(grin) Hey, good to know someone’s enjoying these diff articles. I actually don’t have anything going on this evening (somewhat of a novelty these days), so I hope to crank out another installment today. Stay tuned…
(And no need to apologize for your first comment — it was quite amusing, actually! I just hope this doesn’t mean I have to get my blog regulated by the FDA…)
March 31st, 2005 at 7:41 pm
Diff, part 4: Matches
April 1st, 2005 at 6:11 am
Diff, part 5: Match strings
April 2nd, 2005 at 5:52 am
Diff, part 6: Match strings and common subsequences
April 5th, 2005 at 7:08 am
Diff, part 7: String ‘em all together? (1 of 2)
April 6th, 2005 at 5:49 am
Diff, part 8: String ‘em all together? (2 of 2)
April 7th, 2005 at 6:42 am
Diff, part 9: Try ‘em all? (1 of 2)
April 11th, 2005 at 6:22 am
Diff, part 11: Once more, with state
April 12th, 2005 at 6:03 am
Diff, part 12: More state, Mr. Sulu
April 15th, 2005 at 6:28 am
Diff, part 15: Implementing Expendable
April 20th, 2005 at 7:11 am
Diff, part 18: Optimizing the third loop
April 21st, 2005 at 6:41 am
Diff, part 19: De-dup by length?