Joe White’s Blog

Life, .NET, and Cats


Diff, part 3: Introducing the Longest Common Subsequence

In the last two articles, I’ve begun outlining what diff is all about. This time, I’ll start to go into how it’s actually done.

It turns out that it’s hard to find out what’s different between two lists, but much easier to find out what’s the same. So, we start by finding the Longest Common Subsequence (LCS), a much easier problem than diff. Once we’ve got the LCS, the diffs are easy. So we can throw away our stupid diff implementation, and use LCS instead… once we figure out how to do an LCS, anyway.

But first, what is a Longest Common Subsequence? I’ll illustrate with these two lists, A and B:

A is (1 2 3)
B is (1 4 5 3)

A subsequence of A is just a subset of A. (1 2) is a subsequence of A; so are (1 2 3), (3), and () (the empty list). However, (3 2) is not a subsequence of A. To be a subsequence, the elements have to stay in their original order. (2 2) isn’t a subsequence of A either; no fair repeating stuff (though it would’ve been fine if there were two 2s in the original).

A common subsequence of A and B would be anything that’s a subsequence of A, and is also a subsequence of B. There are four in this case: (), (1), (3), and (1 3).

The longest common subsequence is just the longest of all possible common subsequences. Here, it’s (1 3).

The LCS tells us what’s the same between A and B. And once we know what’s the same, the diff is pretty straightforward. Actually, any common subsequence would do (our stupid implementation is actually the degenerate case, using () as its common subsequence). But if we use the longest possible common subsequence, we’ll end up with the shortest possible set of diffs, which is exactly what we want.

So, how do we implement LCS? Well, we could build all the possible subsequences of A, check to see which ones are also subsequences of B, and keep track of the longest one we’ve found so far. When we’re done, we’ve got our LCS.

It would work, but it would be stupid. A list of length 3 has 8 possible subsequences. A list of length 10 has 1,024. Length 20? 1,048,576. Length 100? 1,267,650,600,228,229,401,496,703,205,376 possible subsequences. And if you’re comparing files with a few thousand lines? You’ll be there a while.

Hmm. Maybe we need a not-so-stupid LCS algorithm.

14 Responses to “Diff, part 3: Introducing the Longest Common Subsequence”

  1. Marcel Popescu Says:

    Hey, you’re on vacation or something? I need my fix! :)

  2. Marcel Popescu Says:

    Ok, I now realize that that comment was too terse. I really appreciate this series. Please go on. (Now I sound like Anya in "Buffy the Vampire Slayer" - "thank you for your money; please go away now".)

  3. Joe White Says:

    (grin) Hey, good to know someone’s enjoying these diff articles. I actually don’t have anything going on this evening (somewhat of a novelty these days), so I hope to crank out another installment today. Stay tuned…

    (And no need to apologize for your first comment — it was quite amusing, actually! I just hope this doesn’t mean I have to get my blog regulated by the FDA…)

  4. Joe White's Blog Says:

    Diff, part 4: Matches

  5. Joe White's Blog Says:

    Diff, part 5: Match strings

  6. Joe White's Blog Says:

    Diff, part 6: Match strings and common subsequences

  7. Joe White's Blog Says:

    Diff, part 7: String ‘em all together? (1 of 2)

  8. Joe White's Blog Says:

    Diff, part 8: String ‘em all together? (2 of 2)

  9. Joe White's Blog Says:

    Diff, part 9: Try ‘em all? (1 of 2)

  10. Joe White's Blog Says:

    Diff, part 11: Once more, with state

  11. Joe White's Blog Says:

    Diff, part 12: More state, Mr. Sulu

  12. Joe White's Blog Says:

    Diff, part 15: Implementing Expendable

  13. Joe White's Blog Says:

    Diff, part 18: Optimizing the third loop

  14. Joe White's Blog Says:

    Diff, part 19: De-dup by length?

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Joe White's Blog copyright © 2004-2008. Portions of the site layout use Yahoo! YUI Reset, Fonts, and Grids.
Proudly powered by WordPress. Entries (RSS) and Comments (RSS).