Joe White's Blog Life, .NET, and cats

Diff, part 9: Try 'em all? (1 of 2) #.NET #Delphi #diff

So in the last couple of diff articles, I've made a wild stab at a Longest Common Subsequence (LCS) algorithm. Unfortunately, it didn't work all the time, which makes it pretty silly as an algorithm.

The problem with that algorithm was the linear approach it took. It looked at one match at a time, and after each match, it expected to know the one-and-only longest-yet common subsequence, and be able to ignore everything else.

And it doesn't work that way. Let's go back to the example at the end of the last article:

A is (1 2 3 4 5 6 7)
B is (6 7 3 4 5 1 2)
Matches are [0,5], [1,6], [2,2], [3,3], [4,4], [5,0], and [6,1].

With the above inputs, we want our algorithm to come up with an LCS of [2,2] [3,3] [4,4]. So after it got to the [2,2], we wanted it somehow to know that it should keep that [2,2] instead of the [0,5] [1,6] it had come up with so far. But what if it had this input instead?

A is (1 2 3)
B is (4 5 3 6 7 1 2)
Matches are [0,5], [1,6], and [2,2].

The first three matches are the same, but this time, after it processes the [2,2], we expect it to somehow know that it should throw the [2,2] away and keep the [0,5] [1,6], because the latter is our LCS. So with the same inputs, we expect different results. The algorithm, as described so far, doesn't have enough information to make the right decision.

What it comes down to is, when the algorithm has [0,5] [1,6] in its register, and it sees the [2,2], there is no reasonable way to decide which one is better. We need an algorithm that will try them both, and see which one yields the longest common subsequence.

Well, does that mean we're back to generating all possible match strings? Here I'll take a page from Raymond Chen's book, and ask, "What would the world be like if we did try all the possible match strings?"

If we try all the possible combinations (permutations? someday I'm going to have to figure out the difference) of 3 matches — let's call them match a, match b, and match c — we'll end up with 15 match strings:

  • 3 match strings of length 1: a, b, c.
  • 6 match strings of length 2: ab, ac, ba, bc, ca, cb.
  • 6 match strings of length 3: abc, acb, bac, bca, cab, cba.

I won't spend a lot of time on the math, but this is 3 + 3*2 + 3*2*1 total match strings. With 4 matches, we would have 4 + 4*3 + 4*3*2 + 4*3*2*1 total match strings. With N matches, we would have N + (N)(N-1) + (N)(N-1)(N-2) + ... + (N)(N-1)(N-2)...(2)(1). Here are the actual numbers:

  • 3 matches: 15 match strings
  • 4 matches: 64 match strings
  • 5 matches: 325 match strings
  • 6 matches: 1,956 match strings
  • 10 matches: 9,864,100 match strings
  • 20 matches: way too damn many match strings

We're looking at a super-exponential progression. That's right: worse than 2n. So if we had to start by building all the possible match strings, then a diff of two 1000-line files would take longer — much longer — than cracking a 1000-bit symmetric encryption key by brute force. (As far as I know, 128-bit keys are still considered uncrackable, and each additional bit doubles the cracking time.) And when it finished, sometime after the heat death of the universe, you really wouldn't still care about the diff results anyway.

I'd say there's a strong possibility that there's a more efficient way to do it.