Joe White's Blog Life, .NET, and cats

Diff, part 13: MaxLength and driving directions #.NET #Delphi #diff

Last time in the diff saga, I showed a Longest Common Subsequence (LCS) algorithm with real promise, but a big weakness: it's O(2n). Heat death of the universe, etc. We can improve on that.

Remember that we build longer CSes by starting with shorter ones, and adding stuff onto the end. Suppose we write a new function MaxLength(CSA), which returns the length of the longest CS that starts with CSA. Let's go back to our last example:

A is (1 2 3 4)
B is (1 3 4 2)
Matches are [0,0], [1,3], [2,1], and [3,2].

Let's look at our function MaxLength with some CSes from this input:

  • MaxLength([0,0]) = 3, because [0,0] will grow into length-3 CS [0,0] [2,1] [3,2].
  • MaxLength([0,0] [2,1]) is also 3, because [0,0] [2,1] will grow into that same length-3 CS, [0,0] [2,1] [3,2].
  • MaxLength([0,0] [2,1] [3,2]) is — you guessed it — 3.
  • MaxLength([0,0] [3,2]) = 2, because [0,0] [3,2] has already maxed out; there are no matches we can add at the end to make a longer CS.
  • MaxLength([2,1]) = 2, because [2,1] will grow into [2,1] [3,2].

Now, here's the plan. We follow our stateful LCS algorithm, but for every CS we've accepted so far, we also keep track of its MaxLength. Then, when we come up with a new CS, we calculate its MaxLength, and compare that to the MaxLengths that are already in our list. If the new MaxLength is better than anything in the list, then we throw away the CSes with worse MaxLengths, and add the new one. But if the new MaxLength is worse than anything in the list, then we throw the new one away.

The highest MaxLength will always belong to CSes that are leading subsets of the LCS. That is, if the LCS is [0,0] [2,1] [3,2], then [0,0] will have the highest possible MaxLength — 3, in this case — and so will [0,0] [2,1] and [0,0] [2,1] [3,2]. So if we immediately throw away any CS with a lower MaxLength — [0,0] [1,3], for example, with a MaxLength of a mere 2 — then we'll home right in on our LCS. We won't waste time getting sidetracked, and we won't linger on CSes with obvious holes ([0,0] [3,2] has a MaxLength of 2, so out it goes).

Well, okay, the obvious holes don't play too big a part in an example this short. But they're what makes our previous algorithm take exponential time. Fixing them is what makes the algorithm practical. And with MaxLength, we can do just that. Goody!

Until reality sets in, at any rate.

See, you can't actually know what any of the MaxLengths are until after you've generated all the longer CSes. Which pretty much only happens after you're done running the entire O(2n) LCS algorithm. So you have to go through the algorithm once, without optimizations, so you can run it again, with optimizations. Yeah, that makes a lot of sense.

Driver: "Will we get there faster if I turn left or right at this intersection?"

Navigator: "Well, let's try turning left, and drive until we get there, and see how long it takes. Then we'll come back here, and the second time, we'll turn right at this intersection, and then drive until we get there. Then we'll know for sure which way will get us there fastest, and we can come back here again and take that route."

Okay. So, the MaxLength scheme probably isn't going to work. At all. However, it is a useful concept to build upon, and next time I'll show a variation that's nearly as useful, and not nearly as stupid.