Diff, part 14: Expendability #.NET #Delphi #diff

Last time, on the Diff Chronicles, I brought up a useful optimization called MaxLength. But useful as it is, it's totally impractical to put into practice, because finding the exact MaxLength takes too long.

But what if we could look at two CSes, and know that one's MaxLength is better than the other's, without actually calculating their MaxLengths?

Let's define a new function, called Expendable(CS₁, CS₂). We give it two CSes, and it tells us which of them we can throw away. Sometimes it returns nil, meaning we can't throw either of them away. (Thanks to Sam for suggesting that this be a function — it simplifies the explanation quite a bit.)

Assuming that we can implement the Expendable function, it's simple to plug it into yesterday's algorithm: Every time we come up with a new CS that we want to add to our list, we first loop through our existing CSes, and compare each one to our CS_new. If Expendable(CS_i, CS_new) ever returns CS_i, we delete CS_i from our list, and then move on to CS_i+1. If Expendable(CS_i, CS_new) ever returns CS_new, we stop there, and don't bother adding CS_new to our list. And whenever Expendable(CS_i, CS_new) returns nil, we just keep going.

This is essentially what we did yesterday, but we were using an Expendable function that was implemented in terms of MaxLength. Now we just have to figure out how to implement the Expendable function without using that expensive MaxLength calculation. How do we look at two CSes and figure out that we can eliminate one without hurting our chances of finding the LCS?

Let's look at (you knew this was coming, didn't you?) some examples:

Expendable([0,0] [1,1], [1,1]) => [1,1]. Anything we can add onto one of these, we can add onto the other. But one of them is already longer. It's a sure thing that MaxLength([0,0] [1,1]) = MaxLength([1,1]) + 1; we can tell that without knowing exactly what their MaxLengths are. Therefore, [0,0] [1,1] wins, and [1,1] gets flushed.
Expendable([0,0], [1,1]) => [1,1]. We know that there's a match [1,1] out there somewhere, because it's in the second CS. So we know that the [0,0] will end up growing to [0,0] [1,1] sooner or later. That means that this example is exactly the same thing as the previous example, and we already know that once it grows to that point, [1,1] is going to lose. Why wait that long? We can just cut our losses now, and toss the second CS right away.
Expendable([3,6], [5,2]) => nil. We can't eliminate [3,6] because it might well be part of the LCS. For example, there might be only one more match after this, and that match might be [4,7]. We can't eliminate [5,2] because there might be only one more match, and it might be [6,3]. We can't guarantee that either of these CSes is out of the running, so we can't eliminate either one.
Expendable([0,5] [1,6] [2,7], [3,0]) => nil. It's tempting to say, "We're working towards the longest possible CS, so when we need to throw one away, go for the shorter one." But we've tried that before, and it didn't work. If the next three matches are [4,1], [5,2], and [6,3], then [3,0] will grow into the next big powerhouse, because we can add those matches onto [3,0] to get a length-4 CS, but we can't add them onto [0,5] [1,6] [2,7]. Bottom line, we can't throw away either CS without waiting to see what other matches are yet to come.
Expendable([4,0], [4,4]) => [4,4]. We can guarantee that throwing away [4,4] won't hurt our chances of finding the right LCS. There is no possible way that MaxLength([4,4]) could be larger than MaxLength([4,0]), but there are possible ways that MaxLength([4,0]) could be better; for example, the next match might be [5,2], which would grow [4,0] but would not grow [4,4]. It's not guaranteed that [4,0] will result in a better MaxLength, but it is guaranteed that it won't ever be worse.

The basic concept is something I think of as "growth potential". [0,0] has more room to grow than [1,1] does; [1,1] is already closer to the end of the lists, and you can't build your Lego tower much higher if there's a ceiling in the way. [4,0] has more room to grow than [4,4] does. But whether [3,6] has more room to grow than [5,2] depends on a lot of things, like the size of the input lists, and where the matches fall; so neither one is guaranteed expendable.

Tomorrow, I'll provide an actual algorithm for the Expendable function, and put it to work.