Diff, part 18: Optimizing the third loop #.NET #Delphi #diff

Our diff series continues, once again focusing on optimization.

On Monday, I sketched out our overall algorithm to date. Last Friday, I showed an Expendable function for eliminating common subsequences that aren't going anywhere. This time, I'll look at an interaction between those two pieces.

Suppose we have three CSes so far:

[1,1]
[1,1] [2,2]
[1,1] [2,2] [3,3]
[1,1] [2,2] [3,3] [4,4]

And let's say the next match in our input is [5,5]. Our algorithm says to try appending the [5,5] to each of the existing CSes, then use Expendable to compare the new CS to existing CSes and trim things back. So we would come up with:

[1,1] [5,5], which gets eliminated because Expendable([1,1] [2,2], [1,1] [5,5]) = [1,1] [5,5]. (See Expendable scenario #10.)
[1,1] [2,2] [5,5], which gets eliminated because Expendable([1,1] [2,2] [3,3], [1,1] [2,2] [5,5]) = [1,1] [2,2] [5,5]. (See scenario #10.)
[1,1] [2,2] [3,3] [5,5], which gets eliminated because Expendable([1,1] [2,2] [3,3] [4,4], [1,1] [2,2] [3,3] [5,5]) = [1,1] [2,2] [3,3] [5,5]. (See scenario #10.)
[1,1] [2,2] [3,3] [4,4] [5,5], which we keep.

Our first four tries wound up getting thrown away, and it looks like a pattern may be forming. But it's hard to tell whether the pattern is related to our specific inputs, or if it's true in general. Will the first few CS_news we generate always be wasted effort? Let's look at it a different way, one that doesn't depend so much on the particular inputs and the current state. Our algorithm-to-date uses Expendable in a specific way: it always passes an existing CS_i as the first parameter to Expendable, and a proposed CS_new as the second parameter. But we're not restricted to using Expendable this way; we can actually call it with any two CSes, including two proposed CS_news. That means we can compare our four new CSes to each other, and anything that drops out is, by definition, less worthy. If we never bother to pass those less-worthy CSes on to the rest of the algorithm, we know, from the way we defined Expendable, that we won't be hurting our chances of finding the longest possible CS. There are several different combinations of ways we could call Expendable on our four CS_news, but let's just call it with successive pairs and see what happens:

Expendable([5,5], [1,1] [5,5]) = [5,5]. (See scenario #5.)
Expendable([1,1] [5,5], [1,1] [2,2] [5,5]) = [1,1] [5,5]. (See scenario #5.)
Expendable([1,1] [2,2] [5,5], [1,1] [2,2] [3,3] [5,5]) = [1,1] [2,2] [5,5]. (See scenario #5.)
Expendable([1,1] [2,2] [3,3] [5,5], [1,1] [2,2] [3,3] [4,4] [5,5]) = [1,1] [2,2] [3,3] [5,5]. (See scenario #5.)

So of our four CS_news, three turned out to be expendable with regards to each other. (This is the same thing we saw above, but this time we've eliminated one variable — we're just looking at the CS_news; the outcome no longer depends on what CSes we already happened to have on tap.) And there seems to be a pattern here: scenario #5 keeps showing up. Let's take a closer look at scenario #5:

#	Len	AI	BI	CS₁	CS₂	Make CS₁ win	Make CS₂ win	Exp
5	<	=	=	`[3,3]`	`[0,0] [3,3]`	None possible	EOF	1

What does this mean in English? It means that if two CSes are different lengths, but the last match is the same in both, the shorter CS can be thrown away.

This is an appealing rule, because it immediately eliminates obvious holes. But it also works in cases without obvious holes: if, for example, we have [0,3] [1,4] [5,5] and [2,0] [3,1] [4,2] [5,5], we have no obvious holes, but Expendable still tells us, with confidence, that we can throw the shorter one away.

This has a very practical application to our algorithm: we don't need to append our new match to every one of our extant CSes. If we just use the longest CS-so-far that the new match will fit onto, we're guaranteed safe — we don't even need to try the others. When we're deciding which existing CS(es) to extend with our new match, only two things matter: the existing CSes' length, and whether we can append our new match to them. After all, once we've appended the same new match onto a bunch of different CSes, they lose their previous identity; now their last matches are all the same, and the only thing that can vary is their length. So scenario 5 kicks in, and all but the longest of this newly-minted batch of CSes will drop out. Why bother with anything but the longest to begin with?

What if there's more than one "longest CS that our match will fit onto"? For example, let's say we already have these CSes:

[0,3]
[4,1]
[4,1] [6,2]

If the next match is [5,5], what do we do? First we look at our longest CS so far, [4,1] [6,2]. Can we append [5,5] to that, and still have a valid CS? No, we can't, so we have to go for the shorter CSes. Can we append [5,5] to [0,3]? Yes. What about [4,1]? Yes. So we now have these CS_news:

[0,3] [5,5]
[4,1] [5,5]

Scenario 5 doesn't help here, because these are both the same length. No, this is a job for scenario #14:

#	Len	AI	BI	CS₁	CS₂	Make CS₁ win	Make CS₂ win	Exp
14	=	=	=	`[3,3]`	`[3,3]`	None possible	None possible	Either

Meaning, if CS₁ and CS₂ are both the same length, and both end in the same match, then we should throw one of them away, and it really doesn't matter which.

With all this in mind, we can make another change to our algorithm's outline, this time to the third loop (with a minor tweak to the fourth):

Loop through all the CSes we have so far, going from longest to shortest. Stop on the first CS that we can construct a longer CS from by appending [i, j].
Loop through all the CSes again, calling Expendable on [CS_k, CS_new] and discarding CSes as needed.

A couple of finer points to mention here:

If we never find a CS that we can extend with [i, j], then we still generate a CS_new. In this case, our CS_new would be of length 1, containing only our new match. (This was true of the algorithm I sketched out on Monday, too; I should've specified this then. For example, if the only CS we have so far is [1,4], and our new match is [4,2], then we add [4,2] to our CSes-so-far list.)
If we have multiple existing CSes that have the same length (e.g., [0,3] [1,4] and [3,0] [4,1]), it doesn't matter which of them comes first in our "longest to shortest" ordering. It's not important whether our iteration visits [0,3] [1,4] or [3,0] [4,1] first, just that we don't try either of them until we've tried everything longer, and that we try both of them before we try anything shorter.

This is a nice optimization. We just took two nested O(N) loops, and made them into two successive O(N) loops. This leg of the algorithm used to be O(N²), and now it's just O(N). Making progress.