Diff, part 15: Implementing Expendable #.NET #Delphi #diff

In yesterday's episode of Diff Quest, I outlined a function called Expendable, which takes two CSes and tells you whether you can throw one of them away. Today I'll describe that function's implementation.

Expendable takes two parameters, CS₁ and CS₂. Each of those is a common subsequence, and therefore consists of a bunch of matches. And each match is a pair of indexes, like [0,1].

The Expendable function is all about the idea of growth potential. Growth potential depends on two things, and two things only: the length of the CS, and the last match in the CS. [0,0] [2,2] has the exact same growth potential as [1,1] [2,2] — the earlier matches don't matter; what matters is what we can tack onto the end.

This means there are three numbers that we care about, when we're thinking about a CS's growth potential: its length; the AIndex of its last match; and the BIndex of its last match. So the inputs to Expendable can be described by six individual numbers, and the different return values of Expendable are determined by the relationship between corresponding pairs:

Is CS₁Length less than, equal to, or greater than CS₂Length?
Is CS₁AIndex less than, equal to, or greater than CS₂AIndex?
Is CS₁BIndex less than, equal to, or greater than CS₂BIndex?

The following table lists all the possible combinations of these. The "Len" column shows "<" if CS₁Length < CS₂Length, "=" if CS₁Length = CS₂Length, and ">" if CS₁Length > CS₂Length. The "AI" column is the same thing for AIndex, and "BI" is BIndex. After that, there's a sample CS₁ and a sample CS₂ that exhibit these relationships (right-justified so the last matches always line up, since it's the last match that's important).

The next two columns are a kind of game, where I try to come up with a set of inputs that would make CS₁ "win" (i.e., inputs that would build CS₁ into a longer subsequence without also growing CS₂), and then a set of inputs that would make CS₂ win. It doesn't matter how contrived those inputs are; what matters is whether those inputs are possible (and sometimes they're not, which is when expendability kicks in). "EOF" means that no further input is required to make that CS win — if we reached end of file, and there were no more matches, then it's already won.

Finally, the last column shows the result of Expendable(CS₁, CS₂).

#	Len	AI	BI	CS₁	CS₂	Make CS₁ win	Make CS₂ win	Exp
1	<	<	<	`[3,3]`	`[0,0] [6,6]`	`[4,4] [5,5]`	EOF	nil
2	<	<	=	`[3,3]`	`[0,0] [6,3]`	`[4,4] [5,5]`	EOF	nil
3	<	<	>	`[3,6]`	`[0,0] [6,3]`	`[4,7] [5,8]`	EOF	nil
4	<	=	<	`[3,3]`	`[0,0] [3,6]`	`[4,4] [5,5]`	EOF	nil
5	<	=	=	`[3,3]`	`[0,0] [3,3]`	None possible	EOF	1
6	<	=	>	`[3,6]`	`[0,0] [3,3]`	None possible	EOF	1
7	<	>	<	`[6,3]`	`[0,0] [3,6]`	`[7,4] [8,5]`	EOF	nil
8	<	>	=	`[6,3]`	`[0,0] [3,3]`	None possible	EOF	1
9	<	>	>	`[6,6]`	`[0,0] [3,3]`	None possible	EOF	1
10	=	<	<	`[3,3]`	`[6,6]`	`[4,4]`	None possible	2
11	=	<	=	`[3,3]`	`[6,3]`	`[4,4]`	None possible	2
12	=	<	>	`[3,6]`	`[6,3]`	`[4,7]`	`[7,4]`	nil
13	=	=	<	`[3,3]`	`[3,6]`	`[4,4]`	None possible	2
14	=	=	=	`[3,3]`	`[3,3]`	None possible	None possible	Either
15	=	=	>	`[3,6]`	`[3,3]`	None possible	`[4,4]`	1
16	=	>	<	`[6,3]`	`[3,6]`	`[7,4]`	`[4,7]`	nil
17	=	>	=	`[6,3]`	`[3,3]`	None possible	`[4,4]`	1
18	=	>	>	`[6,6]`	`[3,3]`	None possible	`[4,4]`	1
19	>	<	<	`[0,0] [3,3]`	`[6,6]`	EOF	None possible	2
20	>	<	=	`[0,0] [3,3]`	`[6,3]`	EOF	None possible	2
21	>	<	>	`[0,0] [3,6]`	`[6,3]`	EOF	`[7,4] [8,5]`	nil
22	>	=	<	`[0,0] [3,3]`	`[3,6]`	EOF	None possible	2
23	>	=	=	`[0,0] [3,3]`	`[3,3]`	EOF	None possible	2
24	>	=	>	`[0,0] [3,6]`	`[3,3]`	EOF	`[4,4] [5,5]`	nil
25	>	>	<	`[0,0] [6,3]`	`[3,6]`	EOF	`[4,7] [5,8]`	nil
26	>	>	=	`[0,0] [6,3]`	`[3,3]`	EOF	`[4,4] [5,5]`	nil
27	>	>	>	`[0,0] [6,6]`	`[3,3]`	EOF	`[4,4] [5,5]`	nil

Row #14 is interesting, in that the Expendable function could meaningfully return either CS₁ or CS₂. That's because there's nothing to differentiate the two inputs — they are, as far as we care, the same. One might be [0,1] [2,2] and the other might be [1,0] [2,2], but they both have the same growth potential, so there's no need to keep both. We can throw one of them away, and it really doesn't matter which one; we can decide arbitrarily.

I spent a while staring at this table, trying to figure out the pattern, because of row #14. I originally had it marked as "nil", but I finally realized that that's not right; we do want to throw away one or the other. Once I figured out that both CS₁ and CS₂ were expendable for that row, the pattern became clear:

If CS₁Length < CS₂Length, and CS₁AIndex >= CS₂AIndex, and CS₁BIndex >= CS₂BIndex, then CS₁ is expendable.
If CS₁Length > CS₂Length, and CS₁AIndex <= CS₂AIndex, and CS₁BIndex <= CS₂BIndex, then CS₂ is expendable.

It's not the most graceful of algorithms, but it's not a hard one to implement.

At this point, I've described all the major ingredients for a practical (better-than-O(2ⁿ)) LCS algorithm; I could (or you could) go ahead and start writing code, and end up with a working LCS library. But I'm not going to start writing code for it, not just yet. Why? Because even though we're now well into the realm of polynomial time, we're still doing more work than we have to. There are a few more sneaky shortcuts in store, and I'll start delving into them next week.