Joe White's Blog Life, .NET, and cats

Diff, part 15: Implementing Expendable #.NET #Delphi #diff

In yesterday's episode of Diff Quest, I outlined a function called Expendable, which takes two CSes and tells you whether you can throw one of them away. Today I'll describe that function's implementation.

Expendable takes two parameters, CS1 and CS2. Each of those is a common subsequence, and therefore consists of a bunch of matches. And each match is a pair of indexes, like [0,1].

The Expendable function is all about the idea of growth potential. Growth potential depends on two things, and two things only: the length of the CS, and the last match in the CS. [0,0] [2,2] has the exact same growth potential as [1,1] [2,2] — the earlier matches don't matter; what matters is what we can tack onto the end.

This means there are three numbers that we care about, when we're thinking about a CS's growth potential: its length; the AIndex of its last match; and the BIndex of its last match. So the inputs to Expendable can be described by six individual numbers, and the different return values of Expendable are determined by the relationship between corresponding pairs:

  • Is CS1Length less than, equal to, or greater than CS2Length?
  • Is CS1AIndex less than, equal to, or greater than CS2AIndex?
  • Is CS1BIndex less than, equal to, or greater than CS2BIndex?

The following table lists all the possible combinations of these. The "Len" column shows "<" if CS1Length < CS2Length, "=" if CS1Length = CS2Length, and ">" if CS1Length > CS2Length. The "AI" column is the same thing for AIndex, and "BI" is BIndex. After that, there's a sample CS1 and a sample CS2 that exhibit these relationships (right-justified so the last matches always line up, since it's the last match that's important).

The next two columns are a kind of game, where I try to come up with a set of inputs that would make CS1 "win" (i.e., inputs that would build CS1 into a longer subsequence without also growing CS2), and then a set of inputs that would make CS2 win. It doesn't matter how contrived those inputs are; what matters is whether those inputs are possible (and sometimes they're not, which is when expendability kicks in). "EOF" means that no further input is required to make that CS win — if we reached end of file, and there were no more matches, then it's already won.

Finally, the last column shows the result of Expendable(CS1, CS2).

# Len AI BI CS1 CS2 Make CS1 win Make CS2 win Exp
1 < < <       [3,3] [0,0] [6,6] [4,4] [5,5] EOF nil
2 < < =       [3,3] [0,0] [6,3] [4,4] [5,5] EOF nil
3 < < >       [3,6] [0,0] [6,3] [4,7] [5,8] EOF nil
4 < = <       [3,3] [0,0] [3,6] [4,4] [5,5] EOF nil
5 < = =       [3,3] [0,0] [3,3] None possible EOF 1
6 < = >       [3,6] [0,0] [3,3] None possible EOF 1
7 < > <       [6,3] [0,0] [3,6] [7,4] [8,5] EOF nil
8 < > =       [6,3] [0,0] [3,3] None possible EOF 1
9 < > >       [6,6] [0,0] [3,3] None possible EOF 1
10 = < <       [3,3]       [6,6] [4,4] None possible 2
11 = < =       [3,3]       [6,3] [4,4] None possible 2
12 = < >       [3,6]       [6,3] [4,7] [7,4] nil
13 = = <       [3,3]       [3,6] [4,4] None possible 2
14 = = =       [3,3]       [3,3] None possible None possible Either
15 = = >       [3,6]       [3,3] None possible [4,4] 1
16 = > <       [6,3]       [3,6] [7,4] [4,7] nil
17 = > =       [6,3]       [3,3] None possible [4,4] 1
18 = > >       [6,6]       [3,3] None possible [4,4] 1
19 > < < [0,0] [3,3]       [6,6] EOF None possible 2
20 > < = [0,0] [3,3]       [6,3] EOF None possible 2
21 > < > [0,0] [3,6]       [6,3] EOF [7,4] [8,5] nil
22 > = < [0,0] [3,3]       [3,6] EOF None possible 2
23 > = = [0,0] [3,3]       [3,3] EOF None possible 2
24 > = > [0,0] [3,6]       [3,3] EOF [4,4] [5,5] nil
25 > > < [0,0] [6,3]       [3,6] EOF [4,7] [5,8] nil
26 > > = [0,0] [6,3]       [3,3] EOF [4,4] [5,5] nil
27 > > > [0,0] [6,6]       [3,3] EOF [4,4] [5,5] nil

Row #14 is interesting, in that the Expendable function could meaningfully return either CS1 or CS2. That's because there's nothing to differentiate the two inputs — they are, as far as we care, the same. One might be [0,1] [2,2] and the other might be [1,0] [2,2], but they both have the same growth potential, so there's no need to keep both. We can throw one of them away, and it really doesn't matter which one; we can decide arbitrarily.

I spent a while staring at this table, trying to figure out the pattern, because of row #14. I originally had it marked as "nil", but I finally realized that that's not right; we do want to throw away one or the other. Once I figured out that both CS1 and CS2 were expendable for that row, the pattern became clear:

  • If CS1Length < CS2Length, and CS1AIndex >= CS2AIndex, and CS1BIndex >= CS2BIndex, then CS1 is expendable.
  • If CS1Length > CS2Length, and CS1AIndex <= CS2AIndex, and CS1BIndex <= CS2BIndex, then CS2 is expendable.

It's not the most graceful of algorithms, but it's not a hard one to implement.

At this point, I've described all the major ingredients for a practical (better-than-O(2n)) LCS algorithm; I could (or you could) go ahead and start writing code, and end up with a working LCS library. But I'm not going to start writing code for it, not just yet. Why? Because even though we're now well into the realm of polynomial time, we're still doing more work than we have to. There are a few more sneaky shortcuts in store, and I'll start delving into them next week.