Joe White's Blog Life, .NET, and cats

Diff, part 12: More state, Mr. Sulu #.NET #Delphi #diff

The grand diff saga continues...

Last time, I showed an improved one-pass LCS algorithm. But I mentioned that, once again, this is an algorithm that won't work for all inputs. Here's an example:

A is (1 2 3 4)
B is (1 3 4 2)
Matches are [0,0], [1,3], [2,1], and [3,2].

Let's try our algorithm and see what we get.

  • First match: [0,0]. That's our only match so far, so we have one common subsequence (CS): [0,0].
  • Next match: [1,3]. Can we tack that onto [0,0] and still have a valid CS? Yes. We now have one longer CS: [0,0] [1,3].
  • Next match: [2,1]. Can we tack that onto [0,0] [1,3] and still have a valid CS? No, but we don't want to lose the [2,1], so we'll start a new CS. Now we have two: [0,0] [1,3] and [2,1].
  • Last match: [3,2]. Can we tack that onto [0,0] [1,3]? No. [2,1]? Yes. We end up with two CSes: [0,0] [1,3] and [2,1] [3,2].

Whoops! Given those inputs, we should have had a length-3 LCS: (1 3 4), or [0,0] [2,1] [3,2]. What happened?

What happened was that the [0,0] got sucked into another CS — [0,0] [1,3] — so it wasn't available as a starting point for us to build [0,0] [2,1] [3,2] from. How can we fix this, and make sure the [0,0] is still available even when it's been used in another match string?

Well, pretty easily, with a tweak to our list handling. Instead of dropping shorter CSes from the list and replacing them with longer ones, we can just add. So instead of removing [0,0] from our list when we add [0,0] [1,3], we just add them both.

Let's give that a shot:

  • First match: [0,0]. Can we append that to any existing CSes? No, so we'd better add it as a new CS. Our CS list now contains [0,0].
  • Next match: [1,3]. Append this to everything we can — that would be [0,0]. Our CS list now contains both [0,0] and [0,0] [1,3].
  • Next match: [2,1]. Append this to everything we can — [0,0] once again. Our CS list now has [0,0], [0,0] [1,3], and [0,0] [2,1].
  • Last match: [3,2]. Append this to everything we can — [0,0] and [0,0] [2,1]. Our CS list now has [0,0], [0,0] [1,3], [0,0] [2,1], [0,0] [3,2], and [0,0] [2,1] [3,2].

And, ta-daah — the longest of those, [0,0] [2,1] [3,2], is our expected LCS.

But wait a minute. Why is it trying both [0,0] [3,2] and [0,0] [2,1] [3,2]? They're the same thing, except that the first one has an obvious hole in the middle.

Actually, I took it easy in this example. If I'd been all proper about it, then I would've added each match as a length-1 match string, too; so the CS list would also contain [1,3], [2,1], and [3,2] as length-1 match strings, plus anything else we built off of those. Why? We have to put in length-1 match strings that aren't built off of anything else; after all, we do it in step 1! So after step 2, we would've had three CSes ([0,0], [0,0] [1,3], and [1,3]), not two. After step 3, we would've had five. After step 4, we would've had eight. The farther we go, the more obvious holes we accumulate.

This whole thing is smelling a lot like exponential time, even without adding the extra length-1 match strings. True, this algorithm is able to cut its losses — it doesn't keep trying longer and longer match strings that all start with, say, [1,3] [2,1], because it's smart enough to know that none of them will ever work. But it's not smart enough to avoid obvious holes, and if it can't cut very many losses (which will be the case a lot of the time), you'll be waiting a long, long while for your diff to finish.

But it's a good algorithm, once we cure its predilection for Swiss cheese. What it needs, at this point, are a few good shortcuts.