Joe White’s Blog

Life, .NET, and Cats


Diff, part 19: De-dup by length?

Thursday, April 21st, 2005

Today in the diff series, I’ll start in on what I think will be the last major optimization in our LCS algorithm.

Recall that we may have multiple same-length common subsequences in our “best CSes so far” pool. For example, we may have [0,0] [1,2] and [0,0] [2,1]. We’ve been keeping both because the next match might be [2,3], or [3,2], or maybe even [128,2]. We can’t throw away any CS that we might reasonably be able to build upon later (unless we can make an ironclad guarantee that it’s expendable).

Also recall that, when we generate a new CS, we do so by adding a new match onto an existing CS; specifically, the longest one that we can append our new match to. When we look for an existing CS to extend, the only thing we care about is: for the given match, will this CSk give us the longest CSnew that we can possibly generate?

We don’t actually care whether CSk ends in [1,2] or [2,1], as long as we can guarantee that when we append the current match — be it [2,3], [3,2], [3,3], or whatever else — we’ll end up with the longest CSnew we can get. Anywhere and everywhere along the way, we can throw away as many CSes as we want (just like we did yesterday), as long as we never compromise that guarantee. Never throw a CS away unless you can prove it has less growth potential than something else you already have; otherwise, anything goes.

These two properties — length is all-important, yet we keep multiple CSes of the same length — seem to be in a bit of tension with each other. Which suggests an intriguing question: can we get to the point where length alone is a good enough guarantee? Can we get to the point where we hang onto just one length-1 CS, one length-2 CS, one length-3 CS, etc., at a time?

If that were possible, it would mean that we could make a big optimization to our fourth loop — the one where we call Expendable to compare our CSnew to each and every CSk, so we can see if there’s anything (including possibly CSnew) that we can throw away. That fourth loop could, in fact, stop being a loop. It could become a single call to Expendable, comparing CSnew to the CSk of the same length. (Of course, if CSnew is the longest CS so far, then there would be no CSk of the same length, and we wouldn’t need to call Expendable at all.) And that call to Expendable would always have a definitive result: either you throw away CSk, or you throw away CSnew. Only one CS of any given length would be left standing at any given time.

A worthy optimization, to be sure. But is it doable? We would have to throw away one of those CSes-of-the-same-length, but Expendable([0,0] [1,2], [0,0] [2,1]) returns nil, meaning we need both CSes at one time or another. How could we reconcile this?

By siezing on that “at one time or another”, and making this into a timing issue. Then we can control the timing, by controlling the way we iterate. How, and in what order, do we generate our matches?

We would be home free if we could generate our matches in this order:

  • [0,0]
  • [2,1]
  • Every possible match that could extend [2,1] but not [1,2]
  • [1,2]

Or perhaps:

  • [0,0]
  • [1,2]
  • Every possible match that could extend [1,2] but not [2,1]
  • [2,1]

As long as our algorithm can consistently return matches in one of these orders (and can be consistent about which one), then when we get to the point where [0,0] [1,2] and [0,0] [2,1] are our CSnew and CSk (in whichever order), we know with absolute certainty that we can throw the old one away, replace it with the new one, and be sure that we’re still going to find the longest possible CS when we’re done.

So how do we handle the third bullet point in either of the above orderings? That third bullet is kind of the “and then a miracle occurs” stage. Next time, I’ll start to nail down what “every possible match that could extend x but not y” really means, and how we can possibly hope to pull it off.

Diff, part 18: Optimizing the third loop

Wednesday, April 20th, 2005

Our diff series continues, once again focusing on optimization.

&PBOX_R&

On Monday, I sketched out our overall algorithm to date. Last Friday, I showed an Expendable function for eliminating common subsequences that aren’t going anywhere. This time, I’ll look at an interaction between those two pieces.

Suppose we have three CSes so far:

  • [1,1]
  • [1,1] [2,2]
  • [1,1] [2,2] [3,3]
  • [1,1] [2,2] [3,3] [4,4]

And let’s say the next match in our input is [5,5]. Our algorithm says to try appending the [5,5] to each of the existing CSes, then use Expendable to compare the new CS to existing CSes and trim things back. So we would come up with:

  • [1,1] [5,5], which gets eliminated because Expendable([1,1] [2,2], [1,1] [5,5]) = [1,1] [5,5]. (See Expendable scenario #10.)
  • [1,1] [2,2] [5,5], which gets eliminated because Expendable([1,1] [2,2] [3,3], [1,1] [2,2] [5,5]) = [1,1] [2,2] [5,5]. (See scenario #10.)
  • [1,1] [2,2] [3,3] [5,5], which gets eliminated because Expendable([1,1] [2,2] [3,3] [4,4], [1,1] [2,2] [3,3] [5,5]) = [1,1] [2,2] [3,3] [5,5]. (See scenario #10.)
  • [1,1] [2,2] [3,3] [4,4] [5,5], which we keep.

Our first four tries wound up getting thrown away, and it looks like a pattern may be forming. But it’s hard to tell whether the pattern is related to our specific inputs, or if it’s true in general. Will the first few CSnews we generate always be wasted effort?

Let’s look at it a different way, one that doesn’t depend so much on the particular inputs and the current state. Our algorithm-to-date uses Expendable in a specific way: it always passes an existing CSi as the first parameter to Expendable, and a proposed CSnew as the second parameter. But we’re not restricted to using Expendable this way; we can actually call it with any two CSes, including two proposed CSnews. That means we can compare our four new CSes to each other, and anything that drops out is, by definition, less worthy. If we never bother to pass those less-worthy CSes on to the rest of the algorithm, we know, from the way we defined Expendable, that we won’t be hurting our chances of finding the longest possible CS.

There are several different combinations of ways we could call Expendable on our four CSnews, but let’s just call it with successive pairs and see what happens:

  • Expendable([5,5], [1,1] [5,5]) = [5,5]. (See scenario #5.)
  • Expendable([1,1] [5,5], [1,1] [2,2] [5,5]) = [1,1] [5,5]. (See scenario #5.)
  • Expendable([1,1] [2,2] [5,5], [1,1] [2,2] [3,3] [5,5]) = [1,1] [2,2] [5,5]. (See scenario #5.)
  • Expendable([1,1] [2,2] [3,3] [5,5], [1,1] [2,2] [3,3] [4,4] [5,5]) = [1,1] [2,2] [3,3] [5,5]. (See scenario #5.)

So of our four CSnews, three turned out to be expendable with regards to each other. (This is the same thing we saw above, but this time we’ve eliminated one variable — we’re just looking at the CSnews; the outcome no longer depends on what CSes we already happened to have on tap.) And there seems to be a pattern here: scenario #5 keeps showing up. Let’s take a closer look at scenario #5:

# Len AI BI CS1 CS2 Make CS1 win Make CS2 win Exp
5 < = = [3,3] [0,0] [3,3] None possible EOF 1

What does this mean in English? It means that if two CSes are different lengths, but the last match is the same in both, the shorter CS can be thrown away.

This is an appealing rule, because it immediately eliminates obvious holes. But it also works in cases without obvious holes: if, for example, we have [0,3] [1,4] [5,5] and [2,0] [3,1] [4,2] [5,5], we have no obvious holes, but Expendable still tells us, with confidence, that we can throw the shorter one away.

This has a very practical application to our algorithm: we don’t need to append our new match to every one of our extant CSes. If we just use the longest CS-so-far that the new match will fit onto, we’re guaranteed safe — we don’t even need to try the others. When we’re deciding which existing CS(es) to extend with our new match, only two things matter: the existing CSes’ length, and whether we can append our new match to them. After all, once we’ve appended the same new match onto a bunch of different CSes, they lose their previous identity; now their last matches are all the same, and the only thing that can vary is their length. So scenario 5 kicks in, and all but the longest of this newly-minted batch of CSes will drop out. Why bother with anything but the longest to begin with?

What if there’s more than one “longest CS that our match will fit onto”? For example, let’s say we already have these CSes:

  • [0,3]
  • [4,1]
  • [4,1] [6,2]

If the next match is [5,5], what do we do? First we look at our longest CS so far, [4,1] [6,2]. Can we append [5,5] to that, and still have a valid CS? No, we can’t, so we have to go for the shorter CSes. Can we append [5,5] to [0,3]? Yes. What about [4,1]? Yes. So we now have these CSnews:

  • [0,3] [5,5]
  • [4,1] [5,5]

Scenario 5 doesn’t help here, because these are both the same length. No, this is a job for scenario #14:

# Len AI BI CS1 CS2 Make CS1 win Make CS2 win Exp
14 = = = [3,3] [3,3] None possible None possible Either

Meaning, if CS1 and CS2 are both the same length, and both end in the same match, then we should throw one of them away, and it really doesn’t matter which.

With all this in mind, we can make another change to our algorithm’s outline, this time to the third loop (with a minor tweak to the fourth):

  • Loop through all the CSes we have so far, going from longest to shortest. Stop on the first CS that we can construct a longer CS from by appending [i, j].
  • Loop through all the CSes again, calling Expendable on [CSk, CSnew] and discarding CSes as needed.

A couple of finer points to mention here:

  • If we never find a CS that we can extend with [i, j], then we still generate a CSnew. In this case, our CSnew would be of length 1, containing only our new match. (This was true of the algorithm I sketched out on Monday, too; I should’ve specified this then. For example, if the only CS we have so far is [1,4], and our new match is [4,2], then we add [4,2] to our CSes-so-far list.)
  • If we have multiple existing CSes that have the same length (e.g., [0,3] [1,4] and [3,0] [4,1]), it doesn’t matter which of them comes first in our “longest to shortest” ordering. It’s not important whether our iteration visits [0,3] [1,4] or [3,0] [4,1] first, just that we don’t try either of them until we’ve tried everything longer, and that we try both of them before we try anything shorter.

This is a nice optimization. We just took two nested O(N) loops, and made them into two successive O(N) loops. This leg of the algorithm used to be O(N2), and now it’s just O(N). Making progress.

Diff, part 17: Optimizing match discovery

Tuesday, April 19th, 2005

The next really interesting topic in the diff series is going to take more time to write than I have this morning. So today, I’ll talk about a fairly simple optimization to the match-finding loops.

Our outer two loops looked like this:

  • Loop through all the elements in A, and for each Ai:
    • Loop through all the elements in B, and for each Bj that matches Ai:
      • (do some other stuff)

Now, imagine that we have input that’s a typical source-code file, perhaps C# or Delphi code, and that we’re ignoring leading whitespace as we compare.

In a source file of any length, we’re going to have repetition. There will be blank lines in both files. There will be lines containing only ‘{‘ (C#) or only the keyword ‘begin‘ (Delphi), which all compare the same because we’re stripping off the leading whitespace before we compare. Same thing for ‘}‘ or ‘end;‘.

Every time we come across one of those duplicated lines, we do a bunch of wasted work. Let’s see how:

  • Get the first element from A. Let’s say it’s “unit Foo;“.
    • Loop through all the elements in B, and for each Bj that matches “unit Foo;“, do some stuff.
  • Get the next element from A. This one is the empty string, “”.
    • Loop through all the elements in B, and for each Bj that matches “”, do some stuff.
  • Get the next element from A: “interface“.
    • Loop through all the elements in B, and for each Bj that matches “interface“, do some stuff.
  • Get the next element from A: the empty string again.
    • Loop through all the elements in B, and for each Bj that matches “”, do some stuff.
  • Get the next element from A: “uses SysUtils, Classes;“.
    • Loop through all the elements in B, and for each Bj that matches “uses SysUtils, Classes;“, do some stuff.
  • Get the next element from A: the empty string again.
    • Loop through all the elements in B, and for each Bj that matches “”, do some stuff.

Every time we get “” from A, we loop through B looking for all of its “”s. But we already did that. Couldn’t we save ourselves some work if we just remembered where all the “”s are, so we can just loop over the interesting bits of B, rather than the whole boring thing?

Indeed we could, and it would save us time even if there isn’t any duplication between the two lists. We do it by replacing those two outer loops with this:

  • Create a Hashtable called BHash.
  • Loop through all the elements in B, and for each Bj:
    • If BHash[Bj] is nil, then BHash[Bj] := TList.Create;
    • BHash[Bj].Add(j);
  • Loop through all the elements in A, and for each Ai:
    • Loop through all the matching “j”s we’ve already listed at BHash[Ai]:
      • (do some other stuff)

We start with a single pass through B to build a reverse-lookup table. (Ideally we would use a genuine Hashtable, so that adds and lookups are both always O(1). If we’re using Delphi for Win32, which doesn’t have a real Hashtable, we’ll have to fake one, or implement one from scratch.)

This reverse-lookup table maps each element in B to a list of indexes where that element appears:

BHash[“unit Foo;“] = (0)
BHash[“”] = (1, 3, 5, …)
BHash[“interface“] = (2)
BHash[“uses SysUtils, Classes;“] = (4)

Now, when we get to an element in A, we’re not looping through every element in B. That was a bit wasteful to begin with, because for typical input, most of those elements in B wouldn’t match — we were looping over all N even though only a handful will actually matter. Now, with the addition of the reverse-lookup hash, we only look at the elements in B that do match.

So the second loop, instead of being O(N), is now O(D) — the number of iterations is roughly proportional to the number of duplicates in list B. Actually, list A figures into that calculation too, because we’ll only care about those duplicates in B if they also appear in A — after all, if they don’t appear in A, we’re never going to look them up in B.

D may approach N, if the lists contain almost all duplicates (e.g., if both input files consist of nothing but blank lines); or it may approach 1, if the lists contain almost no duplicates. It may even approach 0, if the two input files have nothing in common, because every time we go to look for something in BHash, there’ll be nothing to find and therefore nothing to iterate.

So our outer two loops, instead of being O(N2), are now O(ND). Not bad for a short morning’s work.

Diff, part 16: A look back, and a big O

Monday, April 18th, 2005

I’ve been talking about the diff algorithm for three weeks now, and I’m proud to say that I now own the phrase “stupid diff” (at least as far as Google cares, which is all that really matters anyway). I think I’ll start putting a little trademark symbol after it.

[Update, 19 Apr 2005: Only a day after I posted this, and I’m no longer the #1 Google result for “stupid diff”. Oh well. Easy come, easy go.]

At this point, I’ve presented all the parts and pieces needed to make a serviceable Longest Common Subsequence algorithm. Now let’s look at just how serviceable. How efficient are these parts and pieces?

Last time, I said we’d reached polynomial time. But what degree polynomial are we talking about? Here’s a quick sketch of what the algorithm does:

  • Loop through all the elements in A, and for each Ai:
    • Loop through all the elements in B, and for each Bj that matches Ai:
      • Loop through all the CSes we have so far, and for each CS that we can construct a longer CS from by appending [i, j]:
        • Loop through all the CSes again, calling Expendable on [CSk, CSnew] and discarding CSes as needed.

Looks like our algorithm, so far, is O(N4). Sure beats our old super-exponential time, but N4 still leaves something to be desired.

Can we do better? You bet. In fact, by the time we’re done, the limiting factor is going to be how efficiently we can find matches: those outer two loops. Next time, I’ll take a look at making that a little more efficient (though still technically O(N2)), and start setting the stage for some of the later improvements.

Diff, part 15: Implementing Expendable

Friday, April 15th, 2005

In yesterday’s episode of Diff Quest, I outlined a function called Expendable, which takes two CSes and tells you whether you can throw one of them away. Today I’ll describe that function’s implementation.

Expendable takes two parameters, CS1 and CS2. Each of those is a common subsequence, and therefore consists of a bunch of matches. And each match is a pair of indexes, like [0,1].

The Expendable function is all about the idea of growth potential. Growth potential depends on two things, and two things only: the length of the CS, and the last match in the CS. [0,0] [2,2] has the exact same growth potential as [1,1] [2,2] — the earlier matches don’t matter; what matters is what we can tack onto the end.

This means there are three numbers that we care about, when we’re thinking about a CS’s growth potential: its length; the AIndex of its last match; and the BIndex of its last match. So the inputs to Expendable can be described by six individual numbers, and the different return values of Expendable are determined by the relationship between corresponding pairs:

  • Is CS1Length less than, equal to, or greater than CS2Length?
  • Is CS1AIndex less than, equal to, or greater than CS2AIndex?
  • Is CS1BIndex less than, equal to, or greater than CS2BIndex?

The following table lists all the possible combinations of these. The “Len” column shows “<” if CS1Length < CS2Length, “=” if CS1Length = CS2Length, and “>” if CS1Length > CS2Length. The “AI” column is the same thing for AIndex, and “BI” is BIndex. After that, there’s a sample CS1 and a sample CS2 that exhibit these relationships (right-justified so the last matches always line up, since it’s the last match that’s important).

The next two columns are a kind of game, where I try to come up with a set of inputs that would make CS1 “win” (i.e., inputs that would build CS1 into a longer subsequence without also growing CS2), and then a set of inputs that would make CS2 win. It doesn’t matter how contrived those inputs are; what matters is whether those inputs are possible (and sometimes they’re not, which is when expendability kicks in). “EOF” means that no further input is required to make that CS win — if we reached end of file, and there were no more matches, then it’s already won.

Finally, the last column shows the result of Expendable(CS1, CS2).

# Len AI BI CS1 CS2 Make CS1 win Make CS2 win Exp
1 < < <       [3,3] [0,0] [6,6] [4,4] [5,5] EOF nil
2 < < =       [3,3] [0,0] [6,3] [4,4] [5,5] EOF nil
3 < < >       [3,6] [0,0] [6,3] [4,7] [5,8] EOF nil
4 < = <       [3,3] [0,0] [3,6] [4,4] [5,5] EOF nil
5 < = =       [3,3] [0,0] [3,3] None possible EOF 1
6 < = >       [3,6] [0,0] [3,3] None possible EOF 1
7 < > <       [6,3] [0,0] [3,6] [7,4] [8,5] EOF nil
8 < > =       [6,3] [0,0] [3,3] None possible EOF 1
9 < > >       [6,6] [0,0] [3,3] None possible EOF 1
10 = < <       [3,3]       [6,6] [4,4] None possible 2
11 = < =       [3,3]       [6,3] [4,4] None possible 2
12 = < >       [3,6]       [6,3] [4,7] [7,4] nil
13 = = <       [3,3]       [3,6] [4,4] None possible 2
14 = = =       [3,3]       [3,3] None possible None possible Either
15 = = >       [3,6]       [3,3] None possible [4,4] 1
16 = > <       [6,3]       [3,6] [7,4] [4,7] nil
17 = > =       [6,3]       [3,3] None possible [4,4] 1
18 = > >       [6,6]       [3,3] None possible [4,4] 1
19 > < < [0,0] [3,3]       [6,6] EOF None possible 2
20 > < = [0,0] [3,3]       [6,3] EOF None possible 2
21 > < > [0,0] [3,6]       [6,3] EOF [7,4] [8,5] nil
22 > = < [0,0] [3,3]       [3,6] EOF None possible 2
23 > = = [0,0] [3,3]       [3,3] EOF None possible 2
24 > = > [0,0] [3,6]       [3,3] EOF [4,4] [5,5] nil
25 > > < [0,0] [6,3]       [3,6] EOF [4,7] [5,8] nil
26 > > = [0,0] [6,3]       [3,3] EOF [4,4] [5,5] nil
27 > > > [0,0] [6,6]       [3,3] EOF [4,4] [5,5] nil

Row #14 is interesting, in that the Expendable function could meaningfully return either CS1 or CS2. That’s because there’s nothing to differentiate the two inputs — they are, as far as we care, the same. One might be [0,1] [2,2] and the other might be [1,0] [2,2], but they both have the same growth potential, so there’s no need to keep both. We can throw one of them away, and it really doesn’t matter which one; we can decide arbitrarily.

I spent a while staring at this table, trying to figure out the pattern, because of row #14. I originally had it marked as “nil”, but I finally realized that that’s not right; we do want to throw away one or the other. Once I figured out that both CS1 and CS2 were expendable for that row, the pattern became clear:

  • If CS1Length < CS2Length, and CS1AIndex >= CS2AIndex, and CS1BIndex >= CS2BIndex, then CS1 is expendable.
  • If CS1Length > CS2Length, and CS1AIndex <= CS2AIndex, and CS1BIndex <= CS2BIndex, then CS2 is expendable.

It’s not the most graceful of algorithms, but it’s not a hard one to implement.

At this point, I’ve described all the major ingredients for a practical (better-than-O(2n)) LCS algorithm; I could (or you could) go ahead and start writing code, and end up with a working LCS library. But I’m not going to start writing code for it, not just yet. Why? Because even though we’re now well into the realm of polynomial time, we’re still doing more work than we have to. There are a few more sneaky shortcuts in store, and I’ll start delving into them next week.

Diff, part 14: Expendability

Thursday, April 14th, 2005

Last time, on the Diff Chronicles, I brought up a useful optimization called MaxLength. But useful as it is, it’s totally impractical to put into practice, because finding the exact MaxLength takes too long.

But what if we could look at two CSes, and know that one’s MaxLength is better than the other’s, without actually calculating their MaxLengths?

Let’s define a new function, called Expendable(CS1, CS2). We give it two CSes, and it tells us which of them we can throw away. Sometimes it returns nil, meaning we can’t throw either of them away. (Thanks to Sam for suggesting that this be a function — it simplifies the explanation quite a bit.)

Assuming that we can implement the Expendable function, it’s simple to plug it into yesterday’s algorithm: Every time we come up with a new CS that we want to add to our list, we first loop through our existing CSes, and compare each one to our CSnew. If Expendable(CSi, CSnew) ever returns CSi, we delete CSi from our list, and then move on to CSi+1. If Expendable(CSi, CSnew) ever returns CSnew, we stop there, and don’t bother adding CSnew to our list. And whenever Expendable(CSi, CSnew) returns nil, we just keep going.

This is essentially what we did yesterday, but we were using an Expendable function that was implemented in terms of MaxLength. Now we just have to figure out how to implement the Expendable function without using that expensive MaxLength calculation. How do we look at two CSes and figure out that we can eliminate one without hurting our chances of finding the LCS?

Let’s look at (you knew this was coming, didn’t you?) some examples:

  • Expendable([0,0] [1,1], [1,1]) => [1,1].
    Anything we can add onto one of these, we can add onto the other. But one of them is already longer. It’s a sure thing that MaxLength([0,0] [1,1]) = MaxLength([1,1]) + 1; we can tell that without knowing exactly what their MaxLengths are. Therefore, [0,0] [1,1] wins, and [1,1] gets flushed.
  • Expendable([0,0], [1,1]) => [1,1].
    We know that there’s a match [1,1] out there somewhere, because it’s in the second CS. So we know that the [0,0] will end up growing to [0,0] [1,1] sooner or later. That means that this example is exactly the same thing as the previous example, and we already know that once it grows to that point, [1,1] is going to lose. Why wait that long? We can just cut our losses now, and toss the second CS right away.
  • Expendable([3,6], [5,2]) => nil.
    We can’t eliminate [3,6] because it might well be part of the LCS. For example, there might be only one more match after this, and that match might be [4,7]. We can’t eliminate [5,2] because there might be only one more match, and it might be [6,3]. We can’t guarantee that either of these CSes is out of the running, so we can’t eliminate either one.
  • Expendable([0,5] [1,6] [2,7], [3,0]) => nil.
    It’s tempting to say, “We’re working towards the longest possible CS, so when we need to throw one away, go for the shorter one.” But we’ve tried that before, and it didn’t work. If the next three matches are [4,1], [5,2], and [6,3], then [3,0] will grow into the next big powerhouse, because we can add those matches onto [3,0] to get a length-4 CS, but we can’t add them onto [0,5] [1,6] [2,7]. Bottom line, we can’t throw away either CS without waiting to see what other matches are yet to come.
  • Expendable([4,0], [4,4]) => [4,4].
    We can guarantee that throwing away [4,4] won’t hurt our chances of finding the right LCS. There is no possible way that MaxLength([4,4]) could be larger than MaxLength([4,0]), but there are possible ways that MaxLength([4,0]) could be better; for example, the next match might be [5,2], which would grow [4,0] but would not grow [4,4]. It’s not guaranteed that [4,0] will result in a better MaxLength, but it is guaranteed that it won’t ever be worse.

The basic concept is something I think of as “growth potential”. [0,0] has more room to grow than [1,1] does; [1,1] is already closer to the end of the lists, and you can’t build your Lego tower much higher if there’s a ceiling in the way. [4,0] has more room to grow than [4,4] does. But whether [3,6] has more room to grow than [5,2] depends on a lot of things, like the size of the input lists, and where the matches fall; so neither one is guaranteed expendable.

Tomorrow, I’ll provide an actual algorithm for the Expendable function, and put it to work.

Diff, part 13: MaxLength and driving directions

Wednesday, April 13th, 2005

Last time in the diff saga, I showed a Longest Common Subsequence (LCS) algorithm with real promise, but a big weakness: it’s O(2n). Heat death of the universe, etc. We can improve on that.

Remember that we build longer CSes by starting with shorter ones, and adding stuff onto the end. Suppose we write a new function MaxLength(CSA), which returns the length of the longest CS that starts with CSA. Let’s go back to our last example:

A is (1 2 3 4)
B is (1 3 4 2)
Matches are [0,0], [1,3], [2,1], and [3,2].

Let’s look at our function MaxLength with some CSes from this input:

  • MaxLength([0,0]) = 3, because [0,0] will grow into length-3 CS [0,0] [2,1] [3,2].
  • MaxLength([0,0] [2,1]) is also 3, because [0,0] [2,1] will grow into that same length-3 CS, [0,0] [2,1] [3,2].
  • MaxLength([0,0] [2,1] [3,2]) is — you guessed it — 3.
  • MaxLength([0,0] [3,2]) = 2, because [0,0] [3,2] has already maxed out; there are no matches we can add at the end to make a longer CS.
  • MaxLength([2,1]) = 2, because [2,1] will grow into [2,1] [3,2].

Now, here’s the plan. We follow our stateful LCS algorithm, but for every CS we’ve accepted so far, we also keep track of its MaxLength. Then, when we come up with a new CS, we calculate its MaxLength, and compare that to the MaxLengths that are already in our list. If the new MaxLength is better than anything in the list, then we throw away the CSes with worse MaxLengths, and add the new one. But if the new MaxLength is worse than anything in the list, then we throw the new one away.

The highest MaxLength will always belong to CSes that are leading subsets of the LCS. That is, if the LCS is [0,0] [2,1] [3,2], then [0,0] will have the highest possible MaxLength — 3, in this case — and so will [0,0] [2,1] and [0,0] [2,1] [3,2]. So if we immediately throw away any CS with a lower MaxLength — [0,0] [1,3], for example, with a MaxLength of a mere 2 — then we’ll home right in on our LCS. We won’t waste time getting sidetracked, and we won’t linger on CSes with obvious holes ([0,0] [3,2] has a MaxLength of 2, so out it goes).

Well, okay, the obvious holes don’t play too big a part in an example this short. But they’re what makes our previous algorithm take exponential time. Fixing them is what makes the algorithm practical. And with MaxLength, we can do just that. Goody!

Until reality sets in, at any rate.

See, you can’t actually know what any of the MaxLengths are until after you’ve generated all the longer CSes. Which pretty much only happens after you’re done running the entire O(2n) LCS algorithm. So you have to go through the algorithm once, without optimizations, so you can run it again, with optimizations. Yeah, that makes a lot of sense.

Driver: “Will we get there faster if I turn left or right at this intersection?”

Navigator: “Well, let’s try turning left, and drive until we get there, and see how long it takes. Then we’ll come back here, and the second time, we’ll turn right at this intersection, and then drive until we get there. Then we’ll know for sure which way will get us there fastest, and we can come back here again and take that route.”

Okay. So, the MaxLength scheme probably isn’t going to work. At all. However, it is a useful concept to build upon, and next time I’ll show a variation that’s nearly as useful, and not nearly as stupid.

Diff, part 12: More state, Mr. Sulu

Tuesday, April 12th, 2005

The grand diff saga continues…

Last time, I showed an improved one-pass LCS algorithm. But I mentioned that, once again, this is an algorithm that won’t work for all inputs. Here’s an example:

A is (1 2 3 4)
B is (1 3 4 2)
Matches are [0,0], [1,3], [2,1], and [3,2].

Let’s try our algorithm and see what we get.

  • First match: [0,0]. That’s our only match so far, so we have one common subsequence (CS): [0,0].
  • Next match: [1,3]. Can we tack that onto [0,0] and still have a valid CS? Yes. We now have one longer CS: [0,0] [1,3].
  • Next match: [2,1]. Can we tack that onto [0,0] and still have a valid CS? No, but we don’t want to lose the [2,1], so we’ll start a new CS. Now we have two: [0,0] [1,3] and [2,1].
  • Last match: [3,2]. Can we tack that onto [0,0] [1,3]? No. [2,1]? Yes. We end up with two CSes: [0,0] [1,3] and [2,1] [3,2].

Whoops! Given those inputs, we should have had a length-3 LCS: (1 3 4), or [0,0] [2,1] [3,2]. What happened?

What happened was that the [0,0] got sucked into another CS — [0,0] [1,3] — so it wasn’t available as a starting point for us to build [0,0] [2,1] [3,2] from. How can we fix this, and make sure the [0,0] is still available even when it’s been used in another match string?

Well, pretty easily, with a tweak to our list handling. Instead of dropping shorter CSes from the list and replacing them with longer ones, we can just add. So instead of removing [0,0] from our list when we add [0,0] [1,3], we just add them both.

Let’s give that a shot:

  • First match: [0,0]. Can we append that to any existing CSes? No, so we’d better add it as a new CS. Our CS list now contains [0,0].
  • Next match: [1,3]. Append this to everything we can — that would be [0,0]. Our CS list now contains both [0,0] and [0,0] [1,3].
  • Next match: [2,1]. Append this to everything we can — [0,0] once again. Our CS list now has [0,0], [0,0] [1,3], and [0,0] [2,1].
  • Last match: [3,2]. Append this to everything we can — [0,0] and [0,0] [2,1]. Our CS list now has [0,0], [0,0] [1,3], [0,0] [2,1], [0,0] [3,2], and [0,0] [2,1] [3,2].

And, ta-daah — the longest of those, [0,0] [2,1] [3,2], is our expected LCS.

But wait a minute. Why is it trying both [0,0] [3,2] and [0,0] [2,1] [3,2]? They’re the same thing, except that the first one has an obvious hole in the middle.

Actually, I took it easy in this example. If I’d been all proper about it, then I would’ve added each match as a length-1 match string, too; so the CS list would also contain [1,3], [2,1], and [3,2] as length-1 match strings, plus anything else we built off of those. Why? We have to put in length-1 match strings that aren’t built off of anything else; after all, we do it in step 1! So after step 2, we would’ve had three CSes ([0,0], [0,0] [1,3], and [1,3]), not two. After step 3, we would’ve had five. After step 4, we would’ve had eight. The farther we go, the more obvious holes we accumulate.

This whole thing is smelling a lot like exponential time, even without adding the extra length-1 match strings. True, this algorithm is able to cut its losses — it doesn’t keep trying longer and longer match strings that all start with, say, [1,3] [2,1], because it’s smart enough to know that none of them will ever work. But it’s not smart enough to avoid obvious holes, and if it can’t cut very many losses (which will be the case a lot of the time), you’ll be waiting a long, long while for your diff to finish.

But it’s a good algorithm, once we cure its predilection for Swiss cheese. What it needs, at this point, are a few good shortcuts.

Diff, part 11: Once more, with state

Monday, April 11th, 2005

Continuing in the diff series:

It’s been a long time since Friday, so let’s start by reviewing some of the problems with the “try ’em all” approach to LCS that we looked at last time. Here’s our sample input again:

A is (1 2 3 4 5 6 7)
B is (6 7 3 4 5 1 2)
Matches are [0,5], [1,6], [2,2], [3,3], [4,4], [5,0], and [6,1].

Trying every possible match string will take a ridiculous amount of time, because it wastes its time a couple of different ways:

  • Obvious gaps: It tries match strings like [2,2] [4,4] and [3,3]. That’s bad — we want something that can recognize patterns a little better, and just try [2,2] [3,3] [4,4] without trying variants with gaps (since there are a lot of variants with gaps).
  • Invalid: It tries match strings like [0,4] [2,2] that aren’t going to pass our rule. That’s not so bad, but it would also try [0,4] [2,2] [3,3], [0,4] [2,2] [3,3] [4,4], etc., instead of cutting its losses as soon as it notices that no valid common subsequence will ever contain [0,4] [2,2].

There are really only three match strings — out of a possible 325 (31 if we can impose ordering on our matches, though we haven’t actually worked out a way to do that yet) — that make any sense: [0,4] [1,5]; [2,2] [3,3] [4,4]; and [5,0] [6,1]. We want to ignore the other 322 (28) possibilities, and go straight to the good stuff. How do we do that?

Actually, we’ve gotten pretty close to that already, in our “string ’em all together” algorithm. I made a big deal out of how it didn’t work, but most of that came down to the fact that we only gave it one register; it could only keep track of one “best-yet common subsequence” at a time. If it had a list instead, it would be a pretty decent fit. Let’s run through the way it would work.

  • We start with the first match, [0,5]. That’s the only match we’ve looked at so far, so we have one common subsequence (CS) at this point: [0,5].
  • The next match is [1,6]. Can we tack that onto [0,5] and still have a valid CS? We would end up with [0,5] [1,6], which passes our rule, so yes, we can append it. Our CS is now [0,5] [1,6].
  • Next: [2,2]. Can we tack that onto [0,5] [1,6] and still have a valid CS? We would end up with [0,5] [1,6] [2,2], which fails our rule, so no, we can’t append it to our existing CS. But we don’t want to throw the [2,2] away, so let’s add it to our list. Now we have two CSes going: [0,5] [1,6] and [2,2].
  • Next: [3,3]. Can we tack that onto [0,5] [1,6]? No. What about [2,2]? Yes. Our CSes are now [0,5] [1,6] and [2,2] [3,3].
  • Next: [4,4]. Can we tack that onto [0,5] [1,6]? No. What about [2,2] [3,3]? Yes. Our CSes are now [0,5] [1,6] and [2,2] [3,3] [4,4].
  • Next: [5,0]. Can we tack that onto [0,5] [1,6]? No. What about [2,2] [3,3] [4,4]? No, so we’ll have to start another CS. Our CSes are now [0,5] [1,6], [2,2] [3,3] [4,4], and [5,0].
  • Next: [6,1]. Can we tack that onto [0,5] [1,6]? No. What about [2,2] [3,3] [4,4]? No. [5,0]? Yes. Our CSes are now [0,5] [1,6], [2,2] [3,3] [4,4], and [5,0] [6,1].

Three sensible CSes, and one is longer than the others. So that’s our LCS. And it does, in fact, match what we expected.

This still won’t quite work for all inputs. I’ll refine it a bit more next time.

Diff, part 10: Try ’em all? (2 of 2)

Friday, April 8th, 2005

Continuing in the diff series

So in the last few articles, I proposed and refined a simpleminded algorithm for finding the Longest Common Subsequence, only to find that the algorithm couldn’t run deterministically in a single pass. It gets to points where it has to make a decision, and it doesn’t have enough information to make that decision on the spot. It would have to try both options, and see which one works better.

Then I pondered the question, “Does this mean we just have to try all the possible match strings, see which ones work, and use the longest of those?” The answer was, “I hope not.

But wait. Last time, I was trying every possible match string, in every possible order: I was building both ab and ba as match strings. But that’s silly, because they can’t both pass our rule. Could we start by putting the matches in some sort of order — [1,1] before [2,2], and like that — and then only generate match strings that have the matches in that order? After all, we already know that [2,2] [1,1] isn’t even worth considering; why generate it in the first place?

We could indeed do this, and it would bring the problem from the super-exponential into the merely exponential. It basically turns the list of N match strings into an N-digit binary number. We ignore the number with all bits set to 0 (because if we have any matches at all, then our LCS will always be at least one match long), so we’re now dealing with 2n-1 possible match strings.

That’s a little better, but still not good enough for 1,000-line files:

  • 3 matches: 7 match strings
  • 4 matches: 15 match strings
  • 5 matches: 31 match strings
  • 6 matches: 63 match strings
  • 10 matches: 1,023 match strings
  • 20 matches: 1,048,575 match strings
  • 1,000 matches: 1.071*10301 match strings

So now, instead of taking much longer than cracking a 1,000-bit symmetric encryption key, our 1,000-line diff now takes merely as long as cracking a 1,000-bit key. My previous comments about the heat death of the universe still apply.

That answers our “What if?” question pretty conclusively.

The thing is, trying all the possible match strings is a bad idea for a much simpler reason: it tries way too many stupid options. It will try match strings with obvious holes in them — every possible combination of obvious holes. If you have matches [1,1], [2,2], [3,3], and [4,4], the “try ’em all” solution will be trying things like [1,1] [2,2] [4,4] and [2,2] [3,3] to see how they work. And it’ll take an obscene amount of time doing it.

So we have to have something less exhaustive than “try ’em all”, but it has to be more comprehensive than “one pass and keep the best so far”. Over the next few articles, I’ll start getting into what that algorithm should look like.


Joe White's Blog copyright © 2004-2011. Portions of the site layout use Yahoo! YUI Reset, Fonts, and Grids.
Proudly powered by WordPress. Entries (RSS) and Comments (RSS). Privacy policy