Diff, part 15: Implementing Expendable

In yesterday’s episode of Diff Quest, I outlined a function called Expendable, which takes two CSes and tells you whether you can throw one of them away. Today I’ll describe that function’s implementation.

Expendable takes two parameters, CS1 and CS2. Each of those is a common subsequence, and therefore consists of a bunch of matches. And each match is a pair of indexes, like [0,1].

The Expendable function is all about the idea of growth potential. Growth potential depends on two things, and two things only: the length of the CS, and the last match in the CS. [0,0] [2,2] has the exact same growth potential as [1,1] [2,2] — the earlier matches don’t matter; what matters is what we can tack onto the end.

This means there are three numbers that we care about, when we’re thinking about a CS’s growth potential: its length; the AIndex of its last match; and the BIndex of its last match. So the inputs to Expendable can be described by six individual numbers, and the different return values of Expendable are determined by the relationship between corresponding pairs:

  • Is CS1Length less than, equal to, or greater than CS2Length?
  • Is CS1AIndex less than, equal to, or greater than CS2AIndex?
  • Is CS1BIndex less than, equal to, or greater than CS2BIndex?

The following table lists all the possible combinations of these. The “Len” column shows “<” if CS1Length < CS2Length, “=” if CS1Length = CS2Length, and “>” if CS1Length > CS2Length. The “AI” column is the same thing for AIndex, and “BI” is BIndex. After that, there’s a sample CS1 and a sample CS2 that exhibit these relationships (right-justified so the last matches always line up, since it’s the last match that’s important).

The next two columns are a kind of game, where I try to come up with a set of inputs that would make CS1 “win” (i.e., inputs that would build CS1 into a longer subsequence without also growing CS2), and then a set of inputs that would make CS2 win. It doesn’t matter how contrived those inputs are; what matters is whether those inputs are possible (and sometimes they’re not, which is when expendability kicks in). “EOF” means that no further input is required to make that CS win — if we reached end of file, and there were no more matches, then it’s already won.

Finally, the last column shows the result of Expendable(CS1, CS2).

# Len AI BI CS1 CS2 Make CS1 win Make CS2 win Exp
1 < < <       [3,3] [0,0] [6,6] [4,4] [5,5] EOF nil
2 < < =       [3,3] [0,0] [6,3] [4,4] [5,5] EOF nil
3 < < >       [3,6] [0,0] [6,3] [4,7] [5,8] EOF nil
4 < = <       [3,3] [0,0] [3,6] [4,4] [5,5] EOF nil
5 < = =       [3,3] [0,0] [3,3] None possible EOF 1
6 < = >       [3,6] [0,0] [3,3] None possible EOF 1
7 < > <       [6,3] [0,0] [3,6] [7,4] [8,5] EOF nil
8 < > =       [6,3] [0,0] [3,3] None possible EOF 1
9 < > >       [6,6] [0,0] [3,3] None possible EOF 1
10 = < <       [3,3]       [6,6] [4,4] None possible 2
11 = < =       [3,3]       [6,3] [4,4] None possible 2
12 = < >       [3,6]       [6,3] [4,7] [7,4] nil
13 = = <       [3,3]       [3,6] [4,4] None possible 2
14 = = =       [3,3]       [3,3] None possible None possible Either
15 = = >       [3,6]       [3,3] None possible [4,4] 1
16 = > <       [6,3]       [3,6] [7,4] [4,7] nil
17 = > =       [6,3]       [3,3] None possible [4,4] 1
18 = > >       [6,6]       [3,3] None possible [4,4] 1
19 > < < [0,0] [3,3]       [6,6] EOF None possible 2
20 > < = [0,0] [3,3]       [6,3] EOF None possible 2
21 > < > [0,0] [3,6]       [6,3] EOF [7,4] [8,5] nil
22 > = < [0,0] [3,3]       [3,6] EOF None possible 2
23 > = = [0,0] [3,3]       [3,3] EOF None possible 2
24 > = > [0,0] [3,6]       [3,3] EOF [4,4] [5,5] nil
25 > > < [0,0] [6,3]       [3,6] EOF [4,7] [5,8] nil
26 > > = [0,0] [6,3]       [3,3] EOF [4,4] [5,5] nil
27 > > > [0,0] [6,6]       [3,3] EOF [4,4] [5,5] nil

Row #14 is interesting, in that the Expendable function could meaningfully return either CS1 or CS2. That’s because there’s nothing to differentiate the two inputs — they are, as far as we care, the same. One might be [0,1] [2,2] and the other might be [1,0] [2,2], but they both have the same growth potential, so there’s no need to keep both. We can throw one of them away, and it really doesn’t matter which one; we can decide arbitrarily.

I spent a while staring at this table, trying to figure out the pattern, because of row #14. I originally had it marked as “nil”, but I finally realized that that’s not right; we do want to throw away one or the other. Once I figured out that both CS1 and CS2 were expendable for that row, the pattern became clear:

  • If CS1Length < CS2Length, and CS1AIndex >= CS2AIndex, and CS1BIndex >= CS2BIndex, then CS1 is expendable.
  • If CS1Length > CS2Length, and CS1AIndex <= CS2AIndex, and CS1BIndex <= CS2BIndex, then CS2 is expendable.

It’s not the most graceful of algorithms, but it’s not a hard one to implement.

At this point, I’ve described all the major ingredients for a practical (better-than-O(2n)) LCS algorithm; I could (or you could) go ahead and start writing code, and end up with a working LCS library. But I’m not going to start writing code for it, not just yet. Why? Because even though we’re now well into the realm of polynomial time, we’re still doing more work than we have to. There are a few more sneaky shortcuts in store, and I’ll start delving into them next week.

First bike ride of spring… oof

We’ve had some beautiful weather lately, so last night I pumped up our bike tires, intending to go for a ride. And then I didn’t do it. Ah well.

I got out this morning, after getting the dust cleaned out of my helmet. I made it all the way around the block. It’s a largish block, with a hill. I’ve ridden up worse hills before, but I was in shape then.

Coasting down the hill was easy. I made it about 2/3 of the way back up before I had to get off and walk. At the time, I debated pedaling a little farther, but I’m glad I didn’t; I was puffing something fierce by the time I got back home. I’m now happily collapsed into my chair and debating how long I can lie here before I have to get back up and go to work. I’ve got some vacation days saved up, don’t I?

Lesson learned #1: Next time, I will take the garage door opener with me, so I don’t have to park my bike, climb up the outside stairs, unlock the front door, go inside, and then go back down the stairs to open the garage. Save myself a little extra exertion.

Lesson learned #2: Next time, I will pour the glass of Gatorade before I leave, so it’s ready when I get back.

Diff, part 14: Expendability

Last time, on the Diff Chronicles, I brought up a useful optimization called MaxLength. But useful as it is, it’s totally impractical to put into practice, because finding the exact MaxLength takes too long.

But what if we could look at two CSes, and know that one’s MaxLength is better than the other’s, without actually calculating their MaxLengths?

Let’s define a new function, called Expendable(CS1, CS2). We give it two CSes, and it tells us which of them we can throw away. Sometimes it returns nil, meaning we can’t throw either of them away. (Thanks to Sam for suggesting that this be a function — it simplifies the explanation quite a bit.)

Assuming that we can implement the Expendable function, it’s simple to plug it into yesterday’s algorithm: Every time we come up with a new CS that we want to add to our list, we first loop through our existing CSes, and compare each one to our CSnew. If Expendable(CSi, CSnew) ever returns CSi, we delete CSi from our list, and then move on to CSi+1. If Expendable(CSi, CSnew) ever returns CSnew, we stop there, and don’t bother adding CSnew to our list. And whenever Expendable(CSi, CSnew) returns nil, we just keep going.

This is essentially what we did yesterday, but we were using an Expendable function that was implemented in terms of MaxLength. Now we just have to figure out how to implement the Expendable function without using that expensive MaxLength calculation. How do we look at two CSes and figure out that we can eliminate one without hurting our chances of finding the LCS?

Let’s look at (you knew this was coming, didn’t you?) some examples:

  • Expendable([0,0] [1,1], [1,1]) => [1,1].
    Anything we can add onto one of these, we can add onto the other. But one of them is already longer. It’s a sure thing that MaxLength([0,0] [1,1]) = MaxLength([1,1]) + 1; we can tell that without knowing exactly what their MaxLengths are. Therefore, [0,0] [1,1] wins, and [1,1] gets flushed.
  • Expendable([0,0], [1,1]) => [1,1].
    We know that there’s a match [1,1] out there somewhere, because it’s in the second CS. So we know that the [0,0] will end up growing to [0,0] [1,1] sooner or later. That means that this example is exactly the same thing as the previous example, and we already know that once it grows to that point, [1,1] is going to lose. Why wait that long? We can just cut our losses now, and toss the second CS right away.
  • Expendable([3,6], [5,2]) => nil.
    We can’t eliminate [3,6] because it might well be part of the LCS. For example, there might be only one more match after this, and that match might be [4,7]. We can’t eliminate [5,2] because there might be only one more match, and it might be [6,3]. We can’t guarantee that either of these CSes is out of the running, so we can’t eliminate either one.
  • Expendable([0,5] [1,6] [2,7], [3,0]) => nil.
    It’s tempting to say, “We’re working towards the longest possible CS, so when we need to throw one away, go for the shorter one.” But we’ve tried that before, and it didn’t work. If the next three matches are [4,1], [5,2], and [6,3], then [3,0] will grow into the next big powerhouse, because we can add those matches onto [3,0] to get a length-4 CS, but we can’t add them onto [0,5] [1,6] [2,7]. Bottom line, we can’t throw away either CS without waiting to see what other matches are yet to come.
  • Expendable([4,0], [4,4]) => [4,4].
    We can guarantee that throwing away [4,4] won’t hurt our chances of finding the right LCS. There is no possible way that MaxLength([4,4]) could be larger than MaxLength([4,0]), but there are possible ways that MaxLength([4,0]) could be better; for example, the next match might be [5,2], which would grow [4,0] but would not grow [4,4]. It’s not guaranteed that [4,0] will result in a better MaxLength, but it is guaranteed that it won’t ever be worse.

The basic concept is something I think of as “growth potential”. [0,0] has more room to grow than [1,1] does; [1,1] is already closer to the end of the lists, and you can’t build your Lego tower much higher if there’s a ceiling in the way. [4,0] has more room to grow than [4,4] does. But whether [3,6] has more room to grow than [5,2] depends on a lot of things, like the size of the input lists, and where the matches fall; so neither one is guaranteed expendable.

Tomorrow, I’ll provide an actual algorithm for the Expendable function, and put it to work.

Whiskey Tango Foxtrot

We’ve been refactoring some legacy code over the past few days, and we’ve found a couple of amusing snippets. They were written by programmers who no longer work here, so we can poke fun at them.

Here’s an interesting bit from 1998:

  // do some stuff
  ShowMessage('Duplicate ID: ' + IntToStr(NewID) + #13 +
    'TableName: ' + tblNewRow.TableName );

So we’ll cancel the append even if the re-raise fails, and execution falls through to the line after the raise. I can honestly say that this bit of resource-protection code never would have occurred to me.

A co-worker found another gem yesterday. This one also dates back to 1998 (though written by a different programmer):

Query.Locate(Query.FieldByName('ID').FieldName, ID, []);

So we call FieldByName, which searches for, and returns, the field whose FieldName is ‘ID’… and then we read that field’s FieldName. Brilliant, no?

I’m sad to say, though, that these coding gems are no longer in our code base. Somehow, they got lost during some code cleanup. I can’t tell you how much that distresses me.

Diff, part 13: MaxLength and driving directions

Last time in the diff saga, I showed a Longest Common Subsequence (LCS) algorithm with real promise, but a big weakness: it’s O(2n). Heat death of the universe, etc. We can improve on that.

Remember that we build longer CSes by starting with shorter ones, and adding stuff onto the end. Suppose we write a new function MaxLength(CSA), which returns the length of the longest CS that starts with CSA. Let’s go back to our last example:

A is (1 2 3 4)
B is (1 3 4 2)
Matches are [0,0], [1,3], [2,1], and [3,2].

Let’s look at our function MaxLength with some CSes from this input:

  • MaxLength([0,0]) = 3, because [0,0] will grow into length-3 CS [0,0] [2,1] [3,2].
  • MaxLength([0,0] [2,1]) is also 3, because [0,0] [2,1] will grow into that same length-3 CS, [0,0] [2,1] [3,2].
  • MaxLength([0,0] [2,1] [3,2]) is — you guessed it — 3.
  • MaxLength([0,0] [3,2]) = 2, because [0,0] [3,2] has already maxed out; there are no matches we can add at the end to make a longer CS.
  • MaxLength([2,1]) = 2, because [2,1] will grow into [2,1] [3,2].

Now, here’s the plan. We follow our stateful LCS algorithm, but for every CS we’ve accepted so far, we also keep track of its MaxLength. Then, when we come up with a new CS, we calculate its MaxLength, and compare that to the MaxLengths that are already in our list. If the new MaxLength is better than anything in the list, then we throw away the CSes with worse MaxLengths, and add the new one. But if the new MaxLength is worse than anything in the list, then we throw the new one away.

The highest MaxLength will always belong to CSes that are leading subsets of the LCS. That is, if the LCS is [0,0] [2,1] [3,2], then [0,0] will have the highest possible MaxLength — 3, in this case — and so will [0,0] [2,1] and [0,0] [2,1] [3,2]. So if we immediately throw away any CS with a lower MaxLength — [0,0] [1,3], for example, with a MaxLength of a mere 2 — then we’ll home right in on our LCS. We won’t waste time getting sidetracked, and we won’t linger on CSes with obvious holes ([0,0] [3,2] has a MaxLength of 2, so out it goes).

Well, okay, the obvious holes don’t play too big a part in an example this short. But they’re what makes our previous algorithm take exponential time. Fixing them is what makes the algorithm practical. And with MaxLength, we can do just that. Goody!

Until reality sets in, at any rate.

See, you can’t actually know what any of the MaxLengths are until after you’ve generated all the longer CSes. Which pretty much only happens after you’re done running the entire O(2n) LCS algorithm. So you have to go through the algorithm once, without optimizations, so you can run it again, with optimizations. Yeah, that makes a lot of sense.

Driver: “Will we get there faster if I turn left or right at this intersection?”

Navigator: “Well, let’s try turning left, and drive until we get there, and see how long it takes. Then we’ll come back here, and the second time, we’ll turn right at this intersection, and then drive until we get there. Then we’ll know for sure which way will get us there fastest, and we can come back here again and take that route.”

Okay. So, the MaxLength scheme probably isn’t going to work. At all. However, it is a useful concept to build upon, and next time I’ll show a variation that’s nearly as useful, and not nearly as stupid.

Batchin’ it

Jennie just left for a half-week stay with her friend in Kansas City, so I’m batchin’ it here for a few days.

(I debated with myself over spelling. I mean, the phrase “batchin’ it” comes from the word “bachelor”, so maybe it should be “bachin'” instead. But that doesn’t really suggest a meaningful pronounciation — I mean, does “bachin'” mean grooving to Johann Sebastian, or what? And you do have to consider, this rite traditionally does involve queuing up household chores — laundry, dishes, etc. — to all be performed as a single batch operation when the wife gets home. Usually by the husband, at gunpoint. So I think my spelling is probably fairly appropriate.)

That said, the first thing I did after she left was to turn off the TV, unload the dishwasher, reload it, and start it. No kidding. We’ll see if it lasts.

Of course, the next thing I did was come down to the computer, so I may already be drifting into batch mode…

Diff, part 12: More state, Mr. Sulu

The grand diff saga continues…

Last time, I showed an improved one-pass LCS algorithm. But I mentioned that, once again, this is an algorithm that won’t work for all inputs. Here’s an example:

A is (1 2 3 4)
B is (1 3 4 2)
Matches are [0,0], [1,3], [2,1], and [3,2].

Let’s try our algorithm and see what we get.

  • First match: [0,0]. That’s our only match so far, so we have one common subsequence (CS): [0,0].
  • Next match: [1,3]. Can we tack that onto [0,0] and still have a valid CS? Yes. We now have one longer CS: [0,0] [1,3].
  • Next match: [2,1]. Can we tack that onto [0,0] and still have a valid CS? No, but we don’t want to lose the [2,1], so we’ll start a new CS. Now we have two: [0,0] [1,3] and [2,1].
  • Last match: [3,2]. Can we tack that onto [0,0] [1,3]? No. [2,1]? Yes. We end up with two CSes: [0,0] [1,3] and [2,1] [3,2].

Whoops! Given those inputs, we should have had a length-3 LCS: (1 3 4), or [0,0] [2,1] [3,2]. What happened?

What happened was that the [0,0] got sucked into another CS — [0,0] [1,3] — so it wasn’t available as a starting point for us to build [0,0] [2,1] [3,2] from. How can we fix this, and make sure the [0,0] is still available even when it’s been used in another match string?

Well, pretty easily, with a tweak to our list handling. Instead of dropping shorter CSes from the list and replacing them with longer ones, we can just add. So instead of removing [0,0] from our list when we add [0,0] [1,3], we just add them both.

Let’s give that a shot:

  • First match: [0,0]. Can we append that to any existing CSes? No, so we’d better add it as a new CS. Our CS list now contains [0,0].
  • Next match: [1,3]. Append this to everything we can — that would be [0,0]. Our CS list now contains both [0,0] and [0,0] [1,3].
  • Next match: [2,1]. Append this to everything we can — [0,0] once again. Our CS list now has [0,0], [0,0] [1,3], and [0,0] [2,1].
  • Last match: [3,2]. Append this to everything we can — [0,0] and [0,0] [2,1]. Our CS list now has [0,0], [0,0] [1,3], [0,0] [2,1], [0,0] [3,2], and [0,0] [2,1] [3,2].

And, ta-daah — the longest of those, [0,0] [2,1] [3,2], is our expected LCS.

But wait a minute. Why is it trying both [0,0] [3,2] and [0,0] [2,1] [3,2]? They’re the same thing, except that the first one has an obvious hole in the middle.

Actually, I took it easy in this example. If I’d been all proper about it, then I would’ve added each match as a length-1 match string, too; so the CS list would also contain [1,3], [2,1], and [3,2] as length-1 match strings, plus anything else we built off of those. Why? We have to put in length-1 match strings that aren’t built off of anything else; after all, we do it in step 1! So after step 2, we would’ve had three CSes ([0,0], [0,0] [1,3], and [1,3]), not two. After step 3, we would’ve had five. After step 4, we would’ve had eight. The farther we go, the more obvious holes we accumulate.

This whole thing is smelling a lot like exponential time, even without adding the extra length-1 match strings. True, this algorithm is able to cut its losses — it doesn’t keep trying longer and longer match strings that all start with, say, [1,3] [2,1], because it’s smart enough to know that none of them will ever work. But it’s not smart enough to avoid obvious holes, and if it can’t cut very many losses (which will be the case a lot of the time), you’ll be waiting a long, long while for your diff to finish.

But it’s a good algorithm, once we cure its predilection for Swiss cheese. What it needs, at this point, are a few good shortcuts.

Diff, part 11: Once more, with state

Continuing in the diff series:

It’s been a long time since Friday, so let’s start by reviewing some of the problems with the “try ’em all” approach to LCS that we looked at last time. Here’s our sample input again:

A is (1 2 3 4 5 6 7)
B is (6 7 3 4 5 1 2)
Matches are [0,5], [1,6], [2,2], [3,3], [4,4], [5,0], and [6,1].

Trying every possible match string will take a ridiculous amount of time, because it wastes its time a couple of different ways:

  • Obvious gaps: It tries match strings like [2,2] [4,4] and [3,3]. That’s bad — we want something that can recognize patterns a little better, and just try [2,2] [3,3] [4,4] without trying variants with gaps (since there are a lot of variants with gaps).
  • Invalid: It tries match strings like [0,4] [2,2] that aren’t going to pass our rule. That’s not so bad, but it would also try [0,4] [2,2] [3,3], [0,4] [2,2] [3,3] [4,4], etc., instead of cutting its losses as soon as it notices that no valid common subsequence will ever contain [0,4] [2,2].

There are really only three match strings — out of a possible 325 (31 if we can impose ordering on our matches, though we haven’t actually worked out a way to do that yet) — that make any sense: [0,4] [1,5]; [2,2] [3,3] [4,4]; and [5,0] [6,1]. We want to ignore the other 322 (28) possibilities, and go straight to the good stuff. How do we do that?

Actually, we’ve gotten pretty close to that already, in our “string ’em all together” algorithm. I made a big deal out of how it didn’t work, but most of that came down to the fact that we only gave it one register; it could only keep track of one “best-yet common subsequence” at a time. If it had a list instead, it would be a pretty decent fit. Let’s run through the way it would work.

  • We start with the first match, [0,5]. That’s the only match we’ve looked at so far, so we have one common subsequence (CS) at this point: [0,5].
  • The next match is [1,6]. Can we tack that onto [0,5] and still have a valid CS? We would end up with [0,5] [1,6], which passes our rule, so yes, we can append it. Our CS is now [0,5] [1,6].
  • Next: [2,2]. Can we tack that onto [0,5] [1,6] and still have a valid CS? We would end up with [0,5] [1,6] [2,2], which fails our rule, so no, we can’t append it to our existing CS. But we don’t want to throw the [2,2] away, so let’s add it to our list. Now we have two CSes going: [0,5] [1,6] and [2,2].
  • Next: [3,3]. Can we tack that onto [0,5] [1,6]? No. What about [2,2]? Yes. Our CSes are now [0,5] [1,6] and [2,2] [3,3].
  • Next: [4,4]. Can we tack that onto [0,5] [1,6]? No. What about [2,2] [3,3]? Yes. Our CSes are now [0,5] [1,6] and [2,2] [3,3] [4,4].
  • Next: [5,0]. Can we tack that onto [0,5] [1,6]? No. What about [2,2] [3,3] [4,4]? No, so we’ll have to start another CS. Our CSes are now [0,5] [1,6], [2,2] [3,3] [4,4], and [5,0].
  • Next: [6,1]. Can we tack that onto [0,5] [1,6]? No. What about [2,2] [3,3] [4,4]? No. [5,0]? Yes. Our CSes are now [0,5] [1,6], [2,2] [3,3] [4,4], and [5,0] [6,1].

Three sensible CSes, and one is longer than the others. So that’s our LCS. And it does, in fact, match what we expected.

This still won’t quite work for all inputs. I’ll refine it a bit more next time.

Diff, part 10: Try ’em all? (2 of 2)

Continuing in the diff series

So in the last few articles, I proposed and refined a simpleminded algorithm for finding the Longest Common Subsequence, only to find that the algorithm couldn’t run deterministically in a single pass. It gets to points where it has to make a decision, and it doesn’t have enough information to make that decision on the spot. It would have to try both options, and see which one works better.

Then I pondered the question, “Does this mean we just have to try all the possible match strings, see which ones work, and use the longest of those?” The answer was, “I hope not.

But wait. Last time, I was trying every possible match string, in every possible order: I was building both ab and ba as match strings. But that’s silly, because they can’t both pass our rule. Could we start by putting the matches in some sort of order — [1,1] before [2,2], and like that — and then only generate match strings that have the matches in that order? After all, we already know that [2,2] [1,1] isn’t even worth considering; why generate it in the first place?

We could indeed do this, and it would bring the problem from the super-exponential into the merely exponential. It basically turns the list of N match strings into an N-digit binary number. We ignore the number with all bits set to 0 (because if we have any matches at all, then our LCS will always be at least one match long), so we’re now dealing with 2n-1 possible match strings.

That’s a little better, but still not good enough for 1,000-line files:

  • 3 matches: 7 match strings
  • 4 matches: 15 match strings
  • 5 matches: 31 match strings
  • 6 matches: 63 match strings
  • 10 matches: 1,023 match strings
  • 20 matches: 1,048,575 match strings
  • 1,000 matches: 1.071*10301 match strings

So now, instead of taking much longer than cracking a 1,000-bit symmetric encryption key, our 1,000-line diff now takes merely as long as cracking a 1,000-bit key. My previous comments about the heat death of the universe still apply.

That answers our “What if?” question pretty conclusively.

The thing is, trying all the possible match strings is a bad idea for a much simpler reason: it tries way too many stupid options. It will try match strings with obvious holes in them — every possible combination of obvious holes. If you have matches [1,1], [2,2], [3,3], and [4,4], the “try ’em all” solution will be trying things like [1,1] [2,2] [4,4] and [2,2] [3,3] to see how they work. And it’ll take an obscene amount of time doing it.

So we have to have something less exhaustive than “try ’em all”, but it has to be more comprehensive than “one pass and keep the best so far”. Over the next few articles, I’ll start getting into what that algorithm should look like.

Diff, part 9: Try ’em all? (1 of 2)

So in the last couple of diff articles, I’ve made a wild stab at a Longest Common Subsequence (LCS) algorithm. Unfortunately, it didn’t work all the time, which makes it pretty silly as an algorithm.

The problem with that algorithm was the linear approach it took. It looked at one match at a time, and after each match, it expected to know the one-and-only longest-yet common subsequence, and be able to ignore everything else.

And it doesn’t work that way. Let’s go back to the example at the end of the last article:

A is (1 2 3 4 5 6 7)
B is (6 7 3 4 5 1 2)
Matches are [0,5], [1,6], [2,2], [3,3], [4,4], [5,0], and [6,1].

With the above inputs, we want our algorithm to come up with an LCS of [2,2] [3,3] [4,4]. So after it got to the [2,2], we wanted it somehow to know that it should keep that [2,2] instead of the [0,5] [1,6] it had come up with so far. But what if it had this input instead?

A is (1 2 3)
B is (4 5 3 6 7 1 2)
Matches are [0,5], [1,6], and [2,2].

The first three matches are the same, but this time, after it processes the [2,2], we expect it to somehow know that it should throw the [2,2] away and keep the [0,5] [1,6], because the latter is our LCS. So with the same inputs, we expect different results. The algorithm, as described so far, doesn’t have enough information to make the right decision.

What it comes down to is, when the algorithm has [0,5] [1,6] in its register, and it sees the [2,2], there is no reasonable way to decide which one is better. We need an algorithm that will try them both, and see which one yields the longest common subsequence.

Well, does that mean we’re back to generating all possible match strings? Here I’ll take a page from Raymond Chen’s book, and ask, “What would the world be like if we did try all the possible match strings?”

If we try all the possible combinations (permutations? someday I’m going to have to figure out the difference) of 3 matches — let’s call them match a, match b, and match c — we’ll end up with 15 match strings:

  • 3 match strings of length 1: a, b, c.
  • 6 match strings of length 2: ab, ac, ba, bc, ca, cb.
  • 6 match strings of length 3: abc, acb, bac, bca, cab, cba.

I won’t spend a lot of time on the math, but this is 3 + 3*2 + 3*2*1 total match strings. With 4 matches, we would have 4 + 4*3 + 4*3*2 + 4*3*2*1 total match strings. With N matches, we would have N + (N)(N-1) + (N)(N-1)(N-2) + … + (N)(N-1)(N-2)…(2)(1). Here are the actual numbers:

  • 3 matches: 15 match strings
  • 4 matches: 64 match strings
  • 5 matches: 325 match strings
  • 6 matches: 1,956 match strings
  • 10 matches: 9,864,100 match strings
  • 20 matches: way too damn many match strings

We’re looking at a super-exponential progression. That’s right: worse than 2n. So if we had to start by building all the possible match strings, then a diff of two 1000-line files would take longer — much longer — than cracking a 1000-bit symmetric encryption key by brute force. (As far as I know, 128-bit keys are still considered uncrackable, and each additional bit doubles the cracking time.) And when it finished, sometime after the heat death of the universe, you really wouldn’t still care about the diff results anyway.

I’d say there’s a strong possibility that there’s a more efficient way to do it.