Joe White's Blog Life, .NET, and cats

Diff, part 19: De-dup by length? #.NET #Delphi #diff

Today in the diff series, I'll start in on what I think will be the last major optimization in our LCS algorithm.

Recall that we may have multiple same-length common subsequences in our "best CSes so far" pool. For example, we may have [0,0] [1,2] and [0,0] [2,1]. We've been keeping both because the next match might be [2,3], or [3,2], or maybe even [128,2]. We can't throw away any CS that we might reasonably be able to build upon later (unless we can make an ironclad guarantee that it's expendable).

Also recall that, when we generate a new CS, we do so by adding a new match onto an existing CS; specifically, the longest one that we can append our new match to. When we look for an existing CS to extend, the only thing we care about is: for the given match, will this CSk give us the longest CSnew that we can possibly generate?

We don't actually care whether CSk ends in [1,2] or [2,1], as long as we can guarantee that when we append the current match — be it [2,3], [3,2], [3,3], or whatever else — we'll end up with the longest CSnew we can get. Anywhere and everywhere along the way, we can throw away as many CSes as we want (just like we did yesterday), as long as we never compromise that guarantee. Never throw a CS away unless you can prove it has less growth potential than something else you already have; otherwise, anything goes.

These two properties — length is all-important, yet we keep multiple CSes of the same length — seem to be in a bit of tension with each other. Which suggests an intriguing question: can we get to the point where length alone is a good enough guarantee? Can we get to the point where we hang onto just one length-1 CS, one length-2 CS, one length-3 CS, etc., at a time?

If that were possible, it would mean that we could make a big optimization to our fourth loop — the one where we call Expendable to compare our CSnew to each and every CSk, so we can see if there's anything (including possibly CSnew) that we can throw away. That fourth loop could, in fact, stop being a loop. It could become a single call to Expendable, comparing CSnew to the CSk of the same length. (Of course, if CSnew is the longest CS so far, then there would be no CSk of the same length, and we wouldn't need to call Expendable at all.) And that call to Expendable would always have a definitive result: either you throw away CSk, or you throw away CSnew. Only one CS of any given length would be left standing at any given time.

A worthy optimization, to be sure. But is it doable? We would have to throw away one of those CSes-of-the-same-length, but Expendable([0,0] [1,2], [0,0] [2,1]) returns nil, meaning we need both CSes at one time or another. How could we reconcile this?

By siezing on that "at one time or another", and making this into a timing issue. Then we can control the timing, by controlling the way we iterate. How, and in what order, do we generate our matches?

We would be home free if we could generate our matches in this order:

  • [0,0]
  • [2,1]
  • Every possible match that could extend [2,1] but not [1,2]
  • [1,2]

Or perhaps:

  • [0,0]
  • [1,2]
  • Every possible match that could extend [1,2] but not [2,1]
  • [2,1]

As long as our algorithm can consistently return matches in one of these orders (and can be consistent about which one), then when we get to the point where [0,0] [1,2] and [0,0] [2,1] are our CSnew and CSk (in whichever order), we know with absolute certainty that we can throw the old one away, replace it with the new one, and be sure that we're still going to find the longest possible CS when we're done.

So how do we handle the third bullet point in either of the above orderings? That third bullet is kind of the "and then a miracle occurs" stage. Next time, I'll start to nail down what "every possible match that could extend x but not y" really means, and how we can possibly hope to pull it off.