Diff, part 6: Match strings and common subsequences #.NET #Delphi #diff

Our sponsor (we have a sponsor?) is proud to bring you the next episode of Diff for the Complete Goober.

A brief aside before I start: I'm going to start using a couple of funky Unicode characters. If these don't show up properly in your browser of choice, please let me know:

≮ should look like < with a slash through it (not less than)
≯ should look like > with a slash through it (not greater than)

So I've talked about the longest common subsequence (LCS) and its place in the diff algorithm. As I mentioned last time, we build the LCS by stringing matches together. But not every possible match string is, in fact, a common subsequence (CS) at all, much less the longest common subsequence.

So the next order of business is making sure we have a rule to separate the commons from the uncommons. Fortunately, this is easy, especially since we're keeping track of the indexes instead of the elements. The rule is just that the A-indexes must be increasing (that is, each match's A-index must be greater than the A-index of the previous match in the string), and so must the B-indexes.

This derives from the definition of a subsequence: its elements must be in the same order as the elements in the original list. If the first match's A-index was, say, 3, and the second match's A-index was 1, then the elements are not in the same order as the original list A (since you've got A[3] first, followed by A[1]); therefore, the match string is not a subsequence of A, at which point it certainly can't be a common subsequence of both A and B.

The explanation is a bit painful, but the rule really is pretty simple. Here are a few examples:

[0,0] [1,1] [2,2] [3,3]
Valid CS: 0 < 1 < 2 < 3.

[0,10] [1,11] [2,12] [3,13]
[10,0] [11,1] [12,2] [13,3]
Valid CSes. (Remember that you're only comparing A-indexes to A-indexes and B-indexes to B-indexes. 0 < 1 < 2 < 3 and 10 < 11 < 12 < 13, so we're OK.)

[0,0] [10,10] [5,5]
Not a valid CS: 0 < 10 ≮ 5.

[0,0] [1,0]
Not a valid CS: the A-indexes are OK (0 < 1), but the B-indexes are not (0 ≮ 0).

Pay particular attention to that last one. We'll come back to it later.