Joe White's Blog Life, .NET, and cats

Diff, part 4: Matches #.NET #Delphi #diff

Another installment in the How Diff Works series:

Last time, I talked about Longest Common Subsequence (LCS), basically a "what's the same between these two lists". So, how do you go about finding the LCS?

Well, you look for matches. (Note: "match" is my own word for this, and may not correspond with anyone else's terminology.) A match is an element that exists somewhere in list A, and is also somewhere in list B. The LCS is just a bunch of matches strung together.

So to continue our earlier example, if we have these two lists:

A is (1 2 3)
B is (1 4 5 3)

then we have two matches. The first match is the element 1, which appears at A[0] and B[0]. The second match is 3, which appears at A[2] and B[3].

Each "match" is, in fact, a common subsequence (of length 1). So if we take our two matches and string them together, we get the LCS, (1 3).

(If you think that's all there is to it, keep reading. I've got several more articles to go.)

Another useful way to express this LCS is [0,0] [2,3]. (Note that I'm making up this notation, so if square brackets are the wrong thing to use, please correct me.) This shows where the matches occur in the original lists, rather than showing what the elements are: the first match is at A[0] and B[0], the next match is at A[2] and B[3].

One situation where indexes are handier is when you're doing a case-insensitive diff. A[0] might be "Hello, world!" and B[0] might be "hElLo, WoRLd!". If you're doing a case-insensitive compare, then A[0] and B[0] are considered equal, even though, in the eyes of the computer, they're not; so which one shows up in the LCS? If it doesn't matter, then you probably didn't want the actual elements in your LCS anyway, and would be better off with the indexes. The same question applies when you're ignoring whitespace, an extremely common option when you're comparing source code.

Using the indexes also makes some operations easier, like showing the two files side-by-side, or generating Unix-diff-style output. And, most important for these articles, it turns out that keeping track of the indexes is essential to building the LCS.

Mathematically speaking, the LCS contains elements (like (1 3)), not indexes. But from here on, when I say LCS, I'll usually mean a list of pairs of indexes (like [0,0] [2,3]), because that's generally going to be more relevant.