Diff, part 10: Try 'em all? (2 of 2) #.NET #Delphi #diff
Continuing in the diff series...
So in the last few articles, I proposed and refined a simpleminded algorithm for finding the Longest Common Subsequence, only to find that the algorithm couldn't run deterministically in a single pass. It gets to points where it has to make a decision, and it doesn't have enough information to make that decision on the spot. It would have to try both options, and see which one works better.
Then I pondered the question, "Does this mean we just have to try all the possible match strings, see which ones work, and use the longest of those?" The answer was, "I hope not."
But wait. Last time, I was trying every possible match string, in every possible order: I was building both ab
and ba
as match strings. But that's silly, because they can't both pass our rule. Could we start by putting the matches in some sort of order — [1,1]
before [2,2]
, and like that — and then only generate match strings that have the matches in that order? After all, we already know that [2,2] [1,1]
isn't even worth considering; why generate it in the first place?
We could indeed do this, and it would bring the problem from the super-exponential into the merely exponential. It basically turns the list of N match strings into an N-digit binary number. We ignore the number with all bits set to 0 (because if we have any matches at all, then our LCS will always be at least one match long), so we're now dealing with 2n-1 possible match strings.
That's a little better, but still not good enough for 1,000-line files:
- 3 matches: 7 match strings
- 4 matches: 15 match strings
- 5 matches: 31 match strings
- 6 matches: 63 match strings
- 10 matches: 1,023 match strings
- 20 matches: 1,048,575 match strings
- 1,000 matches: 1.071*10301 match strings
So now, instead of taking much longer than cracking a 1,000-bit symmetric encryption key, our 1,000-line diff now takes merely as long as cracking a 1,000-bit key. My previous comments about the heat death of the universe still apply.
That answers our "What if?" question pretty conclusively.
The thing is, trying all the possible match strings is a bad idea for a much simpler reason: it tries way too many stupid options. It will try match strings with obvious holes in them — every possible combination of obvious holes. If you have matches [1,1]
, [2,2]
, [3,3]
, and [4,4]
, the "try 'em all" solution will be trying things like [1,1] [2,2] [4,4]
and [2,2] [3,3]
to see how they work. And it'll take an obscene amount of time doing it.
So we have to have something less exhaustive than "try 'em all", but it has to be more comprehensive than "one pass and keep the best so far". Over the next few articles, I'll start getting into what that algorithm should look like.