Diff, part 17: Optimizing match discovery #.NET #Delphi #diff

The next really interesting topic in the diff series is going to take more time to write than I have this morning. So today, I'll talk about a fairly simple optimization to the match-finding loops.

Our outer two loops looked like this:

Loop through all the elements in A, and for each A_i:
- Loop through all the elements in B, and for each B_j that matches A_i:
  - (do some other stuff)

Now, imagine that we have input that's a typical source-code file, perhaps C# or Delphi code, and that we're ignoring leading whitespace as we compare.

In a source file of any length, we're going to have repetition. There will be blank lines in both files. There will be lines containing only '{' (C#) or only the keyword 'begin' (Delphi), which all compare the same because we're stripping off the leading whitespace before we compare. Same thing for '}' or 'end;'.

Every time we come across one of those duplicated lines, we do a bunch of wasted work. Let's see how:

Get the first element from A. Let's say it's "unit Foo;".
- Loop through all the elements in B, and for each B_j that matches "unit Foo;", do some stuff.
Get the next element from A. This one is the empty string, "".
- Loop through all the elements in B, and for each B_j that matches "", do some stuff.
Get the next element from A: "interface".
Loop through all the elements in B, and for each B_j that matches "interface", do some stuff.
Get the next element from A: the empty string again.
- Loop through all the elements in B, and for each B_j that matches "", do some stuff.
Get the next element from A: "uses SysUtils, Classes;".
- Loop through all the elements in B, and for each B_j that matches "uses SysUtils, Classes;", do some stuff.
Get the next element from A: the empty string again.
- Loop through all the elements in B, and for each B_j that matches "", do some stuff.

Every time we get "" from A, we loop through B looking for all of its ""s. But we already did that. Couldn't we save ourselves some work if we just remembered where all the ""s are, so we can just loop over the interesting bits of B, rather than the whole boring thing?

Indeed we could, and it would save us time even if there isn't any duplication between the two lists. We do it by replacing those two outer loops with this:

Create a Hashtable called BHash.
Loop through all the elements in B, and for each B_j:
- If BHash[B_j] is nil, then BHash[B_j] := TList.Create;
- BHash[B_j].Add(j);
Loop through all the elements in A, and for each A_i:
- Loop through all the matching "j"s we've already listed at BHash[A_i]:
  - (do some other stuff)

We start with a single pass through B to build a reverse-lookup table. (Ideally we would use a genuine Hashtable, so that adds and lookups are both always O(1). If we're using Delphi for Win32, which doesn't have a real Hashtable, we'll have to fake one, or implement one from scratch.)

This reverse-lookup table maps each element in B to a list of indexes where that element appears:

BHash["unit Foo;"] = (0) BHash[""] = (1, 3, 5, ...) BHash["interface"] = (2) BHash["uses SysUtils, Classes;"] = (4)

Now, when we get to an element in A, we're not looping through every element in B. That was a bit wasteful to begin with, because for typical input, most of those elements in B wouldn't match — we were looping over all N even though only a handful will actually matter. Now, with the addition of the reverse-lookup hash, we only look at the elements in B that do match.

So the second loop, instead of being O(N), is now O(D) — the number of iterations is roughly proportional to the number of duplicates in list B. Actually, list A figures into that calculation too, because we'll only care about those duplicates in B if they also appear in A — after all, if they don't appear in A, we're never going to look them up in B.

D may approach N, if the lists contain almost all duplicates (e.g., if both input files consist of nothing but blank lines); or it may approach 1, if the lists contain almost no duplicates. It may even approach 0, if the two input files have nothing in common, because every time we go to look for something in BHash, there'll be nothing to find and therefore nothing to iterate.

So our outer two loops, instead of being O(N²), are now O(ND). Not bad for a short morning's work.