Diff, part 8: String ’em all together? (2 of 2)

The diff saga continues…

Last time, we took the plunge and tried to write a Longest Common Subsequence (LCS) algorithm without going to the bother of generating all the possible match strings first. But we hit a snag when our prototype algorithm failed to get the correct LCS for a particular set of inputs.

This time, we’ll add a refinement that makes that algorithm a little more sophisticated. Let’s start by revisiting the input lists:

A is (1 2 3 4 5)
B is (5 2 3 4 1)
Matches are [0,4], [1,1], [2,2], [3,3], and [4,0].

Now recall that we started with the first match — [0,4] — and stuffed it into our “best Common Subsequence (CS) so far” register. Then we came to the second match, and tried to tack it onto the end of our best-yet CS, yielding [0,4] [1,1]. This match string fails our rule; it is not a valid CS.

Last time, we simply threw away the [1,1] and went on to look at the next match, and thus failed to find the real LCS (since the real LCS starts with [1,1]). So this time, let’s look a little more closely. Suppose that instead of throwing the [1,1] away, we instead threw the [0,4] away, and put the [1,1] into our best-yet register instead. What would happen? If we were clever enough about it, we’d end up with [1,1] [2,2] [3,3] as our final LCS — we’d get the right answer.

Okay, so how do we teach a computer how to make that decision? Let’s try this: follow the same reasoning as last time, but this time, if the new match doesn’t fit, then instead of throwing the new match away, let’s throw away the old match string instead. So:

  • First match: [0,4]. This is our best CS so far (since it’s our only CS so far).
  • Next match: [1,1]. Add that to our best CS so far. Result: [0,4] [1,1]. This isn’t a valid CS, so we throw away our old best-yet CS and replace it with [1,1].

So far, so good. We can tell, just by visual inspection, that the LCS is [1,1] [2,2] [3,3]. This latest algorithm, so far, has us on the right track to get there. Let’s continue:

  • Next match: [2,2]. Add that to our best CS so far. Result: [1,1] [2,2]. This is a valid CS; keep it.
  • Next match: [3,3]. Add that to our best CS so far. Result: [1,1] [2,2] [3,3]. This is a valid CS; keep it. Looking good!
  • Next match: [4,0]. Add that to our best CS so far. Result: [1,1] [2,2] [3,3] [4,0]. This isn’t a valid CS, so we… uh-oh… throw away our old best-yet CS and replace it with [4,0].

Okay, that’s not it either. We had it — we had the right LCS — and then our algorithm threw away a length-3 in favor of a length-1. That’s not right! We shouldn’t be throwing away longer CSes and starting over at 1, not when the whole point is to get the longest CS possible. But the rule that made it possible for us to keep the [1,1] also forced us to throw away the [1,1] [2,2] [3,3].

What if we change our algorithm to say that longer CSes always take precedence, and that we only use our “throw it away” tactic when it won’t result in a shorter CS? That is, what if we do our “toss the old best-yet CS, and put our new match in the ‘best-yet’ register as a length-1 CS” only if the old CS was also length 1?

That’s kind of a smelly rule — too highly special-cased for my taste. But it works for this input. Let’s run through those steps again, with this tweak in force:

  • First match: [0,4]. This is our best CS so far (since it’s our only CS so far).
  • Next match: [1,1]. Add that to our best CS so far. Result: [0,4] [1,1]. This isn’t a valid CS. Can we replace our old best-yet CS with [1,1] without losing length? Yes, we can, so our new best-yet CS is [1,1].
  • Next match: [2,2]. Add that to our best CS so far. Result: [1,1] [2,2]. This is a valid CS; keep it.
  • Next match: [3,3]. Add that to our best CS so far. Result: [1,1] [2,2] [3,3]. This is a valid CS; keep it.
  • Next match: [4,0]. Add that to our best CS so far. Result: [1,1] [2,2] [3,3] [4,0]. This isn’t a valid CS. Can we replace our old best-yet CS with [4,0] without losing length? No, we can’t, so our best-yet CS is still [1,1] [2,2] [3,3].

Victory! But just to make sure, let’s take one more example. (My regular readers will, of course, understand this to mean “no, this algorithm doesn’t really work in the general case, and I’ll tell you why.”)

A is (1 2 3 4 5 6 7)
B is (6 7 3 4 5 1 2)
Matches are [0,5], [1,6], [2,2], [3,3], [4,4], [5,0], and [6,1].

Apply the algorithm, and away we go:

  • Start with [0,5].
  • Can we append [1,6] and still have a valid CS? Yep. We now have [0,5] [1,6].
  • Can we append [2,2] and still have a valid CS? No. Can we make [2,2] our “best yet” without losing length? No. We still have [0,5] [1,6].
  • [3,3], [4,4], [5,0], and [6,1]: Same as [2,2]. We still end up with [0,5] [1,6].

Close, but no banana: we end up with a length-2 CS when the real LCS should have been [2,2] [3,3] [4,4]. We went down the wrong path, and were unable to correct later.

Is there a way to tweak this algorithm still more, to make it work all the time? The answer is, not without changing some of the basic structure of the algorithm. The “string ’em all together” approach is fundamentally too limited to deal with the LCS problem, and I’ll go into that in more depth next time. However, we did try some interesting things, like throwing away one CS in favor of another of the same length, that will be useful later on.

Got a headache

(to the tune of “Found a Peanut“)

Got a headache, got a headache,
Got a headache just now,
Just now I’ve got a headache,
Got a headache just now.

It was rotten, it was rotten,
It was rotten just now,
Just now, it was rotten,
It was rotten just now.

Excedrin, Excedrin,
Excedrin just now,
Just now, Excedrin,
Excedrin just now.

Didn’t help much, didn’t help much,
Didn’t help much just now,
Just now it didn’t help much,
Didn’t help much just now.

Going to bed, going to bed,
Going to bed just now,
Just now I’m going to bed,
Going to bed just now.

Thank you, you’ve been a great crowd. I’ll be here all week.

(Hey, I get euphoric when the Excedrin starts to kick in. Whether it’s the caffeine, or something to do with the migraine itself, I don’t know… but I’ll spare you any more silliness by going to bed, just now.)

Diff, part 7: String ’em all together? (1 of 2)

Continuing the Diff for the Complete Goober series (after a brief hiatus on account of Deleted Sleep Time):

The core problem of diff is the building of the longest common subsequence, which we accomplish by finding all the matches in the two lists, then assembling match strings that match our rule (which filters out the match strings that are not common subsequences).

Does it have to be that complicated? Let’s just try it. Say we have these two lists:

A is (1 2 3 4)
B is (1 3 4)
Matches: [0,0], [2,1], and [3,2].

Do we really need to go through all the rigamarole? You can build a match string out of all three matches in their original order, and it’ll be a common subsequence — and at length 3, it’ll be the longest common subsequence. So our LCS has to be [0,0] [2,1] [3,2].

Well, that was easy. Grab all the matches, put them together, and there’s the LCS. It can’t be that easy, can it? Well… no, actually, it can’t:

A is (1 2 3)
B is (1 3 2)
Matches: [0,0], [1,2], and [2,1].

If we try stringing these all together, we get [0,0] [1,2] [2,1], which isn’t a valid common subsequence. In this case, the real LCS is going to be length 2: either [0,0] [1,2] or [0,0] [2,1]; take your pick.

Can we adapt our simple “string ’em all together” strategy to work with these inputs? Well, we could start with [0,0], and keep adding matches on, as long as the result is still a valid CS; and just throw away any matches that we can’t append to our longest-running-common-subsequence. Let’s try that.

  • First match: [0,0]. This gives us a length-1 match string that is a valid CS. (Actually, any length-1 match string will always be a valid CS.)
  • Next match: [1,2]. Add that to our best CS so far. Result: [0,0] [1,2]. This is a valid CS; keep it.
  • Next match: [2,1]. Add that to our best CS so far. Result: [0,0] [1,2] [2,1]. This is not a valid CS; throw it out. Our best CS so far is still [0,0] [1,2].

That looks promising, doesn’t it? Let’s try it with a slightly different input:

A is (1 2 3 4)
B is (1 3 2 4)
Matches: [0,0], [1,2], [2,1], and [3,3].

The lists are the same as above, but with one more match. So we do the same thing as above, but with one more step at the end:

  • Next match: [3,3]. Add that to our best CS so far. Result: [0,0] [1,2] [3,3]. This is a valid CS; keep it.

Again, we get the right LCS. Looking good! One more example:

A is (1 2 3 4 5)
B is (5 2 3 4 1)
Matches: [0,4], [1,1], [2,2], [3,3], and [4,0].

Steps:

  • First match: [0,4]. This gives us a length-1 match string that is a valid CS.
  • Next match: [1,1]. Add that to our best CS so far. Result: [0,4] [1,1]. This is not a valid CS; throw it out. Our best CS so far is still [0,4].
  • Next match: [2,2]. Add that to our best CS so far. Result: [0,4] [2,2]. This is not a valid CS; throw it out. Our best CS so far is still [0,4].
  • Next match: [3,3]. Add that to our best CS so far. Result: [0,4] [3,3]. This is not a valid CS; throw it out. Our best CS so far is still [0,4].
  • Next match: [4,0]. Add that to our best CS so far. Result: [0,4] [4,0]. This is not a valid CS; throw it out. Our best CS so far is still [0,4].

So our modified “string ’em all together” algorithm gives us an LCS of [0,4]. But wait — that’s not right. There are three matches right in a row, in the middle of the inputs. The LCS should be [1,1] [2,2] [3,3]. What does that mean?

Well, it means that our algorithm isn’t right yet. Next time, I’ll add a refinement that would make it work for all these inputs.

(Notice, though, that I said these inputs. I’ve still got several more articles to go.)

Sleep Losing Time

It’s that time of year again. Time to lose an extra hour of sleep every single day, decrease productivity, increase the number of severe auto accidents, and make everyone crabby.

That’s right — it’s Sanity Losing Time. (Oh, wait — sorry. Sanity Losing Time was when the legislature created this silly mess in the first place. What we’re starting now is Sleep Deprivation Time.)

In college, I wrote a paper about Daylight Saving Time. My point was that it’s stupid. It doesn’t save daylight, it doesn’t save time, and traffic fatalities, road rage, and extra time to deliver your pizza because the pizza guy has to weave around all the accidents, just aren’t worth the extra quality time you spend changing your clocks twice a year.

Instead, I argued, if we’re going to arbitrarily change clocks to be out of sync with the seasons, we should instead adopt what I call Daylight Losing Time. My proposal: once every, let’s say, two weeks, everyone sets their clock back an hour. Meaning, everyone gets an extra hour of sleep twice a month. Bye-bye, road rage.

Daylight Losing Time would have other benefits, too. Lovers could enjoy the starlight without staying up past their bedtime. People like me, who enjoy sunrises and early-morning air, but hate getting up early, would be accommodated for fully half the year. And everybody would be less grouchy. What’s not to like?

Write to your elected officials, and demand an end to the madness that is Daylight Saving Time. Urge them to adopt Daylight Losing Time instead. I’ll You’ll be glad you did.

New refactoring

I was pairing with Ron Jeffries yesterday, making a medium-small change (it affected 32 call sites scattered through a couple dozen units, but it was basically just changing those call sites to use a different constructor).

As we worked, we became vaguely aware that we often wound up passing two parameters that always varied together. They seemed to belong together on a parameter object. (The canonical example being that you always find yourself passing an X and a Y, so you create a Point type.)

So we briefly coded up that parameter object, amidst a good deal of spirited discussion on how this should work and how that should work. (That should’ve been a clue. Ron may have seen it, but I didn’t, yet.)

Then we went to use our nifty new class, and realized, for complicated reasons, that we couldn’t actually use it just yet. It was, in fact, a YAGNI — we didn’t have an actual call site using it yet, which is why we (me in particular) had such a hard time figuring out some of the details of how callers would use it. It’s hard to figure out what the code does when it doesn’t do it.

As soon as we figured out that it was a YAGNI, Ron said we should rip it back out. He had a point: it was just cluttering the code, and adding complexity we didn’t need. But I expected to need it within a day or so. It was simple code, but it wasn’t quite so simple a concept, and we had to wrestle with naming a few things. I wanted to hang onto what we’d done (particularly the type name and method names), for possible future reference, but we didn’t want to keep it in the code.

So I invented a new refactoring: Extract YAGNI to Notepad.

Diff, part 6: Match strings and common subsequences

Our sponsor (we have a sponsor?) is proud to bring you the next episode of Diff for the Complete Goober.

A brief aside before I start: I’m going to start using a couple of funky Unicode characters. If these don’t show up properly in your browser of choice, please let me know:

should look like < with a slash through it (not less than)
should look like > with a slash through it (not greater than)


So I’ve talked about the longest common subsequence (LCS) and its place in the diff algorithm. As I mentioned last time, we build the LCS by stringing matches together. But not every possible match string is, in fact, a common subsequence (CS) at all, much less the longest common subsequence.

So the next order of business is making sure we have a rule to separate the commons from the uncommons. Fortunately, this is easy, especially since we’re keeping track of the indexes instead of the elements. The rule is just that the A-indexes must be increasing (that is, each match’s A-index must be greater than the A-index of the previous match in the string), and so must the B-indexes.

This derives from the definition of a subsequence: its elements must be in the same order as the elements in the original list. If the first match’s A-index was, say, 3, and the second match’s A-index was 1, then the elements are not in the same order as the original list A (since you’ve got A[3] first, followed by A[1]); therefore, the match string is not a subsequence of A, at which point it certainly can’t be a common subsequence of both A and B.

The explanation is a bit painful, but the rule really is pretty simple. Here are a few examples:

[0,0] [1,1] [2,2] [3,3]
Valid CS: 0 < 1 < 2 < 3.

[0,10] [1,11] [2,12] [3,13]
[10,0] [11,1] [12,2] [13,3]
Valid CSes. (Remember that you’re only comparing A-indexes to A-indexes and B-indexes to B-indexes. 0 < 1 < 2 < 3 and 10 < 11 < 12 < 13, so we’re OK.)

[0,0] [10,10] [5,5]
Not a valid CS: 0 < 10 5.

[0,0] [1,0]
Not a valid CS: the A-indexes are OK (0 < 1), but the B-indexes are not (0 0).

Pay particular attention to that last one. We’ll come back to it later.

Moved In

We bought our house at the beginning of last July, and ended the lease on our old apartment at the end of July. That left us a month to move all our stuff out of the apartment.

Fortunately, we know ourselves, so we decided we were going to need more than just one month to pack and move all our stuff. So prior to July, I was boxing up all the stuff that wouldn’t care terribly about the temperature (books, stuffed animals, winter coats, etc.), and moving it out to our storage garage at the apartment complex. That way, at the very least, this stuff would be already boxed up when it came time to move, but wouldn’t be in our way before then.

This worked pretty well; come July, much of the time-consuming packing was already out of the way, so we were able to spend less time packing and more time moving. We were committed to getting everything out of the apartment by the end of July; and we did, so that large recurring payment went away after July (thus freeing up those funds to be some portion of a mortgage payment).

We were hoping to get everything out of the storage garage during July, too, but we figured that if we didn’t finish that, it wouldn’t be the end of the world (since the storage-garage rent was a whole lot less than the apartment). If we had to, so the plan went, we could keep the storage garage for an extra month, or maybe two months if we had to, while we got the last of our stuff moved out of it.

Yesterday — eight months after we moved out of the apartment — I finally got the storage garage cleaned out and the keys turned in. Oh, well…

This morning, as I lay groggily in bed before the alarm even went off, I realized that this was actually a pretty big deal. As of yesterday, we are finally and truly Moved In to our house.

(Or rather, we would be Moved In, if I pulled my car full of boxes into the garage.)

Being unpacked, of course, is an entirely different matter…

The Commentator

My dad sent me an interesting link this morning, for a development tool called The Commentator. Some of their marketspeak (emphasis theirs):

The Commentator uses revolutionary real-time language processing to actually grok your code and add the necessary comments on the fly.

The Commentator’s powerful Personality Controls allow you to tweak it’s output so completely that it’s as if The Commentator is speaking for you. In your voice.

Their tagline sums it up: “Time commenting could be time coding.

Check it out. The example comments are quite illuminating.

Caution: Note today’s date.

Diff, part 5: Match strings

Yes, folks, it’s time for another action-packed installment of the How Diff Works saga.

So I’ve talked about what diff is, and how you get diffs by inverting the Longest Common Subsequence (LCS), and how the LCS is just a bunch of matches strung together. And I gave an example of two lists and the matches between them. That example included some hand-waving that made it look easy to build the LCS.

But it isn’t always so. To help make the distinction, I’m going to make up some more terminology: when you string a bunch of matches together, I’ll call that a “match string”. (If anyone has suggestions for a better name, let me know.) A match string can contain just one match, or it can contain several.

Why the distinction? Because something can be a match string, and not be a common subsequence. (And if it isn’t a common subsequence, it certainly can’t be the longest common subsequence.)

Let’s illustrate this with a second example. (I’m using a made-up notation which I explain here.)

A is (1 2)
B is (2 1)

Here, again, we have two matches: the element 1 at [0,1], and the element 2 at [1,0]. This gives us four possible match strings:

[0,1]
[1,0]
[0,1] [1,0]
[1,0] [0,1]

So we just pick the longest match string, and that’s our LCS, right? Well, no. First, we have to figure out which of these match strings actually qualify as common subsequences. Then we take the longest of those.

[0,1]
is a valid CS: (1) is a subsequence of both A and B.

[1,0]
is a valid CS: (2) is a subsequence of both A and B.

[0,1] [1,0]
is NOT a valid CS: (1 2) is a subsequence of A, but is not a subsequence of B. (Remember that, to be a subsequence, the elements have to occur in the same order they did in the original. 1 and 2 do occur in B, but not in that order.)

[1,0] [0,1]
is NOT a valid CS: (2 1) is a subsequence of B, but is not a subsequence of A.

So we only have two valid common subsequences, both of equal length (1). The longest common subsequence, then, is either [0,1] or [1,0]. It doesn’t really matter which you pick; they both qualify as “longest”. So we end up with a length-1 LCS, which gives us the shortest possible diff: one insertion and one deletion. (This makes sense, actually; if the LCS were length 2, then that would mean both lists are the same, and there are no diffs.)

Next time, I’ll outline an algorithm for deciding whether a given match string is a valid common subsequence.