Joe White’s Blog

Life, .NET, and Cats


Archive for March, 2005

Diff, part 4: Matches

Thursday, March 31st, 2005

Another installment in the How Diff Works series:

Last time, I talked about Longest Common Subsequence (LCS), basically a “what’s the same between these two lists”. So, how do you go about finding the LCS?

Well, you look for matches. (Note: “match” is my own word for this, and may not correspond with anyone else’s terminology.) A match is an element that exists somewhere in list A, and is also somewhere in list B. The LCS is just a bunch of matches strung together.

So to continue our earlier example, if we have these two lists:

A is (1 2 3)
B is (1 4 5 3)

then we have two matches. The first match is the element 1, which appears at A[0] and B[0]. The second match is 3, which appears at A[2] and B[3].

Each “match” is, in fact, a common subsequence (of length 1). So if we take our two matches and string them together, we get the LCS, (1 3).

(If you think that’s all there is to it, keep reading. I’ve got several more articles to go.)

Another useful way to express this LCS is [0,0] [2,3]. (Note that I’m making up this notation, so if square brackets are the wrong thing to use, please correct me.) This shows where the matches occur in the original lists, rather than showing what the elements are: the first match is at A[0] and B[0], the next match is at A[2] and B[3].

One situation where indexes are handier is when you’re doing a case-insensitive diff. A[0] might be “Hello, world!” and B[0] might be “hElLo, WoRLd!”. If you’re doing a case-insensitive compare, then A[0] and B[0] are considered equal, even though, in the eyes of the computer, they’re not; so which one shows up in the LCS? If it doesn’t matter, then you probably didn’t want the actual elements in your LCS anyway, and would be better off with the indexes. The same question applies when you’re ignoring whitespace, an extremely common option when you’re comparing source code.

Using the indexes also makes some operations easier, like showing the two files side-by-side, or generating Unix-diff-style output. And, most important for these articles, it turns out that keeping track of the indexes is essential to building the LCS.

Mathematically speaking, the LCS contains elements (like (1 3)), not indexes. But from here on, when I say LCS, I’ll usually mean a list of pairs of indexes (like [0,0] [2,3]), because that’s generally going to be more relevant.

Diff category

Thursday, March 31st, 2005

I’ve created a new Diff category on my blog, for my articles about the diff algorithm and (hopefully, later) source code.

And for my junkies (grin), I’m hoping to get another diff article out tonight.

Diff, part 3: Introducing the Longest Common Subsequence

Tuesday, March 29th, 2005

In the last two articles, I’ve begun outlining what diff is all about. This time, I’ll start to go into how it’s actually done.

It turns out that it’s hard to find out what’s different between two lists, but much easier to find out what’s the same. So, we start by finding the Longest Common Subsequence (LCS), a much easier problem than diff. Once we’ve got the LCS, the diffs are easy. So we can throw away our stupid diff implementation, and use LCS instead… once we figure out how to do an LCS, anyway.

But first, what is a Longest Common Subsequence? I’ll illustrate with these two lists, A and B:

A is (1 2 3)
B is (1 4 5 3)

A subsequence of A is just a subset of A. (1 2) is a subsequence of A; so are (1 2 3), (3), and () (the empty list). However, (3 2) is not a subsequence of A. To be a subsequence, the elements have to stay in their original order. (2 2) isn’t a subsequence of A either; no fair repeating stuff (though it would’ve been fine if there were two 2s in the original).

A common subsequence of A and B would be anything that’s a subsequence of A, and is also a subsequence of B. There are four in this case: (), (1), (3), and (1 3).

The longest common subsequence is just the longest of all possible common subsequences. Here, it’s (1 3).

The LCS tells us what’s the same between A and B. And once we know what’s the same, the diff is pretty straightforward. Actually, any common subsequence would do (our stupid implementation is actually the degenerate case, using () as its common subsequence). But if we use the longest possible common subsequence, we’ll end up with the shortest possible set of diffs, which is exactly what we want.

So, how do we implement LCS? Well, we could build all the possible subsequences of A, check to see which ones are also subsequences of B, and keep track of the longest one we’ve found so far. When we’re done, we’ve got our LCS.

It would work, but it would be stupid. A list of length 3 has 8 possible subsequences. A list of length 10 has 1,024. Length 20? 1,048,576. Length 100? 1,267,650,600,228,229,401,496,703,205,376 possible subsequences. And if you’re comparing files with a few thousand lines? You’ll be there a while.

Hmm. Maybe we need a not-so-stupid LCS algorithm.

Diff, part 2: Head and tail matches

Tuesday, March 29th, 2005

Last time, I defined the diff algorithm, and gave a stupid implementation of it. Here, I’ll add improvement #1: better results by cheating (only I’ll call it “optimizing”).

We’re still very much in the realm of the mundane, here. But we can improve our diffs, a little, by first looking for head and tail matches, and then doing a stupid diff on what’s left. If the first twenty elements in list A are exactly the same as the first twenty elements in list B, and the last twenty elements in list A are exactly the same as the last twenty elements in list B, then you don’t need to bother comparing anything except what’s in between.

If we start by looking for head and tail matches, and then apply our stupid diff algorithm to what’s left, then (in our simple example, where list B simply added two consecutive lines in the middle) we get the ideal diff results: two inserts.

Of course, this breaks down pretty quickly. Suppose list A is the lines from a source file. List B is the same file, but with a comment added at the beginning, a blank line deleted at the end, and no other changes. Now head and tail matches are no help; our stupid algorithm will still give stupid results.

That doesn’t mean that head and tail matches are worthless. In fact, they’re still very valuable. Looking for head and tail matches is an O(N) operation, which is substantially more efficient than any good diff algorithm is going to be — and reduces the number of lines you need to run through the more expensive (something like O(N²), I think, or O(ND), or something like that) operation, which is a huge win. Head and tail matches aren’t very valuable on their own, but they’re the first step in any reasonable diff implementation, even a stupid one.

Next time, we’ll make things a little more interesting, by turning the lists inside-out.

Diff, part 1: A definition, and a stupid implementation

Tuesday, March 29th, 2005

I’ve once again run into a case where it’d be nice to have a diff library for Delphi and/or C#, so I figure it’s time I get around to writing one. As I go, I’ll be blogging about the underlying theory, partly to remind myself how it works, and partly just in case anyone might want to read about it. (Blogging seems, after all, to be one big exercise in YAGNI — it’s the epitome of writing stuff without having a clue whether anyone’s ever going to need it.)

If you don’t understand anything I babble about here, please, post a comment. Don’t be shy. This stuff starts simple, but gets hairy before long. If you’re interested in this stuff, then I don’t want to lose you in the details. If anything isn’t clear, or if you have a question, or a better example, speak up. I don’t bite. (Not strangers, anyway.)


Most of what I know about the diff algorithm comes from reverse-engineering the Perl Algorithm::Diff module from CPAN. It’s a very Perlesque implementation with a lot of optimizations, but by reading the code, running the unit tests, finding out that the tests didn’t exercise major portions of the code, and eventually adding a whole lot of print statements, I eventually managed to puzzle out how the silly thing worked.

So, let’s start at the beginning. What is diff? Well, it’s an algorithm that takes two lists (often representing the lines in a text file), and tells you how those lists are different. To be more specific, it results in a list of instructions on how to convert list A into list B. These instructions take the form of “insert” and “delete” instructions. Delete line 42, insert the line “if not Enabled then” at line 87, etc.

Well, given that definition, I could write a diff algorithm in five minutes. If list A is 40 lines, and list B is the same file with 2 extra lines added in the middle (for a total of 42 lines), I could write a diff algorithm that spits out 40 delete instructions and 42 insert instructions.

But that would be stupid.

The main idea is not just to write a set of diffs, but to write the best set of diffs. Specifically, the shortest set of diffs — the fewest number of insertions and deletions that will get you there. If list B really is just list A with two extra lines stuck in the middle, then we really want a diff algorithm that gives us two inserts, and that’s it.

Next time, we’ll get a little closer, but we’ll get there by cheating.

Wisdom

Tuesday, March 29th, 2005

I just had my last wisdom tooth pulled this morning. The others went reasonably well.

This one… well, didn’t. It was a fun time in the dentist’s office this morning.

He got it to the point where he said it was really loose, but it just wouldn’t come out. I could feel him twisting my entire lower jaw as he tried to pull the danged thing out. And despite the three syringefuls of Novocain he gave me before he started, the thing hurt like a son of a mother.

It wound up breaking into, at my best guess (I didn’t ask), three pieces before he was done. After wrestling with it for a good long while, he took a break and an X-ray, pored over it for a while, and told me I was done and I could go.

It turns out that there’s still a “sliver” of root left in there (don’t you love it when your wisdom teeth have little hooks on the end?), and that it’s right next to a major nerve, so if he did keep going after it, I’d probably end up with some temporary nerve damage and not be a happy camper. If he just leaves it there, there’s a good chance it’ll work its way out over the next couple of weeks; if so, I’m to go back in, and he’ll pull it out.

Otherwise, he said that 99 times out of 100, leaving a sliver in there won’t cause any problems.

(Given our history with health and cats, I’m not totally sure that 99% is all that reassuring…)

Anyway, given the amount of yanking he had to do, I decided to take the morning off work, so I don’t have to talk around my gauze. I’m feeling pretty decent so far (the Novocain hasn’t worn off yet, mind you), but I’d just as soon let this thing clot sooner rather than later.

(You really wanted to know all that, didn’t you?)

How to destroy the Earth

Wednesday, March 23rd, 2005

Wow. This ranks right up there with TRBBTDDA, strawberry Pop-Tart blowtorches, and the Evil Overlord list.

An excerpt from the Preamble of How to destroy the Earth:

Destroying the Earth is harder than you may have been led to believe.

You’ve seen the action movies where the bad guy threatens to destroy the Earth. You’ve heard people on the news claiming that the next nuclear war or cutting down rainforests or persisting in releasing hideous quantities of pollution into the atmosphere threatens to end the world.

Fools.

The Earth was built to last. It is a 4,550,000,000-year-old, 5,973,600,000,000,000,000,000-tonne ball of iron. It has taken more devastating asteroid hits in its lifetime than you’ve had hot dinners, and lo, it still orbits merrily. So my first piece of advice to you, dear would-be Earth-destroyer, is: do NOT think this will be easy.

Make sure you’ve got plenty of time. This thing is amazingly detailed, and extremely amusing. (I laughed out loud more than once. If you know me, you know that’s something.)

(via Bruce Schneier)

Glass Ceilings for the Complete Goober

Thursday, March 17th, 2005

We bought a used kitchen table a few weeks ago. It’s got a glass top.

This made for some entertaining viewing, because Noel climbed up on top of the table, lay down, and made herself comfortable. And Goober, who likes to harass Noel for as many of his waking hours as he can without getting distracted, spent a good forty-five minutes pacing around under the table, trying to figure out how to attack her.

Our kitchen table has more paw prints on the bottom than it does on the top. We get built-in amusement with our breakfast… all we need to do is put Noel on the table and wait. It’s not just a table, it’s an entertainment center!

The sofa state

Thursday, March 17th, 2005

From the first time I saw this picture in the newspaper, back when it was still one of several proposed designs…

I thought, “Why is there a sofa at the bottom of the license plate?”

(As I later found out, it’s actually supposed to be a covered wagon. But it still looks like a sofa to me.)

When Jennie and I were in the car yesterday, we happened to pull up to a stop behind a car with one of these new plates, and I made some comment about living in “the sofa state”.

Jennie had heard (and shared) my opinions about the sofa before, but she still laughed. She said, “I still don’t understand what a couch has to do with Nebraska.”

I shrugged, and said, “It must be because of Nebraska Furniture Mart.” It took her a while to stop laughing.

Nebraska: the Nebraska Furniture Mart state.

Now, won’t that look nice in the history books?

Flip side: Good service

Thursday, March 17th, 2005

Well, I blogged about some poor service I’ve gotten recently, so now it’s time for the flip side.

My car died over the weekend. It started by feeling a little underpowered getting going from a stop (almost like the clutch was slipping). I kept driving, hoping I could make it home and then take it to the shop on Monday, but then after a few minutes, the speedometer conked out, and I was sailing down L Street at 0 mph. Then the power steering went. And finally, the engine sputtered to a stop, halfway through a turn. What fun.

I was four blocks from home when it died, so I hiked home, called a tow truck, and drove Jennie’s car back out to meet the tow truck driver. He got it loaded up and hauled it off to Auto Solutions. This was on Sunday.

I planned to call Auto Solutions on Monday, and let them know why this car had mysteriously appeared on their lot over the weekend. But between meetings and other distractions, I wound up not getting around to calling them until midafternoon on Monday.

As soon as I told them my name, they said, “Oh, yes, the Saturn. The problem was with the belt tensioner; when it went out, the belt to the alternator quit running, so the battery went dead. We’ve already replaced the tensioner and recharged the battery. You can come pick the car up anytime.”

Not only didn’t I have to ever tell them what was wrong, I never even had to tell them which car was mine. And as if that weren’t enough, when I did show up to pick up the car, the receptionist recognized me by sight.

Either they’ve got some darned good customer service, or Jennie and I have got some darned bad cars! (Well, our cars have had their share of troubles. (grin)) But even so, these guys can’t have seen us more than a couple of times a year over the past three years. And yet, not just the receptionist, but also both the owners, know us by sight and by name.

It was a co-worker’s referral that sent me to these guys in the first place. It’s no accident that I’ve referred others to them too.


Joe White's Blog copyright © 2004-2011. Portions of the site layout use Yahoo! YUI Reset, Fonts, and Grids.
Proudly powered by WordPress. Entries (RSS) and Comments (RSS). Privacy policy