Diff, part 19: De-dup by length?
Today in the diff series, I’ll start in on what I think will be the last major optimization in our LCS algorithm.
Recall that we may have multiple same-length common subsequences in our “best CSes so far” pool. For example, we may have [0,0] [1,2] and [0,0] [2,1]. We’ve been keeping both because the next match might be [2,3], or [3,2], or maybe even [128,2]. We can’t throw away any CS that we might reasonably be able to build upon later (unless we can make an ironclad guarantee that it’s expendable).
Also recall that, when we generate a new CS, we do so by adding a new match onto an existing CS; specifically, the longest one that we can append our new match to. When we look for an existing CS to extend, the only thing we care about is: for the given match, will this CSk give us the longest CSnew that we can possibly generate?
We don’t actually care whether CSk ends in [1,2] or [2,1], as long as we can guarantee that when we append the current match — be it [2,3], [3,2], [3,3], or whatever else — we’ll end up with the longest CSnew we can get. Anywhere and everywhere along the way, we can throw away as many CSes as we want (just like we did yesterday), as long as we never compromise that guarantee. Never throw a CS away unless you can prove it has less growth potential than something else you already have; otherwise, anything goes.
These two properties — length is all-important, yet we keep multiple CSes of the same length — seem to be in a bit of tension with each other. Which suggests an intriguing question: can we get to the point where length alone is a good enough guarantee? Can we get to the point where we hang onto just one length-1 CS, one length-2 CS, one length-3 CS, etc., at a time?
If that were possible, it would mean that we could make a big optimization to our fourth loop — the one where we call Expendable to compare our CSnew to each and every CSk, so we can see if there’s anything (including possibly CSnew) that we can throw away. That fourth loop could, in fact, stop being a loop. It could become a single call to Expendable, comparing CSnew to the CSk of the same length. (Of course, if CSnew is the longest CS so far, then there would be no CSk of the same length, and we wouldn’t need to call Expendable at all.) And that call to Expendable would always have a definitive result: either you throw away CSk, or you throw away CSnew. Only one CS of any given length would be left standing at any given time.
A worthy optimization, to be sure. But is it doable? We would have to throw away one of those CSes-of-the-same-length, but Expendable([0,0] [1,2], [0,0] [2,1]) returns nil, meaning we need both CSes at one time or another. How could we reconcile this?
By siezing on that “at one time or another”, and making this into a timing issue. Then we can control the timing, by controlling the way we iterate. How, and in what order, do we generate our matches?
We would be home free if we could generate our matches in this order:
- [0,0]
- [2,1]
- Every possible match that could extend [2,1] but not [1,2]
- [1,2]
Or perhaps:
- [0,0]
- [1,2]
- Every possible match that could extend [1,2] but not [2,1]
- [2,1]
As long as our algorithm can consistently return matches in one of these orders (and can be consistent about which one), then when we get to the point where [0,0] [1,2] and [0,0] [2,1] are our CSnew and CSk (in whichever order), we know with absolute certainty that we can throw the old one away, replace it with the new one, and be sure that we’re still going to find the longest possible CS when we’re done.
So how do we handle the third bullet point in either of the above orderings? That third bullet is kind of the “and then a miracle occurs” stage. Next time, I’ll start to nail down what “every possible match that could extend x but not y” really means, and how we can possibly hope to pull it off.
April 30th, 2005 at 10:09 am
So when are u gonna write the next part?
Its been pretty long already and i am waiting
May 1st, 2005 at 4:53 am
Yeah, I know. (grin) Things have been hectic lately, and these diff articles keep getting more complicated as I go. I’ll try to get another one out early this week.
May 1st, 2005 at 7:00 am
I have found a different approach to your diff problem.
Let me know if this would work fine, and if not why not?
Right now, we are trying to find the ‘matching’ sequence. what if we right away tried finding the added/removed stuff.
Here is how -
We have 2 hashtables -
HashA, HashB.
We have 2 ArrayLists -
RemovedElements, AddedElements
1.)Loop through each element of A and put it in HashA.
2.)Loop through each element of B and put it in HashB.
3.)Loop through each element in A and find whether it exists in HashB. If it doesn’t then , we know that it has been removed. Add this item to RemovedElements.
4.)Loop through each element in HashB and find whether it exists in HashB. If it doesn’t then , we know that it has been added. Add this item to AddedElements.
Thus the running time of the algorithm comes out to be (assuming length of A =m & len of B =n)
O( 2m + 2n)
Is your final optimized version of the lcs algo faster than this? Is there anything wrong about this algo?
May 1st, 2005 at 8:23 am
I found a shortcoming in my own algo
It doesn’t take the order of elements in account.
So if we have -
(1,2) & (2,1) it would report no changes.
But still if u have to diff between 2 lists where the additions/removals are guaranteed to be unique, this algo can be used.
Eg - during incremental parsing when doing a diff to get a delta between 2 versions of an AST.
June 8th, 2005 at 6:55 pm
I’ve been working through file comparison algorithms for a while, now, off and on… mostly off until the last week. (What it comes down to is that the GPL is a PITA, so I decided to start from scratch.)
One key thing I’d missed but you pointed out was that "match string" was not synonymous with "common sub-sequence", so reading your line of reasoning would be worth it if only for that.
I’m impressed by your patience with mincing through all these details and experiments.
But have you already read the PDFs on the leading algorithms (Wu, Manber, Myers, Miller: O(NP) algorithm; and Myers: O(ND) algorithm; or the earlier Hunt)? I don’t see how they track.
The diff doc seems to jump straight to conversion of the objects (strings, code blocks, whatever) into one-word hashes associated with sequence numbers, and then sorting by the hash code as primary key and sequence numbera secondary key. And then there’s some magic, and then they’re traversing a digraph.
It’s that "magic" that worries me.
October 27th, 2005 at 11:24 am
Hi! Though as far as I can see Chapter 20 is still not written, these posts on diff really helped me! It turns out I was a good way on the way myself, but I couldn’t have gotten there without your help! thanks!
http://yoy.be/item.asp?i332
August 4th, 2006 at 4:27 pm
Excellent articles! I know this is about a year later, but I would love to see more in this series. I’m also curious how LCS helps finding diff? Once you’ve generated it all, how does that help us?
November 2nd, 2006 at 6:11 am
I don’t suppose you’re going to finish this? It’s a real shame. I read through all of it and found it highly interesting. Material like yours is what makes the web worth searching and browsing! Please continue…
November 28th, 2006 at 10:10 am
Oh, Please finish this series!
It started clear as mud, started getting clearer, a little muddy again (not your fault, mine) and then just when I thought you’d get to the real nitty gritty of the good stuff, that is the "magic" behind it, *poof* - s’all gone.
Great series, but I’m still left confused on the answer.
January 26th, 2007 at 1:11 pm
Yes, please continue this series. I feel like I’m reading a book series and am waiting for the last book in the series to wrap it all up.
What you’ve published has been extremely interesting, but I’m not sure I know what comes next.
November 15th, 2007 at 10:51 pm
I feel a bit stupid after having read all 19 chapters and thinking that it has an ending
really hoped i’d get to the "ultimate" diff algorithm smarter and/or faster than through the edit distance. real shame.
But jgo’s comment of 6/8/2005 giving the list of diff articles is also useful, a good place to continue.
Praises to the author for his patience, it isn’t easy to break it down in such detail