Joe White's Blog Life, .NET, and cats

Option interaction #.NET #Delphi #dgrok

I'm making slow but (somewhat) steady progress on my Delphi-source-code searching program.

(I was serious, by the way: if anyone has any ideas for a good name for this tool, let me know. Otherwise, you'll be stuck with whatever name I come up with.)

Directory recursion, complete with exclusions, is coded and ad-hoc tested. I don't have that logic put into a thread yet, though, partly because I need to figure out how the pieces will fit together for passing the search options into the thread. In an effort to get that part straight in my head, I've started coding on the GUI.

The part I just finished is the search options. There are five checkboxes:

  • Whole words
    • Strict (only enabled when "Whole words" is checked)
  • Regular expressions
  • Loose whitespace
  • Ignore case

I've got an object that takes the search string you type in, plus these search options, and munges them together to create a Regex object. All the actual searching will be done via this regex. That makes life easier for things like "whole words".

The transformation is interesting, though not terribly difficult, for the most part. "Ignore case" gets passed to the Regex constructor via the RegexOptions enum. "Loose whitespace" involves transforming any whitespace in the search string into \s+. "Regular expressions" passes the string through Regex.Escape(), and "Whole words" sticks a \b at the beginning of the string, and another \b at the end.

Actually, my "whole words" is a little smarter than that. \b matches the boundary between a word character (traditionally meaning "alphanumeric or underscore", but extended in Unicode; see Char.IsLetterOrDigit) and a non-word character. That means that if I'm searching for "TFoo =", and the search changes that to "\bTFoo =\b", the regex will only find cases where there's a non-word character (or the beginning of a line), followed by TFoo, a space, an equal sign, and then a word character (to establish the second word boundary). But our coding standards require another space after that equal sign, not a word character, so that regex will never match. (This is a constant source of frustration when I'm using GExperts grep; I periodically get zero search results because I forgot that "whole words" won't work for a particular search.)

So I don't do the naive "prepend and append \bs", unless you check Strict. In normal "whole words" mode, I only prepend a \b if the search string starts with a word character, and I only append a \b if the search string ends with a word character. Which is generally what I mean when I leave "whole words" checked and then type something like "TFoo =". Obviously I want a word break before the TFoo, but not after the equals. And this will figure that out. It'll be nice to reduce the number of times I cuss out the tool for not returning any results.

Thing is, this logic doesn't work so well when what you're actually typing is a regex (i.e., if you check "Regular expressions"). If you type in the regex [A-Z]+, and have "whole words" checked, you'd expect to find whole words that are in all caps. (Well, assuming you also uncheck "ignore case".) But my logic sees that the first character is [, which is not a word character, so it doesn't prepend the \b. Same thing with the last character, +; not a word character, so it doesn't append the \b. Net result being, it doesn't do what you expect.

If anyone has any suggestions on how to fix this, let me know. (I'm thinking about disabling "whole words" if you check "regular expressions". I'm not sure I like that, but I'm not sure there's any good way around it without actually parsing the entire regex myself. Character classes, parentheses, alternation... whew. Hey, if you're typing in a regex, you can probably type the \b too.)

The other unusual interaction was between "Regular expressions" and "Loose whitespace". When you uncheck "Regular expressions", the app runs your search string through Regex.Escape, which prepends all the special characters with a backslash so they're not special anymore. But... space characters also get prepended with backslashes. Which makes my job, of finding whitespace and turning it into \s+ (a regex expression meaning "any number of whitespace characters, of any kind, be they spaces or tabs or newlines"), a good bit more difficult. If I leave that extra backslash there, and then change the space after it to \b+, I'll be left with \b+, meaning "a backslash followed by one or more 'b' characters". Not good.

They escape whitespace because whitespace means something special if you specify a particular flag in RegexOptions (IgnorePatternWhitespace). But you normally don't specify that, and so you normally don't need to escape whitespace. I looked for another version of Regex.Escape that covers this "normal" case, but I couldn't find one.

I got around this one by iterating through all the characters in the string, letting whitespace (Char.IsWhiteSpace) pass through untouched, and calling Regex.Escape on each individual non-whitespace character. (Yes, of course I put the results into a StringBuilder.) Once I puzzled out that this was the right way to do it, it works well.

Next step: Write a thread to build the directory list, build the file list, read and parse the files, and check the regexes. Then getting the listview working properly (grumble grumble Microsoft grumble grumble not providing usable drag-and-drop out of the box grumble grumble Delphi has had fully-functional drag-and-drop since v1 grumble grumble). And then I'll probably post an EXE if anyone wants to look at it. (The tool will still have a long way to go, but the very very basics should be at least somewhat usable.)