Synthesis 2.0.3: Regex rocks
Sterling Camden
Synthesis version 2.0.3, which you can download below, includes numerous beneficial (I hope) changes to Regex. I have nearly filled out all of the features that I want to add to this class (I’ve only left a few to-dos on the list). What I added in this version can be separated into new syntax and performance enhancements.
Syntax extensions
- Similarly to *, +, or ?, you can now also use {n}, where n is a number of iterations. Or, you can specify {n,m}, where n is the minimum and m is the maximum. You can also omit one of these by keeping the comma. Thus, {3,} means “at least 3, no maximum” and {,8} means “up to 8, no minimum”. For the implementation, I considered replicating the state machine the appropriate number of times with transitions through or around them, but it occurred to me that you could easily create a monster with something like
/(a{,1000}b){,2000}/. Instead, I created a new facility to determine whether a transition is enabled within the current context. This proved very valuable, and made the implementation of word boundary checking (see below) dead simple. - For *, +, ?, and {}, you can immediately follow them with a ? to specify a non-greedy search (the default has always been greedy). For this, I added a scoring mechanism that adds to or subtracts from a match’s score for each repeating character, depending on the greediness mode of the current context.
- Like Perl, Ruby, et al I added \A as a ^ equivalent. It anchors to the beginning of the text, but unlike ^ does not anchor following a newline when in Multiline mode. Likewise, \z anchors to the very end without regard for newlines. \Z (uppercase) anchors to the end of text, but if the last character is a newline it anchors to the position of the newline (regardless of Multiline mode).
- \b now anchors to a word boundary. To use \b for backspace instead, you must enclose it in a character class – e.g., [\b]. \B can be used to match a word non-boundary.
- \a matches a bell character (\x07), and \e matches escape (\x1b).
- When not in a character class, \Q specifies that all characters up to the next \E will be treated as literals.
- We now have named group support, a la Python:
(?P<group>)captures a sub-expression named “group”,(?P=group)can be used to back-reference it within the expression, and\g<group>can be used in the substitution string of replace to insert its last captured content. - I added character class subtraction even though very few regex parsers include it, just because it’s so cool. For instance,
[a-z-[aeiou]]matches all lowercase consonants.
Performance enhancements
The above syntax extensions added a huge number of tests, but I wanted to make sure I covered the code well. Unfortunately, all that new code meant that running htmlcover on test_regex would run for days. Since CodeCoverage makes ample use of Regex, improving the performance of Regex itself would doubly benefit the elapsed time needed to test Regex testing. I implemented the following optimizations:
- If an expression begins with an unconditional \A or ^, then we don’t need to test the initial state of the machine on any part of the string except the beginning of a line. In the case of \A, or when using ^ with Multiline off, we can also give up as soon as the number of remaining states equals zero (because we won’t add any more).
- In previous versions, we always gathered all possible matches first and then picked the best one based on position, greediness score, and length. But since we process the string left to right and the highest priority goes to leftmost position, we can eliminate some possibilities as soon as we accept any match. We no longer need to test each remaining character against the initial machine state, and if the number of remaining states falls to zero at any point then we can take the best one so far, if we have one.
Obviously, the optimizations listed above have greater or lesser benefit depending on the expression and the input it is processing. The best cases reduce matching from having to walk the entire string to exiting after the first character. The worst cases impose a mild penalty for trying to optimize something that doesn’t benefit at all.
CodeCoverage performance
Based on these observations, I was able to tune the Regexen I use in CodeCoverage to boost performance even more. First of all, rather than jumping right into a Regex match, I compare the initial part of each debugger line literally for strings that could be what I’m seeking, then use a Regex to parse out the file name and line number. Adding a ^ anchor to each of those expressions prevents the overhead of the Regex engine trying to restart its state machine on each subsequent character.
I also replaced the associative lists in CodeCoverage with Hashes. That boosts performance on looking up the source file and line number associations in the Map, which happens many thousands of times during code sampling.
With all of those changes, I am now able to run coverage on test_regex in under 12 hours. So it’s once again something I can kick off at the end of the day and see the results in the morning. While that may still seem long, remember that Regex is a very complex case, involving lots of in-memory operations (so lots of statements to step through), and that test_regex tests all of the pathological cases that I could imagine. One test in particular (matching /(a*b*)*/ against a string of a’s and b’s, took a few hours all by itself when sampling, though in normal execution it finishes in a few seconds.
CodeCoverage also now provides a mechanism for getting feedback on its progress, and I’ve modified htmlcover to take advantage of that. During the Sample phase, it reports on each new line number within the main routine, so you can get an idea of where it is in the overall process.
I corrected a few bugs in this version as well. Please see the change log for details.
Posted in Regexen, SynergyDE |
1 Comment » RSS 2.0 | Sphere it!


A commenter going by Anonymous Coward (I bet that’s not his real name, and I beg forgiveness if the masculine pronoun doesn’t apply) 
I thought of a new use for my OPML Browser plugin, but the OPML file I intended to use with it threw me a couple of curve balls.


