The Java Explorer

Tips and insights on Java

  • Subscribe

  • If you find this blog useful, please enter your email address to subscribe and receive notifications of new posts by email.

    Join 37 other followers

Having performance issues with regex?

Posted by Eyal Schneider on May 21, 2009


Java introduced the java.util.regex package in version 1.4. It is a powerful addition, and yet, one should really be an artist in order to use it right. Assuming that a regular expression is proved for correctness, it may still run extremely slow (even take hours) if it is not wisely written.

Continue reading for understanding the origin of the problem, or jump directly to the last section, containing 10 useful performance tips for regular expression writers in Java.

Can it really be so slow?

 Suppose that we want to match only strings composed of sequences of ‘a’s or ‘b’s. A legal regex would be:

(a*b*)*

However, if you run this regex against the string “aaaaaaaaaaaaaaaaaaaaaaaaaaaaax” for example, it takes several minutes to terminate and report that there is no match!
Of course, a better regex in this case would be:

(a|b)*

This one takes less than a millisecond on my machine, on the same input. Obviously, there is a performance issue here.

Why does this happen?

 Like most regex engines, Java uses NFA (Non deterministic Finite Automaton) approach. The engine scans the regex components one by one, and progresses on the input string accordingly. However, it will go back in order to explore matching alternatives when a “dead end” is found. Alternatives result from regex structures such as quantifiers (*, +, ?) and alternation (e.g. a|b|c|d). This exploration technique is called backtracking.
In the catastrophic example above, the engine will actually explore ALL the decompositions of the series of ‘a’s into smaller series, before it realizes that there is no match. This example shows how the backtracking algorithm can result in exponential time evaluation (in the length of the input sting). It also reveals an important property of NFA engines: the worst cases will always be inputs that almost match the pattern. If a match is found, the exploration stops.

The other main approach for regex engines is DFA (Deterministic Final Automata). In this approach, the regex compilation actually builds an automaton, to be used when traversing input strings. Inputs are traversed character by character, with no going back. This guarantees linear timein the length of the input string, regardless of the regex complexity. Instead of trying different match possibilities serially (as in NFA), DFAsimulates a parallel scanning on all possibilities.

So why does Java (and .NET, Perl, Python, Ruby, PHP etc) use NFA and not DFA, which has much better asymptotic behavior? The reason is that NFA has some significant benefits:

  • Its compilation is faster, and requires much less memory
  • It allows some powerful features (See Sun’s tutorial for details on them):
    • Capturing groups and back references
    • Lookarounds
    • Advanced quantifiers (Possessive and Reluctant)

It is important to note that the popular terms NFA and DFA for regex engines are inaccurate. Theoretically speaking, these 2 models have the same computation power, meaning that there can’t be a matching rule that can be expressed in one of them but not in the other. In practice, there was a need for more features, so the two implementation types diverged in their semantics. NFA engines were given more power, making them superior to DFA engines in computation power.

Due to the speed of DFA and the unique features of NFA, there are 2 more “integrative” approaches for regex implementations. Some implementations use both engines (e.g. GNU egrep, which chooses the specific engine at runtime), and some of them really implement a hybrid version (e.g. Tcl regex), enjoying all the benefits.

Tips

 Following are some tips on how to avoid regex efficiency issues in Java. Many of them are aimed at reducing backtracking.

 1)  Pre-compile
Trivial, but worth mentioning. If you use a regular expression more than once, remember to compile it first as a pre-process step:
Pattern p = Pattern.compile(regex,flags);

//Usage stage
Matcher a = p.matcher(input);

 2)  Reluctant quantifiers vs greedy quantifiers
The default quantifiers (* + and ?) are greedy. That means that they start by matching the longest possible sequence, and then they step back gradually if backtracking is needed. In case that you know that the match is usually short, you should use reluctant quantifiers instead. They start by the smallest match, and progress if needed.
For example, suppose that you want to match only strings containing the sequence “hello”. The regex .*hello.* will do the job, but if you know that ‘hello’ usually appears near the beginning of the text, then .*?hello.* will run faster on average.

 3)  Use possessive quantifiers where possible
Unlike reluctant quantifiers which affect performance but do not affect regex behavior, possessive quantifiers may really change the meaning of the regex.
When using *+ instead of *, the first attempted match will be greedy (i.e. longest match, just as with *), but there will be no backtracking in case of failure, even if this causes the complete match to fail. When does this become useful?
Suppose that you need to match text in quotes. The regex \”[^\”]*\”will work fine. However, it will do unnecessary backtracking on negative cases (e.g. “bla bla bla). Using the regex \”[^\”]*+\”instead will eliminate the backtracking, without changing the regex meaning.

Independent grouping can have the same effect, and it allows even more control (See Sun’s tutorial ).

4)  Avoid capturing groups
Any expression in parentheses is considered a capturing group by default. This has a small performance impact. Make your groups “non-capturing groups” when possible, by beginning them with (?: instead of with (.

5)  Use alternation wisely
When using alternation (e.g. Paul|Jane|Chris), the order in which the engine tries to match the options is the same order in which they appear. Therefore, you can take advantage of this property, and order your options from the most common one to the less common one. This will improve the average time of positive matches.

 6)  Avoid multiple interpretations
Write your regex in a way that minimizes the number of different ways to match a particular input string. For example, the regex (a*b*)*given in the beginning of this article, allows interpreting the string “aabb” in too many ways:
(a2b2)
(a1)(a1)(b1)(b1)
(a2)(b2)
(a1)(a1b2)
etc…

The regex (a|b)* on the other hand, forces a unique interpretation on a positive input.
This is very important in order to reduce backtracking, in cases of almost a match.

 7)  Lookarounds
Lookarounds  allow adding restrictions on sequences to the left/right of current position. Specifically, with negative lookahead you can search strings that do not contain some sequence (cumbersome thing to do without this feature!). How can this help improve performance?

Suppose you want to capture the URL from a link tag. Consider the following regex:
<a .* href=(\S*).*/>
For legal tags it will find a match and capture only the href attribute as required (\S stands for a non-whitespace character) .  On some illegal tags, however, exessive backtracking will occur. For example: “<a href= href=href=…. href=something”.
The following regex will prevent this, by replacing the “.*” expression with something more specific, that does not match “href’”:
<a ((?!href).)* href=(\S*)((?!href).)*/>

 8)  Specify lengths
Java has a regex optimizer that checks the input string’s length against the min/max length derived from the regex contents. This allows failing the match in some cases immediately. In order to help this mechanism, specify repetition number on groups whenever possible (e.g. [01]{6}, which matches all binary strings of length 6).

 9)  Extract mandatory strings
Sometimes, strings which are mandatory are hidden inside groups or alternations:
(hello|hell|heel)
This expression could be simplified to:
he(llo|ll|el)
By doing so, the regex optimizer has more information to work with.

 10)  Benchmark your regex
When the regex is used in a performance critical area of your application, it would be wise to test it first. Write a micro-benchmark application that tests it against different inputs. Remember to test different lengths of inputs, and also inputs that almost match your pattern.

 

Links:

 http://java.sun.com/docs/books/tutorial/essential/regex/index.html
http://www.javaworld.com/javaworld/jw-09-2007/jw-09-optimizingregex.html?page=1
http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html
http://www.devarticles.com/c/a/Java/NFA-DFA-POSIX-and-the-Mechanics-of-Expression-Processing/
Advertisements

5 Responses to “Having performance issues with regex?”

  1. Shai said

    If you really try you can even cause an infinite loop with jregex, for example when matching an empty string to the expression ()*

    • Eyal Schneider said

      I just tried the jregex package and saw this behavior (I got out of memory error).
      java.util.regex doesn’t enter into an infinite loop in this case, however.

  2. Thanks for the pithy tips.

    For background and history of DFA versus NFA see:

    http://swtch.com/~rsc/regexp/regexp1.html

  3. Rahul said

    thanks a lot for the information, i was searching for this only, you saved my lot of time, thanks

  4. […] https://eyalsch.wordpress.com/2009/05/21/regex/ […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: