The Java Explorer

Tips and insights on Java

  • Subscribe

  • If you find this blog useful, please enter your email address to subscribe and receive notifications of new posts by email.

    Join 39 other followers

Understanding Generics limitations

Posted by Eyal Schneider on August 14, 2009

Generics were introduced to Java in version 5 (JSR 14), and since then they became standard and very popular. Although there has been a great demand by developers for this feature, many are unsatisfied regarding HOW it was implemented. The design that was adopted has a huge impact on the syntax and semantics of this feature.

Java generics provides the following benefits:

  • Type safety
  • Less explicit casts
  • More declarative APIs

In this post I list some of the limitations of Generics. All of them are derived from two fundamental principles of the Java Generics design:

1) Erasure

Type parameters (e.g. String in  List<String>) are completely removed by the compiler. Formal type parameters (e.g. E in the source code of List) are replaced by their bound (usually Object). As a result, the bytecode has no generic information at all, consequently it is not available at runtime. Unlike C++, at runtime Java shares the same class for different parameterized instantiations of it. These facts have three interesting benefits:

1) JVM implementations prior to version 5 can theoretically stay unchanged in order to support Generics
2) When declaring a parameterized version of a class C, we don’t need the source code of the generic type C
3) The volume of the bytecode does not increase as we add more parameterized types of the same generic class

2) Type safety guarantee
According to Gilad Bracha, the Java Generics lead designer, if your entire application has been compiled without unchecked warnings, it is type safe. “Type safe” in this case means that there will be no unexpected class cast exception, i.e. if a ClassCastException occurs, then it must be caused by an explicit cast in the code. This principle is very important – we don’t want the implicit casts added when compiling generic code to raise runtime exceptions, since they would be hard to understand and fix. 


Formal type parameters usage

You can’t expect the formal type parameter to be used wherever a type name can fit. For example, instantiating an object from a formal type parameter is not allowed:

public class NonSense <T>{
    T obj = new T() ; //Illegal

Due to erasure, the compiler must generate code for NonSense class to be shared among all instantiations, but it doesn’t know anything about T at this point, and can’t assume anything.
The equivalent code in C++ IS allowed, since the compiler creates a dedicated class per parameterized type of NonSense.
Note that replacing the line above with List<T> obj = new ArrayList<T>() would compile, since the compiler generates code for creating an ArrayList, and the formal type parameter is erased.

For the same reasons, the following code does not compile:

public class Something <T> extends T{

Hierarchy of parameterized types

Two different instantiations (with different type parameter) of the same parameterized type don’t have any hierarchy relation. For example,  List<String> is not a List<Object>. The following code snippet illustrates what would have happened if this were allowed:

List<String> ls = new ArrayList<String>();
List<Object> lo = ls; //Illegal
lo.add(new Integer(3)); //The list ls is not homogeneous anymore!


According to the intuition, arrays should follow similar rules as collections (such as ArrayList) do, when it comes to Generics. The reality is a little painful, however. It is not allowed to create arrays of elements which are parameterized types (e.g. new ArrayList<Strings>[10]). If it were allowed by the compiler, we could have done something like:

1.    ArrayList<Integer>[] arrOfIntLists = new ArrayList<Integer>[10]; //Illegal
2.    Object[] arrOfObj = arrOfIntLists;

3.    ArrayList<String> listOfStrings = new ArrayList<String>();
4.    listOfStrings.add("hello");
5.    arrOfObj[0] = listOfStrings;

6.    ArrayList<Integer> list = arrOfIntLists[0];
7.    Integer num = list.get(0);

What happens here?

1) An array of ArrayList of Integers is created.
2) arrOfObj now points to the same array. This is a legal up-cast, since any array is also an Object array. There is no warning in this line.
3,4,5) A list of Strings is created and added to arrOfObj. Note that the addition does not produce an ArrayStoreException, since the runtime type of the array components is simply ArrayList (Due to erasure), and that’s exactly what we are storing.
6) We retrieve the list at position 0 in arrOfIntLists. There is no ClassCastException yet because the implicit runtime cast is a cast to ArrayList, which is the correct class.
7) When trying to retrieve the first item of the list of integers, we fail with a ClassCastException, caused by the implicit cast to Integer.

If this code were legal, it means that the type safety guarantee is broken. The code compiles with no warnings, but at runtime it results in an unexpected ClassCastException in the last line. This happens because we managed to store a string list in an array of integer lists. 

So now we know why creating an ArrayList<Integer>[] is illegal.  But why does the compiler allow creating an instance of ArrayList<ArrayList<Integer>>? Shouldn’t it lead to the same type safety problem? The answer is no. If we convert the example above to the analogical version that uses ArrayList instead of arrays, the second line will not pass compilation, since ArrayList<ArrayList<Integer>> is not a subtype of ArrayList<Object> (See “Hierarchy of parameterized types” above).

The same phenomenon may occur with generic methods that receive generic arrays:

public class MyUtils{ 
    public static <T> void fill(T[] arr, T value){
        for(int i=0;i<arr.length;i++)

This mini-utility looks type safe, but it isn’t. Surprisingly, the following code compiles with no errors/warnings:

String[] strArr = new String[10];

Why does this happen? When a generic method is called this way, the compiler performs type inference for determining what type T stands for. In our case, this is done by inspecting the parameter types, and finding the lowest common type shared by them in the class hierarchy. With fill(strArr,”hi”), the type is obviously String. However, in the example above the most specific type shared by strArr and 3 (which is converted to an Integer) is Object. This causes the code to be legal, but to fail with ArrayStoreException on runtime. Technically, this is not a violation of the type safety guarantee explained in the introduction, because the guarantee deals only with ClassCastException.

There are two simple ways to avoid this problem. The first is to be more specific with the call:

MyUtils.<String>(strArr,3) //Now it does not compile

The second is to fix the method’s signature as follows:

public static <T,S extends T> void fill(T[] arr, S value) 

Latent Typing

In C++, the implementation of a parameterized type can assume anything about the type parameter/s. For example, the implementation of MyList<E> may call any method on E, and the availability of this method is checked against the actual type parameters of the different instantiations at compile time.  Java does not allow it.  Due to the generics design, the compiler has to generate the full bytecode of the generic class, regardless of the instantiations, so calls to methods that are not bound to any class/interface can not be compiled.

Therefore, in Java you are encouraged to declare explicitly all your assumptions on a type parameter. This is done using bounds. For example:

class DrawableSortedSet<T extends Comparable<T>,Drawable>{

This clearly makes the code more declarative, but proponents of Latent Typing argue that the extra declarations are not really needed for runtime type safety (C++ approach proves this), and also that the declarative approach may limit the programmer and force him to define new artificial interfaces containing the desired methods.  


Bounds restrictions

As explained before, bounds allow us to specify constraints on the type parameters being used, both in a generic class or a generic method. These prerequisites give us more freedom. For example, assume that we want a method that manipulates a list of numbers.  We start by the following generic method:

public void manipulate(List<Number> list){

The problem here is that List<Integer> can not be passed as parameter. Only List<Number> is allowed (see “Hierarchy of parameterized types” above). If we want to pass lists of any descendant class of Number, we should use bounds:

public <T extends Number> void manipulate(List<T> list){ 

Now we can pass List<Integer>, List<Float> and so on,  but we face a new limitation. While in the previous version of the method we could call list.add(new Integer(10)) or list.add(new Double(3.2)), now the compiler will not accept it, because it can’t guarantee that the addition maintains the type homogeneousness in the list. If addition of Integers were allowed, then calling the method with a variable declared as List<Float> may result with a new non Float item in the list, and this is a violation of the type safety guarantee.


Static data members

Generic classes are not allowed to have references to their type parameters from their static data members or methods. For example, the following is illegal:

public MyClass<T>{
    private static T value; //Illegal

    public static void analyze(List<T> list){ //Illegal

C++ programmers who learn Java may be tempted to do it, because in C++ this is legal. Remember that unlike C++, Java uses the same class at runtime for MyClass<Integer>, MyClass<String> etc. Static members, by definition, are bound to the class rather than to the instance. It follows from these 2 facts that static members are actually shared by all instances of all generic instantiation of the same generic type. If the code above were allowed, type safety could not be guaranteed anymore. Imagine for example what happens when an instance of MyClass<Integer> stores an Integer in value, and then an instance of MyClass<String> retrieves it as a String.


Generic information at runtime 

Because of erasure, there is no sense in checking generic information at runtime. For example, the following is illegal:

if (list instanceof List<String>){ ....}


Primitive types as parameter types

Unlike C++, Java does not allow primitive types to be type parameters. This limitation is not so significant, since we can always replace the primitive type with the corresponding wrapper, and we can also use auto-boxing/auto-unboxing in the access methods to simplify the code. However, boxing/unboxing can lead to a performance penalty. In order to eliminate it there is no choice but to code a new, dedicated version of the class for the specific primitive type.


Generic exceptions

It is not allowed to define a generic type that extends (directly/indirectly) some Throwable. Trying to do so implies that the programmer wishes to handle different parameterized versions of the same exception at runtime, such as in the following example:


The problem is that erasure does not allow it.  There is no way for the JVM to distinguish at runtime between MyException<Byte> and MyException<Integer>. They are both simply MyException at runtime. Therefore there is no sense in having generic exceptions.


Generic enums

Defining generic enums is not allowed: 

enum State <T>{ //Illegal

     private T attribute;
     public void setAttribute(T attribute){

     public T getAttribute(){
        return attribute;

The reason for this to be illegal is related to the static members limitation (see “Static data members” above). Enums are implemented according to the “Type Safe Enum” design pattern, where the enum values are static data members of the class. Therefore, they are not allowed to have any reference to the class’ type parameter in their declarations.



Generics are a big step forward for Java, making it support a popular programming style common to Ada, C++, C# and other languages. However, simple intuition is not enough for really understanding Java generics. Sometimes one has to go under the hood of the compiler in order to understand its limitations and use it correctly. There has been a proposal of re-working the design by canceling erasure (“Reified Generics”), but it seems like this idea is no longer in the feature list of Java 7 (see



Posted in java, Language | Tagged: , , , , , , , , , , , , , , , , | 3 Comments »


Posted by Eyal Schneider on June 17, 2009

Any Java programmer has encountered at least once the disturbing OutOfMemoryError (OOME) message. The interesting thing is that this error occurs for different reasons, and the diagnosis and treatment in each case is completely different. In this post I will try to cover some of these cases, show how to reproduce them and how to deal with them.
Some important things to remember about OOME:

  • As the name implies, this is an Error, and not an Exception. Errors indicate a serious condition in the JVM. Catching Errors is usually useless – the code doesn’t have the means to recover from them. Since they are usually not caught, they result in some thread getting terminated.
  • The stack trace is not always present in OOME, since its creation also requires memory, which may not be available.
  • The circumstances where an OOME is thrown may vary between different JVM implementations, due to a different memory management model. The textual description of the error may also vary. This post refers to to my own experience with Sun’s JVM, although I believe that most of it is relevant for other implementations as well.
  • As explained above, OOME is not recoverable at runtime. Therefore, the right strategy is to avoid its recurring once it happens, by means of configuration, design, or hardware modifications.


Java memory organization

Let’s first start with a very short overview of how a Java process organizes its address space. Basically, the memory is divided into heap memory and non-heap memory.  Modern JVM implementations use a generational heap model, in order to improve garbage collection performance. In this model, the heap area is usually split into 3 generations: young, old (“tenured”) and permanent. The young generation is where new objects created with the new operator are allocated, and it is designed to hold “young” objects. Older objects are moved to the tenured generation during GC. The permanent generation is not common to all JVM implementations, and it differs from the other generations – it stores (among other things) metadata of classes that have been loaded. The permanent generation is sometimes regarded as part of the heap, and sometimes not (The heap sizing flags -Xmx and -Xms, for example, do not include the permanent generation size).
Non-heap memory is the additional memory space outside of the areas reserved for the heap, and it can be used for different purposes. I will refer to this memory as native memory.

java.lang.OutOfMemoryError: Java heap space

This common error indicates that an object creation failed because there was no available heap space for it. It will never be thrown unless the JVM tries a full garbage collection first, including removal of softly/weakly referenced objects. When the stack trace is present for this particular OOME, it is usually irrelevant – the specific allocation that could not be satisfied gives us no clue regarding previous allocations that probably caused the problem. Same with the thread where the error occured – It may occur in any thread, since they all share the same heap.
Following is a sample code that eventually generates this OOME:
List<String> list = new ArrayList<String>();
    list.add(new String("Consume more memory!"));
1) Check for memory leaks
  • Use the memory tab in JConsole in order to examine the graph of memory usage of your application over time. Focus on the “Tenured generation”, and click on “perform GC” every once in a while, to see whether the used memory returns to some expected baseline level after every garbage collection.
  • Run your application with the following JVM flags: -verbose:gc and -XX:+PrintGCDetails. This will add console outputs with details on every garbage collection that takes place. Using these outputs, you can analyze the memory consumption pattern, and see whether the memory usage increases constantly where it shouldn´t.                                    

 The following tools can help you analyze the reasons for the memory leak:

  • Run jmap -histo <java process id> at different stages of your application execution. This command prints the number of instances and total used memory per Java class. You can identify classes that have an unreasonable number of instances, or classes for which the number of instances increases unexpectedly. This can limit the search space in your code, when looking for the bug.
  • Generate a heap dump using JConsole or jmap, and analyze it using a heap analyzer (e.g. JHat, IBM HeapAnalyzer). The heap dump contains the full state of the heap at the point of time where the dump was created. JHat allows examining this heap snapshot, navigating through the different instances, and even executing complex queries (OQL language). An advanced feature allows comparing two snapshots taken at different points of time, in order to identify the allocations that occurred between the two. If you want the dump to be created automatically when the OOME occurs, add the -XX:+HeapDumpOnOutOfMemoryError JVM flag.

2) Increase the heap size

Use the JVM flags -Xms and -Xmx to set the initial and maximal heap size, respectively. Make sure not to exceed the physical memory size, otherwise you encourage paging, which affects performance. Sometimes installing additional physical memory is required.

3) Reconsider the design

An inadequate design can easily consume too much heap space and cause this OOME. Sometimes, it is useful to change the design in a way that compromises some desired system property (such as performance), but reduces memory use dramatically. For example, disk/database storage can replace some of the data structures in RAM, and memory sensitive mechanisms such as weak/soft references can be introduced.

java.lang.OutOfMemoryError: PermGen space

When the permanent generation becomes full, this error occurs, regardless of the available space in the object allocation area (young + old generations). Note that this specific message is issued by Sun’s JVM, and other JVM implementations may have different messages, specially those implementations with a different policy for storing class information (such as IBM’s JVM).
I will reproduce this OOME in two different ways. The first one uses a custom class loader to load too many classes. Assume that we have a custom class loader named MyClassLoader, capable of loading the class Dummy:
List<Class<?>> classes = new ArrayList<Class<?>>(); while(true){ MyClassLoader cl = new MyClassLoader(); try{ classes.add(cl.loadClass("Dummy")); }catch (ClassNotFoundException e) { e.printStackTrace(); } }
Note that we are actually loading the same class over and over, with a different class loader (remember that a class is identified by the combination of its name and its class loader object). Also, note that we store references to the loaded classes. If we don’t do so, the garbage collector realizes that the classes are not used, and performs garbage collection on the permanent generation.

A second way of filling up the permanent generation, is by exploiting the String intern pool. The class String manages a pool of strings that have been interned, using String.intern(…). This pool is also a part of the permanent generation.

List<String> list = new ArrayList<String>();
int i=0;
    list.add(("Consume more memory!"+(i++)).intern());

The examples above highlight the common cases where this OOME occurs in practice:

  • Applications which load classes dynamically, such as web servers and application containers
  • A class loader with a memory leak
  • Applications that make use of string interning, in order to improve performance or reducy heap usage


Start by analyzing. A good place to start with is the stack trace. If it contains the java.lang.String.intern(..) call, then the reason is probably string interning. Alternatively, if it contains the java.lang.ClassLoader.defineClass(..) call, then you are probably dealing with a class loading problem. In the latter case, you may want to track the loading/unloading of classes, using the JVM flags -XX:+TraceClassLoading(or -verbose:class) and -XX:+TraceClassUnloading. Also, make sure you don’t have the flag -Xnoclassgc, because it prevents GC on classes, and can easily cause this OOME in applications that keep loading classes dynamically. As a part of the analysis, you can also run the utility jmap -permstat <process id> (not available for Windows). It provides details on all class loaders that have been used since the application was launched. 

Once you have pinpointed the problematic piece of code, check whether the design can be changed to consume less memory. If this is not an option, use the -XX:MaxPermSize flag to increase the maximal size of the permanent generation (e.g. -XX:MaxPermSize=256m).
Remember that increasing the values of -Xmx and -Xms flags instead will not change a thing. As explained before, these flags control the size of the young+old generations only, and do not affect the permanent generation.


java.lang.OutOfMemoryError: unable to create new native thread

Whenever we start a a  thread in Java, the JVM allocates space in native memory, designated for the thread’s call stack. This space is freed only after the thread is terminated. In case that there is not enough native memory for the stack space, we get this error. This can easily be reproduced with the following thread spawner code:
    new Thread(new Runnable(){
        public void run() {
            try {
            } catch(InterruptedException e) { }        
This error is common in applications that try to manage too many live threads, as in the example above. As always, first ask yourself whether the design is reasonable. There are some generic solutions for reducing the number of threads. For example, an application that creates a thread per connection has a scalability problem in its design. The IO multiplexing feature introduced in Java 1.4 can solve this, and reduce dramatically the number of threads (See class java.nio.channels.Selector).
If you wish to try solving this problem by changing the memory settings only, then any of the following can help:
1) Reduce the heap size using -Xmx and -Xms flags. Sometimes, inexperienced programmers increase the heap size when they encounter this OOME, but it only makes the condition worse.
2) Reduce the permanent generation size using -XX:MaxPermSize
3) Reduce the size of memory allocated per stack using -Xss (e.g. -Xss512k). The default values depends on the environment, and range between256k and 1024k. Remember that when you reduce this value, you are increasing the risk of StackOverflowError, specially if you have deep recursion somewhere in your code or libraries that you use.

Other native memory OOMEs

There are many ways to use native memory other than thread stack allocation discussed before. Allocations can be made by the JVM itself, third party libraries, utilities in the JDK libraries,  and JNI code that you write. When the native memory gets exhausted, we may get an OOME of some kind. They are not consistent with their message, but one of them looks as follows:
java.lang.OutOfMemoryError: requested <bytes> bytes for <reason>. Out of swap space?
We can also cause a different message, by using the direct buffer feature of the java.nio package, in order to consume all available native memory:
List<ByteBuffer> list = new ArrayList<ByteBuffer>();
In this case we will get the message:
java.lang.OutOfMemoryError: Direct buffer memory
1)  Check whether there is a memory leak in some native code you use with JNI
2) Reduce the heap size using -Xmx and -Xms flags, to make more space for native memory.
3) Increase the swap space on disk. The way to do this depends on the operating system you use.


There are several types of OutOfMemoryError messages. It is important to know how to identify and analyze them,  and then how to find an appropriate solution. Solutions are in terms of prevention and not recovery at runtime.  A recovery attempt after an OOME may be too late – the data integrity may be broken, and there is no guarantee that there will be enough resources to complete it successfully. Instead, we should re-launch the application with a different setup.
An automatic recovery mechanism that occurs before the OOME can be effective, though. Since Java 1.5, we have the MemoryMXBean class, as part of the JMX services. An example of such a memory tracking mechanism was written by Dr. Heinz Kabutz in his 92nd issue of his newsletter.

Posted in java, JDK packages | Tagged: , , , , , , , , , , , , , , , , , , , , | Leave a Comment »

SimpleDateFormat pitfalls

Posted by Eyal Schneider on May 29, 2009

The class SimpleDateFormat is a very popular class, allowing flexible formatting and parsing of dates in Java. Formatting consists of transforming a Date into a String in a predefined format, and parsing performs the inverse operation.

In many cases, programmers fail to use SimpleDateFormat correctly, due to inattention to some (well documented) characteristics of this class. Maybe this will not be new for some of the readers, but there are 3 important things to remember when using SimpleDateFormat.

Time zone 

Every instance of SimpleDateFormat has an associated time zone to work with, as a TimeZone instance. Otherwise, a parsing operation would not be able to determine an absolute point in time (Date object), from a given string denoting a date. If not specified otherwise, the time zone associated with a new SimpleDateFormat is the default one, corresponding to the time zone reported by the operating system.

Consider the following code:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
Date date = sdf.parse("18-07-1976");

Since the time zone is not explicitly set (using setTimeZone(..)), the code above will result in different outputs when executed in different parts of the world. The midnight that started July 18, 1976 at Thailand, and the same midnight in Florida, have a different GMT representation. The reason is that the midnight arrived in Thailand first.

In many cases, the local default time zone is exactly what the programmer wants. However, on a client-server application, or on a distributed system, we usually want to avoid this time zone relativity. Date strings that are read from configuration / data files / user interface / external software interfaces must be specific with their time zone. We can achieve this in one of two ways. The first one is to include the timezone itself as a part of the expected date format:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy z");
Date date = sdf.parse("18-07-1976 ICT");

This program will always return the same value (206470800000), regardless of the execution location.

The other approach is to define a convention which needs to be respected by all system entities which generate textual dates. For example, we can decide that all dates should be interpreted as GMT dates. Then, we can bind the corresponding time zone to the SimpleDateFormat. For example:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
Date date = sdf.parse("18-07-1976");

 This program will also return the same value everywhere in the world (206496000000). 


Assume  that we are reading a date in the format dd/MM/yyyy from the GUI. We want to validate that the date is well formatted and legal. Consider the following validation code, which makes use of the ParseException thrown by the parse method:

SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy");
Date d = null;
try {
    d = sdf.parse(dateStr);
} catch (ParseException e) {   
    showErrorMessage("Incorrect date format. "+e.getMessage());

Will it do the job? The answer is – not always. Date strings such as “10//10/2008” will result in an error, but dates such as “40/20/2009” will be parsed successfully, with no error message!. The reason is that by default, SimpleDateFormat is lenient, which means that it is tolerant to some formatting errors. It will try to figure out the date by applying some heuristics to the given text. In the case above, month 20 will be interpreted as December plus 8 months (i.e. August 2010), and day 40 of the month will be interpretted as end of August plus 9 days. The resulting date will be September 9th, 2010.

In order to achieve a stricter validation, we should make the SimpleDateFormat non-lenient, by initializing it as follows:

SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy");

Thread safety 

SimpleDateFormat is not thread safe. Concurrent formatting/parsing operations interfere with each other and result in inconsistent behavior. There are 3 ways to deal with this:

1) Create a SimpleDateFormat instance per usage
This will work fine if the usage is not frequent. Otherwise, the performance penalty may be significant.

2) Synchronization
We can achieve thread safety by synchronizing all usages of the SimpleDateFormat instance with the same lock. For example:

private SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
private Date parse(String dateStr) throws ParseException{
        return sdf.parse(dateStr);
private String format(Date date){
        return sdf.format(dateStr);

The disadvantage of this approach is that in case of many threads working with this code, thread contention will affect performance.

3) Use an instance per thread
The usage of ThreadLocal class allows associating a single instance of SimpleDateFormat with each thread that needs to do parsing/formatting of dates.
This way, we achieve both thread safety with no thread contention, and a minimal number of SimpleDateFormat instances.
In some cases, we can be extra economical on memory usage, if we know for example that the threads stop using the SimpleDateFormat at some point of time. In these cases, we could use a soft reference to hold the SimpleDateFormat instance.
An explanation of this technique plus sample code can be found here.


Posted in java, JDK packages | Tagged: , , , , , , | 3 Comments »

Having performance issues with regex?

Posted by Eyal Schneider on May 21, 2009

Java introduced the java.util.regex package in version 1.4. It is a powerful addition, and yet, one should really be an artist in order to use it right. Assuming that a regular expression is proved for correctness, it may still run extremely slow (even take hours) if it is not wisely written.

Continue reading for understanding the origin of the problem, or jump directly to the last section, containing 10 useful performance tips for regular expression writers in Java.

Can it really be so slow?

 Suppose that we want to match only strings composed of sequences of ‘a’s or ‘b’s. A legal regex would be:


However, if you run this regex against the string “aaaaaaaaaaaaaaaaaaaaaaaaaaaaax” for example, it takes several minutes to terminate and report that there is no match!
Of course, a better regex in this case would be:


This one takes less than a millisecond on my machine, on the same input. Obviously, there is a performance issue here.

Why does this happen?

 Like most regex engines, Java uses NFA (Non deterministic Finite Automaton) approach. The engine scans the regex components one by one, and progresses on the input string accordingly. However, it will go back in order to explore matching alternatives when a “dead end” is found. Alternatives result from regex structures such as quantifiers (*, +, ?) and alternation (e.g. a|b|c|d). This exploration technique is called backtracking.
In the catastrophic example above, the engine will actually explore ALL the decompositions of the series of ‘a’s into smaller series, before it realizes that there is no match. This example shows how the backtracking algorithm can result in exponential time evaluation (in the length of the input sting). It also reveals an important property of NFA engines: the worst cases will always be inputs that almost match the pattern. If a match is found, the exploration stops.

The other main approach for regex engines is DFA (Deterministic Final Automata). In this approach, the regex compilation actually builds an automaton, to be used when traversing input strings. Inputs are traversed character by character, with no going back. This guarantees linear timein the length of the input string, regardless of the regex complexity. Instead of trying different match possibilities serially (as in NFA), DFAsimulates a parallel scanning on all possibilities.

So why does Java (and .NET, Perl, Python, Ruby, PHP etc) use NFA and not DFA, which has much better asymptotic behavior? The reason is that NFA has some significant benefits:

  • Its compilation is faster, and requires much less memory
  • It allows some powerful features (See Sun’s tutorial for details on them):
    • Capturing groups and back references
    • Lookarounds
    • Advanced quantifiers (Possessive and Reluctant)

It is important to note that the popular terms NFA and DFA for regex engines are inaccurate. Theoretically speaking, these 2 models have the same computation power, meaning that there can’t be a matching rule that can be expressed in one of them but not in the other. In practice, there was a need for more features, so the two implementation types diverged in their semantics. NFA engines were given more power, making them superior to DFA engines in computation power.

Due to the speed of DFA and the unique features of NFA, there are 2 more “integrative” approaches for regex implementations. Some implementations use both engines (e.g. GNU egrep, which chooses the specific engine at runtime), and some of them really implement a hybrid version (e.g. Tcl regex), enjoying all the benefits.


 Following are some tips on how to avoid regex efficiency issues in Java. Many of them are aimed at reducing backtracking.

 1)  Pre-compile
Trivial, but worth mentioning. If you use a regular expression more than once, remember to compile it first as a pre-process step:
Pattern p = Pattern.compile(regex,flags);

//Usage stage
Matcher a = p.matcher(input);

 2)  Reluctant quantifiers vs greedy quantifiers
The default quantifiers (* + and ?) are greedy. That means that they start by matching the longest possible sequence, and then they step back gradually if backtracking is needed. In case that you know that the match is usually short, you should use reluctant quantifiers instead. They start by the smallest match, and progress if needed.
For example, suppose that you want to match only strings containing the sequence “hello”. The regex .*hello.* will do the job, but if you know that ‘hello’ usually appears near the beginning of the text, then .*?hello.* will run faster on average.

 3)  Use possessive quantifiers where possible
Unlike reluctant quantifiers which affect performance but do not affect regex behavior, possessive quantifiers may really change the meaning of the regex.
When using *+ instead of *, the first attempted match will be greedy (i.e. longest match, just as with *), but there will be no backtracking in case of failure, even if this causes the complete match to fail. When does this become useful?
Suppose that you need to match text in quotes. The regex \”[^\”]*\”will work fine. However, it will do unnecessary backtracking on negative cases (e.g. “bla bla bla). Using the regex \”[^\”]*+\”instead will eliminate the backtracking, without changing the regex meaning.

Independent grouping can have the same effect, and it allows even more control (See Sun’s tutorial ).

4)  Avoid capturing groups
Any expression in parentheses is considered a capturing group by default. This has a small performance impact. Make your groups “non-capturing groups” when possible, by beginning them with (?: instead of with (.

5)  Use alternation wisely
When using alternation (e.g. Paul|Jane|Chris), the order in which the engine tries to match the options is the same order in which they appear. Therefore, you can take advantage of this property, and order your options from the most common one to the less common one. This will improve the average time of positive matches.

 6)  Avoid multiple interpretations
Write your regex in a way that minimizes the number of different ways to match a particular input string. For example, the regex (a*b*)*given in the beginning of this article, allows interpreting the string “aabb” in too many ways:

The regex (a|b)* on the other hand, forces a unique interpretation on a positive input.
This is very important in order to reduce backtracking, in cases of almost a match.

 7)  Lookarounds
Lookarounds  allow adding restrictions on sequences to the left/right of current position. Specifically, with negative lookahead you can search strings that do not contain some sequence (cumbersome thing to do without this feature!). How can this help improve performance?

Suppose you want to capture the URL from a link tag. Consider the following regex:
<a .* href=(\S*).*/>
For legal tags it will find a match and capture only the href attribute as required (\S stands for a non-whitespace character) .  On some illegal tags, however, exessive backtracking will occur. For example: “<a href= href=href=…. href=something”.
The following regex will prevent this, by replacing the “.*” expression with something more specific, that does not match “href’”:
<a ((?!href).)* href=(\S*)((?!href).)*/>

 8)  Specify lengths
Java has a regex optimizer that checks the input string’s length against the min/max length derived from the regex contents. This allows failing the match in some cases immediately. In order to help this mechanism, specify repetition number on groups whenever possible (e.g. [01]{6}, which matches all binary strings of length 6).

 9)  Extract mandatory strings
Sometimes, strings which are mandatory are hidden inside groups or alternations:
This expression could be simplified to:
By doing so, the regex optimizer has more information to work with.

 10)  Benchmark your regex
When the regex is used in a performance critical area of your application, it would be wise to test it first. Write a micro-benchmark application that tests it against different inputs. Remember to test different lengths of inputs, and also inputs that almost match your pattern.



Posted in java, JDK packages | Tagged: , , , , , , | 6 Comments »