The Java Explorer

Tips and insights on Java

  • Subscribe

  • Email Subscription

    If you find this blog useful, please enter your email address to subscribe and receive notifications of new posts by email.

Comparators and sorted sets

Posted by Eyal Schneider on November 23, 2009

Java suggests two techniques for defining an ordering on a collection of objects. The first is implementing the Comparator interface, as a stand alone comparison mechanism. The second is making the objects comparable, by making their class/es implement the Comparable interface.  

When the ordering is inherent to the class nature, such as with class Integer, implementing the Comparable interface makes sense. In the other cases, we don’t want to add specific ordering logic to the class implementation. Instead, we implement comparators on a need to basis. We may also want to implement multiple comparators for the same class, thus having many ordering schemes.

Regardless of the interface we choose to implement (Comparable/Comparator), there are some important rules to follow. Suppose that we need a class for maintaining a set of points in the plane, ordered by their distance from a fixed point. Let’s first define a class for representing a point:    

public class Point{
    private int x;
    private int y;
    public Point(int x, int y) {
        this.x = x;
        this.y = y;
    }
    public int getX() {
        return x;
    }

    public int getY() {
        return y;
    }

    public boolean equals(Object other){
        if (!(other instanceof Point))
            return false;
        Point otherPoint = (Point) other;
        return x==otherPoint.x && y==otherPoint.y;
    }

    public int hashCode(){
        return x ^ y;
    }

    public double distanceFrom(Point other){
        int dx = x-other.x;
        int dy = y-other.y;
        return Math.sqrt(dx*dx+dy*dy);
    }

    public String toString(){
        return "("+x+","+y+")";
    }

 }

And now let’s define the OrderedPointSet class, which uses a TreeSet with a comparator for maintaining the sorted set of points efficiently: 


public class OrderedPointSet {
    private SortedSet<Point> points;
  private Point refPoint;
  public OrderedPointSet(final Point refPoint){
        this.refPoint = refPoint;
        points = new TreeSet<Point>(new Comparator<Point>(){
            public int compare(Point p1, Point p2) {
                double dist1 = refPoint.distanceFrom(p1);
                double dist2 = refPoint.distanceFrom(p2);    
                if (dist1>dist2)
                    return 1;
                if (dist2>dist1)
                    return -1;
                return 0;
            }
        });
    }
    public void add(Point p){
        points.add(p);
    }
    //returns the reference point followed by a list of points ordered in ascending distance from it

    public String toString(){
        return refPoint+":"+points.toString();
    }

    //Other methods
    ....
}

We can now test our implementation: 

OrderedPointSet ps = new OrderedPointSet(new Point(0,0));
ps.add(new Point(1,5));
ps.add(new Point(10,10));
ps.add(new Point(5,20));
ps.add(new Point(5,5));
System.out.println(ps);

The output contains all points, with their right order:  

(0,0):[(1,5), (5,5), (10,10), (5,20)]
 

Apparently the job is done. However, the implementation has a major flaw, that can be detected with the following test scenario:  

OrderedPointSet ps = new OrderedPointSet(new Point(0,0));
ps.add(new Point(5,10));
ps.add(new Point(2,11));
ps.add(new Point(5,5));
System.out.println(ps);

We expect to see 3 points in the output, but we see only two:   

(0,0):[(5,5), (5,10)] 

How did this happen? The first two points, though not equal, have the same distance from the reference point  (square root of 125). The TreeSet implementation (as well as other SortedSet implementations) never invokes the equals() method on its items, and relies only on the comparator implementation for determining equality. Therefore, from its point of view, points (5,10) and (2,11) are equal. When trying to add the latter, the collection stays unchanged, because it obeys the contract of Set.add(), which doesn’t permit overriding. As a consequence, another part of Set’s contract is broken: we add two distinct items (according to equals method), and yet one of them is rejected.
The problem originates from the comparator implementation above, which doesn’t define an ordering consistent with equals, as required by the SortedSet documentation. The formal definition is as follows:  

The ordering imposed by a comparator c on a set of elements S is said to be consistent with equals if and only if c.compare(e1, e2)==0 has the same boolean value as e1.equals(e2) for every e1 and e2 in S.        

Actually, for solving the problem described above it suffices to enforce only one direction of this rule: if two items are equal according to the comparator, they must be equal according to the equals() method. However, failing to comply with the other direction will result in a different violation of the Set interface -  the collection may accept two items despite the fact that they are equal according to equals().
Returning to our example, the fixed version of the comparator enforces consistency with equals by breaking distance ties of distinct points arbitrarily:

public int compare(Point p1, Point p2) {
    double dist1 = refPoint.distanceFrom(p1);
    double dist2 = refPoint.distanceFrom(p2);    
    if (dist1>dist2)
        return 1;    
    if (dist2>dist1)
        return -1;
 
    int dx = p1.getX()-p2.getX();
    if (dx != 0)
        return dx;
    return p1.getY()-p2.getY();
}

Proving that the consistency rule holds for this code is easy:

Let p1,p2 be points.

(Direction 1) Assume that p1.equals(p2). Then, p1.x=p2.x, and p1.y=p2.y. In this case, when running compare(p1,p2), conditions on lines 4 , 6 and 10 will not be met. The returned value is then p1.y-p2.y, which is 0, as requested.
(Direction 2) Assume that compare(p1,p2)=0. That means that the method must have terminated at line 12 (the other return statements never return 0). Therefore we know that p1.y=p2.y. Also, we know that condition on line 10 was not met, hence p1.x=p2.x. In other words, p1 and p2 are identical, so p1.equals(p2) is true, as requested.

Summing up, when implementing a Comparator (or Comparable)  interface intended to be used in some SortedSet implementation (or as a SortedMap key), make sure the comparator complies with the consistency with equals rule. This will guarantee that the collection respects the Set contract, and does not exhibit weird behaviours. After implementing the comparator, it may be wise to prove that the consistency property really holds.

Posted in JDK packages, java | Tagged: , , , , , , | 4 Comments »

Strings and Memory leaks

Posted by Eyal Schneider on October 27, 2009

java.lang.String is one of the mostly used classes in Java. Interestingly,  it is important to know some of its implementation details in order to avoid memory leaks.
Consider the following seemingly innocent piece of code:

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

 

public class StrangeStringThing {
    private static Random rnd = new Random();
    private static final int MEGA = 1024*1024;
 
    private static String createRandomStr(int length){
        StringBuilder sb = new StringBuilder(length);
        for(int i=0;i<length;i++)
            sb.append((char)('a'+rnd.nextInt('z'-'a'+1)));
        return sb.toString();
    }
 
    public static void main(String[] args){
        List<String> substrings = new ArrayList<String>();
        for(int i=0;i<100;i++){
            String randomStr = createRandomStr(MEGA);
            String subStr = randomStr.substring(1000,1004);
            substrings.add(subStr);
        }
    }
}

This code generates a very long random string, and then keeps only a 5 chars long substring of it. This process is repeated 100 times.  We expect to end this program with 100 strings of length 5.
Surprisingly, if we try to run this code with the heap size limited to 100MB (-Xmx100m), we get an OutOfMemoryError.  How did that happen? all temporal large strings should have been cleared by the garbage collection, so there should be a really small memory footprint. Furthermore, the documentation of java.lang.String does not indicate any misuse of the API.

How are Java Strings stored?

The Java language specification defines precisely the API of class String, its Unicode details, its immutability, and the way String literals should be handled. Naturally, it does not define how Strings should be  represented internally, and what are the required time/space complexities of the different operations on them. These aren’t functional requirements. As a consequence, different JVM providers have the freedom to choose their preferred design for class String. Sun’s implementation (and also IBM’s) represent a string with a combination of 3 data members:

  • value – A reference to a char array
  • offset - A start index, inside the array
  • count - The actual length of the string.

When allocating a new string, the char array contains exactly the string content. Offset is set to 0, and count is set to the char array length. Then, when calling the method substring(..) upon it, the new string being returned contains a reference to the same char array, but its offset and count members are modified, reflecting the requested subsequence of chars. The sharing of the same char array by multiple String instances is possible, since strings are immutable. There are two benefits of this implementation approach in comparison to a substring implementation based on copying:
1) Memory usage is usually reduced, specially in cases where many substrings of the same string are taken, or if the substrings are long
2) substring(..) runs in constant time, instead of linear time

 Obviously there is a tradeoff here – if the string utilization pattern is not as described in (1), we may suffer from excessive memory consumption. The code above demonstrates this edge case: we take very small substrings of very large strings, where the latter are temporary. The temporary strings (referenced by randomStr) ARE collected by the garbage collector during the loop. However, their internal char arrays can not be collected, since they are being shared with the sub-strings. Therefore, the very long char arrays are kept unintentionally in memory, and that constitutes a memory leak.

The solution is simple – forcing the substrings to be copied into a new compact char array. This is done by replacing line 22 above with:

  String subStr = new String(randomStr.substring(1000,1004));

It looks like the API designers were aware of the the possibility of substring being implemented as described, so they added a copy constructor specifically for avoiding the memory leak. Note that otherwise this constructor is useless, because we are dealing with an immutable class.

Is this a Java bug?

There is currently an open bug report in Sun’s bug database regarding this issue.  It doesn’t look like it will be fixed soon, because doing so may introduce performance regression in many existing programs. Besides, there is a very simple workaround (described above), what makes the priority of this bug relatively low.

Summary

Java developers should be aware of the prevalent String.substring(..) implementation, and its memory leak potential risk. Whenever we know that the original string we take the substring from has a shorter life span than the substring itself, we should consider using the copy constructor to avoid sharing of the underlying char array.

Posted in JDK packages, java | Tagged: , , , , , , , | 2 Comments »

Java Quiz

Posted by Eyal Schneider on September 17, 2009

This time I posted a 20 questions quiz, developed for testing general Java skills.
Most of the questions address fundamental topics in Java(5), but the test is not a trivial one.  The right answers plus their explanations can be found on the bottom. You are welcome to add your comments.

Enjoy, and good luck!

Question 1

public class CrazyCount {
    private static volatile int count = 0;

    public static void main(String[] args) {
        for(int i=0;i<1000;i++)
            new Thread(new Runnable(){
                public void run() {
	           count++;
                }
            }).start();

        //.....
    }
}

What will be the value of count after all 1000 threads finish their work?

(a) Garbage, because the writes are not synchronized
(b) A value between 1 and 1000
(c) Exactly 1000
(d) At least 1000
Answer

Question 2

Consider the following multi threaded producer/consumer model:

 public class ProducerConsumer {
    private Job currentJob = null;

    public synchronized void addJob(Job job) throws InterruptedException{
        while (currentJob!=null)
            this.wait();
        currentJob=job;
    }

    public synchronized Job consumeJob(){
        Job res = currentJob;
        res.process();
        currentJob=null;
        this.notify();
        return res;
    }
}

Which of the following is correct?

(a) addJob will wait till the current job is executed, but consumeJob will never block.
(b) Jobs will be executed in the exact order in which they are added with addJob.
(c) This code may not work because of deadlocks
(d) ‘while’ in addJob can be replaced by ‘if’, without changing the behavior.
Answer

Question 3

Consider the following Singleton design pattern implementation:

public class ConnectionsMgr {
    private static ConnectionsMgr instance = null;

    public static ConnectionsMgr getInstance(){
        if (instance==null){
	    synchronized(ConnectionsMgr.class){
		if (instance==null)
		    instance = new ConnectionsMgr();
	    }
    	}
	return instance;
    }

    //...
}

Why is this implementation problematic?

(a) An uninitialized ConnectionsMgr object may be returned
(b) Synchronizing on ConnectionsMgr.class is not allowed
(c) The instance may be created more than once
Answer

Question 4

Which of the following statements regarding the final reserved word is correct?

(a) A final method can not be overriden by subclasses, and a final class can not be subclassed
(b) A final data member must be given a value in its declaration line
(c) A local variable declared as final can still be mutated using setter methods
(d) (a) and (c) are correct
Answer

Question 5

public static void main(String[] args) {
    try{
	String s = null;
	s=s.substring(10);
    }
    catch(Exception e){
	System.out.println("Caught exception!");
    }
    catch(RuntimeException e){
	System.out.println("Caught a runtime exception!");
    }
}

This code:

(a) Displays “Caught exception!” when executed
(b) Displays “Caught runtime exception!” when executed
(c) Runs without throwing an exception
(d) Does not compile
Answer

Question 6

Which of the following statements regarding exceptions is correct?

(a) IOException can be thrown without the need to declare it in the method header
(b) catch(Exception) will catch OutOfMemoryError
(c) ClassCastException can be thrown without the need to declare it in the method header
(d) (b) and (c) are correct
Answer

Question 7

Which of the following is the preferred way to stop a thread?

(a) Calling the Thread’s interrupt() method
(b) Calling interrupt() and then stop()
(c) Calling the stop() method
(d) Calling join() to wait for the thread to end
Answer

 Question 8 

public class Swap {
    public static void swapStrings(String x, String y){
	String temp = x;
	x=y;
	y=temp;
    }

    public static void main(String[] args) {
	String a = "1";
	String b = "2";
	swapStrings(a, b);
	System.out.println("a="+a+" ,b="+b);
    }
}

What will be the output when executing this main?

(a) An exception will be thrown
(b) “a=2 ,b=1”
(c) “a=1 ,b=2” because strings are immutable
(d) “a=1 ,b=2” because Java passes parameters by value
Answer

Question 9

public static void main(String[] args) {
    ArrayList<Point> list = new ArrayList<Point>();
    list.add(new Point(100,200));

    ArrayList<Point> listCopy = (ArrayList<Point>) list.clone();

    listCopy.get(0).setLocation(150,250);

    System.out.println(list);
    System.out.println(listCopy);
}

What will be the output of this main?

(a) [java.awt.Point[x=100,y=200]]
       [java.awt.Point[x=100,y=200]]

(b) [java.awt.Point[x=150,y=250]]
       [java.awt.Point[x=100,y=200]]

(c) [java.awt.Point[x=100,y=200]]
       [java.awt.Point[x=150,y=250]]

(d) [java.awt.Point[x=150,y=250]]
       [java.awt.Point[x=150,y=250]]
Answer

Question 10

Assume that class B extends class A.
Consider a method that is declared as follows:  public void callMe(List<A> l)

Which of the following data types can be passed as parameters to callMe(..)?
(a) List<B>
(b) ArrayList<B>
(c) ArrayList<A>
(d) All answers are correct
Answer

Question 11

Assumes that class B extends class A.

public void callMe(List<? extends A> l){
    l.add(new A());
    l.add(new B());
}

This method:

(a) Compiles, but with a warning
(b) Does not compile
(c) Runs correctly
(d) May cause ClassCastException
Answer

Question 12

What will be the output of the following code?

class Base {
    int i = 10;

    public Base(){
        System.out.println(get());
    }

    public int get(){
        return i;
    }

}
public class Derived extends Base{
    int j = 20;

    public int get(){
        return j;
    }

    public static void main(String argv[]){
        Base b = new Derived();
    }
}

(a) 0
(b) 10
(c) 20
(d) None of the above
Answer

Question 13

What will be the output of the following code?

class Base {
    protected int i = 10;

    public int get(){
        return i;
    }
}

public class Derived extends Base{
    protected int i = 20;

    public int get(){
        return i;
    }

    public static void main(String argv[])
    {
        Base b = new Derived();
        System.out.println(b.i);
        System.out.println(b.get());
    }
}

(a)     10
            10

(b)     10
            20

(c)     20
            10

(d)
     20
            20 
Answer

Question 14

What is the output of the following code?

String s1 = new String("Test");
String s2 = "Test";
if (s1==s2)
    System.out.println("Same");
if (s1.equals(s2))
    System.out.println("Equals");

(a) Same
(b) Equals
(c) Same
        Equals
(d) No output
Answer

Question 15

What is the output of the following code?

class Parent{
    private void f(){
        System.out.println("Parent.f()");
    }

    public void g(){
        System.out.println("Parent.g()");
        f();
    }
}

public class Child extends Parent{
    public void f(){
        System.out.println("Child.f()");
    }

    public static void main(String args[]){
        Parent p = new Child();
        p.g();
    }
}

(a) Parent.g()
       Child.f()
(b) Parent.g()
        Parent.f()
(c) The code does not compile
(d) An exception is thrown
Answer

Question 16

Consider the following overloading scenario:

public class Overloader{
    public static void read(String s){
        System.out.println("read(String)");
    }

    public static void read(Integer i){
        System.out.println("read(Integer)");
    }

    public static void read(Object o){
        System.out.println("read(Object)");
    }

    public static void main(String args[]){
        Object s = new String("Java");
        Integer i = 10;
//      Overloader.read(s);
//      Overloader.read(i);
//      Overloader.read(null);
    }
}

What would be the output of each of the 3 read commands in comment?

(a) read(s) prints “read(String)”
        read(i) prints “read(Integer)”
        read(null) prints “read(Object)”

(b) read(s) prints “read(Object)”
        read(i) prints “read(Integer)”
        read(null) prints “read(Object)”

(c) read(s) prints “read(Object)”
        read(i) prints “read(Integer)”
        read(null) does not compile

(d) None of the above
 Answer

Question 17

Which of the following statements is correct?

(a) A daemon thread is terminated when all non-daemon threads terminate
(b) A thread is created as a daemon thread if and only if it is created by a daemon thread
(c) Non daemon threads may keep the application from shutting down, even if the main thread is terminated.
(d) All of the above are correct
Answer

Question 18

class Parent{
    protected void x(){}
    public void y(){}
}

public class Child extends Parent{
    public void x(){}
    protected void y(){}
}

When compiling this code:
(a) It compiles successfully
(b) Compilation error – x can’t have its visibility increased
(c) Compilation error – y can not have its visibility reduced
(d) Compilation error – neither x nor y can have their visibility changed
Answer

Question 19

String s = new String("Wanna live forever!");
WeakReference<String> r1 = new WeakReference<String>(s);
SoftReference<String> r2 = new SoftReference<String>(s);
String s2 = r1.get();
s = null;

Assuming that no garbage collection took place, what is the reference strength of the string created at line #1, after these lines of code?
(a) Not referenced
(b) Weakly referenced
(c) Softly referenced
(d) Strongly referenced
Answer

Question 20

int x=0;
int y=0;
while(true){
    if ((++x==10) && (++y==10))
        return;
}

When executed, this code:
(a) Performs 10 iterations and ends
(b) Performs more than 10 iterations and ends
(c) Performs less than 10 iterations and ends
(d) Enters an infinite loop
Answer
 

Answers

1)  (b)
A race condition between threads may cause a a write to cancel other writes, thus multiple concurrent “incrementations” may result in a single incrementation.

2) (a)
(b) is incorrect, since there is no guarantee about the order in which waiting threads are awakened using Object.notify().
(d) is incorrect because when a producer thread is awakened, there is no guarantee that another producer thread was not awakened just before, setting the current job to another value.
Besides, Java documentation recommends always to use ‘while’ loops on wait() rather than ‘if’.

3) (a)
The line “instance=new ConnectionsMgr()” is not atomic. It results in multiple machine instructions, one of which is the reference assignment itself. Due to the Java memory model rules, this specific instruction is allowed to be reordered with respect to the other instructions. This reordering may cause the assignment to take effect before the object is initialized. Another thread calling getInstance() may therefore receive a non null reference to an uninitialized memory block.  See this article for more details.

4) (d)
An object data member or a local object variable declared as final can not change its address. However, the referenced object can be freely mutated.
(b) is not true because the member can be initialized in the constructor/s also.

5) (d)
RuntimeException derives from Exception, so the ordering of the catch blocks makes the second unreachable.

6) (c)
ClassCastException is an unchecked exception, so it does not need to be declared. IOException, however, is a checked exception, and you must either catch it or declare the method as “throws IOException”. OutOfMemoryError is a Throwable, but it derives from Error and not from Exception.

7)  (a)
Thread.stop() is a deprecated method that is not recommended for use. It stops a thread violently without guaranteeing anything about the integrity of the data manipulated by that thread. This may cause corrupt data to be viewed by other threads. Furthermore, there are some cases in which Thread.stop() will not terminate a thread.
Thread.interrupt() is preferred, but it relies on the code run by the thread to be friendly and react to the interruption.

8) (d)
Since Java passes parameters by value (not by reference!), there is no way to implement a swap method in Java. In C++ for example, it is possible, using references (&).

9) (d)
Collections such as java.util.ArrayList do not perform deep copying when cloned. Item references stay the same.

10) (c)
(a) and (b) are incorrect because if legal, they would have break the type safety. The method callMe could have added an instance of type A, and the caller now has a collection of both A’s and B’s, while it assumes that it has only elements of type B. See my previous post for more details.

11) (b)
Assume that C descends from B. The method callMe may receive as a parameter an object of type List<C>. If the additions were considered legal, this would result in a list defined as List<C> containing instances of super classes such as B and A. That breaks the generics type safety, since it will not protect the collection from an unexpected ClassCastException. See my previous post for more details.

12) (a)
Overriding methods that are called directly/indirectly by the superclass’ constructor should be done with care. If the overriden method uses data members belonging to the subclass, then they are still not initialized at that point! In this example, the data member is of type int, so the default initialization will give it a value 0.
For more explanations and a workaround for this problem, see posts 86 and 86b of the Java Specialists’ Newsletter.

13) (b)
Methods are invoked virtually – the exact method to be executed is determined at runtime according to the actual class of the object on which it was called. Data members, however, do not really override each other. Each class has its member i, and the one to be retrieved is determines at compile time, according to the declared class of the reference (in this case Base).

14) (b)
Using new String(…) forces Java to actually create a new instance of a string. Strings defined as literals (e.g. String x=”hello”), however, will always be represented by the same object reference at runtime.

15) (a)

16) (c)
The method signature to be invoked is determined at compile time. Therefore, it depends on the declared type and not the runtime type of the argument/s.
Regarding the null parameter – the compiler can not determine which signature is the correct one, since all the three reads are legitimate.

17) (d)

18) (c)
Reducing visibility of methods when overriding is not permitted.

19) (d)
s2 still hard-references the string.

20) (b)
Tricky thing… The optimizer causes the second part of the condition not to be executed when the first part evaluates to false. Consequently, only when x reaches 10 y will become 1. Later, x will overflow and reach 10 again, and so on; until both counters reach 10 and the loop will stop.

Posted in Language, java | Tagged: , , | 6 Comments »

Understanding Generics limitations

Posted by Eyal Schneider on August 14, 2009

Generics were introduced to Java in version 5 (JSR 14), and since then they became standard and very popular. Although there has been a great demand by developers for this feature, many are unsatisfied regarding HOW it was implemented. The design that was adopted has a huge impact on the syntax and semantics of this feature.

Java generics provides the following benefits:

  • Type safety
  • Less explicit casts
  • More declarative APIs

In this post I list some of the limitations of Generics. All of them are derived from two fundamental principles of the Java Generics design:

1) Erasure

Type parameters (e.g. String in  List<String>) are completely removed by the compiler. Formal type parameters (e.g. E in the source code of List) are replaced by their bound (usually Object). As a result, the bytecode has no generic information at all, consequently it is not available at runtime. Unlike C++, at runtime Java shares the same class for different parameterized instantiations of it. These facts have three interesting benefits:

1) JVM implementations prior to version 5 can theoretically stay unchanged in order to support Generics
2) When declaring a parameterized version of a class C, we don’t need the source code of the generic type C
3) The volume of the bytecode does not increase as we add more parameterized types of the same generic class

2) Type safety guarantee
According to Gilad Bracha, the Java Generics lead designer, if your entire application has been compiled without unchecked warnings, it is type safe. “Type safe” in this case means that there will be no unexpected class cast exception, i.e. if a ClassCastException occurs, then it must be caused by an explicit cast in the code. This principle is very important – we don’t want the implicit casts added when compiling generic code to raise runtime exceptions, since they would be hard to understand and fix. 

 

Formal type parameters usage

You can’t expect the formal type parameter to be used wherever a type name can fit. For example, instantiating an object from a formal type parameter is not allowed:

public class NonSense <T>{
    T obj = new T() ; //Illegal
}

Due to erasure, the compiler must generate code for NonSense class to be shared among all instantiations, but it doesn’t know anything about T at this point, and can’t assume anything.
The equivalent code in C++ IS allowed, since the compiler creates a dedicated class per parameterized type of NonSense.
Note that replacing the line above with List<T> obj = new ArrayList<T>() would compile, since the compiler generates code for creating an ArrayList, and the formal type parameter is erased.

For the same reasons, the following code does not compile:

public class Something <T> extends T{
}
 

Hierarchy of parameterized types

Two different instantiations (with different type parameter) of the same parameterized type don’t have any hierarchy relation. For example,  List<String> is not a List<Object>. The following code snippet illustrates what would have happened if this were allowed:

List<String> ls = new ArrayList<String>();
ls.add("one");
ls.add("two");
List<Object> lo = ls; //Illegal
lo.add(new Integer(3)); //The list ls is not homogeneous anymore!
 

Arrays

According to the intuition, arrays should follow similar rules as collections (such as ArrayList) do, when it comes to Generics. The reality is a little painful, however. It is not allowed to create arrays of elements which are parameterized types (e.g. new ArrayList<Strings>[10]). If it were allowed by the compiler, we could have done something like:
 

1.    ArrayList<Integer>[] arrOfIntLists = new ArrayList<Integer>[10]; //Illegal
2.    Object[] arrOfObj = arrOfIntLists;

3.    ArrayList<String> listOfStrings = new ArrayList<String>();
4.    listOfStrings.add("hello");
5.    arrOfObj[0] = listOfStrings;

6.    ArrayList<Integer> list = arrOfIntLists[0];
7.    Integer num = list.get(0);

What happens here?

1) An array of ArrayList of Integers is created.
2) arrOfObj now points to the same array. This is a legal up-cast, since any array is also an Object array. There is no warning in this line.
3,4,5) A list of Strings is created and added to arrOfObj. Note that the addition does not produce an ArrayStoreException, since the runtime type of the array components is simply ArrayList (Due to erasure), and that’s exactly what we are storing.
6) We retrieve the list at position 0 in arrOfIntLists. There is no ClassCastException yet because the implicit runtime cast is a cast to ArrayList, which is the correct class.
7) When trying to retrieve the first item of the list of integers, we fail with a ClassCastException, caused by the implicit cast to Integer.

If this code were legal, it means that the type safety guarantee is broken. The code compiles with no warnings, but at runtime it results in an unexpected ClassCastException in the last line. This happens because we managed to store a string list in an array of integer lists. 

So now we know why creating an ArrayList<Integer>[] is illegal.  But why does the compiler allow creating an instance of ArrayList<ArrayList<Integer>>? Shouldn’t it lead to the same type safety problem? The answer is no. If we convert the example above to the analogical version that uses ArrayList instead of arrays, the second line will not pass compilation, since ArrayList<ArrayList<Integer>> is not a subtype of ArrayList<Object> (See “Hierarchy of parameterized types” above).

The same phenomenon may occur with generic methods that receive generic arrays:

public class MyUtils{ 
    public static <T> void fill(T[] arr, T value){
        for(int i=0;i<arr.length;i++)
            arr[i]=value;
    } 
}

This mini-utility looks type safe, but it isn’t. Surprisingly, the following code compiles with no errors/warnings:

String[] strArr = new String[10];
fill(strArr,3);

Why does this happen? When a generic method is called this way, the compiler performs type inference for determining what type T stands for. In our case, this is done by inspecting the parameter types, and finding the lowest common type shared by them in the class hierarchy. With fill(strArr,”hi”), the type is obviously String. However, in the example above the most specific type shared by strArr and 3 (which is converted to an Integer) is Object. This causes the code to be legal, but to fail with ArrayStoreException on runtime. Technically, this is not a violation of the type safety guarantee explained in the introduction, because the guarantee deals only with ClassCastException.

There are two simple ways to avoid this problem. The first is to be more specific with the call:

MyUtils.<String>(strArr,3) //Now it does not compile

The second is to fix the method’s signature as follows:

public static <T,S extends T> void fill(T[] arr, S value) 
 

Latent Typing

In C++, the implementation of a parameterized type can assume anything about the type parameter/s. For example, the implementation of MyList<E> may call any method on E, and the availability of this method is checked against the actual type parameters of the different instantiations at compile time.  Java does not allow it.  Due to the generics design, the compiler has to generate the full bytecode of the generic class, regardless of the instantiations, so calls to methods that are not bound to any class/interface can not be compiled.

Therefore, in Java you are encouraged to declare explicitly all your assumptions on a type parameter. This is done using bounds. For example:

class DrawableSortedSet<T extends Comparable<T>,Drawable>{
....
}

This clearly makes the code more declarative, but proponents of Latent Typing argue that the extra declarations are not really needed for runtime type safety (C++ approach proves this), and also that the declarative approach may limit the programmer and force him to define new artificial interfaces containing the desired methods.  

 

Bounds restrictions

As explained before, bounds allow us to specify constraints on the type parameters being used, both in a generic class or a generic method. These prerequisites give us more freedom. For example, assume that we want a method that manipulates a list of numbers.  We start by the following generic method:

public void manipulate(List<Number> list){
  ....
}

The problem here is that List<Integer> can not be passed as parameter. Only List<Number> is allowed (see “Hierarchy of parameterized types” above). If we want to pass lists of any descendant class of Number, we should use bounds:

public <T extends Number> void manipulate(List<T> list){ 
    ....
}

Now we can pass List<Integer>, List<Float> and so on,  but we face a new limitation. While in the previous version of the method we could call list.add(new Integer(10)) or list.add(new Double(3.2)), now the compiler will not accept it, because it can’t guarantee that the addition maintains the type homogeneousness in the list. If addition of Integers were allowed, then calling the method with a variable declared as List<Float> may result with a new non Float item in the list, and this is a violation of the type safety guarantee.

 

Static data members

Generic classes are not allowed to have references to their type parameters from their static data members or methods. For example, the following is illegal:

public MyClass<T>{
    private static T value; //Illegal

    public static void analyze(List<T> list){ //Illegal
        ....
    }
}

C++ programmers who learn Java may be tempted to do it, because in C++ this is legal. Remember that unlike C++, Java uses the same class at runtime for MyClass<Integer>, MyClass<String> etc. Static members, by definition, are bound to the class rather than to the instance. It follows from these 2 facts that static members are actually shared by all instances of all generic instantiation of the same generic type. If the code above were allowed, type safety could not be guaranteed anymore. Imagine for example what happens when an instance of MyClass<Integer> stores an Integer in value, and then an instance of MyClass<String> retrieves it as a String.

 

Generic information at runtime 

Because of erasure, there is no sense in checking generic information at runtime. For example, the following is illegal:

if (list instanceof List<String>){ ....}

 

Primitive types as parameter types

Unlike C++, Java does not allow primitive types to be type parameters. This limitation is not so significant, since we can always replace the primitive type with the corresponding wrapper, and we can also use auto-boxing/auto-unboxing in the access methods to simplify the code. However, boxing/unboxing can lead to a performance penalty. In order to eliminate it there is no choice but to code a new, dedicated version of the class for the specific primitive type.

 

Generic exceptions

It is not allowed to define a generic type that extends (directly/indirectly) some Throwable. Trying to do so implies that the programmer wishes to handle different parameterized versions of the same exception at runtime, such as in the following example:

try{
    ....
}catch(MyException<Byte>){
    ....
}catch(MyException<Integer>){
    ....
} 

The problem is that erasure does not allow it.  There is no way for the JVM to distinguish at runtime between MyException<Byte> and MyException<Integer>. They are both simply MyException at runtime. Therefore there is no sense in having generic exceptions.

 

Generic enums

Defining generic enums is not allowed: 

enum State <T>{ //Illegal
    STANDING,WALKING,RUNNING;

     private T attribute;
     public void setAttribute(T attribute){
         this.attribute=attribute;
     }

     public T getAttribute(){
        return attribute;
    }
}

The reason for this to be illegal is related to the static members limitation (see “Static data members” above). Enums are implemented according to the “Type Safe Enum” design pattern, where the enum values are static data members of the class. Therefore, they are not allowed to have any reference to the class’ type parameter in their declarations.

 

Conclusion

Generics are a big step forward for Java, making it support a popular programming style common to Ada, C++, C# and other languages. However, simple intuition is not enough for really understanding Java generics. Sometimes one has to go under the hood of the compiler in order to understand its limitations and use it correctly. There has been a proposal of re-working the design by canceling erasure (“Reified Generics”), but it seems like this idea is no longer in the feature list of Java 7 (see http://tech.puredanger.com/java7).

 

Links

http://www.angelikalanger.com/GenericsFAQ/JavaGenericsFAQ.html
http://www.angelikalanger.com/GenericsFAQ/FAQSections/Fundamentals.html
http://mindview.net/WebLog/log-0051
http://java.sun.com/j2se/1.5/pdf/generics-tutorial.pdf
http://www.ibm.com/developerworks/java/library/j-jtp01255.html
http://www.devarticles.com/c/a/Java/Generics-and-Limitations-in-Java/
http://java.sun.com/developer/technicalArticles/J2SE/generics/#TJVM

Posted in Language, java | Tagged: , , , , , , , , , , , , , , , , | 1 Comment »

OutOfMemoryError

Posted by Eyal Schneider on June 17, 2009

Any Java programmer has encountered at least once the disturbing OutOfMemoryError (OOME) message. The interesting thing is that this error occurs for different reasons, and the diagnosis and treatment in each case is completely different. In this post I will try to cover some of these cases, show how to reproduce them and how to deal with them.
Some important things to remember about OOME:

  • As the name implies, this is an Error, and not an Exception. Errors indicate a serious condition in the JVM. Catching Errors is usually useless – the code doesn’t have the means to recover from them. Since they are usually not caught, they result in some thread getting terminated.
  • The stack trace is not always present in OOME, since its creation also requires memory, which may not be available.
  • The circumstances where an OOME is thrown may vary between different JVM implementations, due to a different memory management model. The textual description of the error may also vary. This post refers to to my own experience with Sun’s JVM, although I believe that most of it is relevant for other implementations as well.
  • As explained above, OOME is not recoverable at runtime. Therefore, the right strategy is to avoid its recurring once it happens, by means of configuration, design, or hardware modifications.

 

Java memory organization

Let’s first start with a very short overview of how a Java process organizes its address space. Basically, the memory is divided into heap memory and non-heap memory.  Modern JVM implementations use a generational heap model, in order to improve garbage collection performance. In this model, the heap area is usually split into 3 generations: young, old (“tenured”) and permanent. The young generation is where new objects created with the new operator are allocated, and it is designed to hold “young” objects. Older objects are moved to the tenured generation during GC. The permanent generation is not common to all JVM implementations, and it differs from the other generations - it stores (among other things) metadata of classes that have been loaded. The permanent generation is sometimes regarded as part of the heap, and sometimes not (The heap sizing flags -Xmx and -Xms, for example, do not include the permanent generation size).
Non-heap memory is the additional memory space outside of the areas reserved for the heap, and it can be used for different purposes. I will refer to this memory as native memory.

java.lang.OutOfMemoryError: Java heap space

This common error indicates that an object creation failed because there was no available heap space for it. It will never be thrown unless the JVM tries a full garbage collection first, including removal of softly/weakly referenced objects. When the stack trace is present for this particular OOME, it is usually irrelevant – the specific allocation that could not be satisfied gives us no clue regarding previous allocations that probably caused the problem. Same with the thread where the error occured - It may occur in any thread, since they all share the same heap.
  
Following is a sample code that eventually generates this OOME:
List<String> list = new ArrayList<String>();
while(true)
    list.add(new String("Consume more memory!"));
Solutions:
1) Check for memory leaks
  • Use the memory tab in JConsole in order to examine the graph of memory usage of your application over time. Focus on the “Tenured generation”, and click on “perform GC” every once in a while, to see whether the used memory returns to some expected baseline level after every garbage collection.
  • Run your application with the following JVM flags: -verbose:gc and -XX:+PrintGCDetails. This will add console outputs with details on every garbage collection that takes place. Using these outputs, you can analyze the memory consumption pattern, and see whether the memory usage increases constantly where it shouldn´t.                                    

 The following tools can help you analyze the reasons for the memory leak:

  • Run jmap -histo <java process id> at different stages of your application execution. This command prints the number of instances and total used memory per Java class. You can identify classes that have an unreasonable number of instances, or classes for which the number of instances increases unexpectedly. This can limit the search space in your code, when looking for the bug.
  • Generate a heap dump using JConsole or jmap, and analyze it using a heap analyzer (e.g. JHat, IBM HeapAnalyzer). The heap dump contains the full state of the heap at the point of time where the dump was created. JHat allows examining this heap snapshot, navigating through the different instances, and even executing complex queries (OQL language). An advanced feature allows comparing two snapshots taken at different points of time, in order to identify the allocations that occurred between the two. If you want the dump to be created automatically when the OOME occurs, add the -XX:+HeapDumpOnOutOfMemoryError JVM flag.

2) Increase the heap size

Use the JVM flags -Xms and -Xmx to set the initial and maximal heap size, respectively. Make sure not to exceed the physical memory size, otherwise you encourage paging, which affects performance. Sometimes installing additional physical memory is required.

3) Reconsider the design

An inadequate design can easily consume too much heap space and cause this OOME. Sometimes, it is useful to change the design in a way that compromises some desired system property (such as performance), but reduces memory use dramatically. For example, disk/database storage can replace some of the data structures in RAM, and memory sensitive mechanisms such as weak/soft references can be introduced.

java.lang.OutOfMemoryError: PermGen space

When the permanent generation becomes full, this error occurs, regardless of the available space in the object allocation area (young + old generations). Note that this specific message is issued by Sun’s JVM, and other JVM implementations may have different messages, specially those implementations with a different policy for storing class information (such as IBM’s JVM).
I will reproduce this OOME in two different ways. The first one uses a custom class loader to load too many classes. Assume that we have a custom class loader named MyClassLoader, capable of loading the class Dummy:
List<Class<?>> classes = new ArrayList<Class<?>>(); while(true){ MyClassLoader cl = new MyClassLoader(); try{ classes.add(cl.loadClass("Dummy")); }catch (ClassNotFoundException e) { e.printStackTrace(); } }
Note that we are actually loading the same class over and over, with a different class loader (remember that a class is identified by the combination of its name and its class loader object). Also, note that we store references to the loaded classes. If we don’t do so, the garbage collector realizes that the classes are not used, and performs garbage collection on the permanent generation.

A second way of filling up the permanent generation, is by exploiting the String intern pool. The class String manages a pool of strings that have been interned, using String.intern(…). This pool is also a part of the permanent generation.

List<String> list = new ArrayList<String>();
int i=0;
while(true)
    list.add(("Consume more memory!"+(i++)).intern());

The examples above highlight the common cases where this OOME occurs in practice:

  • Applications which load classes dynamically, such as web servers and application containers
  • A class loader with a memory leak
  • Applications that make use of string interning, in order to improve performance or reducy heap usage

Solutions:

Start by analyzing. A good place to start with is the stack trace. If it contains the java.lang.String.intern(..) call, then the reason is probably string interning. Alternatively, if it contains the java.lang.ClassLoader.defineClass(..) call, then you are probably dealing with a class loading problem. In the latter case, you may want to track the loading/unloading of classes, using the JVM flags -XX:+TraceClassLoading(or -verbose:class) and -XX:+TraceClassUnloading. Also, make sure you don’t have the flag -Xnoclassgc, because it prevents GC on classes, and can easily cause this OOME in applications that keep loading classes dynamically. As a part of the analysis, you can also run the utility jmap -permstat <process id> (not available for Windows). It provides details on all class loaders that have been used since the application was launched. 

Once you have pinpointed the problematic piece of code, check whether the design can be changed to consume less memory. If this is not an option, use the -XX:MaxPermSize flag to increase the maximal size of the permanent generation (e.g. -XX:MaxPermSize=256m).
Remember that increasing the values of -Xmx and -Xms flags instead will not change a thing. As explained before, these flags control the size of the young+old generations only, and do not affect the permanent generation.

 

java.lang.OutOfMemoryError: unable to create new native thread

Whenever we start a a  thread in Java, the JVM allocates space in native memory, designated for the thread’s call stack. This space is freed only after the thread is terminated. In case that there is not enough native memory for the stack space, we get this error. This can easily be reproduced with the following thread spawner code:
while(true){
    new Thread(new Runnable(){
        public void run() {
            try {
                Thread.sleep(60*60*1000);
            } catch(InterruptedException e) { }        
        }    
    }).start();
}  
Solutions:
This error is common in applications that try to manage too many live threads, as in the example above. As always, first ask yourself whether the design is reasonable. There are some generic solutions for reducing the number of threads. For example, an application that creates a thread per connection has a scalability problem in its design. The IO multiplexing feature introduced in Java 1.4 can solve this, and reduce dramatically the number of threads (See class java.nio.channels.Selector).
If you wish to try solving this problem by changing the memory settings only, then any of the following can help:
1) Reduce the heap size using -Xmx and -Xms flags. Sometimes, inexperienced programmers increase the heap size when they encounter this OOME, but it only makes the condition worse.
2) Reduce the permanent generation size using -XX:MaxPermSize
3) Reduce the size of memory allocated per stack using -Xss (e.g. -Xss512k). The default values depends on the environment, and range between256k and 1024k. Remember that when you reduce this value, you are increasing the risk of StackOverflowError, specially if you have deep recursion somewhere in your code or libraries that you use.
  

Other native memory OOMEs

There are many ways to use native memory other than thread stack allocation discussed before. Allocations can be made by the JVM itself, third party libraries, utilities in the JDK libraries,  and JNI code that you write. When the native memory gets exhausted, we may get an OOME of some kind. They are not consistent with their message, but one of them looks as follows:
 
java.lang.OutOfMemoryError: requested <bytes> bytes for <reason>. Out of swap space?
 
We can also cause a different message, by using the direct buffer feature of the java.nio package, in order to consume all available native memory:
List<ByteBuffer> list = new ArrayList<ByteBuffer>();
while(true)    
    list.add(ByteBuffer.allocateDirect(10000000)); 
 
In this case we will get the message:
java.lang.OutOfMemoryError: Direct buffer memory
 
 Solutions:
1)  Check whether there is a memory leak in some native code you use with JNI
2) Reduce the heap size using -Xmx and -Xms flags, to make more space for native memory.
3) Increase the swap space on disk. The way to do this depends on the operating system you use.
 
 

Summary

There are several types of OutOfMemoryError messages. It is important to know how to identify and analyze them,  and then how to find an appropriate solution. Solutions are in terms of prevention and not recovery at runtime.  A recovery attempt after an OOME may be too late - the data integrity may be broken, and there is no guarantee that there will be enough resources to complete it successfully. Instead, we should re-launch the application with a different setup.
An automatic recovery mechanism that occurs before the OOME can be effective, though. Since Java 1.5, we have the MemoryMXBean class, as part of the JMX services. An example of such a memory tracking mechanism was written by Dr. Heinz Kabutz in his 92nd issue of his newsletter.

Posted in JDK packages, java | Tagged: , , , , , , , , , , , , , , , , , , , , | Leave a Comment »

SimpleDateFormat pitfalls

Posted by Eyal Schneider on May 29, 2009

The class SimpleDateFormat is a very popular class, allowing flexible formatting and parsing of dates in Java. Formatting consists of transforming a Date into a String in a predefined format, and parsing performs the inverse operation.

In many cases, programmers fail to use SimpleDateFormat correctly, due to inattention to some (well documented) characteristics of this class. Maybe this will not be new for some of the readers, but there are 3 important things to remember when using SimpleDateFormat.

Time zone 

Every instance of SimpleDateFormat has an associated time zone to work with, as a TimeZone instance. Otherwise, a parsing operation would not be able to determine an absolute point in time (Date object), from a given string denoting a date. If not specified otherwise, the time zone associated with a new SimpleDateFormat is the default one, corresponding to the time zone reported by the operating system.

Consider the following code:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
Date date = sdf.parse("18-07-1976");
System.out.println(date.getTime());

Since the time zone is not explicitly set (using setTimeZone(..)), the code above will result in different outputs when executed in different parts of the world. The midnight that started July 18, 1976 at Thailand, and the same midnight in Florida, have a different GMT representation. The reason is that the midnight arrived in Thailand first.

In many cases, the local default time zone is exactly what the programmer wants. However, on a client-server application, or on a distributed system, we usually want to avoid this time zone relativity. Date strings that are read from configuration / data files / user interface / external software interfaces must be specific with their time zone. We can achieve this in one of two ways. The first one is to include the timezone itself as a part of the expected date format:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy z");
Date date = sdf.parse("18-07-1976 ICT");
System.out.println(date.getTime());

This program will always return the same value (206470800000), regardless of the execution location.

The other approach is to define a convention which needs to be respected by all system entities which generate textual dates. For example, we can decide that all dates should be interpreted as GMT dates. Then, we can bind the corresponding time zone to the SimpleDateFormat. For example:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
sdf.setTimeZone(TimeZone.getTimeZone("GMT"));
Date date = sdf.parse("18-07-1976");
System.out.println(date.getTime());

 This program will also return the same value everywhere in the world (206496000000). 

Validation

Assume  that we are reading a date in the format dd/MM/yyyy from the GUI. We want to validate that the date is well formatted and legal. Consider the following validation code, which makes use of the ParseException thrown by the parse method:

SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy");
Date d = null;
try {
    d = sdf.parse(dateStr);
} catch (ParseException e) {   
    showErrorMessage("Incorrect date format. "+e.getMessage());
}

Will it do the job? The answer is – not always. Date strings such as “10//10/2008″ will result in an error, but dates such as “40/20/2009″ will be parsed successfully, with no error message!. The reason is that by default, SimpleDateFormat is lenient, which means that it is tolerant to some formatting errors. It will try to figure out the date by applying some heuristics to the given text. In the case above, month 20 will be interpreted as December plus 8 months (i.e. August 2010), and day 40 of the month will be interpretted as end of August plus 9 days. The resulting date will be September 9th, 2010.

In order to achieve a stricter validation, we should make the SimpleDateFormat non-lenient, by initializing it as follows:

SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy");
sdf.setLenient(false);

Thread safety 

SimpleDateFormat is not thread safe. Concurrent formatting/parsing operations interfere with each other and result in inconsistent behavior. There are 3 ways to deal with this:

1) Create a SimpleDateFormat instance per usage
This will work fine if the usage is not frequent. Otherwise, the performance penalty may be significant.

2) Synchronization
We can achieve thread safety by synchronizing all usages of the SimpleDateFormat instance with the same lock. For example:

private SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
...
private Date parse(String dateStr) throws ParseException{
    synchronized(sdf){
        return sdf.parse(dateStr);
    }
}
private String format(Date date){
    synchronized(sdf){
        return sdf.format(dateStr);
    }
}

The disadvantage of this approach is that in case of many threads working with this code, thread contention will affect performance.

3) Use an instance per thread
The usage of ThreadLocal class allows associating a single instance of SimpleDateFormat with each thread that needs to do parsing/formatting of dates.
This way, we achieve both thread safety with no thread contention, and a minimal number of SimpleDateFormat instances.
In some cases, we can be extra economical on memory usage, if we know for example that the threads stop using the SimpleDateFormat at some point of time. In these cases, we could use a soft reference to hold the SimpleDateFormat instance.
An explanation of this technique plus sample code can be found here.

Links

http://www.javaspecialists.eu/archive/Issue172.html
http://java.sun.com/javase/6/docs/api/java/text/SimpleDateFormat.html

Posted in JDK packages, java | Tagged: , , , , , , | 1 Comment »

Having performance issues with regex?

Posted by Eyal Schneider on May 21, 2009

Java introduced the java.util.regex package in version 1.4. It is a powerful addition, and yet, one should really be an artist in order to use it right. Assuming that a regular expression is proved for correctness, it may still run extremely slow (even take hours) if it is not wisely written.

Continue reading for understanding the origin of the problem, or jump directly to the last section, containing 10 useful performance tips for regular expression writers in Java.

Can it really be so slow?

 Suppose that we want to match only strings composed of sequences of ‘a’s or ‘b’s. A legal regex would be:

(a*b*)*

However, if you run this regex against the string “aaaaaaaaaaaaaaaaaaaaaaaaaaaaax” for example, it takes several minutes to terminate and report that there is no match!
Of course, a better regex in this case would be:

(a|b)*

This one takes less than a millisecond on my machine, on the same input. Obviously, there is a performance issue here.

Why does this happen?

 Like most regex engines, Java uses NFA (Non deterministic Finite Automaton) approach. The engine scans the regex components one by one, and progresses on the input string accordingly. However, it will go back in order to explore matching alternatives when a “dead end” is found. Alternatives result from regex structures such as quantifiers (*, +, ?) and alternation (e.g. a|b|c|d). This exploration technique is called backtracking.
In the catastrophic example above, the engine will actually explore ALL the decompositions of the series of ‘a’s into smaller series, before it realizes that there is no match. This example shows how the backtracking algorithm can result in exponential time evaluation (in the length of the input sting). It also reveals an important property of NFA engines: the worst cases will always be inputs that almost match the pattern. If a match is found, the exploration stops.

The other main approach for regex engines is DFA (Deterministic Final Automata). In this approach, the regex compilation actually builds an automaton, to be used when traversing input strings. Inputs are traversed character by character, with no going back. This guarantees linear timein the length of the input string, regardless of the regex complexity. Instead of trying different match possibilities serially (as in NFA), DFAsimulates a parallel scanning on all possibilities.

So why does Java (and .NET, Perl, Python, Ruby, PHP etc) use NFA and not DFA, which has much better asymptotic behavior? The reason is that NFA has some significant benefits:

  • Its compilation is faster, and requires much less memory
  • It allows some powerful features (See Sun’s tutorial for details on them):
    • Capturing groups and back references
    • Lookarounds
    • Advanced quantifiers (Possessive and Reluctant)

It is important to note that the popular terms NFA and DFA for regex engines are inaccurate. Theoretically speaking, these 2 models have the same computation power, meaning that there can’t be a matching rule that can be expressed in one of them but not in the other. In practice, there was a need for more features, so the two implementation types diverged in their semantics. NFA engines were given more power, making them superior to DFA engines in computation power.

Due to the speed of DFA and the unique features of NFA, there are 2 more “integrative” approaches for regex implementations. Some implementations use both engines (e.g. GNU egrep, which chooses the specific engine at runtime), and some of them really implement a hybrid version (e.g. Tcl regex), enjoying all the benefits.

Tips

 Following are some tips on how to avoid regex efficiency issues in Java. Many of them are aimed at reducing backtracking.

 1)  Pre-compile
Trivial, but worth mentioning. If you use a regular expression more than once, remember to compile it first as a pre-process step:
Pattern p = Pattern.compile(regex,flags);

//Usage stage
Matcher a = p.matcher(input);

 2)  Reluctant quantifiers vs greedy quantifiers
The default quantifiers (* + and ?) are greedy. That means that they start by matching the longest possible sequence, and then they step back gradually if backtracking is needed. In case that you know that the match is usually short, you should use reluctant quantifiers instead. They start by the smallest match, and progress if needed.
For example, suppose that you want to match only strings containing the sequence “hello”. The regex .*hello.* will do the job, but if you know that ‘hello’ usually appears near the beginning of the text, then .*?hello.* will run faster on average.

 3)  Use possessive quantifiers where possible
Unlike reluctant quantifiers which affect performance but do not affect regex behavior, possessive quantifiers may really change the meaning of the regex.
When using *+ instead of *, the first attempted match will be greedy (i.e. longest match, just as with *), but there will be no backtracking in case of failure, even if this causes the complete match to fail. When does this become useful?
Suppose that you need to match text in quotes. The regex \”[^\”]*\”will work fine. However, it will do unnecessary backtracking on negative cases (e.g. “bla bla bla). Using the regex \”[^\”]*+\”instead will eliminate the backtracking, without changing the regex meaning.

Independent grouping can have the same effect, and it allows even more control (See Sun’s tutorial ).

4)  Avoid capturing groups
Any expression in parentheses is considered a capturing group by default. This has a small performance impact. Make your groups “non-capturing groups” when possible, by beginning them with (?: instead of with (.

5)  Use alternation wisely
When using alternation (e.g. Paul|Jane|Chris), the order in which the engine tries to match the options is the same order in which they appear. Therefore, you can take advantage of this property, and order your options from the most common one to the less common one. This will improve the average time of positive matches.

 6)  Avoid multiple interpretations
Write your regex in a way that minimizes the number of different ways to match a particular input string. For example, the regex (a*b*)*given in the beginning of this article, allows interpreting the string “aabb” in too many ways:
(a2b2)
(a1)(a1)(b1)(b1)
(a2)(b2)
(a1)(a1b2)
etc…

The regex (a|b)* on the other hand, forces a unique interpretation on a positive input.
This is very important in order to reduce backtracking, in cases of almost a match.

 7)  Lookarounds
Lookarounds  allow adding restrictions on sequences to the left/right of current position. Specifically, with negative lookahead you can search strings that do not contain some sequence (cumbersome thing to do without this feature!). How can this help improve performance?

Suppose you want to capture the URL from a link tag. Consider the following regex:
<a .* href=(\S*).*/>
For legal tags it will find a match and capture only the href attribute as required (\S stands for a non-whitespace character) .  On some illegal tags, however, exessive backtracking will occur. For example: “<a href= href=href=…. href=something”.
The following regex will prevent this, by replacing the “.*” expression with something more specific, that does not match “href’”:
<a ((?!href).)* href=(\S*)((?!href).)*/>

 8)  Specify lengths
Java has a regex optimizer that checks the input string’s length against the min/max length derived from the regex contents. This allows failing the match in some cases immediately. In order to help this mechanism, specify repetition number on groups whenever possible (e.g. [01]{6}, which matches all binary strings of length 6).

 9)  Extract mandatory strings
Sometimes, strings which are mandatory are hidden inside groups or alternations:
(hello|hell|heel)
This expression could be simplified to:
he(llo|ll|el)
By doing so, the regex optimizer has more information to work with.

 10)  Benchmark your regex
When the regex is used in a performance critical area of your application, it would be wise to test it first. Write a micro-benchmark application that tests it against different inputs. Remember to test different lengths of inputs, and also inputs that almost match your pattern.

 

Links:

 http://java.sun.com/docs/books/tutorial/essential/regex/index.html
http://www.javaworld.com/javaworld/jw-09-2007/jw-09-optimizingregex.html?page=1
http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html
http://www.devarticles.com/c/a/Java/NFA-DFA-POSIX-and-the-Mechanics-of-Expression-Processing/

Posted in JDK packages, java | Tagged: , , , , , , | 3 Comments »