The Java Explorer

Tips and insights on Java

  • Subscribe

  • If you find this blog useful, please enter your email address to subscribe and receive notifications of new posts by email.

    Join 37 other followers

Posts Tagged ‘java’

Comparators – part 3

Posted by Eyal Schneider on April 30, 2012

In this post and in this one, I illustrate how a bad implementation of comparators can cause annoying bugs.
Lately I saw a strangely looking Comparator implementation, which looks like this:

class WeirdComparator implements Comparator<Long> {
    @Override
    public int compare(Long a, Long b) {
        if (a == null || b == null)
            return 0;
        if (a > b)
            return 1;
        if (b > a)
            return -1;
        return 0;
    }
}

The implementor decided not to rely on the natural ordering of Long, since it doesn’t handle null values, which are acceptable in his use-case.

The first two lines caught my eye: they are obviously wrong in terms of the Comparator contract, but it is not immediately clear what kind of damage they can cause. The contract of Comparator/Comparable states that the implementation must define a total order on all values of the type it handles. That means, among other things, the transitivity property:
If compare(b,a)>=0 and compare(c,b)>=0, then compare(c,a)>=0. With the current implementation, compare(null, 2L)>=0 and compare(1L, null)>=0, but compare (1L, 2L)<0.

What happens if we sort using this comparator?

    Long[] arr = {100L, 101L, 102L, 103L,
                  50L , null,  51L,  52L};
    Arrays.sort(arr,new WeirdComparator());
    System.out.println(Arrays.toString(arr));

The output in this case is a mess: “[50, 100, 101, 102, 103, null, 51, 52]”.

Java uses the Mergesort algorithm for sorting arrays of objects, and a variation of the Quicksort algorithm for sorting arrays of primitive types. Java’s Mergesort implementation has an optimization that defines a threshold of 7, such that any array range with a smaller size is not sorted recursively with Mergesort, but rather with Insertion sort.
The array size in the example is 8, so the two halves are sorted with Insertion Sort (leaving them unchanged), and then they are merged. In the merge phase, the value 50 is chosen first, and then the new head of the second half is null, which is defined to be equal to any value. This causes all the contents of the first half to be copied to the output.

The faulty comparator is also a disaster for sorted sets:

    SortedSet<Long> set = new TreeSet<Long>(new WeirdComparator());
    set.add(null);
    set.add(10L);
    set.add(20L);
    System.out.println(set);

Although we add 3 values, only the null value remains. As explained in my other post, TreeSet respects the Set invariants which prohibit duplicate values, and it uses the comparator to determine equality. Therefore, the values 10L and 20L are considered duplicates of null and aren’t added to the tree.

Posted in java | Tagged: , , , , , , , | 1 Comment »

Strings and Memory leaks

Posted by Eyal Schneider on October 27, 2009

java.lang.String is one of the mostly used classes in Java. Interestingly,  it is important to know some of its implementation details in order to avoid memory leaks.
Consider the following seemingly innocent piece of code:

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

 

public class StrangeStringThing {
    private static Random rnd = new Random();
    private static final int MEGA = 1024*1024;
 
    private static String createRandomStr(int length){
        StringBuilder sb = new StringBuilder(length);
        for(int i=0;i<length;i++)
            sb.append((char)('a'+rnd.nextInt('z'-'a'+1)));
        return sb.toString();
    }
 
    public static void main(String[] args){
        List<String> substrings = new ArrayList<String>();
        for(int i=0;i<100;i++){
            String randomStr = createRandomStr(MEGA);
            String subStr = randomStr.substring(1000,1004);
            substrings.add(subStr);
        }
    }
}

This code generates a very long random string, and then keeps only a 5 chars long substring of it. This process is repeated 100 times.  We expect to end this program with 100 strings of length 5.
Surprisingly, if we try to run this code with the heap size limited to 100MB (-Xmx100m), we get an OutOfMemoryError.  How did that happen? all temporary large strings should have been cleared by the garbage collection, so there should be a really small memory footprint. Furthermore, the documentation of java.lang.String does not indicate any misuse of the API.

How are Java Strings stored?

The Java language specification defines precisely the API of class String, its Unicode details, its immutability, and the way String literals should be handled. Naturally, it does not define how Strings should be  represented internally, and what are the required time/space complexities of the different operations on them. These aren’t functional requirements. As a consequence, different JVM providers have the freedom to choose their preferred design for class String. Sun’s implementation (and also IBM’s) represent a string with a combination of 3 data members:

  • value – A reference to a char array
  • offset – A start index, inside the array
  • count – The actual length of the string.

When allocating a new string, the char array contains exactly the string content. Offset is set to 0, and count is set to the char array length. Then, when calling the method substring(..) upon it, the new string being returned contains a reference to the same char array, but its offset and count members are modified, reflecting the requested subsequence of chars. The sharing of the same char array by multiple String instances is possible, since strings are immutable. There are two benefits of this implementation approach in comparison to a substring implementation based on copying:
1) Memory usage is usually reduced, specially in cases where many substrings of the same string are taken, or if the substrings are long
2) substring(..) runs in constant time, instead of linear time

 Obviously there is a tradeoff here – if the string utilization pattern is not as described in (1), we may suffer from excessive memory consumption. The code above demonstrates this edge case: we take very small substrings of very large strings, where the latter are temporary. The temporary strings (referenced by randomStr) ARE collected by the garbage collector during the loop. However, their internal char arrays can not be collected, since they are being shared with the sub-strings. Therefore, the very long char arrays are kept unintentionally in memory, and that constitutes a memory leak.

The solution is simple – forcing the substrings to be copied into a new compact char array. This is done by replacing line 22 above with:

  String subStr = new String(randomStr.substring(1000,1004));

It looks like the API designers were aware of the the possibility of substring being implemented as described, so they added a copy constructor specifically for avoiding the memory leak. Note that otherwise this constructor is useless, because we are dealing with an immutable class.

Is this a Java bug?

There is currently an open bug report in Sun’s bug database regarding this issue.  It doesn’t look like it will be fixed soon, because doing so may introduce performance regression in many existing programs. Besides, there is a very simple workaround (described above), what makes the priority of this bug relatively low.

Summary

Java developers should be aware of the prevalent String.substring(..) implementation, and its memory leak potential risk. Whenever we know that the original string we take the substring from has a shorter life span than the substring itself, we should consider using the copy constructor to avoid sharing of the underlying char array.

Posted in java, JDK packages | Tagged: , , , , , , , | 5 Comments »

Java Quiz

Posted by Eyal Schneider on September 17, 2009

This time I posted a 20 questions quiz, developed for testing general Java skills.
Most of the questions address fundamental topics in Java(5), but the test is not a trivial one.  The right answers plus their explanations can be found on the bottom. You are welcome to add your comments.

Enjoy, and good luck!

Question 1

public class CrazyCount {
    private static volatile int count = 0;

    public static void main(String[] args) {
        for(int i=0;i<1000;i++)
            new Thread(new Runnable(){
                public void run() {
	           count++;
                }
            }).start();

        //.....
    }
}

What will be the value of count after all 1000 threads finish their work?

(a) Garbage, because the writes are not synchronized
(b) A value between 1 and 1000
(c) Exactly 1000
(d) At least 1000
Answer

Question 2

Consider the following multi threaded producer/consumer model:

 public class ProducerConsumer {
    private Job currentJob = null;

    public synchronized void addJob(Job job) throws InterruptedException{
        while (currentJob!=null)
            this.wait();
        currentJob=job;
    }

    public synchronized Job consumeJob(){
        Job res = currentJob;
        res.process();
        currentJob=null;
        this.notify();
        return res;
    }
}

Which of the following is correct?

(a) addJob will wait till the current job is executed, but consumeJob will never block.
(b) Jobs will be executed in the exact order in which they are added with addJob.
(c) This code may not work because of deadlocks
(d) ‘while’ in addJob can be replaced by ‘if’, without changing the behavior.
Answer

Question 3

Consider the following Singleton design pattern implementation:

public class ConnectionsMgr {
    private static ConnectionsMgr instance = null;

    public static ConnectionsMgr getInstance(){
        if (instance==null){
	    synchronized(ConnectionsMgr.class){
		if (instance==null)
		    instance = new ConnectionsMgr();
	    }
    	}
	return instance;
    }

    //...
}

Why is this implementation problematic?

(a) An uninitialized ConnectionsMgr object may be returned
(b) Synchronizing on ConnectionsMgr.class is not allowed
(c) The instance may be created more than once
Answer

Question 4

Which of the following statements regarding the final reserved word is correct?

(a) A final method can not be overriden by subclasses, and a final class can not be subclassed
(b) A final data member must be given a value in its declaration line
(c) A local variable declared as final can still be mutated using setter methods
(d) (a) and (c) are correct
Answer

Question 5

public static void main(String[] args) {
    try{
	String s = null;
	s=s.substring(10);
    }
    catch(Exception e){
	System.out.println("Caught exception!");
    }
    catch(RuntimeException e){
	System.out.println("Caught a runtime exception!");
    }
}

This code:

(a) Displays “Caught exception!” when executed
(b) Displays “Caught runtime exception!” when executed
(c) Runs without throwing an exception
(d) Does not compile
Answer

Question 6

Which of the following statements regarding exceptions is correct?

(a) IOException can be thrown without the need to declare it in the method header
(b) catch(Exception) will catch OutOfMemoryError
(c) ClassCastException can be thrown without the need to declare it in the method header
(d) (b) and (c) are correct
Answer

Question 7

Which of the following is the preferred way to stop a thread?

(a) Calling the Thread’s interrupt() method
(b) Calling interrupt() and then stop()
(c) Calling the stop() method
(d) Calling join() to wait for the thread to end
Answer

Question 8

public class Swap {
    public static void swapStrings(String x, String y){
	String temp = x;
	x=y;
	y=temp;
    }

    public static void main(String[] args) {
	String a = "1";
	String b = "2";
	swapStrings(a, b);
	System.out.println("a="+a+" ,b="+b);
    }
}

What will be the output when executing this main?

(a) An exception will be thrown
(b) “a=2 ,b=1”
(c) “a=1 ,b=2” because strings are immutable
(d) “a=1 ,b=2” because Java passes parameters by value
Answer

Question 9

public static void main(String[] args) {
    ArrayList<Point> list = new ArrayList<Point>();
    list.add(new Point(100,200));

    ArrayList<Point> listCopy = (ArrayList<Point>) list.clone();

    listCopy.get(0).setLocation(150,250);

    System.out.println(list);
    System.out.println(listCopy);
}

What will be the output of this main?

(a) [java.awt.Point[x=100,y=200]]
[java.awt.Point[x=100,y=200]]

(b) [java.awt.Point[x=150,y=250]]
[java.awt.Point[x=100,y=200]]

(c) [java.awt.Point[x=100,y=200]]
[java.awt.Point[x=150,y=250]]

(d) [java.awt.Point[x=150,y=250]]
[java.awt.Point[x=150,y=250]]
Answer

Question 10

Assume that class B extends class A.
Consider a method that is declared as follows:  public void callMe(List<A> l)

Which of the following data types can be passed as parameters to callMe(..)?
(a) List<B>
(b) ArrayList<B>
(c) ArrayList<A>
(d) All answers are correct
Answer

Question 11

Assumes that class B extends class A.

public void callMe(List<? extends A> l){
    l.add(new A());
    l.add(new B());
}

This method:

(a) Compiles, but with a warning
(b) Does not compile
(c) Runs correctly
(d) May cause ClassCastException
Answer

Question 12

What will be the output of the following code?

class Base {
    int i = 10;

    public Base(){
        System.out.println(get());
    }

    public int get(){
        return i;
    }

}
public class Derived extends Base{
    int j = 20;

    public int get(){
        return j;
    }

    public static void main(String argv[]){
        Base b = new Derived();
    }
}

(a) 0
(b) 10
(c) 20
(d) None of the above
Answer

Question 13

What will be the output of the following code?

class Base {
    protected int i = 10;

    public int get(){
        return i;
    }
}

public class Derived extends Base{
    protected int i = 20;

    public int get(){
        return i;
    }

    public static void main(String argv[])
    {
        Base b = new Derived();
        System.out.println(b.i);
        System.out.println(b.get());
    }
}

(a) 10
10

(b) 10
20

(c) 20
10

(d)
20
20
Answer

Question 14

What is the output of the following code?

String s1 = new String("Test");
String s2 = "Test";
if (s1==s2)
    System.out.println("Same");
if (s1.equals(s2))
    System.out.println("Equals");

(a) Same
(b) Equals
(c) Same
Equals
(d) No output
Answer

Question 15

What is the output of the following code?

class Parent{
    private void f(){
        System.out.println("Parent.f()");
    }

    public void g(){
        System.out.println("Parent.g()");
        f();
    }
}

public class Child extends Parent{
    public void f(){
        System.out.println("Child.f()");
    }

    public static void main(String args[]){
        Parent p = new Child();
        p.g();
    }
}

(a) Parent.g()
Child.f()
(b) Parent.g()
Parent.f()
(c) The code does not compile
(d) An exception is thrown
Answer

Question 16

Consider the following overloading scenario:

public class Overloader{
    public static void read(String s){
        System.out.println("read(String)");
    }

    public static void read(Integer i){
        System.out.println("read(Integer)");
    }

    public static void read(Object o){
        System.out.println("read(Object)");
    }

    public static void main(String args[]){
        Object s = new String("Java");
        Integer i = 10;
//      Overloader.read(s);
//      Overloader.read(i);
//      Overloader.read(null);
    }
}

What would be the output of each of the 3 read commands in comment?

(a) read(s) prints “read(String)”
read(i) prints “read(Integer)”
read(null) prints “read(Object)”

(b) read(s) prints “read(Object)”
read(i) prints “read(Integer)”
read(null) prints “read(Object)”

(c) read(s) prints “read(Object)”
read(i) prints “read(Integer)”
read(null) does not compile

(d) None of the above
Answer

Question 17

Which of the following statements is correct?

(a) A daemon thread is terminated when all non-daemon threads terminate
(b) A thread is created as a daemon thread if and only if it is created by a daemon thread
(c) Non daemon threads may keep the application from shutting down, even if the main thread is terminated.
(d) All of the above are correct
Answer

Question 18

class Parent{
    protected void x(){}
    public void y(){}
}

public class Child extends Parent{
    public void x(){}
    protected void y(){}
}

When compiling this code:
(a) It compiles successfully
(b) Compilation error – x can’t have its visibility increased
(c) Compilation error – y can not have its visibility reduced
(d) Compilation error – neither x nor y can have their visibility changed
Answer

Question 19

String s = new String("Wanna live forever!");
WeakReference<String> r1 = new WeakReference<String>(s);
SoftReference<String> r2 = new SoftReference<String>(s);
String s2 = r1.get();
s = null;

Assuming that no garbage collection took place, what is the reference strength of the string created at line #1, after these lines of code?
(a) Not referenced
(b) Weakly referenced
(c) Softly referenced
(d) Strongly referenced
Answer

Question 20

int x=0;
int y=0;
while(true){
    if ((++x==10) && (++y==10))
        return;
}

When executed, this code:
(a) Performs 10 iterations and ends
(b) Performs more than 10 iterations and ends
(c) Performs less than 10 iterations and ends
(d) Enters an infinite loop
Answer

Answers

1)  (b)
A race condition between threads may cause a a write to cancel other writes, thus multiple concurrent “incrementations” may result in a single incrementation.

2) (a)
(b) is incorrect, since there is no guarantee about the order in which waiting threads are awakened using Object.notify().
(d) is incorrect because when a producer thread is awakened, there is no guarantee that another producer thread was not awakened just before, setting the current job to another value.
Besides, Java documentation recommends always to use ‘while’ loops on wait() rather than ‘if’.

3) (a)
The line “instance=new ConnectionsMgr()” is not atomic. It results in multiple machine instructions, one of which is the reference assignment itself. Due to the Java memory model rules, this specific instruction is allowed to be reordered with respect to the other instructions. This reordering may cause the assignment to take effect before the object is initialized. Another thread calling getInstance() may therefore receive a non null reference to an uninitialized memory block.  See this article for more details.

4) (d)
An object data member or a local object variable declared as final can not change its address. However, the referenced object can be freely mutated.
(b) is not true because the member can be initialized in the constructor/s also.

5) (d)
RuntimeException derives from Exception, so the ordering of the catch blocks makes the second unreachable.

6) (c)
ClassCastException is an unchecked exception, so it does not need to be declared. IOException, however, is a checked exception, and you must either catch it or declare the method as “throws IOException”. OutOfMemoryError is a Throwable, but it derives from Error and not from Exception.

7)  (a)
Thread.stop() is a deprecated method that is not recommended for use. It stops a thread violently without guaranteeing anything about the integrity of the data manipulated by that thread. This may cause corrupt data to be viewed by other threads. Furthermore, there are some cases in which Thread.stop() will not terminate a thread.
Thread.interrupt() is preferred, but it relies on the code run by the thread to be friendly and react to the interruption.

8) (d)
Since Java passes parameters by value (not by reference!), there is no way to implement a swap method in Java. In C++ for example, it is possible, using references (&).

9) (d)
Collections such as java.util.ArrayList do not perform deep copying when cloned. Item references stay the same.

10) (c)
(a) and (b) are incorrect because if legal, they would have break the type safety. The method callMe could have added an instance of type A, and the caller now has a collection of both A’s and B’s, while it assumes that it has only elements of type B. See my previous post for more details.

11) (b)
Assume that C descends from B. The method callMe may receive as a parameter an object of type List<C>. If the additions were considered legal, this would result in a list defined as List<C> containing instances of super classes such as B and A. That breaks the generics type safety, since it will not protect the collection from an unexpected ClassCastException. See my previous post for more details.

12) (a)
Overriding methods that are called directly/indirectly by the superclass’ constructor should be done with care. If the overriden method uses data members belonging to the subclass, then they are still not initialized at that point! In this example, the data member is of type int, so the default initialization will give it a value 0.
For more explanations and a workaround for this problem, see posts 86 and 86b of the Java Specialists’ Newsletter.

13) (b)
Methods are invoked virtually – the exact method to be executed is determined at runtime according to the actual class of the object on which it was called. Data members, however, do not really override each other. Each class has its member i, and the one to be retrieved is determines at compile time, according to the declared class of the reference (in this case Base).

14) (b)
Using new String(…) forces Java to actually create a new instance of a string. Strings defined as literals (e.g. String x=”hello”), however, will always be represented by the same object reference at runtime.

15) (b)

16) (c)
The method signature to be invoked is determined at compile time. Therefore, it depends on the declared type and not the runtime type of the argument/s.
Regarding the null parameter – the compiler can not determine which signature is the correct one, since all the three reads are legitimate.

17) (d)

18) (c)
Reducing visibility of methods when overriding is not permitted.

19) (d)
s2 still hard-references the string.

20) (b)
Tricky thing… The Java language specification specifies that the second operand of && is never evaluated when the first operand evaluates to false (See this section). Consequently, only when x reaches 10 y will become 1. Later, x will overflow and reach 10 again, and so on; until both counters reach 10 and the loop will stop.

Posted in java, Language | Tagged: , , | 10 Comments »

OutOfMemoryError

Posted by Eyal Schneider on June 17, 2009

Any Java programmer has encountered at least once the disturbing OutOfMemoryError (OOME) message. The interesting thing is that this error occurs for different reasons, and the diagnosis and treatment in each case is completely different. In this post I will try to cover some of these cases, show how to reproduce them and how to deal with them.
Some important things to remember about OOME:

  • As the name implies, this is an Error, and not an Exception. Errors indicate a serious condition in the JVM. Catching Errors is usually useless – the code doesn’t have the means to recover from them. Since they are usually not caught, they result in some thread getting terminated.
  • The stack trace is not always present in OOME, since its creation also requires memory, which may not be available.
  • The circumstances where an OOME is thrown may vary between different JVM implementations, due to a different memory management model. The textual description of the error may also vary. This post refers to to my own experience with Sun’s JVM, although I believe that most of it is relevant for other implementations as well.
  • As explained above, OOME is not recoverable at runtime. Therefore, the right strategy is to avoid its recurring once it happens, by means of configuration, design, or hardware modifications.

 

Java memory organization

Let’s first start with a very short overview of how a Java process organizes its address space. Basically, the memory is divided into heap memory and non-heap memory.  Modern JVM implementations use a generational heap model, in order to improve garbage collection performance. In this model, the heap area is usually split into 3 generations: young, old (“tenured”) and permanent. The young generation is where new objects created with the new operator are allocated, and it is designed to hold “young” objects. Older objects are moved to the tenured generation during GC. The permanent generation is not common to all JVM implementations, and it differs from the other generations – it stores (among other things) metadata of classes that have been loaded. The permanent generation is sometimes regarded as part of the heap, and sometimes not (The heap sizing flags -Xmx and -Xms, for example, do not include the permanent generation size).
Non-heap memory is the additional memory space outside of the areas reserved for the heap, and it can be used for different purposes. I will refer to this memory as native memory.

java.lang.OutOfMemoryError: Java heap space

This common error indicates that an object creation failed because there was no available heap space for it. It will never be thrown unless the JVM tries a full garbage collection first, including removal of softly/weakly referenced objects. When the stack trace is present for this particular OOME, it is usually irrelevant – the specific allocation that could not be satisfied gives us no clue regarding previous allocations that probably caused the problem. Same with the thread where the error occured – It may occur in any thread, since they all share the same heap.
  
Following is a sample code that eventually generates this OOME:
List<String> list = new ArrayList<String>();
while(true)
    list.add(new String("Consume more memory!"));
Solutions:
1) Check for memory leaks
  • Use the memory tab in JConsole in order to examine the graph of memory usage of your application over time. Focus on the “Tenured generation”, and click on “perform GC” every once in a while, to see whether the used memory returns to some expected baseline level after every garbage collection.
  • Run your application with the following JVM flags: -verbose:gc and -XX:+PrintGCDetails. This will add console outputs with details on every garbage collection that takes place. Using these outputs, you can analyze the memory consumption pattern, and see whether the memory usage increases constantly where it shouldn´t.                                    

 The following tools can help you analyze the reasons for the memory leak:

  • Run jmap -histo <java process id> at different stages of your application execution. This command prints the number of instances and total used memory per Java class. You can identify classes that have an unreasonable number of instances, or classes for which the number of instances increases unexpectedly. This can limit the search space in your code, when looking for the bug.
  • Generate a heap dump using JConsole or jmap, and analyze it using a heap analyzer (e.g. JHat, IBM HeapAnalyzer). The heap dump contains the full state of the heap at the point of time where the dump was created. JHat allows examining this heap snapshot, navigating through the different instances, and even executing complex queries (OQL language). An advanced feature allows comparing two snapshots taken at different points of time, in order to identify the allocations that occurred between the two. If you want the dump to be created automatically when the OOME occurs, add the -XX:+HeapDumpOnOutOfMemoryError JVM flag.

2) Increase the heap size

Use the JVM flags -Xms and -Xmx to set the initial and maximal heap size, respectively. Make sure not to exceed the physical memory size, otherwise you encourage paging, which affects performance. Sometimes installing additional physical memory is required.

3) Reconsider the design

An inadequate design can easily consume too much heap space and cause this OOME. Sometimes, it is useful to change the design in a way that compromises some desired system property (such as performance), but reduces memory use dramatically. For example, disk/database storage can replace some of the data structures in RAM, and memory sensitive mechanisms such as weak/soft references can be introduced.

java.lang.OutOfMemoryError: PermGen space

When the permanent generation becomes full, this error occurs, regardless of the available space in the object allocation area (young + old generations). Note that this specific message is issued by Sun’s JVM, and other JVM implementations may have different messages, specially those implementations with a different policy for storing class information (such as IBM’s JVM).
I will reproduce this OOME in two different ways. The first one uses a custom class loader to load too many classes. Assume that we have a custom class loader named MyClassLoader, capable of loading the class Dummy:
List<Class<?>> classes = new ArrayList<Class<?>>(); while(true){ MyClassLoader cl = new MyClassLoader(); try{ classes.add(cl.loadClass("Dummy")); }catch (ClassNotFoundException e) { e.printStackTrace(); } }
Note that we are actually loading the same class over and over, with a different class loader (remember that a class is identified by the combination of its name and its class loader object). Also, note that we store references to the loaded classes. If we don’t do so, the garbage collector realizes that the classes are not used, and performs garbage collection on the permanent generation.

A second way of filling up the permanent generation, is by exploiting the String intern pool. The class String manages a pool of strings that have been interned, using String.intern(…). This pool is also a part of the permanent generation.

List<String> list = new ArrayList<String>();
int i=0;
while(true)
    list.add(("Consume more memory!"+(i++)).intern());

The examples above highlight the common cases where this OOME occurs in practice:

  • Applications which load classes dynamically, such as web servers and application containers
  • A class loader with a memory leak
  • Applications that make use of string interning, in order to improve performance or reducy heap usage

Solutions:

Start by analyzing. A good place to start with is the stack trace. If it contains the java.lang.String.intern(..) call, then the reason is probably string interning. Alternatively, if it contains the java.lang.ClassLoader.defineClass(..) call, then you are probably dealing with a class loading problem. In the latter case, you may want to track the loading/unloading of classes, using the JVM flags -XX:+TraceClassLoading(or -verbose:class) and -XX:+TraceClassUnloading. Also, make sure you don’t have the flag -Xnoclassgc, because it prevents GC on classes, and can easily cause this OOME in applications that keep loading classes dynamically. As a part of the analysis, you can also run the utility jmap -permstat <process id> (not available for Windows). It provides details on all class loaders that have been used since the application was launched. 

Once you have pinpointed the problematic piece of code, check whether the design can be changed to consume less memory. If this is not an option, use the -XX:MaxPermSize flag to increase the maximal size of the permanent generation (e.g. -XX:MaxPermSize=256m).
Remember that increasing the values of -Xmx and -Xms flags instead will not change a thing. As explained before, these flags control the size of the young+old generations only, and do not affect the permanent generation.

 

java.lang.OutOfMemoryError: unable to create new native thread

Whenever we start a a  thread in Java, the JVM allocates space in native memory, designated for the thread’s call stack. This space is freed only after the thread is terminated. In case that there is not enough native memory for the stack space, we get this error. This can easily be reproduced with the following thread spawner code:
while(true){
    new Thread(new Runnable(){
        public void run() {
            try {
                Thread.sleep(60*60*1000);
            } catch(InterruptedException e) { }        
        }    
    }).start();
}  
Solutions:
This error is common in applications that try to manage too many live threads, as in the example above. As always, first ask yourself whether the design is reasonable. There are some generic solutions for reducing the number of threads. For example, an application that creates a thread per connection has a scalability problem in its design. The IO multiplexing feature introduced in Java 1.4 can solve this, and reduce dramatically the number of threads (See class java.nio.channels.Selector).
If you wish to try solving this problem by changing the memory settings only, then any of the following can help:
1) Reduce the heap size using -Xmx and -Xms flags. Sometimes, inexperienced programmers increase the heap size when they encounter this OOME, but it only makes the condition worse.
2) Reduce the permanent generation size using -XX:MaxPermSize
3) Reduce the size of memory allocated per stack using -Xss (e.g. -Xss512k). The default values depends on the environment, and range between256k and 1024k. Remember that when you reduce this value, you are increasing the risk of StackOverflowError, specially if you have deep recursion somewhere in your code or libraries that you use.
  

Other native memory OOMEs

There are many ways to use native memory other than thread stack allocation discussed before. Allocations can be made by the JVM itself, third party libraries, utilities in the JDK libraries,  and JNI code that you write. When the native memory gets exhausted, we may get an OOME of some kind. They are not consistent with their message, but one of them looks as follows:
 
java.lang.OutOfMemoryError: requested <bytes> bytes for <reason>. Out of swap space?
 
We can also cause a different message, by using the direct buffer feature of the java.nio package, in order to consume all available native memory:
List<ByteBuffer> list = new ArrayList<ByteBuffer>();
while(true)    
    list.add(ByteBuffer.allocateDirect(10000000)); 
 
In this case we will get the message:
java.lang.OutOfMemoryError: Direct buffer memory
 
 Solutions:
1)  Check whether there is a memory leak in some native code you use with JNI
2) Reduce the heap size using -Xmx and -Xms flags, to make more space for native memory.
3) Increase the swap space on disk. The way to do this depends on the operating system you use.
 
 

Summary

There are several types of OutOfMemoryError messages. It is important to know how to identify and analyze them,  and then how to find an appropriate solution. Solutions are in terms of prevention and not recovery at runtime.  A recovery attempt after an OOME may be too late – the data integrity may be broken, and there is no guarantee that there will be enough resources to complete it successfully. Instead, we should re-launch the application with a different setup.
An automatic recovery mechanism that occurs before the OOME can be effective, though. Since Java 1.5, we have the MemoryMXBean class, as part of the JMX services. An example of such a memory tracking mechanism was written by Dr. Heinz Kabutz in his 92nd issue of his newsletter.

Posted in java, JDK packages | Tagged: , , , , , , , , , , , , , , , , , , , , | Leave a Comment »

SimpleDateFormat pitfalls

Posted by Eyal Schneider on May 29, 2009

The class SimpleDateFormat is a very popular class, allowing flexible formatting and parsing of dates in Java. Formatting consists of transforming a Date into a String in a predefined format, and parsing performs the inverse operation.

In many cases, programmers fail to use SimpleDateFormat correctly, due to inattention to some (well documented) characteristics of this class. Maybe this will not be new for some of the readers, but there are 3 important things to remember when using SimpleDateFormat.

Time zone 

Every instance of SimpleDateFormat has an associated time zone to work with, as a TimeZone instance. Otherwise, a parsing operation would not be able to determine an absolute point in time (Date object), from a given string denoting a date. If not specified otherwise, the time zone associated with a new SimpleDateFormat is the default one, corresponding to the time zone reported by the operating system.

Consider the following code:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
Date date = sdf.parse("18-07-1976");
System.out.println(date.getTime());

Since the time zone is not explicitly set (using setTimeZone(..)), the code above will result in different outputs when executed in different parts of the world. The midnight that started July 18, 1976 at Thailand, and the same midnight in Florida, have a different GMT representation. The reason is that the midnight arrived in Thailand first.

In many cases, the local default time zone is exactly what the programmer wants. However, on a client-server application, or on a distributed system, we usually want to avoid this time zone relativity. Date strings that are read from configuration / data files / user interface / external software interfaces must be specific with their time zone. We can achieve this in one of two ways. The first one is to include the timezone itself as a part of the expected date format:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy z");
Date date = sdf.parse("18-07-1976 ICT");
System.out.println(date.getTime());

This program will always return the same value (206470800000), regardless of the execution location.

The other approach is to define a convention which needs to be respected by all system entities which generate textual dates. For example, we can decide that all dates should be interpreted as GMT dates. Then, we can bind the corresponding time zone to the SimpleDateFormat. For example:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
sdf.setTimeZone(TimeZone.getTimeZone("GMT"));
Date date = sdf.parse("18-07-1976");
System.out.println(date.getTime());

 This program will also return the same value everywhere in the world (206496000000). 

Validation

Assume  that we are reading a date in the format dd/MM/yyyy from the GUI. We want to validate that the date is well formatted and legal. Consider the following validation code, which makes use of the ParseException thrown by the parse method:

SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy");
Date d = null;
try {
    d = sdf.parse(dateStr);
} catch (ParseException e) {   
    showErrorMessage("Incorrect date format. "+e.getMessage());
}

Will it do the job? The answer is – not always. Date strings such as “10//10/2008” will result in an error, but dates such as “40/20/2009” will be parsed successfully, with no error message!. The reason is that by default, SimpleDateFormat is lenient, which means that it is tolerant to some formatting errors. It will try to figure out the date by applying some heuristics to the given text. In the case above, month 20 will be interpreted as December plus 8 months (i.e. August 2010), and day 40 of the month will be interpretted as end of August plus 9 days. The resulting date will be September 9th, 2010.

In order to achieve a stricter validation, we should make the SimpleDateFormat non-lenient, by initializing it as follows:

SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy");
sdf.setLenient(false);

Thread safety 

SimpleDateFormat is not thread safe. Concurrent formatting/parsing operations interfere with each other and result in inconsistent behavior. There are 3 ways to deal with this:

1) Create a SimpleDateFormat instance per usage
This will work fine if the usage is not frequent. Otherwise, the performance penalty may be significant.

2) Synchronization
We can achieve thread safety by synchronizing all usages of the SimpleDateFormat instance with the same lock. For example:

private SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
...
private Date parse(String dateStr) throws ParseException{
    synchronized(sdf){
        return sdf.parse(dateStr);
    }
}
private String format(Date date){
    synchronized(sdf){
        return sdf.format(dateStr);
    }
}

The disadvantage of this approach is that in case of many threads working with this code, thread contention will affect performance.

3) Use an instance per thread
The usage of ThreadLocal class allows associating a single instance of SimpleDateFormat with each thread that needs to do parsing/formatting of dates.
This way, we achieve both thread safety with no thread contention, and a minimal number of SimpleDateFormat instances.
In some cases, we can be extra economical on memory usage, if we know for example that the threads stop using the SimpleDateFormat at some point of time. In these cases, we could use a soft reference to hold the SimpleDateFormat instance.
An explanation of this technique plus sample code can be found here.

Links

http://www.javaspecialists.eu/archive/Issue172.html
http://java.sun.com/javase/6/docs/api/java/text/SimpleDateFormat.html

Posted in java, JDK packages | Tagged: , , , , , , | 3 Comments »

Having performance issues with regex?

Posted by Eyal Schneider on May 21, 2009

Java introduced the java.util.regex package in version 1.4. It is a powerful addition, and yet, one should really be an artist in order to use it right. Assuming that a regular expression is proved for correctness, it may still run extremely slow (even take hours) if it is not wisely written.

Continue reading for understanding the origin of the problem, or jump directly to the last section, containing 10 useful performance tips for regular expression writers in Java.

Can it really be so slow?

 Suppose that we want to match only strings composed of sequences of ‘a’s or ‘b’s. A legal regex would be:

(a*b*)*

However, if you run this regex against the string “aaaaaaaaaaaaaaaaaaaaaaaaaaaaax” for example, it takes several minutes to terminate and report that there is no match!
Of course, a better regex in this case would be:

(a|b)*

This one takes less than a millisecond on my machine, on the same input. Obviously, there is a performance issue here.

Why does this happen?

 Like most regex engines, Java uses NFA (Non deterministic Finite Automaton) approach. The engine scans the regex components one by one, and progresses on the input string accordingly. However, it will go back in order to explore matching alternatives when a “dead end” is found. Alternatives result from regex structures such as quantifiers (*, +, ?) and alternation (e.g. a|b|c|d). This exploration technique is called backtracking.
In the catastrophic example above, the engine will actually explore ALL the decompositions of the series of ‘a’s into smaller series, before it realizes that there is no match. This example shows how the backtracking algorithm can result in exponential time evaluation (in the length of the input sting). It also reveals an important property of NFA engines: the worst cases will always be inputs that almost match the pattern. If a match is found, the exploration stops.

The other main approach for regex engines is DFA (Deterministic Final Automata). In this approach, the regex compilation actually builds an automaton, to be used when traversing input strings. Inputs are traversed character by character, with no going back. This guarantees linear timein the length of the input string, regardless of the regex complexity. Instead of trying different match possibilities serially (as in NFA), DFAsimulates a parallel scanning on all possibilities.

So why does Java (and .NET, Perl, Python, Ruby, PHP etc) use NFA and not DFA, which has much better asymptotic behavior? The reason is that NFA has some significant benefits:

  • Its compilation is faster, and requires much less memory
  • It allows some powerful features (See Sun’s tutorial for details on them):
    • Capturing groups and back references
    • Lookarounds
    • Advanced quantifiers (Possessive and Reluctant)

It is important to note that the popular terms NFA and DFA for regex engines are inaccurate. Theoretically speaking, these 2 models have the same computation power, meaning that there can’t be a matching rule that can be expressed in one of them but not in the other. In practice, there was a need for more features, so the two implementation types diverged in their semantics. NFA engines were given more power, making them superior to DFA engines in computation power.

Due to the speed of DFA and the unique features of NFA, there are 2 more “integrative” approaches for regex implementations. Some implementations use both engines (e.g. GNU egrep, which chooses the specific engine at runtime), and some of them really implement a hybrid version (e.g. Tcl regex), enjoying all the benefits.

Tips

 Following are some tips on how to avoid regex efficiency issues in Java. Many of them are aimed at reducing backtracking.

 1)  Pre-compile
Trivial, but worth mentioning. If you use a regular expression more than once, remember to compile it first as a pre-process step:
Pattern p = Pattern.compile(regex,flags);

//Usage stage
Matcher a = p.matcher(input);

 2)  Reluctant quantifiers vs greedy quantifiers
The default quantifiers (* + and ?) are greedy. That means that they start by matching the longest possible sequence, and then they step back gradually if backtracking is needed. In case that you know that the match is usually short, you should use reluctant quantifiers instead. They start by the smallest match, and progress if needed.
For example, suppose that you want to match only strings containing the sequence “hello”. The regex .*hello.* will do the job, but if you know that ‘hello’ usually appears near the beginning of the text, then .*?hello.* will run faster on average.

 3)  Use possessive quantifiers where possible
Unlike reluctant quantifiers which affect performance but do not affect regex behavior, possessive quantifiers may really change the meaning of the regex.
When using *+ instead of *, the first attempted match will be greedy (i.e. longest match, just as with *), but there will be no backtracking in case of failure, even if this causes the complete match to fail. When does this become useful?
Suppose that you need to match text in quotes. The regex \”[^\”]*\”will work fine. However, it will do unnecessary backtracking on negative cases (e.g. “bla bla bla). Using the regex \”[^\”]*+\”instead will eliminate the backtracking, without changing the regex meaning.

Independent grouping can have the same effect, and it allows even more control (See Sun’s tutorial ).

4)  Avoid capturing groups
Any expression in parentheses is considered a capturing group by default. This has a small performance impact. Make your groups “non-capturing groups” when possible, by beginning them with (?: instead of with (.

5)  Use alternation wisely
When using alternation (e.g. Paul|Jane|Chris), the order in which the engine tries to match the options is the same order in which they appear. Therefore, you can take advantage of this property, and order your options from the most common one to the less common one. This will improve the average time of positive matches.

 6)  Avoid multiple interpretations
Write your regex in a way that minimizes the number of different ways to match a particular input string. For example, the regex (a*b*)*given in the beginning of this article, allows interpreting the string “aabb” in too many ways:
(a2b2)
(a1)(a1)(b1)(b1)
(a2)(b2)
(a1)(a1b2)
etc…

The regex (a|b)* on the other hand, forces a unique interpretation on a positive input.
This is very important in order to reduce backtracking, in cases of almost a match.

 7)  Lookarounds
Lookarounds  allow adding restrictions on sequences to the left/right of current position. Specifically, with negative lookahead you can search strings that do not contain some sequence (cumbersome thing to do without this feature!). How can this help improve performance?

Suppose you want to capture the URL from a link tag. Consider the following regex:
<a .* href=(\S*).*/>
For legal tags it will find a match and capture only the href attribute as required (\S stands for a non-whitespace character) .  On some illegal tags, however, exessive backtracking will occur. For example: “<a href= href=href=…. href=something”.
The following regex will prevent this, by replacing the “.*” expression with something more specific, that does not match “href’”:
<a ((?!href).)* href=(\S*)((?!href).)*/>

 8)  Specify lengths
Java has a regex optimizer that checks the input string’s length against the min/max length derived from the regex contents. This allows failing the match in some cases immediately. In order to help this mechanism, specify repetition number on groups whenever possible (e.g. [01]{6}, which matches all binary strings of length 6).

 9)  Extract mandatory strings
Sometimes, strings which are mandatory are hidden inside groups or alternations:
(hello|hell|heel)
This expression could be simplified to:
he(llo|ll|el)
By doing so, the regex optimizer has more information to work with.

 10)  Benchmark your regex
When the regex is used in a performance critical area of your application, it would be wise to test it first. Write a micro-benchmark application that tests it against different inputs. Remember to test different lengths of inputs, and also inputs that almost match your pattern.

 

Links:

 http://java.sun.com/docs/books/tutorial/essential/regex/index.html
http://www.javaworld.com/javaworld/jw-09-2007/jw-09-optimizingregex.html?page=1
http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html
http://www.devarticles.com/c/a/Java/NFA-DFA-POSIX-and-the-Mechanics-of-Expression-Processing/

Posted in java, JDK packages | Tagged: , , , , , , | 5 Comments »