The Java Explorer

Tips and insights on Java

  • Subscribe

  • If you find this blog useful, please enter your email address to subscribe and receive notifications of new posts by email.

    Join 37 other followers

Archive for the ‘JDK packages’ Category

Generating DOM from XML preserving line numbers

Posted by Eyal Schneider on November 30, 2010

Lately I was implementing a simple scripting language based on XML. One of the goals was to emit informative and detailed error messages for syntax errors and script execution errors. In order to achieve this, I was required to include the relevant script line number in all error messages.  Default error messages for malformed XML or XML Schema validation failures do include line numbers, but they cover a limited set of possible errors. Scripting languages often require further validations, which are beyond the expressive power of standard XML schemas.

Parsing an XML into a corresponding DOM structure is straight forward when using the javax.xml.parsers.DocumentBuilder class:

public static Document readXML(InputStream is) throws IOException, SAXException{
    DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
    try {
        DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
        return docBuilder.parse(is);
    } catch (ParserConfigurationException e) {
        throw new RuntimeException("Can't create DOM builder.", e);
    }
}
 



The problem is that once we have a DOM tree, the association between elements and their line number is lost. I couldn’t find any way of tweaking the DOM parser to be aware of line numbers. Therefore I tried using a SAX parser instead. SAX parsers traverse a document in its textual order from beginning to end, triggering callback methods whenever XML building blocks are encountered or a syntax error occurs. A SAX parser can be supplied with a DefaultHandler that uses a Locator. The latter is used for tracking the position of XML entities in the input document.

Following is a utility method that converts an XML given in an InputStream into a DOM structure, by using a SAX parser. Instead of keeping the line numbers in a new data structure, the DOM is enriched with a new attribute per element, indicating the line number of the element in the input document.


public static Document readXML(InputStream is, final String lineNumAttribName) throws IOException, SAXException {
    final Document doc;
    SAXParser parser;
    try {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        parser = factory.newSAXParser();
        DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
        doc = docBuilder.newDocument();           
    } catch(ParserConfigurationException e){
        throw new RuntimeException("Can't create SAX parser / DOM builder.", e);
    }

    final Stack<Element> elementStack = new Stack<Element>();
    final StringBuilder textBuffer = new StringBuilder();
    DefaultHandler handler = new DefaultHandler() {
        private Locator locator;

        @Override
        public void setDocumentLocator(Locator locator) {
            this.locator = locator; //Save the locator, so that it can be used later for line tracking when traversing nodes.
        }
       
        @Override
        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {               
            addTextIfNeeded();
            Element el = doc.createElement(qName);
            for(int i = 0;i < attributes.getLength(); i++)
                el.setAttribute(attributes.getQName(i), attributes.getValue(i));
            el.setAttribute(lineNumAttribName, String.valueOf(locator.getLineNumber()));
            elementStack.push(el);               
        }
       
        @Override
        public void endElement(String uri, String localName, String qName){
            addTextIfNeeded();
            Element closedEl = elementStack.pop();
            if (elementStack.isEmpty()) { // Is this the root element?
                doc.appendChild(closedEl);
            } else {
                Element parentEl = elementStack.peek();
                parentEl.appendChild(closedEl);                   
            }
        }
       
        @Override
        public void characters (char ch[], int start, int length) throws SAXException {
            textBuffer.append(ch, start, length);
        }
       
        // Outputs text accumulated under the current node
        private void addTextIfNeeded() {
            if (textBuffer.length() > 0) {
                Element el = elementStack.peek();
                Node textNode = doc.createTextNode(textBuffer.toString());
                el.appendChild(textNode);
                textBuffer.delete(0, textBuffer.length());
            }
        }           
    };
    parser.parse(is, handler);
   
    return doc;
}   

 

In order to compose a hierarchical structure from the linear traversal, a stack is used. The stack contains at any moment of the traversal the “path” to the current location. Whenever an element start is encountered we build the element and add it to the stack, and later when it closes we remove it from the stack and attach it to the parent node.

Note that the character(..) method implementation only appends the new text into a text buffer. This is due to the fact that it is not guaranteed that the method provides the full piece of text; it may return it in chunks.

Finally, note that while this implementation is fine for my purposes, it is not prepared to deal with any XML entity. Comments and processing instructions (such as <?xml version=”1.0″? encoding=”UTF-8″ standalone=”yes”?>) are ignored. In addition, CDATA sections are undressed and escaped, instead of copying them as is (actually there is no semantic difference between the two).

Posted in java, JDK packages | Tagged: , , , , , , , , | 1 Comment »

Exceptional exceptions

Posted by Eyal Schneider on December 27, 2009

From time to time, I find myself wondering what happens when an exception is thrown from some exotic place in the code. It is not always trivial how these exceptions are handled, and what are the consequences. In this post I summarize some of these cases.

Exception in static initialization code

Static initialization occurs when a class is loaded by a class loader. Its code can either come in the form of a static block (i.e. static{ ..}), or as a call to  a static method for initializing a static data member. In both cases, checked exceptions are not allowed by the compiler. When an unchecked exception occurs, on the other hand, it is wrapped by the ExceptionInInitializerError, which is then thrown in the context of the thread that triggered the class loading. The following code snippet produces this scenario:

 
class Useless {
    static {
        if (true) // Bypasses the compiler check for obvious exceptions in initializers
            throw new RuntimeException();
    }
}  

public class TestUselessClass {
    public static void main(String[] args) {
        try {
            new Useless(); // Triggers class loading
        } catch (ExceptionInInitializerError e) {
            e.printStackTrace();
        }
        new Useless();
    }
} 

 

Since class loading is lazy, the class gets loaded and initialized only when we try to create an instance from it. The class loading fails because initialization can’t complete, and then the ExceptionInInitializerError is thrown. In line 16 we try to use the class anyway. The class loading will not be attempted again, and this time we get a NoClassDefFoundError. This error is more severe than ClassNotFoundException, and it indicates in this case that we are trying to use a class that is in an erroneous state due to a failed initialization.              

It is important to note that in some cases the exception is not wrapped by ExceptionInInitializerError. If in line 5 we chose to throw an Error instead of a RuntimeException, then the JVM would have thrown the original exception without wrapping it.              

Summing up, we should be careful with complex static initializers, and be aware that an exception during initialization makes the class (and possibly the whole application) useless. Unless we want to report a fatal situation, we should not throw a runtime exception from a static initializer.

Exceptions in constructors

 

Exceptions in constructors are not rare at all. A checked exception can be used to indicate a “legitimate” problem when trying to create an instance, while an unchecked exception typically indicates a bug either in the client code or in the constructor itself. In both cases, the object is actually allocated in the heap space, but a reference to it is not returned. The object remains in a partially initialized state, until it gets garbage collected.  We learn from these facts that saving a reference to the object from the constructor itself (by using “this” reference) is a risky thing, since we may give access to an object in an invalid state.              

Another interesting thing to note about exceptions in constructors is related to reflection. When we need to invoke the empty constructor using a class object clzz, we sometimes use the method clzz.newInstance(). Reading its documentation reveals something unusual: any exceptions thrown by the constructor are propagated without a change. In other words, the newInstance method may throw checked exceptions that it does not even declare (!). We can exploit this flaw in order to bypass the compiler’s exception checking: 

 
public class UnsafeThrower {
    private static Exception toThrow; 

    private UnsafeThrower() throws Exception{
        throw toThrow;
    } 

    public static synchronized void throwExc(Exception e){
        toThrow = e;
        try {
            UnsafeThrower.class.newInstance();
        } catch (InstantiationException e1) {
            throw new IllegalArgumentException("Can't deal with InstantiationException");
        } catch (IllegalAccessException e1) {
            throw new IllegalArgumentException("Can't deal with IllegalAccessException");
        } finally{
            toThrow = null;
        }
    }
} 

 

Since newInstance declares two checked exceptions (InstantiationException and IllegalAccessException), the trick can not work with these two exceptions, and we get an IllegalArgumentException instead.           

To summarize this, we should avoid using Class.newInstance, since it does not handle exceptions correctly. Instead, we should use Constructor.newInstance, which generates an InvocationTargetException wrapper for any constructor exception.

Exceptions in a finally block 

 

A finally block is executed immediately after the try block finishes executing, either gracefully or with an exception. Any returned value or exception produced by the finally block overrides any returned value or exception ending the try block previously.    

The following flawed method implementation illustrates such a scenario:  

 
public static int countLines(String filePath) throws IOException{
    int count = 0;
    BufferedReader br = null;
    try{
        br = new BufferedReader(new FileReader(filePath));   
        while (br.readLine()!=null)
            count++;
    }finally{
     br.close(); //May cause a NullPointerException!
    }
    return count;
}

 

The method countLines in supposed to throw an IOException for all cases where an IO related problem is encountered. However, when the file is not found we get a NullPointerException instead, since it hides the original IOException being thrown. Obviously, The solution is to add a null verification on br in line 10.  

Exceptions in finalizers

Finalizers are intended for disposing of system resources or doing other kind of cleanup. We implement a finalizer by overriding Object.finalize().
What happens when we throw an exception from within an implementation of the finalize() method?  

The documentation of Object.finalize() states that:  

“Any exception thrown by the finalize method causes the finalization of this object to be halted, but is otherwise ignored.”   

 In other words, the thrown exception is ignored, and the object will eventually be garbage collected (assuming that no finalizer made this object reachable again).  

Let’s put this statement to the test:  

   
public class FinalizerTester {
    public void finalize() {
        throw new RuntimeException("Halting the finalizer...");
    }  

    public static void main(String[] args) {
        while (true)
            new FinalizerTester();
    }
}  

 

We would expect this code to run indefinitely, performing GC continuously. However, running this code on my machine ends with a OutOfMemoryError. Does this prove that the documentation is wrong, and an exception in the finalizer does prevent the object from being garbage collected after all?  

Inspecting the error’s stack trace supplies a clue of what really happens:  

Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
    at java.lang.ref.Finalizer.register(Finalizer.java:72)
    at java.lang.Object.<init>(Object.java:20)
    at exceptions.FinalizerTester.<init>(FinalizerTester.java:3)
    at exceptions.FinalizerTester.main(FinalizerTester.java:10)
  

Whenever an Object with a non-empty finalizer is created, the JVM registers the object in the finalizer queue. The registration does not actually add the object to the queue; it only makes sure that the object enters the queue later, if and when it is no longer reachable . A dedicated thread (called “Finalizer” in Sun’s JVM) works in the background , removes objects from the queue, and executes their finalizers. Sometimes the rate at which objects get ready for finalization is greater than the rate at which the finalizer thread empties the queue.  As a consequence, the queue may explode and cause a OutOfMemoryError. In our case, this is exactly what happens: we slowed down the dequeue rate by throwing an exception, which is a relatively expensive operation. In fact, any other time-consuming task inside finalize() can have the same effect. 

We can conclude that:

1) An exception in an object’s finalizer may cause the cleanup to be incomplete, but it does not make the object non-collectable by the GC.
2) There are many good reasons for avoiding the usage of finalizers. One of them is that time-consuming finalizers delay the recycling of dead objects, possibly causing OutOfMemoryError.

 

Posted in java, JDK packages | Tagged: , , , , , , , , , , , , , , | Leave a Comment »

Comparators and sorted sets

Posted by Eyal Schneider on November 23, 2009

Java suggests two techniques for defining an ordering on a collection of objects. The first is implementing the Comparator interface, as a stand alone comparison mechanism. The second is making the objects comparable, by making their class/es implement the Comparable interface.  

When the ordering is inherent to the class nature, such as with class Integer, implementing the Comparable interface makes sense. In the other cases, we don’t want to add specific ordering logic to the class implementation. Instead, we implement comparators on a need to basis. We may also want to implement multiple comparators for the same class, thus having many ordering schemes.

Regardless of the interface we choose to implement (Comparable/Comparator), there are some important rules to follow. Suppose that we need a class for maintaining a set of points in the plane, ordered by their distance from a fixed point. Let’s first define a class for representing a point:    

public class Point{
    private int x;
    private int y;
    public Point(int x, int y) {
        this.x = x;
        this.y = y;
    }
    public int getX() {
        return x;
    }

    public int getY() {
        return y;
    }

    public boolean equals(Object other){
        if (!(other instanceof Point))
            return false;
        Point otherPoint = (Point) other;
        return x==otherPoint.x && y==otherPoint.y;
    }

    public int hashCode(){
        return x ^ y;
    }

    public double distanceFrom(Point other){
        int dx = x-other.x;
        int dy = y-other.y;
        return Math.sqrt(dx*dx+dy*dy);
    }

    public String toString(){
        return "("+x+","+y+")";
    }

 }

And now let’s define the OrderedPointSet class, which uses a TreeSet with a comparator for maintaining the sorted set of points efficiently: 


public class OrderedPointSet {
    private SortedSet<Point> points;
  private Point refPoint;
  public OrderedPointSet(final Point refPoint){
        this.refPoint = refPoint;
        points = new TreeSet<Point>(new Comparator<Point>(){
            public int compare(Point p1, Point p2) {
                double dist1 = refPoint.distanceFrom(p1);
                double dist2 = refPoint.distanceFrom(p2);    
                if (dist1>dist2)
                    return 1;
                if (dist2>dist1)
                    return -1;
                return 0;
            }
        });
    }
    public void add(Point p){
        points.add(p);
    }
    //returns the reference point followed by a list of points ordered in ascending distance from it

    public String toString(){
        return refPoint+":"+points.toString();
    }

    //Other methods
    ....
}

We can now test our implementation: 

OrderedPointSet ps = new OrderedPointSet(new Point(0,0));
ps.add(new Point(1,5));
ps.add(new Point(10,10));
ps.add(new Point(5,20));
ps.add(new Point(5,5));
System.out.println(ps);

The output contains all points, with their right order:  

(0,0):[(1,5), (5,5), (10,10), (5,20)]
 

Apparently the job is done. However, the implementation has a major flaw, that can be detected with the following test scenario:  

OrderedPointSet ps = new OrderedPointSet(new Point(0,0));
ps.add(new Point(5,10));
ps.add(new Point(2,11));
ps.add(new Point(5,5));
System.out.println(ps);

We expect to see 3 points in the output, but we see only two:   

(0,0):[(5,5), (5,10)] 

How did this happen? The first two points, though not equal, have the same distance from the reference point  (square root of 125). The TreeSet implementation (as well as other SortedSet implementations) never invokes the equals() method on its items, and relies only on the comparator implementation for determining equality. Therefore, from its point of view, points (5,10) and (2,11) are equal. When trying to add the latter, the collection stays unchanged, because it obeys the contract of Set.add(), which doesn’t permit overriding. As a consequence, another part of Set’s contract is broken: we add two distinct items (according to equals method), and yet one of them is rejected.
The problem originates from the comparator implementation above, which doesn’t define an ordering consistent with equals, as required by the SortedSet documentation. The formal definition is as follows:  

The ordering imposed by a comparator c on a set of elements S is said to be consistent with equals if and only if c.compare(e1, e2)==0 has the same boolean value as e1.equals(e2) for every e1 and e2 in S.        

Actually, for solving the problem described above it suffices to enforce only one direction of this rule: if two items are equal according to the comparator, they must be equal according to the equals() method. However, failing to comply with the other direction will result in a different violation of the Set interface –  the collection may accept two items despite the fact that they are equal according to equals().
Returning to our example, the fixed version of the comparator enforces consistency with equals by breaking distance ties of distinct points arbitrarily:

public int compare(Point p1, Point p2) {
    double dist1 = refPoint.distanceFrom(p1);
    double dist2 = refPoint.distanceFrom(p2);    
    if (dist1>dist2)
        return 1;    
    if (dist2>dist1)
        return -1;
 
    int dx = p1.getX()-p2.getX();
    if (dx != 0)
        return dx;
    return p1.getY()-p2.getY();
}

Proving that the consistency rule holds for this code is easy:

Let p1,p2 be points.

(Direction 1) Assume that p1.equals(p2). Then, p1.x=p2.x, and p1.y=p2.y. In this case, when running compare(p1,p2), conditions on lines 4 , 6 and 10 will not be met. The returned value is then p1.y-p2.y, which is 0, as requested.
(Direction 2) Assume that compare(p1,p2)=0. That means that the method must have terminated at line 12 (the other return statements never return 0). Therefore we know that p1.y=p2.y. Also, we know that condition on line 10 was not met, hence p1.x=p2.x. In other words, p1 and p2 are identical, so p1.equals(p2) is true, as requested.

Summing up, when implementing a Comparator (or Comparable)  interface intended to be used in some SortedSet implementation (or as a SortedMap key), make sure the comparator complies with the consistency with equals rule. This will guarantee that the collection respects the Set contract, and does not exhibit weird behaviours. After implementing the comparator, it may be wise to prove that the consistency property really holds.

Posted in java, JDK packages | Tagged: , , , , , , | 5 Comments »

Strings and Memory leaks

Posted by Eyal Schneider on October 27, 2009

java.lang.String is one of the mostly used classes in Java. Interestingly,  it is important to know some of its implementation details in order to avoid memory leaks.
Consider the following seemingly innocent piece of code:

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

 

public class StrangeStringThing {
    private static Random rnd = new Random();
    private static final int MEGA = 1024*1024;
 
    private static String createRandomStr(int length){
        StringBuilder sb = new StringBuilder(length);
        for(int i=0;i<length;i++)
            sb.append((char)('a'+rnd.nextInt('z'-'a'+1)));
        return sb.toString();
    }
 
    public static void main(String[] args){
        List<String> substrings = new ArrayList<String>();
        for(int i=0;i<100;i++){
            String randomStr = createRandomStr(MEGA);
            String subStr = randomStr.substring(1000,1004);
            substrings.add(subStr);
        }
    }
}

This code generates a very long random string, and then keeps only a 5 chars long substring of it. This process is repeated 100 times.  We expect to end this program with 100 strings of length 5.
Surprisingly, if we try to run this code with the heap size limited to 100MB (-Xmx100m), we get an OutOfMemoryError.  How did that happen? all temporary large strings should have been cleared by the garbage collection, so there should be a really small memory footprint. Furthermore, the documentation of java.lang.String does not indicate any misuse of the API.

How are Java Strings stored?

The Java language specification defines precisely the API of class String, its Unicode details, its immutability, and the way String literals should be handled. Naturally, it does not define how Strings should be  represented internally, and what are the required time/space complexities of the different operations on them. These aren’t functional requirements. As a consequence, different JVM providers have the freedom to choose their preferred design for class String. Sun’s implementation (and also IBM’s) represent a string with a combination of 3 data members:

  • value – A reference to a char array
  • offset – A start index, inside the array
  • count – The actual length of the string.

When allocating a new string, the char array contains exactly the string content. Offset is set to 0, and count is set to the char array length. Then, when calling the method substring(..) upon it, the new string being returned contains a reference to the same char array, but its offset and count members are modified, reflecting the requested subsequence of chars. The sharing of the same char array by multiple String instances is possible, since strings are immutable. There are two benefits of this implementation approach in comparison to a substring implementation based on copying:
1) Memory usage is usually reduced, specially in cases where many substrings of the same string are taken, or if the substrings are long
2) substring(..) runs in constant time, instead of linear time

 Obviously there is a tradeoff here – if the string utilization pattern is not as described in (1), we may suffer from excessive memory consumption. The code above demonstrates this edge case: we take very small substrings of very large strings, where the latter are temporary. The temporary strings (referenced by randomStr) ARE collected by the garbage collector during the loop. However, their internal char arrays can not be collected, since they are being shared with the sub-strings. Therefore, the very long char arrays are kept unintentionally in memory, and that constitutes a memory leak.

The solution is simple – forcing the substrings to be copied into a new compact char array. This is done by replacing line 22 above with:

  String subStr = new String(randomStr.substring(1000,1004));

It looks like the API designers were aware of the the possibility of substring being implemented as described, so they added a copy constructor specifically for avoiding the memory leak. Note that otherwise this constructor is useless, because we are dealing with an immutable class.

Is this a Java bug?

There is currently an open bug report in Sun’s bug database regarding this issue.  It doesn’t look like it will be fixed soon, because doing so may introduce performance regression in many existing programs. Besides, there is a very simple workaround (described above), what makes the priority of this bug relatively low.

Summary

Java developers should be aware of the prevalent String.substring(..) implementation, and its memory leak potential risk. Whenever we know that the original string we take the substring from has a shorter life span than the substring itself, we should consider using the copy constructor to avoid sharing of the underlying char array.

Posted in java, JDK packages | Tagged: , , , , , , , | 5 Comments »

OutOfMemoryError

Posted by Eyal Schneider on June 17, 2009

Any Java programmer has encountered at least once the disturbing OutOfMemoryError (OOME) message. The interesting thing is that this error occurs for different reasons, and the diagnosis and treatment in each case is completely different. In this post I will try to cover some of these cases, show how to reproduce them and how to deal with them.
Some important things to remember about OOME:

  • As the name implies, this is an Error, and not an Exception. Errors indicate a serious condition in the JVM. Catching Errors is usually useless – the code doesn’t have the means to recover from them. Since they are usually not caught, they result in some thread getting terminated.
  • The stack trace is not always present in OOME, since its creation also requires memory, which may not be available.
  • The circumstances where an OOME is thrown may vary between different JVM implementations, due to a different memory management model. The textual description of the error may also vary. This post refers to to my own experience with Sun’s JVM, although I believe that most of it is relevant for other implementations as well.
  • As explained above, OOME is not recoverable at runtime. Therefore, the right strategy is to avoid its recurring once it happens, by means of configuration, design, or hardware modifications.

 

Java memory organization

Let’s first start with a very short overview of how a Java process organizes its address space. Basically, the memory is divided into heap memory and non-heap memory.  Modern JVM implementations use a generational heap model, in order to improve garbage collection performance. In this model, the heap area is usually split into 3 generations: young, old (“tenured”) and permanent. The young generation is where new objects created with the new operator are allocated, and it is designed to hold “young” objects. Older objects are moved to the tenured generation during GC. The permanent generation is not common to all JVM implementations, and it differs from the other generations – it stores (among other things) metadata of classes that have been loaded. The permanent generation is sometimes regarded as part of the heap, and sometimes not (The heap sizing flags -Xmx and -Xms, for example, do not include the permanent generation size).
Non-heap memory is the additional memory space outside of the areas reserved for the heap, and it can be used for different purposes. I will refer to this memory as native memory.

java.lang.OutOfMemoryError: Java heap space

This common error indicates that an object creation failed because there was no available heap space for it. It will never be thrown unless the JVM tries a full garbage collection first, including removal of softly/weakly referenced objects. When the stack trace is present for this particular OOME, it is usually irrelevant – the specific allocation that could not be satisfied gives us no clue regarding previous allocations that probably caused the problem. Same with the thread where the error occured – It may occur in any thread, since they all share the same heap.
  
Following is a sample code that eventually generates this OOME:
List<String> list = new ArrayList<String>();
while(true)
    list.add(new String("Consume more memory!"));
Solutions:
1) Check for memory leaks
  • Use the memory tab in JConsole in order to examine the graph of memory usage of your application over time. Focus on the “Tenured generation”, and click on “perform GC” every once in a while, to see whether the used memory returns to some expected baseline level after every garbage collection.
  • Run your application with the following JVM flags: -verbose:gc and -XX:+PrintGCDetails. This will add console outputs with details on every garbage collection that takes place. Using these outputs, you can analyze the memory consumption pattern, and see whether the memory usage increases constantly where it shouldn´t.                                    

 The following tools can help you analyze the reasons for the memory leak:

  • Run jmap -histo <java process id> at different stages of your application execution. This command prints the number of instances and total used memory per Java class. You can identify classes that have an unreasonable number of instances, or classes for which the number of instances increases unexpectedly. This can limit the search space in your code, when looking for the bug.
  • Generate a heap dump using JConsole or jmap, and analyze it using a heap analyzer (e.g. JHat, IBM HeapAnalyzer). The heap dump contains the full state of the heap at the point of time where the dump was created. JHat allows examining this heap snapshot, navigating through the different instances, and even executing complex queries (OQL language). An advanced feature allows comparing two snapshots taken at different points of time, in order to identify the allocations that occurred between the two. If you want the dump to be created automatically when the OOME occurs, add the -XX:+HeapDumpOnOutOfMemoryError JVM flag.

2) Increase the heap size

Use the JVM flags -Xms and -Xmx to set the initial and maximal heap size, respectively. Make sure not to exceed the physical memory size, otherwise you encourage paging, which affects performance. Sometimes installing additional physical memory is required.

3) Reconsider the design

An inadequate design can easily consume too much heap space and cause this OOME. Sometimes, it is useful to change the design in a way that compromises some desired system property (such as performance), but reduces memory use dramatically. For example, disk/database storage can replace some of the data structures in RAM, and memory sensitive mechanisms such as weak/soft references can be introduced.

java.lang.OutOfMemoryError: PermGen space

When the permanent generation becomes full, this error occurs, regardless of the available space in the object allocation area (young + old generations). Note that this specific message is issued by Sun’s JVM, and other JVM implementations may have different messages, specially those implementations with a different policy for storing class information (such as IBM’s JVM).
I will reproduce this OOME in two different ways. The first one uses a custom class loader to load too many classes. Assume that we have a custom class loader named MyClassLoader, capable of loading the class Dummy:
List<Class<?>> classes = new ArrayList<Class<?>>(); while(true){ MyClassLoader cl = new MyClassLoader(); try{ classes.add(cl.loadClass("Dummy")); }catch (ClassNotFoundException e) { e.printStackTrace(); } }
Note that we are actually loading the same class over and over, with a different class loader (remember that a class is identified by the combination of its name and its class loader object). Also, note that we store references to the loaded classes. If we don’t do so, the garbage collector realizes that the classes are not used, and performs garbage collection on the permanent generation.

A second way of filling up the permanent generation, is by exploiting the String intern pool. The class String manages a pool of strings that have been interned, using String.intern(…). This pool is also a part of the permanent generation.

List<String> list = new ArrayList<String>();
int i=0;
while(true)
    list.add(("Consume more memory!"+(i++)).intern());

The examples above highlight the common cases where this OOME occurs in practice:

  • Applications which load classes dynamically, such as web servers and application containers
  • A class loader with a memory leak
  • Applications that make use of string interning, in order to improve performance or reducy heap usage

Solutions:

Start by analyzing. A good place to start with is the stack trace. If it contains the java.lang.String.intern(..) call, then the reason is probably string interning. Alternatively, if it contains the java.lang.ClassLoader.defineClass(..) call, then you are probably dealing with a class loading problem. In the latter case, you may want to track the loading/unloading of classes, using the JVM flags -XX:+TraceClassLoading(or -verbose:class) and -XX:+TraceClassUnloading. Also, make sure you don’t have the flag -Xnoclassgc, because it prevents GC on classes, and can easily cause this OOME in applications that keep loading classes dynamically. As a part of the analysis, you can also run the utility jmap -permstat <process id> (not available for Windows). It provides details on all class loaders that have been used since the application was launched. 

Once you have pinpointed the problematic piece of code, check whether the design can be changed to consume less memory. If this is not an option, use the -XX:MaxPermSize flag to increase the maximal size of the permanent generation (e.g. -XX:MaxPermSize=256m).
Remember that increasing the values of -Xmx and -Xms flags instead will not change a thing. As explained before, these flags control the size of the young+old generations only, and do not affect the permanent generation.

 

java.lang.OutOfMemoryError: unable to create new native thread

Whenever we start a a  thread in Java, the JVM allocates space in native memory, designated for the thread’s call stack. This space is freed only after the thread is terminated. In case that there is not enough native memory for the stack space, we get this error. This can easily be reproduced with the following thread spawner code:
while(true){
    new Thread(new Runnable(){
        public void run() {
            try {
                Thread.sleep(60*60*1000);
            } catch(InterruptedException e) { }        
        }    
    }).start();
}  
Solutions:
This error is common in applications that try to manage too many live threads, as in the example above. As always, first ask yourself whether the design is reasonable. There are some generic solutions for reducing the number of threads. For example, an application that creates a thread per connection has a scalability problem in its design. The IO multiplexing feature introduced in Java 1.4 can solve this, and reduce dramatically the number of threads (See class java.nio.channels.Selector).
If you wish to try solving this problem by changing the memory settings only, then any of the following can help:
1) Reduce the heap size using -Xmx and -Xms flags. Sometimes, inexperienced programmers increase the heap size when they encounter this OOME, but it only makes the condition worse.
2) Reduce the permanent generation size using -XX:MaxPermSize
3) Reduce the size of memory allocated per stack using -Xss (e.g. -Xss512k). The default values depends on the environment, and range between256k and 1024k. Remember that when you reduce this value, you are increasing the risk of StackOverflowError, specially if you have deep recursion somewhere in your code or libraries that you use.
  

Other native memory OOMEs

There are many ways to use native memory other than thread stack allocation discussed before. Allocations can be made by the JVM itself, third party libraries, utilities in the JDK libraries,  and JNI code that you write. When the native memory gets exhausted, we may get an OOME of some kind. They are not consistent with their message, but one of them looks as follows:
 
java.lang.OutOfMemoryError: requested <bytes> bytes for <reason>. Out of swap space?
 
We can also cause a different message, by using the direct buffer feature of the java.nio package, in order to consume all available native memory:
List<ByteBuffer> list = new ArrayList<ByteBuffer>();
while(true)    
    list.add(ByteBuffer.allocateDirect(10000000)); 
 
In this case we will get the message:
java.lang.OutOfMemoryError: Direct buffer memory
 
 Solutions:
1)  Check whether there is a memory leak in some native code you use with JNI
2) Reduce the heap size using -Xmx and -Xms flags, to make more space for native memory.
3) Increase the swap space on disk. The way to do this depends on the operating system you use.
 
 

Summary

There are several types of OutOfMemoryError messages. It is important to know how to identify and analyze them,  and then how to find an appropriate solution. Solutions are in terms of prevention and not recovery at runtime.  A recovery attempt after an OOME may be too late – the data integrity may be broken, and there is no guarantee that there will be enough resources to complete it successfully. Instead, we should re-launch the application with a different setup.
An automatic recovery mechanism that occurs before the OOME can be effective, though. Since Java 1.5, we have the MemoryMXBean class, as part of the JMX services. An example of such a memory tracking mechanism was written by Dr. Heinz Kabutz in his 92nd issue of his newsletter.

Posted in java, JDK packages | Tagged: , , , , , , , , , , , , , , , , , , , , | Leave a Comment »

SimpleDateFormat pitfalls

Posted by Eyal Schneider on May 29, 2009

The class SimpleDateFormat is a very popular class, allowing flexible formatting and parsing of dates in Java. Formatting consists of transforming a Date into a String in a predefined format, and parsing performs the inverse operation.

In many cases, programmers fail to use SimpleDateFormat correctly, due to inattention to some (well documented) characteristics of this class. Maybe this will not be new for some of the readers, but there are 3 important things to remember when using SimpleDateFormat.

Time zone 

Every instance of SimpleDateFormat has an associated time zone to work with, as a TimeZone instance. Otherwise, a parsing operation would not be able to determine an absolute point in time (Date object), from a given string denoting a date. If not specified otherwise, the time zone associated with a new SimpleDateFormat is the default one, corresponding to the time zone reported by the operating system.

Consider the following code:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
Date date = sdf.parse("18-07-1976");
System.out.println(date.getTime());

Since the time zone is not explicitly set (using setTimeZone(..)), the code above will result in different outputs when executed in different parts of the world. The midnight that started July 18, 1976 at Thailand, and the same midnight in Florida, have a different GMT representation. The reason is that the midnight arrived in Thailand first.

In many cases, the local default time zone is exactly what the programmer wants. However, on a client-server application, or on a distributed system, we usually want to avoid this time zone relativity. Date strings that are read from configuration / data files / user interface / external software interfaces must be specific with their time zone. We can achieve this in one of two ways. The first one is to include the timezone itself as a part of the expected date format:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy z");
Date date = sdf.parse("18-07-1976 ICT");
System.out.println(date.getTime());

This program will always return the same value (206470800000), regardless of the execution location.

The other approach is to define a convention which needs to be respected by all system entities which generate textual dates. For example, we can decide that all dates should be interpreted as GMT dates. Then, we can bind the corresponding time zone to the SimpleDateFormat. For example:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
sdf.setTimeZone(TimeZone.getTimeZone("GMT"));
Date date = sdf.parse("18-07-1976");
System.out.println(date.getTime());

 This program will also return the same value everywhere in the world (206496000000). 

Validation

Assume  that we are reading a date in the format dd/MM/yyyy from the GUI. We want to validate that the date is well formatted and legal. Consider the following validation code, which makes use of the ParseException thrown by the parse method:

SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy");
Date d = null;
try {
    d = sdf.parse(dateStr);
} catch (ParseException e) {   
    showErrorMessage("Incorrect date format. "+e.getMessage());
}

Will it do the job? The answer is – not always. Date strings such as “10//10/2008” will result in an error, but dates such as “40/20/2009” will be parsed successfully, with no error message!. The reason is that by default, SimpleDateFormat is lenient, which means that it is tolerant to some formatting errors. It will try to figure out the date by applying some heuristics to the given text. In the case above, month 20 will be interpreted as December plus 8 months (i.e. August 2010), and day 40 of the month will be interpretted as end of August plus 9 days. The resulting date will be September 9th, 2010.

In order to achieve a stricter validation, we should make the SimpleDateFormat non-lenient, by initializing it as follows:

SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy");
sdf.setLenient(false);

Thread safety 

SimpleDateFormat is not thread safe. Concurrent formatting/parsing operations interfere with each other and result in inconsistent behavior. There are 3 ways to deal with this:

1) Create a SimpleDateFormat instance per usage
This will work fine if the usage is not frequent. Otherwise, the performance penalty may be significant.

2) Synchronization
We can achieve thread safety by synchronizing all usages of the SimpleDateFormat instance with the same lock. For example:

private SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
...
private Date parse(String dateStr) throws ParseException{
    synchronized(sdf){
        return sdf.parse(dateStr);
    }
}
private String format(Date date){
    synchronized(sdf){
        return sdf.format(dateStr);
    }
}

The disadvantage of this approach is that in case of many threads working with this code, thread contention will affect performance.

3) Use an instance per thread
The usage of ThreadLocal class allows associating a single instance of SimpleDateFormat with each thread that needs to do parsing/formatting of dates.
This way, we achieve both thread safety with no thread contention, and a minimal number of SimpleDateFormat instances.
In some cases, we can be extra economical on memory usage, if we know for example that the threads stop using the SimpleDateFormat at some point of time. In these cases, we could use a soft reference to hold the SimpleDateFormat instance.
An explanation of this technique plus sample code can be found here.

Links

http://www.javaspecialists.eu/archive/Issue172.html
http://java.sun.com/javase/6/docs/api/java/text/SimpleDateFormat.html

Posted in java, JDK packages | Tagged: , , , , , , | 3 Comments »

Having performance issues with regex?

Posted by Eyal Schneider on May 21, 2009

Java introduced the java.util.regex package in version 1.4. It is a powerful addition, and yet, one should really be an artist in order to use it right. Assuming that a regular expression is proved for correctness, it may still run extremely slow (even take hours) if it is not wisely written.

Continue reading for understanding the origin of the problem, or jump directly to the last section, containing 10 useful performance tips for regular expression writers in Java.

Can it really be so slow?

 Suppose that we want to match only strings composed of sequences of ‘a’s or ‘b’s. A legal regex would be:

(a*b*)*

However, if you run this regex against the string “aaaaaaaaaaaaaaaaaaaaaaaaaaaaax” for example, it takes several minutes to terminate and report that there is no match!
Of course, a better regex in this case would be:

(a|b)*

This one takes less than a millisecond on my machine, on the same input. Obviously, there is a performance issue here.

Why does this happen?

 Like most regex engines, Java uses NFA (Non deterministic Finite Automaton) approach. The engine scans the regex components one by one, and progresses on the input string accordingly. However, it will go back in order to explore matching alternatives when a “dead end” is found. Alternatives result from regex structures such as quantifiers (*, +, ?) and alternation (e.g. a|b|c|d). This exploration technique is called backtracking.
In the catastrophic example above, the engine will actually explore ALL the decompositions of the series of ‘a’s into smaller series, before it realizes that there is no match. This example shows how the backtracking algorithm can result in exponential time evaluation (in the length of the input sting). It also reveals an important property of NFA engines: the worst cases will always be inputs that almost match the pattern. If a match is found, the exploration stops.

The other main approach for regex engines is DFA (Deterministic Final Automata). In this approach, the regex compilation actually builds an automaton, to be used when traversing input strings. Inputs are traversed character by character, with no going back. This guarantees linear timein the length of the input string, regardless of the regex complexity. Instead of trying different match possibilities serially (as in NFA), DFAsimulates a parallel scanning on all possibilities.

So why does Java (and .NET, Perl, Python, Ruby, PHP etc) use NFA and not DFA, which has much better asymptotic behavior? The reason is that NFA has some significant benefits:

  • Its compilation is faster, and requires much less memory
  • It allows some powerful features (See Sun’s tutorial for details on them):
    • Capturing groups and back references
    • Lookarounds
    • Advanced quantifiers (Possessive and Reluctant)

It is important to note that the popular terms NFA and DFA for regex engines are inaccurate. Theoretically speaking, these 2 models have the same computation power, meaning that there can’t be a matching rule that can be expressed in one of them but not in the other. In practice, there was a need for more features, so the two implementation types diverged in their semantics. NFA engines were given more power, making them superior to DFA engines in computation power.

Due to the speed of DFA and the unique features of NFA, there are 2 more “integrative” approaches for regex implementations. Some implementations use both engines (e.g. GNU egrep, which chooses the specific engine at runtime), and some of them really implement a hybrid version (e.g. Tcl regex), enjoying all the benefits.

Tips

 Following are some tips on how to avoid regex efficiency issues in Java. Many of them are aimed at reducing backtracking.

 1)  Pre-compile
Trivial, but worth mentioning. If you use a regular expression more than once, remember to compile it first as a pre-process step:
Pattern p = Pattern.compile(regex,flags);

//Usage stage
Matcher a = p.matcher(input);

 2)  Reluctant quantifiers vs greedy quantifiers
The default quantifiers (* + and ?) are greedy. That means that they start by matching the longest possible sequence, and then they step back gradually if backtracking is needed. In case that you know that the match is usually short, you should use reluctant quantifiers instead. They start by the smallest match, and progress if needed.
For example, suppose that you want to match only strings containing the sequence “hello”. The regex .*hello.* will do the job, but if you know that ‘hello’ usually appears near the beginning of the text, then .*?hello.* will run faster on average.

 3)  Use possessive quantifiers where possible
Unlike reluctant quantifiers which affect performance but do not affect regex behavior, possessive quantifiers may really change the meaning of the regex.
When using *+ instead of *, the first attempted match will be greedy (i.e. longest match, just as with *), but there will be no backtracking in case of failure, even if this causes the complete match to fail. When does this become useful?
Suppose that you need to match text in quotes. The regex \”[^\”]*\”will work fine. However, it will do unnecessary backtracking on negative cases (e.g. “bla bla bla). Using the regex \”[^\”]*+\”instead will eliminate the backtracking, without changing the regex meaning.

Independent grouping can have the same effect, and it allows even more control (See Sun’s tutorial ).

4)  Avoid capturing groups
Any expression in parentheses is considered a capturing group by default. This has a small performance impact. Make your groups “non-capturing groups” when possible, by beginning them with (?: instead of with (.

5)  Use alternation wisely
When using alternation (e.g. Paul|Jane|Chris), the order in which the engine tries to match the options is the same order in which they appear. Therefore, you can take advantage of this property, and order your options from the most common one to the less common one. This will improve the average time of positive matches.

 6)  Avoid multiple interpretations
Write your regex in a way that minimizes the number of different ways to match a particular input string. For example, the regex (a*b*)*given in the beginning of this article, allows interpreting the string “aabb” in too many ways:
(a2b2)
(a1)(a1)(b1)(b1)
(a2)(b2)
(a1)(a1b2)
etc…

The regex (a|b)* on the other hand, forces a unique interpretation on a positive input.
This is very important in order to reduce backtracking, in cases of almost a match.

 7)  Lookarounds
Lookarounds  allow adding restrictions on sequences to the left/right of current position. Specifically, with negative lookahead you can search strings that do not contain some sequence (cumbersome thing to do without this feature!). How can this help improve performance?

Suppose you want to capture the URL from a link tag. Consider the following regex:
<a .* href=(\S*).*/>
For legal tags it will find a match and capture only the href attribute as required (\S stands for a non-whitespace character) .  On some illegal tags, however, exessive backtracking will occur. For example: “<a href= href=href=…. href=something”.
The following regex will prevent this, by replacing the “.*” expression with something more specific, that does not match “href’”:
<a ((?!href).)* href=(\S*)((?!href).)*/>

 8)  Specify lengths
Java has a regex optimizer that checks the input string’s length against the min/max length derived from the regex contents. This allows failing the match in some cases immediately. In order to help this mechanism, specify repetition number on groups whenever possible (e.g. [01]{6}, which matches all binary strings of length 6).

 9)  Extract mandatory strings
Sometimes, strings which are mandatory are hidden inside groups or alternations:
(hello|hell|heel)
This expression could be simplified to:
he(llo|ll|el)
By doing so, the regex optimizer has more information to work with.

 10)  Benchmark your regex
When the regex is used in a performance critical area of your application, it would be wise to test it first. Write a micro-benchmark application that tests it against different inputs. Remember to test different lengths of inputs, and also inputs that almost match your pattern.

 

Links:

 http://java.sun.com/docs/books/tutorial/essential/regex/index.html
http://www.javaworld.com/javaworld/jw-09-2007/jw-09-optimizingregex.html?page=1
http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html
http://www.devarticles.com/c/a/Java/NFA-DFA-POSIX-and-the-Mechanics-of-Expression-Processing/

Posted in java, JDK packages | Tagged: , , , , , , | 5 Comments »