The Java Explorer

Tips and insights on Java

  • Subscribe

  • If you find this blog useful, please enter your email address to subscribe and receive notifications of new posts by email.

    Join 35 other followers

Posts Tagged ‘validation’

Generating DOM from XML preserving line numbers

Posted by Eyal Schneider on November 30, 2010

Lately I was implementing a simple scripting language based on XML. One of the goals was to emit informative and detailed error messages for syntax errors and script execution errors. In order to achieve this, I was required to include the relevant script line number in all error messages.  Default error messages for malformed XML or XML Schema validation failures do include line numbers, but they cover a limited set of possible errors. Scripting languages often require further validations, which are beyond the expressive power of standard XML schemas.

Parsing an XML into a corresponding DOM structure is straight forward when using the javax.xml.parsers.DocumentBuilder class:

public static Document readXML(InputStream is) throws IOException, SAXException{
    DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
    try {
        DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
        return docBuilder.parse(is);
    } catch (ParserConfigurationException e) {
        throw new RuntimeException("Can't create DOM builder.", e);
    }
}
 



The problem is that once we have a DOM tree, the association between elements and their line number is lost. I couldn’t find any way of tweaking the DOM parser to be aware of line numbers. Therefore I tried using a SAX parser instead. SAX parsers traverse a document in its textual order from beginning to end, triggering callback methods whenever XML building blocks are encountered or a syntax error occurs. A SAX parser can be supplied with a DefaultHandler that uses a Locator. The latter is used for tracking the position of XML entities in the input document.

Following is a utility method that converts an XML given in an InputStream into a DOM structure, by using a SAX parser. Instead of keeping the line numbers in a new data structure, the DOM is enriched with a new attribute per element, indicating the line number of the element in the input document.


public static Document readXML(InputStream is, final String lineNumAttribName) throws IOException, SAXException {
    final Document doc;
    SAXParser parser;
    try {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        parser = factory.newSAXParser();
        DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
        doc = docBuilder.newDocument();           
    } catch(ParserConfigurationException e){
        throw new RuntimeException("Can't create SAX parser / DOM builder.", e);
    }

    final Stack<Element> elementStack = new Stack<Element>();
    final StringBuilder textBuffer = new StringBuilder();
    DefaultHandler handler = new DefaultHandler() {
        private Locator locator;

        @Override
        public void setDocumentLocator(Locator locator) {
            this.locator = locator; //Save the locator, so that it can be used later for line tracking when traversing nodes.
        }
       
        @Override
        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {               
            addTextIfNeeded();
            Element el = doc.createElement(qName);
            for(int i = 0;i < attributes.getLength(); i++)
                el.setAttribute(attributes.getQName(i), attributes.getValue(i));
            el.setAttribute(lineNumAttribName, String.valueOf(locator.getLineNumber()));
            elementStack.push(el);               
        }
       
        @Override
        public void endElement(String uri, String localName, String qName){
            addTextIfNeeded();
            Element closedEl = elementStack.pop();
            if (elementStack.isEmpty()) { // Is this the root element?
                doc.appendChild(closedEl);
            } else {
                Element parentEl = elementStack.peek();
                parentEl.appendChild(closedEl);                   
            }
        }
       
        @Override
        public void characters (char ch[], int start, int length) throws SAXException {
            textBuffer.append(ch, start, length);
        }
       
        // Outputs text accumulated under the current node
        private void addTextIfNeeded() {
            if (textBuffer.length() > 0) {
                Element el = elementStack.peek();
                Node textNode = doc.createTextNode(textBuffer.toString());
                el.appendChild(textNode);
                textBuffer.delete(0, textBuffer.length());
            }
        }           
    };
    parser.parse(is, handler);
   
    return doc;
}   

 

In order to compose a hierarchical structure from the linear traversal, a stack is used. The stack contains at any moment of the traversal the “path” to the current location. Whenever an element start is encountered we build the element and add it to the stack, and later when it closes we remove it from the stack and attach it to the parent node.

Note that the character(..) method implementation only appends the new text into a text buffer. This is due to the fact that it is not guaranteed that the method provides the full piece of text; it may return it in chunks.

Finally, note that while this implementation is fine for my purposes, it is not prepared to deal with any XML entity. Comments and processing instructions (such as <?xml version=”1.0″? encoding=”UTF-8″ standalone=”yes”?>) are ignored. In addition, CDATA sections are undressed and escaped, instead of copying them as is (actually there is no semantic difference between the two).

Posted in java, JDK packages | Tagged: , , , , , , , , | 1 Comment »

SimpleDateFormat pitfalls

Posted by Eyal Schneider on May 29, 2009

The class SimpleDateFormat is a very popular class, allowing flexible formatting and parsing of dates in Java. Formatting consists of transforming a Date into a String in a predefined format, and parsing performs the inverse operation.

In many cases, programmers fail to use SimpleDateFormat correctly, due to inattention to some (well documented) characteristics of this class. Maybe this will not be new for some of the readers, but there are 3 important things to remember when using SimpleDateFormat.

Time zone 

Every instance of SimpleDateFormat has an associated time zone to work with, as a TimeZone instance. Otherwise, a parsing operation would not be able to determine an absolute point in time (Date object), from a given string denoting a date. If not specified otherwise, the time zone associated with a new SimpleDateFormat is the default one, corresponding to the time zone reported by the operating system.

Consider the following code:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
Date date = sdf.parse("18-07-1976");
System.out.println(date.getTime());

Since the time zone is not explicitly set (using setTimeZone(..)), the code above will result in different outputs when executed in different parts of the world. The midnight that started July 18, 1976 at Thailand, and the same midnight in Florida, have a different GMT representation. The reason is that the midnight arrived in Thailand first.

In many cases, the local default time zone is exactly what the programmer wants. However, on a client-server application, or on a distributed system, we usually want to avoid this time zone relativity. Date strings that are read from configuration / data files / user interface / external software interfaces must be specific with their time zone. We can achieve this in one of two ways. The first one is to include the timezone itself as a part of the expected date format:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy z");
Date date = sdf.parse("18-07-1976 ICT");
System.out.println(date.getTime());

This program will always return the same value (206470800000), regardless of the execution location.

The other approach is to define a convention which needs to be respected by all system entities which generate textual dates. For example, we can decide that all dates should be interpreted as GMT dates. Then, we can bind the corresponding time zone to the SimpleDateFormat. For example:

SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
sdf.setTimeZone(TimeZone.getTimeZone("GMT"));
Date date = sdf.parse("18-07-1976");
System.out.println(date.getTime());

 This program will also return the same value everywhere in the world (206496000000). 

Validation

Assume  that we are reading a date in the format dd/MM/yyyy from the GUI. We want to validate that the date is well formatted and legal. Consider the following validation code, which makes use of the ParseException thrown by the parse method:

SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy");
Date d = null;
try {
    d = sdf.parse(dateStr);
} catch (ParseException e) {   
    showErrorMessage("Incorrect date format. "+e.getMessage());
}

Will it do the job? The answer is – not always. Date strings such as “10//10/2008” will result in an error, but dates such as “40/20/2009” will be parsed successfully, with no error message!. The reason is that by default, SimpleDateFormat is lenient, which means that it is tolerant to some formatting errors. It will try to figure out the date by applying some heuristics to the given text. In the case above, month 20 will be interpreted as December plus 8 months (i.e. August 2010), and day 40 of the month will be interpretted as end of August plus 9 days. The resulting date will be September 9th, 2010.

In order to achieve a stricter validation, we should make the SimpleDateFormat non-lenient, by initializing it as follows:

SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy");
sdf.setLenient(false);

Thread safety 

SimpleDateFormat is not thread safe. Concurrent formatting/parsing operations interfere with each other and result in inconsistent behavior. There are 3 ways to deal with this:

1) Create a SimpleDateFormat instance per usage
This will work fine if the usage is not frequent. Otherwise, the performance penalty may be significant.

2) Synchronization
We can achieve thread safety by synchronizing all usages of the SimpleDateFormat instance with the same lock. For example:

private SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy");
...
private Date parse(String dateStr) throws ParseException{
    synchronized(sdf){
        return sdf.parse(dateStr);
    }
}
private String format(Date date){
    synchronized(sdf){
        return sdf.format(dateStr);
    }
}

The disadvantage of this approach is that in case of many threads working with this code, thread contention will affect performance.

3) Use an instance per thread
The usage of ThreadLocal class allows associating a single instance of SimpleDateFormat with each thread that needs to do parsing/formatting of dates.
This way, we achieve both thread safety with no thread contention, and a minimal number of SimpleDateFormat instances.
In some cases, we can be extra economical on memory usage, if we know for example that the threads stop using the SimpleDateFormat at some point of time. In these cases, we could use a soft reference to hold the SimpleDateFormat instance.
An explanation of this technique plus sample code can be found here.

Links

http://www.javaspecialists.eu/archive/Issue172.html
http://java.sun.com/javase/6/docs/api/java/text/SimpleDateFormat.html

Posted in java, JDK packages | Tagged: , , , , , , | 3 Comments »