The Java Explorer

Tips and insights on Java

  • Subscribe

  • If you find this blog useful, please enter your email address to subscribe and receive notifications of new posts by email.

    Join 37 other followers

Posts Tagged ‘DefaultHandler’

Generating DOM from XML preserving line numbers

Posted by Eyal Schneider on November 30, 2010

Lately I was implementing a simple scripting language based on XML. One of the goals was to emit informative and detailed error messages for syntax errors and script execution errors. In order to achieve this, I was required to include the relevant script line number in all error messages.  Default error messages for malformed XML or XML Schema validation failures do include line numbers, but they cover a limited set of possible errors. Scripting languages often require further validations, which are beyond the expressive power of standard XML schemas.

Parsing an XML into a corresponding DOM structure is straight forward when using the javax.xml.parsers.DocumentBuilder class:

public static Document readXML(InputStream is) throws IOException, SAXException{
    DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
    try {
        DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
        return docBuilder.parse(is);
    } catch (ParserConfigurationException e) {
        throw new RuntimeException("Can't create DOM builder.", e);
    }
}
 



The problem is that once we have a DOM tree, the association between elements and their line number is lost. I couldn’t find any way of tweaking the DOM parser to be aware of line numbers. Therefore I tried using a SAX parser instead. SAX parsers traverse a document in its textual order from beginning to end, triggering callback methods whenever XML building blocks are encountered or a syntax error occurs. A SAX parser can be supplied with a DefaultHandler that uses a Locator. The latter is used for tracking the position of XML entities in the input document.

Following is a utility method that converts an XML given in an InputStream into a DOM structure, by using a SAX parser. Instead of keeping the line numbers in a new data structure, the DOM is enriched with a new attribute per element, indicating the line number of the element in the input document.


public static Document readXML(InputStream is, final String lineNumAttribName) throws IOException, SAXException {
    final Document doc;
    SAXParser parser;
    try {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        parser = factory.newSAXParser();
        DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
        doc = docBuilder.newDocument();           
    } catch(ParserConfigurationException e){
        throw new RuntimeException("Can't create SAX parser / DOM builder.", e);
    }

    final Stack<Element> elementStack = new Stack<Element>();
    final StringBuilder textBuffer = new StringBuilder();
    DefaultHandler handler = new DefaultHandler() {
        private Locator locator;

        @Override
        public void setDocumentLocator(Locator locator) {
            this.locator = locator; //Save the locator, so that it can be used later for line tracking when traversing nodes.
        }
       
        @Override
        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {               
            addTextIfNeeded();
            Element el = doc.createElement(qName);
            for(int i = 0;i < attributes.getLength(); i++)
                el.setAttribute(attributes.getQName(i), attributes.getValue(i));
            el.setAttribute(lineNumAttribName, String.valueOf(locator.getLineNumber()));
            elementStack.push(el);               
        }
       
        @Override
        public void endElement(String uri, String localName, String qName){
            addTextIfNeeded();
            Element closedEl = elementStack.pop();
            if (elementStack.isEmpty()) { // Is this the root element?
                doc.appendChild(closedEl);
            } else {
                Element parentEl = elementStack.peek();
                parentEl.appendChild(closedEl);                   
            }
        }
       
        @Override
        public void characters (char ch[], int start, int length) throws SAXException {
            textBuffer.append(ch, start, length);
        }
       
        // Outputs text accumulated under the current node
        private void addTextIfNeeded() {
            if (textBuffer.length() > 0) {
                Element el = elementStack.peek();
                Node textNode = doc.createTextNode(textBuffer.toString());
                el.appendChild(textNode);
                textBuffer.delete(0, textBuffer.length());
            }
        }           
    };
    parser.parse(is, handler);
   
    return doc;
}   

 

In order to compose a hierarchical structure from the linear traversal, a stack is used. The stack contains at any moment of the traversal the “path” to the current location. Whenever an element start is encountered we build the element and add it to the stack, and later when it closes we remove it from the stack and attach it to the parent node.

Note that the character(..) method implementation only appends the new text into a text buffer. This is due to the fact that it is not guaranteed that the method provides the full piece of text; it may return it in chunks.

Finally, note that while this implementation is fine for my purposes, it is not prepared to deal with any XML entity. Comments and processing instructions (such as <?xml version=”1.0″? encoding=”UTF-8″ standalone=”yes”?>) are ignored. In addition, CDATA sections are undressed and escaped, instead of copying them as is (actually there is no semantic difference between the two).

Advertisements

Posted in java, JDK packages | Tagged: , , , , , , , , | 1 Comment »