Processing XML Files in Java

Robert Torok August 15, 2018
Processing XML Files in Java

In this post we're going to take a look at the SAX and DOM parsers in Java. Even though it's clear that these parsers should be used in different situations (the most important question may be the size of the XML file we want to process) we'll also compare them and see how much time and memory they take to parse the same XML file.

SAX stands for Simple API for XML. It's an event-based parser, meaning that when an XML element gets processed, it invokes the callback methods you implemented. That said, no internal structure is built up when parsing which results in low memory usage compared to DOM.

DOM (Document Object Model) loads the entire document into the memory as a tree structure.

In a nutshell, use DOM when the size of the XML is insignificant and you'll want to traverse it. However, when it comes to huge XML files rather use the SAX parser because it's way faster. The cost here is that once the XML is read, there's no way to get back to any part of the XML. That said, choosing the right parser entirely depends on the requirements.

Let's get started! We will implement XML parsers for a simple PhoneBook application that'll read a phone book file and return the records. An entry in the phone book is represented by the following class:

public class PhoneBookEntry {
    private String name;
    private String number;

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getNumber() {
        return number;
    }

    public void setNumber(String number) {
        this.number = number;
    }

}

The phone book XML file looks like as follows:

<phonebook>
    <entries>
        <entry>
            <name>foo bar</name>
            <number>12345</number>
        </entry>
        <entry>
            <name>test</name>
            <number>641616</number>
        </entry>
    </entries>
</phonebook>

In order to decouple our phone book file parser library from the client application, let's create an interface. Having the following interface will allow us to easily switch between the different parser implementations:

public interface PhoneBookParser {
    ListPhoneBookEntry loadPhoneBook(String fileName);
}

Okay, let's implement the SAX parser first!

public class PhoneBookParserSaxImpl extends DefaultHandler implements PhoneBookParser {
    private List<PhoneBookEntry> phoneBook = new LinkedList<>();
    private PhoneBookEntry currentEntry;
    private StringBuilder tempValue;

    private enum Tag {
        NAME, NUMBER, UNDEFINED
    }

    private Tag currentTag = Tag.UNDEFINED;

    @Override
    public void startElement(String namespace, String localName, String qName, Attributes attributes) throws SAXException {
        tempValue = new StringBuilder();
        if (qName.equalsIgnoreCase("entry")) {
            currentEntry = new PhoneBookEntry();
        } else if (qName.equalsIgnoreCase("name")) {
            currentTag = Tag.NAME;
        } else if (qName.equalsIgnoreCase("number")) {
            currentTag = Tag.NUMBER;
        }
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        if (qName.equalsIgnoreCase("entry")) {
            phoneBook.add(currentEntry);
        }

        switch (currentTag) {
            case NAME: {
                if (currentEntry != null) {
                    currentEntry.setName(tempValue.toString());
                }
                break;
            }
            case NUMBER: {
                if (currentEntry != null) {
                    currentEntry.setNumber(tempValue.toString());
                }
                break;
            }
        }

        currentTag = Tag.UNDEFINED;
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        tempValue.append(new String(ch, start, length));
    }

    @Override
    public List<PhoneBookEntry> loadPhoneBook(String fileName) {
        try {
            SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
            SAXParser parser = saxParserFactory.newSAXParser();
            XMLReader xmlReader = parser.getXMLReader();
            xmlReader.setContentHandler(this);
            xmlReader.parse(fileName);
            return phoneBook;
            } catch (IOException | SAXException e) {
                e.printStackTrace();
            } catch (ParserConfigurationException e) {
                e.printStackTrace();
            }
            return Collections.emptyList();
    }
}

Notice the startElement, endElement and characters methods. They're callback functions and are invoked by the parser when processing a tag. It's interesting to note that you may expect the characters method to get invoked only once for a tag, but that's not the case. That's why we introduced StringBuilder to store the temporary data in the characters method and eventually output in the endElement method. (Also, note the if statements in method startElement. We only have this for simplicity; this is definitely unmaintainable and fragile for real-world objects. We'll jump back to this question in another post.)

Now that we have implemented the SAX parser, let's proceed with the DOM imlementation:

public class PhoneBookParserDomImpl implements PhoneBookParser {
        private List<PhoneBookEntry<phoneBook> = new LinkedList<>();
        private Document document;

        @Override
        public List<PhoneBookEntry> loadPhoneBook(String fileName) {
            loadPhoneBookInternalImpl(fileName);
            if (document != null) {
                NodeList nodeList = document.getElementsByTagName("entry");
                for (int i = 0; i < nodeList.getLength(); i++) {
                    Node node = nodeList.item(i);
                    if (node.getNodeType() == Node.ELEMENT_NODE) {
                        Element e = (Element) node;
                        String name = e.getElementsByTagName("name").item(0).getTextContent();
                        String number = e.getElementsByTagName("number").item(0).getTextContent();

                        PhoneBookEntry phoneBookEntry = new PhoneBookEntry();
                        phoneBookEntry.setName(name);
                        phoneBookEntry.setNumber(number);
                        phoneBook.add(phoneBookEntry);
                    }
                }
            }
            return phoneBook;
        }

    private void loadPhoneBookInternalImpl(String fileName) {
        try {
            DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
            DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
            document = documentBuilder.parse(new File(fileName));
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Great, now let's load the phone book by using our parsers:

public class Main {
    public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {
        // PhoneBookParser phoneBookParser = new PhoneBookParserDomImpl();
        PhoneBookParser phoneBookParser = new PhoneBookParserSaxImpl();
        List<PhoneBookEntry> phoneBookEntries = phoneBookParser.loadPhoneBook("phonebook.xml");
        for (PhoneBookEntry entry : phoneBookEntries) {
            System.out.println("Name: " + entry.getName());
            System.out.println("Number: " + entry.getNumber());
            System.out.println("----------");
        }
    }
}

The output should look like as follows:

Name: foo bar
Number: 12345
----------
Name: test
Number: 641616
----------

In the beginning of the post I mentioned that we are going to test these implementations with large XML files. I generated a phone book XML file with 250k, 500k and 1m entries, it's time to see how the SAX and DOM parsers cope with this huge file. I also added a JAXB to the comparison for curiosity, however, it's not that fair comparing them since JAXB is a higher level API leveraging some low-level implementations like SAX.

The following graph shows how much time they needed to process the test XML files:

Time to process graph

And the memory consumption:

Memory consumption graph

DOM memory footprint with 1m entities:

DOM memory footprint

SAX memory footprint with 1m entities:

SAX memory footprint

Further reading: