Lavanya Jeyaratnam

parsing xml in chunks

I recently worked on ingesting data such as programmes, celebrities, and schedule into the Atlas. Initially, the plan was to run the ingest in AWS Lambda. However, two of these files were roughly about 500MB each and parsing the entire file in memory (Dom Parsing) wasn’t an option because lambda had a 1.5GB memory limit.

stax parsing ftw!

As observed from the above code snippet, JAXBContext for Person class was first created. Then, Unmarshaller instance is created from the JAXBContext instance.

Using the factory instance, we create XMLInputFactory and XMLStreamReader instance. The event is validated by checking whether the type is a Start element and the local name is on by using XMLStreamReader#require method. If the validation fails then it will throw an exception, XMLStreamException.

After the validation is successful, we iterate over the elements until a start element with local name celebrities is found. This is done in order to skip other tags such as header, content, copyright and created that are not relevant information for our use case. With the XMLStreamReader instance now pointing to the celebrities tag, we continue to parse the xml. Now within the celebrities tag, we search for a start element with local name Person. Then, the XMLStreamReader instance is unmarshalled using the Person class.

conclusion

When parsing a XML file, you can choose DOM parsing when the file size is small and there are not any resource constraints. If the file size is bigger and there are certain constraints to check for, then it is optimal to choose StAX parsing. Although, for StAX parsing, the developer needs to understand and know the XML structure beforehand which could be a problem.

If you enjoyed the read, drop us a comment below or share the article, follow us on Twitter or subscribe to our #MetaBeers newsletter. Before you go, grab a PDF of the article, and let us know if it’s time we worked together.

blog comments powered by Disqus