I recently worked on ingesting data such as programmes, celebrities, and schedule into the Atlas. Initially, the plan was to run the ingest in AWS Lambda. However, two of these files were roughly about 500MB each and parsing the entire file in memory (Dom Parsing) wasn’t an option because lambda had a 1.5GB memory limit.
stax parsing ftw!
As observed from the above code snippet,
Person class was first created. Then,
Unmarshaller instance is created from the
Using the factory instance, we create
XMLStreamReader instance. The event is validated by checking whether the type is a
Start element and the local name is
on by using
XMLStreamReader#require method. If the validation fails then it will throw an exception,
After the validation is successful, we iterate over the elements until a start element with local name
celebrities is found. This is done in order to skip other tags such as header, content, copyright and created that are not relevant information for our use case. With the
XMLStreamReader instance now pointing to the
celebrities tag, we continue to parse the xml. Now within the
celebrities tag, we search for a start element with local name
Person. Then, the
XMLStreamReader instance is unmarshalled using the
When parsing a XML file, you can choose DOM parsing when the file size is small and there are not any resource constraints. If the file size is bigger and there are certain constraints to check for, then it is optimal to choose StAX parsing. Although, for StAX parsing, the developer needs to understand and know the XML structure beforehand which could be a problem.
If you enjoyed the read, drop us a comment below or share the article, follow us on Twitter or subscribe to our #MetaBeers newsletter. Before you go, grab a PDF of the article, and let us know if it’s time we worked together.