banner
Handling massive XML files with perl and shell scripts

Recently, we were faced with an interesting problem. Our capture system uses master data from host systems for validation. Master data changes infrequently, but change it does, and so it is imported from the host system daily. Master data is supplied in the form of xml files, and massive ones too. An xml file can be more than 300 mb in size. Every time, the whole master data is given to our system. Parsing through this whole data and inserting it into a database is time consuming, so we needed a solution where we could compare the current xml file with the last imported xml, and insert only the differences.

What is the best way to compare two very large XML files? What are the issues involved? Going the DOM way is fraught with risks, because XML trees take around 10 times the size on disk. A 320 MB file would get expanded to 3.2 GB in memory. It would impossible, if not difficult to keep two such files in memory.

We first evaluated some tools to do the task for us. Evaluation was done with two xml files of around 327 MB each. The machine was a VMWare VM with 3 Gigs of RAM and 10 Gigs of swap space. The CPU is an i5 processor running on a DELL Vostro 3700 laptop.

Libxmldiff is a an efficient way of comparing large XML files. The web site promises to compare 100 MB plus files easily. However, with two files of 327 MB each, it gave a segmentation fault.
Xmlcmp.de has an xml toolbox which promises to compare and merge large xml files easily. The web site promises

“The comparison of two 1.400 MB xml-files needs only 6 minutes.
(Configuration: AMD Opteron Processor 244, 2 processors, 7 GB memory, 2 RAID-10-Diskarrays)”
The product fairly lives up to its promise. It takes some time to understand the configuration, but once that is understood, the tool compared our two files in around 6 minutes, which is not bad at all. However, the price is a bit steep (1700 Euros for a single server license), so we decided to explore a bit more.
Since there were not too many other products in the market, we decided to write the code ourselves. Perl was an obvious choice, due to its strong text processing capabilities, as well as the fact that it is already installed in our production systems.
First job before we started was to choose the right XML parser. DOM or SAX? The answer was clear – it had to be SAX. Or we could break the file into pieces and use DOM. Both approaches were tried.
There are many XML parsers in the perl world. Which one was fastest? Some existing studies helped. Here are two benchmarks

http://www.xml.com/pub/a/2004/09/15/pl-perf.html

http://xmlbench.sourceforge.net/results/benchmark/index.html

The first one gives a benchmark of all perl xml implementations. XML:SAX:ExpatXS is the best, and our tests also proved it.

The approach we followed was to

  • Convert both files to csv
  • Sort both files according to a known key
  • Diff the files using the unix command
  • Convert the resulting file to an XML format.

Of the different steps above, the first one is the most time consuming. So we tried out different approaches to convert an XML file to a CSV file.

Many small sub trees

The first approach was to read through the file and extract one xml element at a time (using perl regular expressions). For each element, construct a DOM tree using the ExpatXS parser. From this DOM tree extract and write out the attributes and sub elements on to a csv file.
Even though this approach looks crude, it is surprisingly effective. A 327 MB file can be converted to a csv file in around 6 minutes and 7 seconds.

STX

STX or streaming transformations in XML (http://stx.sourceforge.net/) is a method for quickly transforming large XML documents using an XSLT like language. There are implementations in Java (Joost), and perl (XML::SAX::STX). The perl version is available at http://search.cpan.org/~pcimprich/XML-STX-0.43/STX.pm. The version is 0.43 and seems to have been last updated in 2004.
The idea here is to write a small XSLT like script, and run it using the STX processor. The script below matches any element of types /bpColl/bp/addrCity, /bpColl/bp/ addrCompanyName and writes it out using the . operator. This script can be run from a perl script which uses XML::STX (which needs to be installed).

<?xml version=”1.0″ encoding=”ISO-8859-1″?>
<stx:transformversion=”1.0″ xmlns:stx=”http://stx.sourceforge.net/2002/ns” xmlns=”http://www.w3.org/1999/xhtml”>
<stx:template match=”/bpColl/bp/addrCity”>
<stx:value-of select=”.”/>
<stx:text>,</stx:text>
</stx:template>
<stx:template match=”/bpColl/bp/addrCompanyName”>
<stx:value-of select=”.”/>
<stx:text>
</stx:text>
</stx:template>
</stx:transform>

This is a very nice, elegant API. However, the performance leaves much to be desired. It takes around 4 hours to convert an XML file to csv file format. The STX implementation uses XML::SAX::ExpatXS, so the speed should be fine, unless STX itself has overheads.

XML::SAX::ExpatXS

It is possible to parse the file using this SAX parser. This is quite fast, it finishes in 6 minutes and 30 seconds. The time goes down to 4 minutes and 52 seconds if we ignore the character nodes that we are not interested in.

XMLGAWK

Xmlgawk is a very interesting utility that uses the awk idiom to parse xml documents. A script similar to an awk script can be written to parse the documents. The script itself is very short, and performance is fairly good. A simple script to convert some nodes to csv format, took around 6 and a half minutes.

PreProcessing xml file before perl parsing.

The input xml file can be preprocessed with grep or sed to remove nodes that are not required. This is very fast. The XML file that we used had a structure that consisted of multiple records, with each record containing multiple fields. The record was represented by an XML element and each field was represented by an xml element within it. If each xml element appears on a separate line, it is easy to remove it using grep with the command
Grep –EV ‘<elementname’ input.xml > cleanedinput.xml
Multiple elements can be removed from the input xml file like this by using the | separator. This is very fast. Running the xmlgawk script or the perl program after this preprocessing is done reduces the processing time by 75%

Some other promising approaches

Some other approaches were considered but were not followed up due to paucity of time. One was to use a Java parser such as Saxon. Piccolo is a very fast Java SAX parser.
http://search.cpan.org/~rbs/XML-Filter-Dispatcher-0.52/lib/XML/Filter/Dispatcher.pm
XML::Filter::Dispatcher and Twig are two other options as mentioned in the article below ( a bit dated though).
http://tomacorp.com/perl/xml/saxvstwig.html
This url has a very good discussion of different techniques for parsing using SAX.
http://www.velocityreviews.com/forums/t167818-perl-sax2-slow.html