Existing XML-conscious compressors

XMLZIP parses XML data and breaks the structural tree into many components: a ``root'' component containing all data up to depth d from the root, and one component for each of the remaining subtrees starting at depth d. The root component is modified by adding references to the remaining subtrees, and the components are compressed by Java's built-in ZIP/DEFLATE archive library. This does not improve compression over compressing the whole document: in fact, using XMLZIP with d=2, the example XML file that comes with the XMLZIP distribution compresses to 225512 bytes, whereas ordinary ZIP compresses it to only 118848 bytes. Increasing d tends to decrease compression performance, since redundancies across separated subtrees cannot be used in compression. XMLZIP's chief benefit that it allows limited random access to XML documents without storing the whole document uncompressed or in memory.

XMILL parses the XML data, transforms it, and then compresses the result using an ordinary text compressor. The XMILL transformation splits the data into three components: one for element and attribute symbol names, one for plain text, and one for the document tree structure. The XML fragment "<elt att="abcd">XYZ</elt>" might be transformed to "[elt,att]; [abcd, XYZ]; [S0 S1 T X T X]", where "Si" refers to symbol table entry "i", "T" refers to the next unused text string, and "X" terminates an element or attribute list. In fact, XMILL allows multiple text containers within the text component, in which case a reference "Ti" refers to the next unused string in container "i". By default, XMILL uses a container-building heuristic of ``group by enclosing structure'', so for example all text immediately enclosed in elements "<elt>...</elt>" is put in one container. Users can provide other heuristics using a _path expression_/_user compressor_ language of command line options. This facility makes better compression possible, but requires expertise and effort on the part of the user.

Table 1(b) illustrates XMILL's compression performance on our corpus, using the default options. The ``XMILL'' column shows the bit rate of the XMILL transformation. The other columns show the bit rates obtained by the compressors on the transformed data.

XMILL improves overall compression for "gzip" and "ppmd+" by about 18%. Structured document compression with "gzip" and "ppmd+" also improves by 20-40%. Surprisingly, for compressors besides "gzip", XMILL-transformed documents compress 5-10% _worse_ than the originals. We speculate the reason is that XMILL attempts to do something that these compressors already do better. The "bzip2", "ppmd+", and "ppm*" compressors can take advantage of contexts much longer than the immediately enclosing XML element or attribute name. XMILL disrupts these contexts, cutting off opportunities for compression.

The XMILL transformation is itself fast, and speeds up subsequent compression by as much as 50% because compressors ultimately see fewer symbols. On the other hand, XMILL scatters parts of the document, making incremental processing impossible. Also, although the path-expression/user-compressor language is powerful, there is no guarantee that users will be able to put it to its best use. In the next section we will describe an alternative transformation that is also fast, compresses almost as well with "gzip", compresses better with other compressors, and works incrementally.

Table 2: SAX event encoding

Class	Symbol	Meaning	Next Class	Example
Elts	00-F9	Element start tag	Sym? Atts Elts	...<elt...
	FA	CDATA section	Str	...<![CDATA[...]]>...
	FB	Entity	Str	...&ent;...
	FC	Processing instruction	Str Str	...<?app inst?>...
	FD	Comment	Str	...<!-...->...
	FE	Characters	Str	...asdf...
	FF	Element end tag		...</elt>...
Atts	00-FE	Attribute name	Sym? Str	...<elt att="val"...
	FF	Attribute list end
Str	01-FF	Character data	Str
	00	End of string