next up previous
Next: Multiplexed hierarchical modeling Up: Compressing XML with Multiplexed Previous: Existing XML-conscious compressors

  
SAX event encoding

Simple API for XML (SAX) is an event-driven interface in which an application supplies the parser with ``callback'' event handlers that are invoked when certain parsing events occur. For example, in parsing the fragment "<elt att="abcd">XYZ</elt>", a SAX parser would report events


startElement("elt",("att","abcd"))
characters("XYZ")
endElement("elt")



These events provide all the information an XML-compliant application needs. We can leverage the work a SAX parser does by encoding the sequence of events. A decoder can decode these events, and reconstitute an XML document equivalent to the original. Alternatively, the decoder could act as a SAX parser, parsing encoded event sequences instead of text, and sending these events directly to the application.

We chose an event encoding that uses single bytes to encode element start tags, end tags, and attribute names and to indicate events such as ``begin/end characters'', ``begin/end comment'', and so on. The encoder and decoder maintain consistent symbol tables; whenever a new symbol is encountered, the encoder sends the symbol name and the decoder enters it into the table. Table 2 describes the encoding in more detail. As an example, assuming the tag "elt" has been seen before, and is represented by byte "10", but attribute name "att" has not, and the next available byte for attribute names is "0D", our running XML fragment would be encoded as

<elt att= "asdf" > XYZ </elt>
10 0D a t t 00 a s d f 00 FF FE X Y Z 00 FF

This encoding is implemented using the "expat" XML parser, version 1.95. In principle, the encoding can be implemented using any SAX parser. Table 3(a) shows the results of compressing the encoded SAX events; the ``ESAX'' column lists the bits of encoded SAX per original symbol.

Using "gzip" or "ppmd+", ESAX is only 7% or 1% worse than XMILL, respectively. With the other compressors , ESAX is 8-10% better than XMILL and even slightly (3%) better than "bzip2"/"ppm*" on the raw XML. Although ESAX is slower than XMILL, overall compression time is only substantially worse for "gzip" and is actually considerably better (60%) than XMILL for "ppm*". Thus, using "bzip2", ESAX combines XMILL's speedup factor of 2 with an improvement in compression. Furthermore, ESAX, unlike XMILL, can be encoded and decoded online, so that the XML data can be processed incrementally.

 
Table 3: (a) ESAX and (b) MHM compression results, (c) comparison
  (a) Encoded SAX (b) New Models (c) Comparison
name ESAX GZIP BZIP2 PPMD+ PPM* MM MHM MHM* XML/BZ2 % imp
elts 4.081 0.577 0.392 0.473 0.420 0.441 0.381 0.378 0.415 8.9%
proof 2.517 0.198 0.150 0.264 0.140 0.273 0.168 0.106 0.166 36.3%
play 6.329 2.068 1.532 1.504 1.535 1.517 1.489 1.499 1.549 3.2%
sprot 5.541 2.009 1.954 1.848 1.906 1.841 1.778 1.816 2.020 10.1%
stats 2.566 0.593 0.352 0.464 0.364 0.356 0.314 0.311 0.368 15.7%
tal 2.452 0.194 0.119 0.223 0.121 0.223 0.139 0.093 0.121 23.3%
tpc 5.687 1.438 1.081 1.290 1.215 1.236 1.092 1.096 1.100 0.4%
treebank 7.272 1.678 1.406 1.305 1.393 1.246 1.171 1.164 1.524 23.7%
spec2 6.826 1.840 1.507 1.470 1.473 1.515 1.441 1.446 1.539 6.1%
spec3 6.738 2.018 1.683 1.596 1.630 1.653 1.592 1.603 1.722 6.9%
weblog 5.030 2.068 2.267 1.897 1.936 1.783 1.765 1.772 2.389 25.8%
average 5.005 1.335 1.131 1.120 1.103 1.098 1.029 1.026 1.175 12.6%
weighted 3.912 0.877 0.649 0.726 0.660 0.704 0.630 0.615 0.667 7.8%
time(s) 2.56 5.14 8.41 42.32 74.13 39.18 46.97 314.94    

                   
  


next up previous
Next: Multiplexed hierarchical modeling Up: Compressing XML with Multiplexed Previous: Existing XML-conscious compressors
James Cheney
2000-11-24