next up previous
Next: Existing XML-conscious compressors Up: Compressing XML with Multiplexed Previous: XML background

  
Compressing XML as text

XML is stored in plain text files, so the most obvious approach to XML compression is to use existing text compressors. To establish a benchmark against which we can compare other methods of XML compression, we constructed a corpus containing a wide variety of forms of XML data. The corpus includes XML versions of formal proofs and safety-annotated assembly language ("proof", "tal"), as well as Shakespearean plays ("play"), statistical and scientific databases ("stats", "elts"), and XML versions of the XML specification ("spec1", "spec2"). We also included some of the small XML examples ("sprot", "tpc", "treebank", "weblog") distributed with XMILL; unfortunately the full corpus was not available for these experiments.

The documents chosen for the corpus reflect a wide a variety of sources, in order to avoid skewing our results to a particular kind of data. It is possible that the real uses of XML data will be different from the somewhat artificial uses listed above. Thus, any conclusions drawn based on our corpus must be tentative. Nevertheless, because XML is not yet a mature, everyday technology, this would likely be true of any alternative proposed corpus.

We compressed each document using "gzip"[10], "bzip2"[19], "ppmd+"[5,21] (with a context bound of 5), and "ppm*"[4,21]. The compression results, reported in bits per original XML document character, are shown in Table 1(a). The first column, ``XML'', lists the sizes of the documents in bytes, and the other columns show the bits per symbol rate of the compressors. For each file, the best compression in a table is in bold, and the best compression overall (including Tables 1 and 3) is in bold italic. At the bottom of each table is listed the average bit rate, weighted average (total compressed bits divided by total bytes), and compression time on the corpus.

Markup tends to be more redundant and compressible than text. _Textual_ documents (mostly text, little markup) tend to have high bit rates of around 1.5-2.0. This is below the 2.1-2.2 bpc usually seen with English text, and reflects an averaging of very redundant markup and less redundant text. _Structured_ documents (mostly markup, little text) compress up to several times smaller than text documents. Documents mixing text and structure tend to fall in between these extremes. This distinction between textual and structural data is a recurring theme. Some compressors compress text well, others compress structure well, but few compress both well.

There is no clear winner. The "bzip2" compressor is best overall; it does 20-30% better than "gzip" and about as well as "ppm*". However, each compressor performs poorly on at least one document. For structured documents, there is a large gap between "gzip" and "ppmd+" and the other compressors (as much as a factor of 4), but otherwise there are no obvious _a priori_ reasons to believe a particular compressor will be effective on an unknown document.

Because all XML documents are generated by a fairly restrictive context-free grammar, perhaps grammar-based compression techniques such as Nevill-Manning and Witten's SEQUITUR [18] or Kieffer and Yang's grammar-based codes [13] would do better. We have experimented with a version of YK compression and found that it compressed about as well as "bzip2" and "ppm*" on our corpus. Nevertheless other forms of grammar-based compression, perhaps tailored to XML, might offer improvements.

Plain-text XML compression is easy, fairly fast, and relies on existing, robust compressors. On the other hand, good compression is either very slow (using "ppm*"), or fairly slow and off-line (using "bzip2"). Off-line compression is undesirable for XML because it forces a long wait before document parsing and processing can begin. Moreover, there is no reason to believe that text compressors can take full advantage of the redundancies present in structured XML data, especially in light of the high variation in bit rates among the compressors.

 
Table 1: (a) Text and (b) XMILL compression results
  (a) Raw XML (b) XMILL
Name XML GZIP BZIP2 PPMD+ PPM* XMILL GZIP BZIP2 PPMD+ PPM*
elts 113135 0.620 0.415 0.597 0.443 2.683 0.440 0.408 0.408 0.405
proof 252682 0.311 0.166 0.481 0.156 2.846 0.171 0.142 0.283 0.134
play 251898 2.160 1.549 1.576 1.542 5.365 2.053 1.669 1.625 1.707
sprot 10248 2.033 2.020 1.899 1.991 4.613 1.998 2.158 1.943 2.083
stats 669309 0.798 0.368 0.591 0.378 2.145 0.446 0.352 0.360 0.387
tal 734535 0.312 0.121 0.467 0.127 2.167 0.168 0.131 0.223 0.139
tpc 287906 1.475 1.100 1.391 1.279 3.718 1.208 1.064 1.141 1.224
treebank 6298 1.782 1.524 1.336 1.434 2.267 1.285 1.367 1.172 1.222
spec1 196308 1.948 1.539 1.517 1.511 6.619 1.942 1.893 1.631 1.697
spec2 201918 2.139 1.722 1.653 1.663 6.656 2.128 2.190 1.752 1.829
weblog 2244 2.053 2.389 1.925 1.986 4.078 2.246 2.642 2.078 2.132
average   1.420 1.175 1.223 1.137 3.924 1.282 1.274 1.143 1.178
weighted   1.001 0.667 0.877 0.680 3.369 0.817 0.729 0.719 0.720
time(s)   2.98 16.55 62.23 399.16 0.82 2.88 7.65 40.38 189.92
  


next up previous
Next: Existing XML-conscious compressors Up: Compressing XML with Multiplexed Previous: XML background
James Cheney
2000-11-24