Compressing XML as text

The documents chosen for the corpus reflect a wide a variety of sources, in order to avoid skewing our results to a particular kind of data. It is possible that the real uses of XML data will be different from the somewhat artificial uses listed above. Thus, any conclusions drawn based on our corpus must be tentative. Nevertheless, because XML is not yet a mature, everyday technology, this would likely be true of any alternative proposed corpus.

We compressed each document using "gzip"[10], "bzip2"[19], "ppmd+"[5,21] (with a context bound of 5), and "ppm*"[4,21]. The compression results, reported in bits per original XML document character, are shown in Table 1(a). The first column, ``XML'', lists the sizes of the documents in bytes, and the other columns show the bits per symbol rate of the compressors. For each file, the best compression in a table is in bold, and the best compression overall (including Tables 1 and 3) is in bold italic. At the bottom of each table is listed the average bit rate, weighted average (total compressed bits divided by total bytes), and compression time on the corpus.

Markup tends to be more redundant and compressible than text. _Textual_ documents (mostly text, little markup) tend to have high bit rates of around 1.5-2.0. This is below the 2.1-2.2 bpc usually seen with English text, and reflects an averaging of very redundant markup and less redundant text. _Structured_ documents (mostly markup, little text) compress up to several times smaller than text documents. Documents mixing text and structure tend to fall in between these extremes. This distinction between textual and structural data is a recurring theme. Some compressors compress text well, others compress structure well, but few compress both well.

There is no clear winner. The "bzip2" compressor is best overall; it does 20-30% better than "gzip" and about as well as "ppm*". However, each compressor performs poorly on at least one document. For structured documents, there is a large gap between "gzip" and "ppmd+" and the other compressors (as much as a factor of 4), but otherwise there are no obvious _a priori_ reasons to believe a particular compressor will be effective on an unknown document.

Because all XML documents are generated by a fairly restrictive context-free grammar, perhaps grammar-based compression techniques such as Nevill-Manning and Witten's SEQUITUR [18] or Kieffer and Yang's grammar-based codes [13] would do better. We have experimented with a version of YK compression and found that it compressed about as well as "bzip2" and "ppm*" on our corpus. Nevertheless other forms of grammar-based compression, perhaps tailored to XML, might offer improvements.

Plain-text XML compression is easy, fairly fast, and relies on existing, robust compressors. On the other hand, good compression is either very slow (using "ppm*"), or fairly slow and off-line (using "bzip2"). Off-line compression is undesirable for XML because it forces a long wait before document parsing and processing can begin. Moreover, there is no reason to believe that text compressors can take full advantage of the redundancies present in structured XML data, especially in light of the high variation in bit rates among the compressors.

Table 1: (a) Text and (b) XMILL compression results

	(a) Raw XML					(b) XMILL
Name	XML	GZIP	BZIP2	PPMD+	PPM*	XMILL	GZIP	BZIP2	PPMD+	PPM*
elts	113135	0.620	0.415	0.597	0.443	2.683	0.440	0.408	0.408	0.405
proof	252682	0.311	0.166	0.481	0.156	2.846	0.171	0.142	0.283	0.134
play	251898	2.160	1.549	1.576	1.542	5.365	2.053	1.669	1.625	1.707
sprot	10248	2.033	2.020	1.899	1.991	4.613	1.998	2.158	1.943	2.083
stats	669309	0.798	0.368	0.591	0.378	2.145	0.446	0.352	0.360	0.387
tal	734535	0.312	0.121	0.467	0.127	2.167	0.168	0.131	0.223	0.139
tpc	287906	1.475	1.100	1.391	1.279	3.718	1.208	*1.064*	1.141	1.224
treebank	6298	1.782	1.524	1.336	1.434	2.267	1.285	1.367	1.172	1.222
spec1	196308	1.948	1.539	1.517	1.511	6.619	1.942	1.893	1.631	1.697
spec2	201918	2.139	1.722	1.653	1.663	6.656	2.128	2.190	1.752	1.829
weblog	2244	2.053	2.389	1.925	1.986	4.078	2.246	2.642	2.078	2.132
average		1.420	1.175	1.223	1.137	3.924	1.282	1.274	1.143	1.178
weighted		1.001	0.667	0.877	0.680	3.369	0.817	0.729	0.719	0.720
time(s)		2.98	16.55	62.23	399.16	0.82	2.88	7.65	40.38	189.92