We established a working XML compression benchmark based on text compression, and found that "bzip2" compresses XML best, albeit more slowly than "gzip". Experiments using the XMILL transformation verified that XMILL speeds up and improves compression using "gzip" and bounded-context "ppmd+" by up to 15%, but worsens compression for unbounded-context compressors such as "bzip2" and "ppm*". We presented an alternative, Encoded SAX, which speeds up and improves compression for all compressors, compresses 2-4% better than "bzip2" does on text XML, and which has the additional advantage of allowing incremental transmission. Finally, we described a new technique called multiplexed hierarchical modeling that combines existing text compressors and knowledge of XML structure Using the PPMD+ and PPM* models as components, our MHM and MHM* models compress textual XML data about 5% better and structured data from 10-25% better than the best existing method.

James Cheney