Next: Conclusion Up: Compressing XML with Multiplexed Previous: Multiplexed hierarchical modeling

Related and future work

XMLZIP and XMILL are the only XML-specific compressors of which we are aware; undoubtedly others are in development. In the Wireless Access Protocol standard group's Binary XML Content Format (WBXML) [9], XML elements and attribute tags are tokenized with respect to a fixed symbol table. WBXML is similar to ESAX, but packs more information into some bytes, which may hinder compression. Kanne and Moerkotte [11] have addressed storing XML efficiently for database querying.

Both grammar-conscious and grammar-inferring text compression are relevant to XML compression, since XML has context-free structure containing unstructured text. Katajainen et al. [12] and Cameron [3] were the first to investigate grammar-based compression; more recently, Lake [14] combined PPM and grammar modeling. Nevill-Manning and Witten [18] and Yang and Kieffer [13] have investigated text compression using grammars learned from the text. Stream splitting techniques in machine-code compression also resemble our model multiplexing approach (Ernst et al. [8], Lucco[16]).

The MHM model we used was limited to one level of hierarchical context. We observed that more element context helps considerably in compressing structured data, but is harmful in compressing textual data and is slow. Furthermore, compressing structured data well required using the considerably slower PPM*. We believe that it would be worthwhile to redesign existing compressors with both hierarchical and sequential structure in mind, in order to get better compression without sacrificing performance.

Inspired by XMILL's path expression/user compressor language for guiding compression, we also speculate that user assistance can improve compression further. Such assistance might take the form of XMILL command line options or constraints such as DTDs. Knowing constraints enables the model to infer many element or attribute symbols, which then need not be transmitted or received. Conversely, it would also be good to have ``data mining'' tools which assist the user in constructing a set of constraints that characterize an XML source. These tools could be used to find constraints that help compress the data; indeed, that could be one measure of how accurate proposed constraints are. This is an area in which we believe there is much room for improvement in XML compression and data mining.

Next: Conclusion Up: Compressing XML with Multiplexed Previous: Multiplexed hierarchical modeling

James Cheney
2000-11-24