Both grammar-conscious and grammar-inferring text compression are relevant to XML compression, since XML has context-free structure containing unstructured text. Katajainen et al. [12] and Cameron [3] were the first to investigate grammar-based compression; more recently, Lake [14] combined PPM and grammar modeling. Nevill-Manning and Witten [18] and Yang and Kieffer [13] have investigated text compression using grammars learned from the text. Stream splitting techniques in machine-code compression also resemble our model multiplexing approach (Ernst et al. [8], Lucco[16]).
The MHM model we used was limited to one level of hierarchical context. We observed that more element context helps considerably in compressing structured data, but is harmful in compressing textual data and is slow. Furthermore, compressing structured data well required using the considerably slower PPM*. We believe that it would be worthwhile to redesign existing compressors with both hierarchical and sequential structure in mind, in order to get better compression without sacrificing performance.
Inspired by XMILL's path expression/user compressor language for guiding compression, we also speculate that user assistance can improve compression further. Such assistance might take the form of XMILL command line options or constraints such as DTDs. Knowing constraints enables the model to infer many element or attribute symbols, which then need not be transmitted or received. Conversely, it would also be good to have ``data mining'' tools which assist the user in constructing a set of constraints that characterize an XML source. These tools could be used to find constraints that help compress the data; indeed, that could be one measure of how accurate proposed constraints are. This is an area in which we believe there is much room for improvement in XML compression and data mining.