next up previous
Next: Related and future work Up: Compressing XML with Multiplexed Previous: SAX event encoding

Multiplexed hierarchical modeling

We now describe a modeling technique called multiplexed hierarchical modeling (MHM), based on the SAX encoding from the previous section and on PPM modeling. The technique employs two basic ideas: _multiplexing_ several text compression models based on XML's syntactic structure (one model for element structure, one for attributes, and so on), and _injecting_ hierarchical element structure symbols into the multiplexed models.

First, we must briefly review the PPM text compression model. PPM models maintain statistics concerning which symbols have been seen in which contexts of preceding symbols. For each symbol, the model is used to estimate a probability range. This probability range is used to transmit the symbol using arithmetic coding. Then, the model is updated to indicate the symbol has been seen in the context. Probability estimation works as follows: If the symbol has been seen in the longest matching context, then the probability is the relative frequency in the context. Otherwise, an ``escape symbol'' is encoded, and the next longest context is tried, and so on. The decoder maintains the same model and uses symbols seen so far and escape symbols to decode incoming symbols and update its model.

Based on our observations and on the behavior of XMILL, we conjectured that in XML, hierarchical contexts based on the path from the root to the symbol might be as accurate or more accurate than preceding-symbol contexts. To test this hypothesis, we modified the PPMD+ sequential model to allow backtracking based on the hierarchical structure. For example, in encoding "<a><b>X</b>Y</a>", hierarchical backtracking modeling (HBM) uses "<a><b>" as a context for "X" but uses "<a>" and not "X</b>" as a context for "Y". We implemented HBM in PPMD+ and found that it only improves compression for a few documents and worsens compression for others. Comparing the predictions made by sequential modeling and hierarchical backtracking revealed that hierarchical contexts are used in several different ways depending on the syntactic class (element, attribute, or string) of the next symbol. Since HBM is unaware of syntactic context, its estimates are averages of several distinct distributions. Consequently hierarchical backtracking predictions are often poorer than the predictions made by the sequential model.

In order to avoid this averaging phenomenon, we decided to multiplex several models, switching among them based on syntactic context supplied by the parser. We used four models: one for element and attribute names, one for element structure, one for attributes, and one for strings. Each model maintains its own state but all share access to one underlying arithmetic coder; encoding and decoding still run incrementally. Our running example gets encoded as:

Model <elt att= "asdf" > XYZ </elt>
Elts: 10         FE   FF
Atts:   0D   a s d f 00 FF      
Chars:             X Y Z 00  
Syms:     a t t 00          
Each row represents the symbols seen by a model.

If neighboring symbols from different syntactic classes are drawn from distinct independent sources, the multiplexed models adapt to these distributions better than a single model can and prediction improves. However, sometimes symbols are strongly correlated with nearby symbols in different syntactic classes. In this case, multiplexing harms compression by breaking up dependences. On average, these factors balance out, resulting in compression slightly better than sequential modeling (see Table 3, ``MM'' column).

Multiplexing enables more effective hierarchical structure modeling. Model multiplexing breaks existing cross-class sequential dependencies; if we can restore them, we can improve prediction. A common case for these dependencies is for the enclosing element tag to be strongly correlated with enclosed data (recall that this is the motivation for XMILL's text container grouping). The multiplexed hierarchical model (MHM) exploits this by by injecting the enclosing tag symbol into the element, attribute, or string model immediately before an element, attribute, or string is encoded. ``Injecting'' a symbol means telling the model that it has been seen but not explicitly encoding or decoding it. Symbols can only be injected if the decoder knows what they are when they are needed, otherwise decoding is impossible. Fortunately, enclosing element tag symbols satisfy this property. In our running example "<elt att="asdf">XYZ</elt>", the models see:

  <elt att= "asdf" > XYZ </elt>
Elts: 10         FE   FF
Atts:   <10> 0D   a s d f 00 <10> FF      
Chars:             <10> X Y Z 00  
Syms:     a t t 00          
where "<nn>" represents reporting a symbol to the model without explicitly encoding or decoding it.

We modified the multiplexed model to perform element symbol injection. We built two versions: ``MHM'', which used the PPMD+ model with 5 context symbols for all syntax classes, and ``MHM*'', which used the PPM* model for the element and attribute contexts instead. The results are summarized in Table 3(b).

The MM model is about 2% better than ESAX with "ppmd+"; the MHM model is 13% better, and the MHM* model is 15% better. The MHM and MHM* models also compare favorably with ESAX/BZ2 and XML/BZ2, ranging from 3-8% better. MHM's chief weakness is on structured documents, where it can compress 20-40% worse than MHM*; on the other hand, MHM* does only 1-4% worse on textual documents. In terms of speed, MM is about the same as ESAX/PPMD+, MHM is 20% slower, and MHM* is about eight times slower. The slowdown occurs because the compressors process many injected symbols as well as the symbols that are actually transmitted. All three compressors, MM, MHM, and MHM*, run in one pass.

The MHM and MHM* models compress the XML files (individually, and as a whole) better than any other methods of which we are aware. The MHM compressor seems to work best for textual XML with sparse markup, whereas the MHM* compressor does much better on structured documents and slightly worse on textual documents. Table 6(c) compares MHM* with XML compressed with "bzip2", the best previously existing compressor. MHM* compresses better on all documents tested, ranging from a few percent better on textual documents to 20-35% better on structured documents. The biggest problem with both compressors is speed. Since MHM* is only about 5% better overall than ESAX compressed with "bzip2", it is hard to justify the factor-of-40 slowdown. On the other hand, MHM compresses textual documents better and is only six times slower, but doesn't compress structured documents as well as MHM.

Figure 2: Weighted average over entire corpus (bits per symbol)
\put(0.5,0){\epsfig{file=chart.eps} }

next up previous
Next: Related and future work Up: Compressing XML with Multiplexed Previous: SAX event encoding
James Cheney