DTDPPM: DTD-Conscious Compression
DTDPPM is a new version of XMLPPM that performs DTD-specific optimizations to compress XML documents that are valid with respect to a given DTD. These optimizations include:
- Ignorable whitespace stripping: Whitespace occurring in elements whose content model does not include #PCDATA is ignored.
-
Symbol table reuse: The element and attribute names in the DTD are used to build a symbol table used by the encoder and decoder, instead of building the symbol tables dynamically as in XMLPPM.
-
Element symbol prediction: Element symbol codes that can be predicted from context (that is, when there is only one possible "next symbol" in a valid document) are omitted. Similarly, end-element codes are omitted when a valid document can only have empty content at the current position.
-
Attribute list coding: In valid XML, the attributes that can occur at a given position are known ahead of time, and attribute lists order doesn't matter, so we send attribute values in key-sorted order. Also, care is taken to avoid sending redundant information like FIXED attributes, default values, and enumerated values.
DTDPPM implementation
A preliminary (fairly stable, with a few known bugs) implementation has been completed and is available here. This should be regarded as alpha software.
James Cheney