XMLPPM: XML-Conscious PPM Compression

What is XMLPPM?

XMLPPM is a data compression program that compresses XML files from 5 to 30% better than any existing text or XML-specific compressors. It is a combination of the well-known Prediction by Partial Match (PPM) algorithm for text compression, first described by Cleary and Witten in 1984, and an approach to modeling tree-structured data called Multiplexed Hierarchical Modeling (MHM) that I have developed. The XMLPPM source code is part of a project at Sourceforge on XML compression.

Why compress XML?

XML is now a standard interchange format for a wide variety of data. It draws on considerable existing experience with HTML/CSS, but is much more general. XML is not a specific markup language like HTML, but instead is a meta-language for defining languages like HTML. Consequently, XML provides a home for many "niche" application areas that don't fit into the standard HTML data model. If you go to any of several web sites on XML, you will probably run across an exhaustive list of such applications, including e-business transactions, medical information, user interface descriptions, and theological commentary.

However, XML markup tends to inflate file sizes relative to application-specific, text or binary formats. This size blowup carries over to compressed formats as well, that is, text-compressed XML versions are typically still larger than compressed costomized files. Furthermore, compression tends to introduce another "pass" into XML processing since files must be decompressed before they can be parsed and processed.

Getting XMLPPM

XMLPPM is based on the expat XML parser library for C. You will need expat (http://expat.sourceforge.net) to compile the sources for XMLPPM.

Version 0.95-0.96 of XMLPPM used Bill Teahan's C implementation of the PPMD+ algorithm. More recent versions 0.96-0.98 use Dmitry Shkarin's PPMII algorithm. Modified versions of these programs are (with permission) incorporated into XMLPPM.

Using XMLPPM

Currently XMLPPM just reads XML text from standard input and writes compressed bits to standard output; XMLUNPPM, the decompressor, inverts this transformation from standard input to standard output.

To run XMLPPM on your machine, get the source, compile it, and run it as follows:

xmlppm < doc.xml > doc.xppm

To decompress the compressed file, do:

xmlunppm < doc.xppm > doc.new.xml

XMLPPM is fairly mature, but may still have bugs.

An XML Compression Corpus

I constructed the following corpus of XML files to test XMLPPM's compression performance relative to other methods. The files break into three rough categories:

  1. Mostly-text files containing little structured markup
  2. Mixed files containing lots of structure and lots of text
  3. Mostly-markup files containing little text

Downloads

XMLPPM project download page

Corpus (with DTDs): (xml-corpus-1.0.tar.gz)

Paper

I've written a short paper about the ideas underlying XMLPPM, which is available here (HTML version).

New

DTDPPM is a DTD-conscious extension of XMLPPM. More information is available here.
SourceForge.net Logo
James Cheney