Introduction

Whatever XML's advantages, it has one glaring disadvantage: document size. XML's parent standard, Standardized General Markup Language (SGML), made many provisions for minimizing document size. These options made SGML complex and difficult to implement, and they were omitted from XML. Indeed, the XML standard explicitly states that markup terseness was _not_ a design goal. Consequently, XML is easy to process but verbose. XML documents can be many times larger than equivalent non-standardized text or binary formats, even if compressed. There is growing concern in the XML community that inefficiency arising from document size will hinder adoption and use of XML.

From an information-theoretic point of view, this situation is vexing because two documents carrying the same message should have the same entropy and so compress to about the same size using any universal compressor. The problem is that XML documents may have nonlocal redundancy arising from XML's tree structure, which is difficult for text compressors to discover. Conversely, XML-conscious compression techniques might be able to compress XML documents as well as or better than other representations. Fortuitously, the design emphasis on simplicity makes testing this hypothesis easier than in previous structured-data compression approaches (such as syntax-based compression of program files [12,3] or machine code compression[8,16]).

Liefke and Suciu [15] describe XMILL, an XML compressor that transforms documents to expose redundancy, then applies standard text compressors. XMILL combined with "gzip" compresses XML data about 10% better than "gzip" on equivalent non-XML forms; further improvement (up to 50%) is possible with user assistance in the form of complex command-line parameters. This work shows that XML-conscious compression can do better than text compression alone. However, XMILL has several drawbacks: namely, it precludes incremental (online) processing of compressed documents, it actually hinders compressors other than "gzip", and it requires user assistance to achieve the best compression.

In this paper, we will describe alternative approaches to XML compression that illustrate other tradeoffs between speed and effectiveness. We describe experiments using a variety of text compressors and XMILL to compress a variety of XML documents. Using these as a benchmark, we describe our two main results: an online binary encoding for XML called _Encoded SAX_ (ESAX) that compresses better and faster than existing methods; and an online, adaptive, XML-conscious encoding based on Prediction by Partial Match (PPM) [5] called _Multiplexed Hierarchical Modeling_ (MHM) that compresses up to 35% better than any existing method but is fairly slow. First, of course, we need to describe XML in more detail.

Figure: XML example

<?xml version="1.0"?>
<doc>
 <title>The Great American Novel</title>
 <author>A. Nonymous</author>
 <copyright>Copyright 2000</copyright>
 <chapter id="1">
  <title>The Beginning</title>

  <body>
   ...body of chapter...
  </body>
 </chapter>
 ...other chapters...
</doc>