XMill version 0.8
Hedzer Westra <hhwestra@cs.uu.nl>
March 2003


 ----
 
Contents
 Introduction
 Additions in version 0.8
 Related products
 New User Compressors
 XMill memory API
 XMill file format
 Semantic compressors and path expressions
 XMill source documentation
 XMill internal control-flow and UML data
 Thanks
 Known problems
 Todo

 ----
 
Introduction

XMill is an XML compressor created by Hartmut Liefke and Dan Suciu at AT&T and
was released under AT&T license somewhere in december 1999. That release was 
dubbed XMill v0.7. It was developed further by me, Hedzer Westra, in the autumn 
through spring of 2002/2003. Please read the README.txt and MANUAL.txt written 
by Dan & Hartmut before reading this file.

The original developers have moved on with other things and are not interested 
in developing XMill any further. Please contact me, Hedzer, if you need info to 
use or extend XMill.

And now for some real-life proof: I tested XMill on a 150MiB XML database I 
acquired from one of the customers of my employer. Using path expressions and 
PPMDi, I was able to achieve almost twice the compression rate of ZIP using just 
over a minute of CPU time on my Athlon XP 2000+ with 512MB RAM (okay, ZIP is 
faster, but you need to exchange something for that increased compression 
rate.. ;)


 ----
 
Additions in version 0.8

Here is a list of changes and additions that I did for version 0.8:

- rewrote XMill so that it can be compiled as a fully-re-entrant library.
  XMill was developed as two separate command-line tools that had a lot 
  of intertwined code, so this was no simple task, let me tell you that..
  While doing this, I also moved a lot of C functions to (static) classes and
  moved code inside .hpp files to .cpp files. The .hpp and .h files contain
  very little code now, so they should be a bit more readable.
  The only global variables that are left are three read-only array variables.
  
- with the rewrite to a library, there is now the feature to compress and 
  decompress XML files in memory. The commandline tools only provided 
  file-based (or Unix filter-based) functionality. This in-memory feature is 
  accessible using a C++ API. A small example program is included.
  
- a unit testing tool, xmilltest, has been implemented to test this C++ API.
  XMilltest reads in its commands using a simplistic script language and creates 
  detailed statistics and error messages. Please see the annotated example 
  xmilltest.script file for a how-to.
  
- a new User Compressor, BaseXNum, has been created, along with some subclasses.
  See below for more details.
  
- an XMI version number is stored by adding it to the XMill magic, 0x05e3d29e.
  (BTW: I don't know where this magic comes from. As far as I know, it's just a
  random number.)
  The major version number is added to the second LSB (0xd2), the minor number is
  added to the LSB (0x9e). The current version (for XMill version 0.8) is 01.00.
  I also added a uint32 to the header that can be used in future versions.
  Note that v0.7 XMIs can be decompressed without problems. The v0.7 
  decompressor cannot read v0.8 files.

- the type of general purpose compressor can now be chosen to be gzip, bzip2, 
  ppmdi (a variant of ppmd+) or nozip. Previously, the GZIP or BZIP2 support 
  was determined at compile-time.
  
  gzip accepts an index value of 1 through 9. Higher means slower execution
  but better compression.
  
  bzip2 usually compresses better than gzip, but needs considerably more CPU 
  time.
  
  ppmdi always compresses better than gzip and bzip2, but is slower than gzip.
  Compression is faster than bzip2, but decompression is slower. ppmdi accepts 
  a compression index value of 0 through 18. Higher is not always slower 
  execution/better compression, but I did try to give the PPMDi size/order 
  variables some sensible values. Note that you can't change those hardcoded 
  size/order values at will; doing that will break compatibility with 
  v0.8-ppmdi-compressed XMI files since only the compression index is saved in 
  the encoded file, not the actual size/order values.
  
  nozip is a 'compressor' that just stores the data as-is, with 'NOZIP' strings 
  as prefix and postfix. NoZip was developed for debugging purposes. Using nozip
  also shows the 'internal' XMill compression capability, i.e., the amount of
  data reduction that XMill itself yields. 
  
- I analysed the XMI file structure, documented it and wrote an XMI file inspector
  that prints out all kinds of cool numbers and texts about it. It also performs 
  integrity checking. Note that the documentation and implementation is not 
  complete yet. It's almost complete, though.

- Added quoting to constant compressors. If you want to add a double quote "
  inside constant, do this by prefixing it with a backslash \. This of course 
  means that you have to escape \ itself as \\. You can also use \n, \t and \r.
  Note that the path expressions stored in the XMI file are *not unescaped*.
 
- Decompressing to an MSXML4.0 instance. Is not totally finished yet, and 
  only available on Windows, since I don't know how to use DOM or SAX on Unix
  (yet). Feel free to submit code!

  
 ----
 
Related products

Since XMill was released, a few other XML compressors have been released. Some
are Open Source, some commercial. ICT, Inc.'s XML-Xpress (www.ictcompress.com) 
is the most prominent commercial product. ICT claims it has far better 
performance (in both time and size) than XMill, but since I'm looking for an 
Open Source solution, using it is not an option for me. Some Open Source 
implementations of XML compressors are:

 - XMLPPM which uses ppmdi as the underlying GP compressor. Encoded 
   SAX events (or ESAX) and MHM are two fundamentals of this compressor. See 
   www.cs.cornell.edu/people/jcheney/xmlppm/xmlppm.html and
   sourceforge.net/projects/xmlppm
   XMLPPM only supports command-line operations, just like XMill 0.7 did.
 - XGrind, xgrind.sourceforge.net, claims comparable compression as XMill (i.e.
   not as good, but not really bad) and supports fast querying & updating, which
   XMill does not. There is no Windows release of XGrind at this time, which is 
   what my employer needs. Also, the released version (0.0) was released half a 
   year ago. What does this mean for stability and maintenance?

BTW, Liefke and Suciu used the name XMill after they discovered that XMLZIP was
already taken. Unfortunately, XMill seems taken too: xMill, www.xmill.com is a 
computer consultant, and XMILL is the name of some sort of CAD/CAM scripting 
language. Well, as long as 'they' don't complain I'll keep on using the name 
XMill...

There is a W3C standard for XML encoding: wbXML, www.w3.org/TR/wbxml,
but that does only rudimentary encoding, no real compression. I checked the W3C 
encryption pages, but was not able to find any compressor mentioned. Which is 
surprising, given that compression increases entropy and therefore decreases the
chance of breaking encrypted data. That is, if I remember my cryptology classes
correctly :-)

Some more webpages:
 - XMLS or XML Streams (www.sosnoski.com/opensrc/xmls) seems to do something 
   comparable to wbXML.
 - MLpress, www.gi.k.u-tokyo.ac.jp/~leo/work/mlpress. Unfortunately, I don't
   read Japanese >-|
 - gnosis.cx/publish/programming/xml_compression.html. Some Python example code,
   jay!
 - ZXML, www.xml.com/pub/r/904. Free registration required!
 - Bin-XML, www.expway.tv. Commercial.
 - Milau, www9.org/w9cdrom/154/154.html. Looks like wbXML all over again..
 - XCompress and PipeBoost: commercial IIS ISAPI extensions that send compressed
   data.
 - dblab.usc.edu/microsoft compares networked behaviour using binary data, 
   XML-ized and XMill-compressed data.
 
I got these links from www.peterindia.com/XMLCompression.html which is a bit
of a weird list. Some solutions are linked more that once (via different web
sites) and some links just aren't about XML compression.

XMill is mentioned on a good many websites nowadays. lists.xml.org, 
www.xmlhack.com, www.xmlsoftware.com and xmltk.sourceforge.net are the most
interesting. xmltk is a toolkit that redefines default Unix tools like
grep and tail for XML. XMill v0.7beta is part of the toolkit. The xmltk 
project seems to be dormant for some time now, though. I received no answer
on my question - proposed by Suciu - if they were interested in XMill v0.8.

Levene and Wood describe 'XML structure compression' on 
www.webdyn.org/proceedings/levene_xml_structure_compression.pdf but I could not
find an implementation of their compression algorithm.

XMLSoc is another proprietary XML compressor. It was made by someone working for
www.syntion.com, but they don't seem to have released the program (yet).

So, concluding, there are two real competitors: XGrind and XMLPPM. I included
ppmdi into XMill so now the compression is almost as good as that of XMLPPM. The
XGrind querying/updating feature is really nice, but needs a complety different
compression approach, which will inevitably lead to lower compression. I've
been thinking about rewriting xmillinspect to be able to do basic querying. I 
think it's possible; xmillinspect streams all the data and does not read in a 
full run of data at a time, like xdemill does.

Anyways, now that I've invested all this time and effort into XMill, I'm not 
jumping ship (yet) ;-)


 ----
 
New User Compressors
  
  The PathExpr names of the new User Compressors are: 
   basex, b, x, X and base64. 
   
  Their meaning and accepted [optional] parameters:
   basex:	arbitrary-length, arbitrary-base positive numbers
   		params: base [mindigits]
   		example: basex (12 14)
   			  creates a 12-base number compressor that decompresses 
   			  to at least 14 digits (left-filled with zeros)
   			 basex (7)
   			  creates a 7-base number compressor 
   b:		same, but specialized for binary strings
   		params: [mindigits]
   		example: b(16)
   			  creates a binary number compressor that decompresses 
   			  to 16 bits minimum
   x:		same, but specialized for hexadecimal numbers. 0x and 0X are 
                 accepted as prefixes, which are decompressed as well.
   		params: [mindigits]
   		example: x
   			  creates a hex number compressor, no 0 digits filled
   X:		same as 'x', but stores the case of hex digits, using a base-22
   		(10+6+6) number. Not really efficient, since the arbitrary-length
   		feature is implemented by storing int32 numbers, which amounts
   		to 7 base22 digits in stead of 8 digits for base16 digits.
   base64:	RFC2045 (Mime) compliant base64 decoder & encoder. The data is 
                stored in the XMI file as the original binary data that the base64
                data was created from (subsequently compressed using the 
                general purpose compressor of course).
   		params: none
   		
   
All compressors store their data as follows:
 char 			flag
 uint32			# blocks of uint32's that follow
 sequence of uint32's   values that should be decoded to string upon decompress 
 uint32			optional last value
 
The flag contains:
 b7     reserved
 b6-2	# valid digits in last uint32 (max. 31 for base2, which fits in 5 bits)
 b1-0   prefix flag, only valid for x or X:
	 0 - no prefix
	 1 - '0X' prefix
	 2 - '0x' prefix
	 3 - not defined
	 
If the number of valid digits (b6-2) is 0, there is no trailing uint32 value.

For base64, there is no sequence of uint32's, but a sequence of 3-char tuples. 
Each tuple of 3 Bytes (3*8 = 24 bits) 'decodes' to 4 characters (4 * 6 = 24, 
2^6 = 64) of base64 encoded data. The optional last value is not an extra
3-char tuple, but a sequence of 0 to 3 additional chars. The decompressor will
fill this last sequence with '=' chars at the end, as prescribed by RFC 2045.

    
 ----
 
XMill memory API

The memory API is implemented using a single C++ class, XMill, defined in 
XMillApi.h. The main methods are Init(), Compress() and Decompress(). Errors are
thrown using C++ exceptions. The error values that are encapsulated in the 
XMillException instance are defined in xmillapi.h as well. 
Note: the Init() method *must* be called before a Compress() or Decompress() 
action that follows a (de)compress action that caused an error, to reset some 
internal data. Calling Init() before *each* (de)compress action is OK, so that 
is the safest way to go.

The Init() method accepts various settings:
 output type (text or MSXML4)
 lossy or lossless encoding: store all whitespace info or not
 compressor type: gzip, bzip2, ppmdi or nozip
 path expressions
 use dos newline or not
 ignore type (for ignoring CDATA, PI, DTD and/or comments)
 ignore value: yes/no
 indent type for decompressions
 indent count
 gzip/ppmdi compression index
 whether or not to 'decompress' XML data (note that the compressor can be 
  instructed to memcpy() the original XML data if the XMI is larger)
 
The (de)compress actions are defined for files or memory. The memory methods 
work on a full XML or XMI buffer, or on a block of (possibly incomplete) data.

See xmilltest (esp. the testsettings class) and xmillexample for example code.


 ----
 
XMill file format

Based upon the description below - to test if it was correct - I wrote 
XMillInspect, a program that prints info and contents of an XMI file, and 
performs some integrity checks. It might help in developing decompressors for 
other implementation languages, or an XMill query processor.

An XMI file is stored as a series of compressed blocks. Blocks are compressed 
using bzip2, ppmdi, gzip or nozip as general purpose (gp) compressor.
A set of compressed blocks is called a 'run'. Each run is comprised of a run 
header and run data blocks. The first run header contains the global XMI header.
An example:

bzip2 stream 1:
 global XMI header
 run 1 header
bzip2 stream 2:
 run 1 data block 1
bzip2 stream 3:
 run 1 data block 2
..
bzip2 stream i-1:
 run 1 data block j
bzip2 stream i:
 run 2 header
bzip2 stream i+1:
 run 2 data block 1
bzip2 stream i+2:
 run 2 data block 2
etc.

About runs: while compressing, the size of all intermediate data is maintained. 
If this size exceeds a user-definable setting (default 8MiB, which was determined 
by the developers of XMill as the optimal value), the current run is flushed to 
the XMI file, using the current gp compressor. The next run is then filled until 
the limit is reached again, etc.

A basic thing that must be explained first is the storing of int32 numbers. These
are not stored verbatim (using 4 Bytes), but using a UTF-8 like encoding to
prevent storing of lots of Bytes with just 0. The semantics:

(all data stored in network Byte order, so non-Intel)

Unsigned:

first Byte b7=0:  b6-0 contain a number 0-127 (7 bits)

first Byte b7=1:  a larger number was stored
b6=0:	a 14-bit number is stored in this byte (b5-0) and the next

b7=1, b6=1: an even larger number
b5=0:	a 29-bit number is stored in this Byte (b4-0) and the next three Bytes

first Byte = 224 (b7-5 = 1): the next 4 Bytes contain an int32 value 

Signed:

b7=0:	b5-0 contain a number 0-63 (6 bits)
b6 = sign bit (usually used as a flag, not as a real sign!)

b7=1:   a larger number was stored
b6=0:	a 13-bit number is stored in this byte (b4-0) and the next
b5 = sign (not b6!)

b7=1, b6=1: an even larger number
b4=0:	a 28-bit number is stored in this Byte (b3-0) and the next three Bytes
b5 = sign

first Byte = 208 (positive) or 240 (negative) (b7,6,4 = 1, b5=sign): 
	the next 4 Bytes contain an int31 value 

These two encoded number types will be described with sint32 and uint32 in the 
remained of this document.

Some other global remarks about this file format decription:
 - lists of data entities are prefixed with a uint32 number, describing the
   number of elements in the list. A list is described with data type 'list'.
 - strings are stored just like this: first a uint32 with the string length,
   then the string data (no NULL terminator). They will be described with the 
   data type 'lstring'.
 - a list of strings (so first a uint32 followed by a number of lstrings) will
   be described with 'slist'.
   
The global XMI header:

sint32	magic (0x05e3d29e + version, so 0x05e3d39e for version 0x01.00)
	the sign bit is used to indicate that the -w (store global full whitespaces)
	was used or not. On means: ignore whitespaces, the decompressor
	is then responsible for inserting XML formatting whitespaces.
uint32  reserved. Added in v0.8 for future use.
slist	path expressions (verbatim from the .xmill file)

A run header:

A run header contains some global information, and the run data blocks that are 
smaller than a certain #define (currently 2000 Bytes). This kind of data is called
'small' data (as opposed to 'large' data) and is incorporated in the run header 
to increase gp compression rates, which don't perform well on small data 
streams.

run data can be split up into two sections: global data and container data. 
Currently, only the enumeration compressor uses global data.

uint32	totaldatasize (total data size of all run data. This size is used 
		in the decompressor to set the cutoff length: the maximum length
		for the decompressor memory manager. It is implementation dependent;
		the calulation function uses the sizeof(..) a certain struct, and 
		aligns string sizes to a 4-Byte word.)
list	container blocks. per block:
 uint32	0 (for the first block which contains the special data) 
 	or the index of the path expression (1-based)
 list	containers. per container:
  uint32 containersize
slist	labels (each label string length is encoded as sint32 in stead of uint32.
		The sign bit On indicates that the label is an XML attribute.)
	These labels *extend* the list that was cumulatively defined in the 
	previous runs.
list	global data. for each compressor (currently only enumeration, 'e'):
 uint32 state index (== dictionary size)
 uint32 state size
 for each global with state size < 1024 (another #define):
  slist	enumeration values
  (note that this list is *not* prefixed with a uint32. The number can
   be derived from looking at the index number of the corresponding state
smallcontainers (no length prefix - see the block count above), for each container block:
 for each container with size < 2000:
  all data (verbatim) - data is dependent on semantic compressor

run data blocks:

A run data block is either a 'large global data' block or a 'large container' block.
First are all the global blocks, then follow the container blocks, each in their
own bzip2/gzip stream.

The global data is currently only defined for enumeration. It contains:
 for all states with size >= 1024:
  slist	enumeration values (not prefixed; see above for the small global data)

The container data:

for each container block:
 for each container with size >= 2000:
  all data (verbatim)

Note: all verbatim blocks for a single container are split out to separate streams.

The first container block is always present and always contains three containers:
 1: structure container
 2: PI, DTD, CDATA & comment container (might be empty)
 3: global whitespace container (might be empty)

Container 2 and 3 contain a non-prefixed slist.

The structure container (1) contains a list of sint32 values which can be 
interpreted as XML macros:
 0 = insert end label
 1 = insert empty end label (i.e. end the currently open label immediately, e.g.,
      <myLabel/>)
 2 = insert white space
 3 = insert attribute white space
 4 = insert special
 5 and higher = insert the corresponding label (remember, labels are stored in 
   the run header). Subtract 5 from the number and you get the index in the label 
   list
 negative = the absolute value is a block id. The corresponding semantic block 
  uncompressor must decode some data now. This uncompressor can be found by
  indexing this block id into the container list that was stored
  in the run header. The path expression index that the referred container
  stored is this 1-based number of the path expression. An example:

XMI compressed without path expressions:

<test>
 bar
 <foo>312415</foo>
</test>

container block	 index	# containers	remark
 0		 0	3		special container
 1		 1	1		'test' text container
 2		 1	1		'foo' text container

XMI compressed with path expression:

/test/foo => u
//#			(automatically added)
/			(automatically added)

<test>
 bar
 <foo>312415</foo>
</test>

container block	 index	# containers	remark
 0		 0	3		special container
 1		 2	1		'test' text container
 2		 1	1		'foo' uint32 container

The structure container will contain references to container block 1 or 2.
Without path exprs, they both refer to path expression 1 (//#). With path exprs,
the first one will refer to 2 (//# for <test> data), the second to 1 
(/test/foo =>u) for <foo> uint32 data. Note that the path exprs that are stored
in the XMI only contain the part *after* the '=>' so in this case only 'u'.

All other container blocks' data is dependent on the semantic compressor. This 
is described in the next chapter.

The large block streams do not contain any meta info, e.g., their type (global
or container), sequence number or semantic compressor type is not incorporated.
This makes salvaging of a damaged XMI file a bit problematic if a header stream 
is damaged. Run header streams (except the first one) don't have a magic either,
so identifying them is impossible too. This no-magic regime also has its 
repercussions on the decompressor algorithm; it must use exactly the same as the
one for xdemill to map the path expressions to semantic decompressors.


 ----
 
Semantic compressors and path expressions

The chapter about basexnum ('New User Compressors') explains its data structure. 
This chapter explains the data structure of the other user compressors (that 
where already available in v0.7beta) and how the container blocks that store this
data relates to the path expressions

These are the XMill basic semantic compressors and their data formats:

name	long name	stream of..
------------------------------------
 p	print		nothing!
 "..."	constant	nothing!
 t	text		lstring
 u8	Byte		char
 u	Unsigned Int	uint32
 i	Signed Int	sint32
 di	Delta		sint32
 rl	RunLength	tuples of (lstring, uint32 count)
 e	Dictionary	uint32 indexes into the dictionary 
 			the dictionary values are stored in a global block

Each container that uses 'e' stores a global block. E.g., the expression
//#=>e will store a global block for each distinct element in the XML file.
//Name=>e and //Value=>e will store two global blocks. 

How the mapping from container to global blocks is handled is a bit unclear to 
me. I assumed it to be in the order of declaration of path expressions, but that 
does not seem to be true. I've included a warning in xmillinspect when a clear 
mismatch is detected. I'll try to figure this out..

These are the XMill combining semantic compressors and their data formats:

name		stream of..
----------------------------
 seq		nothing!
 seqcomb	nothing!
 or		uint32 with succeeded subcompressor index
 rep		uint32 with repeat counts
 orcomb 	?? The technical paper names this compressor, but it doesn't
 		seem to exist anymore.
 
The combining compressors all have an associated list of subcompressors.
Their respective subcontainers are stored in the same container block as the 
container block of the combining compressor, in the order of declaration in the 
path expression.

The seq and seqcomb containers do not have a container of their own, since they
don't need to store any information. The seqcomb container combines all containers
of its subcompressors, and thus stores the maximum number of containers that any
subcompressor needs. Thus: 'seq', 'or' and 'rep' all store a separate (set of) 
container(s) for each of their subcompressors.

Some examples to clear this up:

or(i t) will store 3 containers in its container block:
Block 1 contains the or's uint32's (a series of 0 or 1 values)
Block 2 contains the i's sint32's
Block 3 contains the t's lstrings

The path expression:

or (seq (u "-" u)
    u
    "text"
    t)
    
will store 5 blocks:
Block 1 contains the or's uint32s (a series of 0-3 values)
Block 2 contains the first u's uint32s
Block 3 contains the second u's uint32s
Block 4 contains the third u's uint32s
Block 5 contains the t's lstrings
    
The slightly modified path expression:

or (seqcomb (u "-" u)
    u
    "text"
    t)
    
will store 4 blocks:
Block 1 contains the or's uint32's (a series of 0-3 values)
Block 2 contains the first *and second* u's uint32s
Block 3 contains the third u's uint32s
Block 4 contains the t's lstrings


 ----
 
XMill source documentation

For people that are interested in extending or upgrading XMill follows a short 
explanation of its source code. Note that any changes have to be reported to 
AT&T, which is the copyright holder. Also note that I e-mailed them about my 
changes but have not received any answer. According to Suciu, nobody at AT&T
is maintaining the code so I guess they kind-of abandoned their rights?!?

The distribution contains the following directories:
 tmp		Unix .o folder
 unix		Unix binaries
 examples	example .xml, .xmi, .xmill and .script files
		also: a base64 encoded image and its original
 documentation	project documentation, copyright notices, etc.
 XMill		XMill, xcmill and xdemill source code
 xmilltest	unit tester source code
 bzlib		BZIP2 source code
 zlib		GZIP source code
 v0.7-binaries	'old' binaries
 xcmill		MS VS.Net 2002 metadata for the compressor 
 xdemill	MS VS.Net 2002 metadata for the decompressor 
 xmillexample	Very simple XMill C++ API example
 xmillinspect	XMI file inspector
 ppmzip		PPMDi (de)compressor
 
The base directory contains a MS VS.Net 2002 (Visual Studio 7) solution and a 
Unix Makefile.

VS.Net projects:
 XMill		(de)compression	library
 xmilltest	unit testing
 xmillexample	command-line (de)compressor test
 xcmill		command-line compressor
 xdemill	command-line decompressor
 xmillinspect	command-line XMI file inspector
 bzlib		BZIP2 library
 zlib		GZIP library
 ppmdi		PPMDi library
 ppmzip		PPMDi (de)compressor
 
All projects are capable of compilation with precompiled headers, but due to some
corruption in my VS.Net installation I had problems with it so .pch files are
currently not enabled.

Note that all projects are configured to use 1-Byte alignment and link with the
multi-threading variant of libc. You'll need to define this in your own projects
as well (or revert the XMill projects to default alignment / single-threading
which is Visual Studio's default value when creating a new project) to be able to 
link & run your program without problems.

The xcmill and xdemill projects contain two files (realmain.cpp and Options.cpp)
that are shared. Those must be compiled with either XDEMILL or XMILL #defined. 
This is a relic of the 0.7 sources, which contained them everywhere.

The bzlib and zlib sources have been modified to support precompiled headers,
but other than that they're the exact sources you can download from their 
respective maintainers.

The PPMDi code (which was already revised by a number of people) had to be 
adapted a bit to be able to stream chunks of data from memory.

XMilltest contains a main funtion (xmilltest.cpp), some small functions 
(utils.cpp) and the unit tester methods (testset.cpp). Should be easy to find 
out its internals..

XMillInspect contains nine source files. Together, it's about 2.5KLOC, so that's
not too much to figure out for yourself. IMHO it's pretty well documented, 
although not very well programmed in some parts. Those parts are identifiable
by huge warnings in the comments.

The XMill project has been split into separate folders to increase understandability:
 Compressors		.xmill path expression compressors
 XMLFiles		file I/O & XML parsing
 API			API functions
 			(de)compression
 			error handling
 			session management
 			file handling
 MemMan			memory manager (to work around new/malloc's bad performance)
 PathEngine		XML Path management
 			Finite State Machine for .xmill expression handling
 ContainerManager	intermediate data containers
 GPCompressors		interface to BZIP2, GZIP, PPMDi and NoZip
 
Where you should be for what intention:
  Extending:
 - adding a user compressor:
    copy-paste-edit & add your source in Compressors
    register your compressor in XMillData.cpp at RegisterCompressorFactories()
    increase the XMILL_CURRENT_VERSION_.. #defines in XMill.h
    update Unix makefile
 - adding a new general purpose compressor:
    add & link a new project containing your (de)compressor of choice
    copy-paste-edit a Zipper subclass & add in GPCompressors
    extend the Zipper::NewZipper() factory
    add a #define for your XMILL_GPC_.. compressor in XMillAPI.h
    add a recognition heuristic in Uncompress.cpp and inspectfile.cpp 
     (the latter being an xmillinspect file)
    extend testset.cpp (xmilltest) in two places (trust me, you'll find out 
     where :-)
    increase the XMILL_CURRENT_VERSION_.. #defines in XMill.h
    update Unix makefile

  Debugging:
 - debugging the parser: (error 11 or something)
    inspect PathEngine files. Warning: the FSM is a pain in the ass to debug..
 - debugging (de)compression errors:
    if you have those, there's probably something wrong around the XMill::Init() 
    call. Debug the code in the API, ContainerManager and PathEngine folder 
    files and you're bound to find the problem somewhere in the xmillapi.cpp or 
    XMillData.cpp files. Crash'n'burn, because FSM debugging is easy compared 
    to this..
 - debugging/upgrading the API and higher level (de)compression routines:
    see API files

I guess the XMLFiles and MemMan files are okay and don't need inspection anymore, 
but you'll have to find that out for yourself..

BTW there are some unused (commented-out) features in the XMill sources that I
didn't really understand so I left that all in. Those things are:

file:
 VRegExpr.cpp / .hpp		no idea what the purpose is or was, I didn't 
 				 really look into that
 FSM.cpp:			commented-out code at the end
 MemStreamer.cpp:		commented-out code at the end
 
#defines:
 SPECIAL_DELETE			?? something to do with delete[] and the magical 
 				 internal memory manager
 TIMING				debugging / testing code for (de)compression speed
 XDEMILL_NOOUTPUT		probably debugging / testing code
 USE_FORWARD_DATAGUIDE		forward in stead of backward path handling; old code
 PROFILE			debugging / testing code for Belgian debugging
 ZALLOC_COUNT			debugging / testing code for BZIP2/GZIP memory 
 				 allocations
 RELEASEMEM_SAFE		clean memory after use, a security protection
 NOTHREAD			code I commented out when converting the internal 
 				 memory manager from global variable use to 
 				 thread-safe code
 
 
 ---

XMill internal control-flow and UML data

Not done yet. Be my guest :-)


 ---

 Thanks

My gratitude goes out to Dan Suciu and Hartmut Liefke for helping me out with 
converting xmill to thread-safe code, and other technical issues, and to James 
Cheney (of xmlppm fame) for aiding me in adding ppmdi as a general purpose 
compressor. And of course AT&T research for allowing me to use and modify xmill 
0.7. Open Source is neat!


 ---

Known problems

There are several issues:
 - there might be a problem when you try to compress a XML text item that's 
   larger than 64kB, because the internal memory manager allocates 64kB per block
   at most, but I didn't test this. In the 0.7 documentation I read that this is
   handled OK, but due to my partial compression modifications (see below) I 
   suspect that this has ceased to work. Bummer..
 - as can be read above, I had quite some problems with the FSM and XMill::Init()
   code. The whole thing comes crashing down on my every once in a while - after
   a code change of course -, since the code was originally meant for a single 
   series of compression or decompression actions in the command line tools. 
   The re-initialization of all have-been global variables (there were *a lot* 
   of those!) might still be sloppy.
 - Compression of partial XML files is a bit wobbly. There might be odd-ball 
   cases where the detection of impartial XML elements screws up. So either fix 
   this or supply complete elements if you have problems. To elaborate, here are
   the degrees of freedom:
    - supply complete XML files (either on disc or in memory)
    - supply blocks of XML data, e.g., first '<?xml version="1.0"?><test>' and 
      then 'YourData</test>'
    - supply impartial blocks, e.g., '<?xml versi', 'on="1.0"?><test>You' and 
      'rData</test>'
   The first two should be handled okay, but the last one is a bit doubtful. 
   A minimum requirement for that is that you supply each element fully at some 
   time during compression. Unclosed items will not be processed by the 
   compressor, which will return you the size of the text that *has* been 
   processed. It's then up to the client to supply more data that contains the 
   full item. In the previous example:
    '<?xml versi' supplied, nothing processed
    'on="1.0"?><test>You' added by the client to the unprocessed data
    '<?xml version="1.0"?><test>You' supplied, '<? xml version="1.0"><test>' 
     processed, 'You' remains
    'rData</test>' added by the client to the unprocessed data
    'YourData</test>' processed
    data ended and all unclosed tags are closed, so the compression is finished
   Nifty, eh? Note that it can be quite slow to handle large files in small 
   blocks; XMill handles complete files *much* faster. FYI: the same problem 
   arises when you define the MEMORY_CUTOFF considerably smaller than the 
   current value of 8MiB.
 - Decompression of blocks of XMI data is even more problematic. I implemented 
   front-end support for it, but I'm not sure whether it's actually possible. 
   I'll get back to you 'bout that :-)
 - Compression of XML files with unclosed elements thrashes the compressor.
 - xmillinspect does not always interprets path expressions correctly.
  

 ---
 
Todo

- fix the known problems mentioned above
- integration test for xcmill, xdemill, xmillinspect and xmilltest
- XMill:
  + add more User Compressors for domain-specific data, e.g., 3D coordinates, 
    floating point numbers, dates, timestamps, etc.
  + user compressor macros and include files in .xmill definitions
  + test robustness on bad XML, XMI and .xmill data
  + control-flow and UML data
  + find out if leveraging XGrind (querying) or PPM (ESAX, MHM) techniques into 
    XMill is possible
  + finish MSXML decompression & add better API support for it
  + decompression to SAX events
  + better API: separate setting of all XMill (de)compression parameters i.s.o.
    setting them all at once using Init().
- xcmill:
  + skip non-arguments in .xmill 
- xmilltest: 
  + XML output of statistics i.s.o. text
  + detailed timings
- xmillinspect: 
  + option to skip non-fatal errors, to retrieve as much info as possible from 
    a damaged XMI file.
  + add command line options for the number of checks to execute, and for
    storing compressed / uncompressed buffers (mainly for debugging purposes)
  + create an XMI fix program for corrupt XMI files based on xmillinspect
  + fix container -> global data (dictionary) mapping
  + research the possibility of creating a query processor out of this code
  + fix container type mapping for complex path expressions