*********************************************************************
****                                                             ****
****                                                             ****
****                    User Manual for XMill                    ****
****                                                             ****
****                                                             ****
*********************************************************************

   Content

   1.   Introduction
   1.1. Underlying Techniques

   2.   Options
   2.1. Basic Options
   2.2. Options for White Space and Special String Handling
   2.2.1. Options for White Space Handling
   2.2.2. Options for Special String Handling

   3.   Path Expressions
   3.1. The '#' symbol
   3.2. Order of Path Expressions
   3.3. Default Grouping

   4.   User Compressors

   5.   Options of the Decompressor XDemill

   6.   The Compression Verbose Mode 

*********************************************************************

1. Introduction

XMill is a special-purpose compressor for XML documents that typically
achieves substantially better compression rates. For large files, we
achieved compression rates twice as good as gzip's compression rate.

Similar to gzip, XMill is a command-line tool that works on a
file-by-file basis. A given file with extension '.xml' is compressed
into a file with extension '.xmi'. Any other file without extension '.xml'
is compressed into a file by appending extension '.xm'. Reversely,
the original file is obtained by replacing extension '.xmi' with
extension '.xml' or by removing extension '.xm'.
Alternatively, the user write the output to the standard output and
optionally read from the standard input.

1.1. Underlying Technique

XMill is based on a grouping strategy that groups and compresses
text items together based on their semantics. For example, a sequence
of <Person> elements (with <Name>, <Age>, <Shoesize>, ... elememts)
in an XML document could be rearranged by grouping all names, all ages,
and all shoe sizes together. This will typically lead to higher
compression ratio, since each group will contain text items with
high similarities.

The default grouping strategy is by considering the parent label of
the text item. Even though this works well, there are cases when a label
has different meanings in different parts of the document (e.g. <Title>
in <Person> has a different meaning from the <Title> in <Book>) or when
different labels have the same meaning (e.g. a <ChildName> in <Person>
contains person name like <Name>).

Therefore, XMill provides a powerful regular path expression language
for grouping text items with respect to their meaning. Each text item
is reachable over a 'path' of labels from the root of the XML document.
For example, the social security of an employee of a company might be
stored in the 'ssno' attribute of element '<Employee>' in element
'<Company>' in the root element '<Root>'. Hence, the path to the text
item is /Root/Company/Employee/@ssno. Section 2 describes how regular
path expression are specified and matched with a specific path.

After the grouping in containers, conventional compressors, such as
gzip are applied to the containers and will exploit those similarities.
Since the number and size of containers grows with the size of the
XML file, a memory window mechanism is implemented: After the overall
size of the containers reaches a certain user-specified memory window,
the containers a compressed and stored in the output file. The compressed
content of the memory window is called a 'run'. After the 'run' is
stored in the output file, the containers are filled with data again.

In addition to path expression, the user can also specify how to
"pre-compress" the specific text item. For example, the user might
want to replace the 'Age' string by its binary integer representation.
Or more complex, an IP number might be replaced by four bytes.

XMill allows the user to specify additional "user compressors" to
pre-compress the text items before it is stored in the containers.
Note that the gzip compression is still applied to the containers
afterwards.

XMill provides an interface for writing own user compressors in C++.
This is particularly useful for domain-specific data, such as DNA
sequences, 3D coordinates, etc.


2. Options

2.1. Basic Options

The XMill compression can be fine-tuned by several options.
The set of available options of XMill is given by the following syntax:

  xmill [-i file] [-v] [-p path] [-m num] [-1..9] [-c] [-d] [-r] [-w] [-h] file...

with

       -i file  - include options from file
       -v       - verbose mode
       -p path  - define path expression
       -m num   - set memory limit
       -1..9    - set the compression factor of zlib (default=6)
       -c       - write on standard output
       -d       - delete input files
       -f       - force overwrite of output files
       -w       - preserve white spaces
       -h       - show extended white space options and user compressors

The options have the following meaning:

 -i file  - Reads additional options from text file 'file'. The text
            file must contain a sequence of options that are separated
            by white spaces. The external option file is particularly
            useful for path expressions (Section 2).

      -v  - Outputs additional information about the grouping and
            compression ratios achieved by the compressor for the given
            setting. The format of the output is described in Section 4.

 -p path  - Defines a new path expression to group text items together.
            Path expressions are describe in Section 2.

  -m num  - Sets the memory window limit for compression. The compressor
            fills the memory window until it is full, compresses
            it afterwards and writes the compressed output to the file.
            Then, the parsing continues and the memory window is filled again.

            A smaller memory window will create more smaller 'runs'
            in the output file. Hence, the compression rate will
            sligtly decrease.

            The window limit is specified in MByte.
            The default memory window limit is 8Mbyte.

   -1..9  - Sets the compression factor of zlib library. The zlib
            library is program interface to gzip. It is based on the
            Lempel-Ziv compression method. The Lempel-Ziv compression
            can yield better results for higher compression factors,
            but can also be substantially slower.
            The default value is 6.

      -c  - Write the compressed output to the standard output instead
            of to some file. This allows the user to pipe the output
            directly into some other process.

            If no file is specified in the command line and option '-c'
            is specified, then the program also reads the input from
            the standard input.

      -d  - Deletes the input XML files after the compression. The default
            is to keep the input file.

      -f  - Overwrites any existing compressed files. If this option is
            not specified, then the user is asked for every existing
            output file whether it should be overwritten.

      -w  - Preserves all white spaces in the XML document. If the option
            is not specified, only the elements and the text strings are
            stored in the compressed file. The decompressor can optionally
            add different kind of indendations to produce a legible
            XML document.
            The preservation of white spaces will guarantee the exact
            perservation of the XML document, if the specified
            user compressors are not "lossy" (see Section 3).

            The preservation and storage of white spaces can be controlled
            through more options described in Section 1.2.

      -h  - Shows the extended options and lists all predefine user
            compressors.

  file... - XML file(s) to be compressed.  For each file named
            'name.xml', a compressed file 'name.xmi' will be created.


2.2. Options for White Space and Special String Handling

Several extended options are available to control the handling of
white spaces and special XML strings, such as DTDs or processing
instructions. The general syntax of those extended options is as
follows:

  xmill ... [-w(i|g|t)] [-l(i|g|t)] [-r(i|g|t)] [-a(i|g)]
            [-n(c|t|p|d)]  file ...

with the following short description:

    -wi      - ignore complete white spaces (default)
    -wg      - store complete white spaces in global container
    -wt      - store complete white spaces as normal text
    -li      - ignore left white spaces (default)
    -lg      - store left white spaces in global container
    -lt      - store left white spaces as normal text
    -ri      - ignore right white spaces (default)
    -rg      - store right white spaces in global container
    -rt      - store right white spaces as normal text
    -ai      - ignore attribute white spaces (default)
    -ag      - store attribute white spaces in global container

    -nc      - ignore comments
    -nt      - ignore DOCTYPE sections
    -np      - ignore PI sections
    -nd      - ignore CDATA sections

2.2.1. Options for White Space Handling

Before we can describe the exact meaning of each white space option,
the following explanations are necessary.

A white space sequence is a consecutive sequence of white space characters
(' ', '\t', '\n', and '\r'). XMill distinguish three types of
white space sequences that can occur in XML documents:

   1. A "complete white space sequence" is a sequence that
      is delimited by some start or end tags. Consider the following
      example:

         <Employee @name=" Tom "  @ssno="123-45-6789" >
            <Address> New York City </Addres>
            <Age>     34            </Age>
         </Employee>

      All white spaces (newlines, tabulators, ...) between
      element <Employee @name=" Tom "> and element <Address>,
      element </Address> and <Age>, and </Age> and </Employee>
      are complete white space sequences.

      
   2. A "left white space sequence" is a sequence that is delimited
      by an element tag on the left and some text on the right.
      For example, the space between <Address> and New York City and the
      space between <Age> and 34 are left white space sequences.
      Furthermore, the attribute string " Tom " has a left white space
      at the beginning.

   3. Analogously, a "right white space sequence" is delimited by some text
      on the left and some element tag on the right.
      For example, the sequences between New York City and </Addreess> and
      the space between 34 and </Age> are white space sequences.
      Furthermore, the trailing space in " Tom " forms a right white
      space sequence.

   4. An "attribute white space sequence" is a white space sequence between
      attributes in the start tag. This includes white spaces between
      the element name and the first attribute, the white spaces
      between attributes and the white spaces between the last attribute
      and the '>' symbol at the end of the element.

      In the example above, the spaces between 'Employee' and '@name...',
      between '...=" Tom "' and '@ssno...' and '...-6789' and '>'
      are attribute white space sequences.

For the first three types of white spaces, there are three possible
storage techniques:

   1. The white spaces could be ignored. In this case, the decompressor
      cannot reconstruct the exact original content. Instead, the
      decompressor creates a user-defined indendation using tabulators
      or spaces.

   2. The white spaces are treated as normal text. In this case, the 
      grouping mechanism and user compressors are invoked on each of
      the white space sequences. Left and right white spaces are
      joined together with the text before user compressor are applied.
      For example, '     34            ' would form a single text item.

   3. All white spaces are stored in a separate global container.
      For example, the left white space sequence '     ' and
      the right white space sequent '            ' could both be
      stored in the global container while the string '34' is
      handled by a user compressor.

For attribute white spaces, only options 1 and 3 are possible. If attribute
white spaces are ignored, then the decompressor automatically inserts
a single space ' ' between all attribute definitions.

2.2.2. Options for Special String Handling

There are four types of special XML strings: comments (of the form
'<!-- ... -->' ), DTDs ( '<!DOCTYPE ... >' ), processing instructions
( '<? ... ?>' ) and CDATA sections ( '<![CDATA[ ... ]]>' ).

By default, all special XML strings are stored in the compressed file
in the same global container. Using options '-nc', '-nt', '-np', and '-nd',
some of the special XML strings can be ignored.


3. Path Expressions

Path expressions are used to group text items together. For example,
one might want to group all social security numbers of employees together.
All text items in the same group are stored and compressed in the same
container. Because of the strong similarities between the text items,
substantially higher compression rates can be achieved.

Path expression can be specified through option '-p...'.
Each path expression describes paths from the root of the XML
document to some node within the document.
The language of path expressions is based on XPath.
Each path must start with symbol '/' and components of the 
path are separated by '/'. For example, '/Company/Employee/@ssno'
denotes the social security numbers of employees in company
elements. The symbol '@' precedes attributes.

The special label '*' matches any label. For example, /*/Employee/@ssno
matches the SS# of any employee that located one level deep
in the XML document.

The separator '//' can be used instead of '/' to specify an arbitrary
sequence of labels between two labels. For example, '/Company//Employees'
describes employees *anywhere* within a company element.
Or, '//Employees' describes all employees in the entire document.

It is also to specify alternative labels for each item in the path.
For example, (Employee|Secretary) represents employees or secretary
labels and '/Company//(Employee|Secretary)' has the expected meaning.
Note that the parentheses can be omitted, since '|' has a higher precedence
than '/' and '//'.

Each path expression identifies all text items that should be grouped
in one container. Note that text items in deeper nested levels of the
XML document are not included by the same path expression.
This is particularly important in the context of mixed content.
Consider the following XML data:

   <Employee>
      Alex <middle> P. </middle> Miller
   </Employee>

Here, a path expression for employees (e.g. '//Employee')
would only group text items 'Alex' and 'Miller' and not the text item 'P.'.

3.1. The '#'-Symbol

Conventional path expressions only allow us to specify the text items
for one single container. Hence, a separate container expression
must be specified for each container.

The special label '#' allows us to group data items into multiple
containers at the same time. Basically, the '#' label stands for
any kind of label. For each possible instantiation of the '#' label
with a specific label, a different data container is generated.
For example, the path expression '/Employee/#' generates a container
for each subelement of 'Employee'.
The path expression '#//' would group text items depending on their.
top level element in the XML document.

There are two additional symbols for specifying multiple container
groupings: '@#' and '##'.
While the '#' symbol can be instantiated as an element or attribute label,
the '@#' symbol can only be instantiated as an attribute.
Similar to '//', the '##' symbol describes a sequence of labels. Each
possible path instantiation for '##' identifies a separate container.

Such generalizes path expressions with '#' symbols are typically
ambiguous. For example, consider the path expression '//#//'. It is
not clear what the instantation of the '#' symbol could be, since
it could be any label on any path.
Therefore, we adopt the following policy:

Path expressions are always parsed from *right to left*, i.e. the parsing
starts at the leaf node for a specific text item. The reason for this
rather counterintuitive approach is described in [1].

The basic policy for '#' symbols is to instantiate them *as early as possible*
during the parsing, i.e. '#' symbols are instantiated in the right-most
manner. Therefore, the '#' symbol in '//#//' would be instantiated
to the last label in the path. This label would identify the container
for the text item.

3.2. Order of Path Expressions

Several path expressions can identify the same text item. For example,
'//Employee/@ssno' identifies a subset of the text item of
'//*/@ssno'. Therefore, the order of the path expression in the
command line (or in the option file) describes the precedence of the
path expression.
For example,

   xmill -p//Employee/@ssno -p//*/@ssno

will group all SS#s of employees in a container separate from
all other SS#s. In contrast,

   xmill -p//*/@ssno -p//Employee/@ssno 

would group all SS#s in one container together. There is not text
item that would match the second path expression, but does not
match the first path expression.

3.3. Default Grouping

In order to guarantee that each text item has at least one matching
path expression, the two path expressions '//#' and '/' are concatenated
at the end of the list user-defined path expressions.
The expression '//#' captures all non-empty paths in the database and
builds containers based on the last label in the path.
The expression '/' captures the empty path and all text items at
the root of the XML tree.


4. User Compressors

Path expressions are a powerful mechanism to group text items
with respect to their meaning. It is quite often the case that
these text items have similar origin and syntax. For example,
social security number will be strings of the form 'DDD-DD-DDDD'
where 'D' denotes digits.

It is often possible to apply additional "semantic compression"
to those string by taking advantage of the semantic knowledge.

In XMill, user compressors can be used to compress text items
grouped by the same path expression together. Path expressions with
user compressors have the following syntax:

   pathexpr=>usercompressor

For example, to compress all ages of employees as positive integers, 
one could write '//Employee/Age=>u'. To convert a social security number
into three integer parts, one could write
'//Employee/@ssno=>seqcomb(u "-" u "-" u)'.

XMill currently provides the following user compressors:

    di       - Delta compressor for signed integers
    e        - Compressor for small number of distinct data values
    i        - Compressor for signed integers
    or       - Variant compressor
    p        - Print compressor
    rep      - Compressor for substrings separated by some delimiter string
    rl       - Run-length encoder for arbitrary text strings
    seq      - Sequence compressor for strings with separators
    seqcomb  - Combined sequence compressor for strings with separators
    t        - Plain text compressor
    u        - Compressor for unsigned integers
    u8       - Compressor for integers between 0 and 255
    "..."    - Constant compressor

It is important to note however the XMill is extensible and that
the programmer can easily add additional special-purpose user compressors.

Each user compressor consumes strings with a specific syntax.
Each string is replaced by a different (typically much smaller) string
that represents the same information.

If the string does not have the expected syntax, then the user
compressor rejects the string, then the next matching path expression
with a different user compressor is considered.
If no user compressor matches the string, then the default
user compressor 't' is used in path expressions '//#' or '/' is used.

We distinguish two types of user compressors: atomic user compressors
and combined user compressors. While atomic user compressors parse and
compress certain text atoms such as integers,  combined user
compressor is parameterized by other, smaller user compressors.
In the example above, the user compressor 'seq(u "-" u "-" u)'
has parameter 'u', "-", 'u', "-", and 'u'.

In general, user compressors can have any parameter string (...)
following the identification name and depending on the user compressor,
the parameter string must satisfy certain restrictions.

Each user compressor operates on a fixed number of containers.
For example, the user compressor 'u' only needs one container
(to store the binary integer values), while 'seq(u "-" u)' requires
two containers: one for each of the integers.


The meaning of the user compressors follows in detail:


      di  - Delta compressor for signed integers.
            The delta compressors stores the changes between successive
            integer numbers. For example, the sequence '10010', '10024',
            '10005', '10006' would be represented as '10010', '14', '-19', '1'.
            
            Note that integers in general are compressed by using a
            variable length encoding scheme: Small values are represented
            by only one byte, medium values by two, and large values by four bytes.

            The delta compressor can have an integer value as a parameter
            describing the minimal number of characters in the output.
            If a given number is too short, then leading zeros are inserted.
            For example, 'di(3)' specifies that number should have at least
            3 characters. Hence, '23' and '-2' would be displayed as
            '023' and '-02', respectively.

            Note that the parameter only affects the output (i.e. decompression)
            of the string. The input XML file is parsed as before and no
            additional syntactic restrictions are imposed.

       e  - "Dictionary" or "Enumeration" Encode.
            The dictionary encoder replaces each text item by a specific
            integer number that is an index into a dictionary.
            The dictionary is extended whenever a new text item is found.

            In addition to the indices, the user compressor must also
            store the dictionary in the compressed file. The dictionaries
            are separately stored at the beginning of the file.

            Note that dictionaries are particularly efficient for path
            expressions that identify a small set of distinct text items,
            such as airport codes, country names, etc.

       i  - Compressor for signed integers.
            This compressor converts signed integers into their binary
            format. Variable length encoding is applied to reduce the size
            of the binary representation.

            Similar to delta encoding, an optional parameter can 
            specify the minimal number of characters in the result.

      or  - Variant compressor.
            The variant compressor tries to apply different kind of
            compressors to the same text item until one of them
            succeeds. The parameters of the variant compressor are
            the alternative subcompressors that should be tested
            for each string. For example, the user compressor string
            'or(u seq(u "-" u))' accepts simple integers such as '45'
            or integers pairs such as '34-67'. This is useful when
            related information is represented in different ways, such
            as for page numbers.

            Note that if none of the subcompressors accepts the string,
            then the variant compressor itself rejects the string.

            To determine which subcompressor was used, it is necessary
            to keep an additional index value. Therefore, the 'or'
            compressor fills an additional container that stores the
            index.

       p  - Print compressor.
            The print compressor can be used for debug purposes. It
            never accepts or compresses any string. Instead, it simply
            prints the string to the standard output.
            This is useful to verify that a certain text item always
            complies to a certain format. For example,
            '/Employee/Age=>or(u p)' will print all ages that are not
            positive integers.

     rep  - Compressor for substrings separated by some delimiter string.
            The repeat compressor compresses a string by dividing it
            by certain delimiters and compressing each part separately.
            For example, to compress a comma-separated sequence of
            keywords, one could use the string 'rep("," e)'.
            The string "," denotes the delimiter and 'e' denotes the
            user compressor applied to each string piece - i.e. the
            keywords.
            In addition to the container(s) required by the subcompressor,
            one extra container is needed to store the number of string
            pieces for a given string. Intuitively, the decompressor
            first reads the number of string pieces and then reads
            as many string pieces from the containers of the
            subcompressor.

            Optionally, a third parameter can be specified to compress
            the last string piece after the last delimiter separately.
            For example, 'rep("," e u)' applied to string
            'xml,compress,xmill,5' would apply compressor 'e' to
            the three text pieces 'xml', 'compress', and 'xmill' and
            would apply compressor 'u' to string '5'.

            A current restriction of the repeat compressor is that
            the subcompressor must always be accepting. I.e. one could not
            write 'rep("," u)'. The reason for this is that in the case of a
            rejection, the previous parsing and compression of text items
            must be undone.
            The problem can easily be solved by using a variant
            compressor such as 'or(u t)':  'rep("," or(u t))'.

      rl  - Run-length encoder for arbitrary text strings.
            The run-length represents a several identical strings 
            by a tuple (len,string) where 'len' is the number of identical
            strings and 'string' is the actual string.
            This is particularly effective for data items whose
            value does not change frequently within the file.

    seq   - Sequence compressor for strings with separators.
            The sequence compressor divides a given string along delimiter
            lines and compresses each piece separately by different
            compressors.
            The parameter of a sequence compressors is an alternating
            sequence of subcompressors and delimiter strings, i.e.
            each subcompressor must be follows by a delimiter string and vice versa.

            For a given string, the sequence compressor first traverses
            the string to find the first delimiter. The part of the string 
            before the delimiter is compressed using the first compressor.
            Then, the second delimiter is identified and so on.

            The subcompressors parse the string pieces separetely and
            store the compressed representation in separate containers.
            For example, 'seq(u "-" u "-" u)' will store an integer
            in three different containers for each string.

  seqcomb - Combined sequence compressor for strings with separators.
            The only different between 'seqcomb' and 'seq' is that the
            containers of the subcompressors 'overlap'.
            For example, 'seqcomb(u "-" u "-" u)' only uses one container
            to store all three integers.

            In general, 'seqcomb' takes the maximum of the number
            of containers required by the subcompressors.

            In practice, 'seqcomb' will often lead to surprisingly different
            compression results. For example, consider some structured
            values (such as SS#, IP address, or date) with only a small
            number of different values. For example, a large number
            of elements might refer to the same IP address as the
            default router.
            Then, it is often better to store the single IP bytes together
            instead of in four separate containers, since Lempel-Ziv
            can exploit more similarities between entire IP numbers.

       t  - Plain text compressor.
            The plain text compressor is the default user compressor
            and simply copies the text item into the container.

       u  - Compressor for unsigned integers.
            This user compressor is very similar to the user compressor
            'i' for signed integers. It only accepts positive integers.
            It also allows an optional additional parameter specifying
            the minimum number of digits.

      u8  - Compressor for integers between 0 and 255.
            This compressor is similar to 'u', but accepts a smaller
            range of integers. These integers are always stored in a
            single byte.
            Like in 'di', 'i', and 'u', an additional minimum number
            of digits can be specified.

    "..." - Constant compressor.
            The constant compressor only verifies that the given text item
            is equivalent to the constant. Otherwise, the compressor
            rejects the string. The constant compressor does not store
            any data in any containers.
            
            The constant compressor is useful for specifying small sets
            of distinct values. For example, one might specify that
            a price could be a number, 'low' or 'high':
            '//price=>or(u "low" "high")'.


5. Options of the Decompressor Xdemill

The options of the decompressor are largely used for controlling the 
output format if white spaces are omitted on the compressor site.
The general syntax of the decompessor XDemill is as follows:

  xdemill [-i file] [-v] [-c] [-d] [-r] [-os num] [-ot] [-on] [-od] [-ou] file ...

       -i file  - include options from file
       -c       - write on standard output
       -d       - delete input files
       -f       - force overwrite of output files
       -os num  - output formatted XML with space indentation
       -ot      - output formatted XML with tabular indentation
       -on      - output unformatted XML (without white spaces)
       -od      - uses DOS newline convention (default)
       -ou      - uses UNIX newline convention
      file ...  - compressed file(s) to be decompressed; for each
                  file 'name.xmi' a new file called 'name.xml' will be
                  generated.


The detailed descriptions of the options follows:

 -i file  - Reads additional options from text file 'file'. The text
            file must contain a sequence of options that are separated
            by white spaces.

      -c  - Write the XML output to the standard output instead
            of to some file. This allows the user to pipe the output
            directly into some other process.

            If no file is specified in the command line and option '-c'
            is specified, then the program also reads the input from
            the standard input.

      -d  - Deletes the input XML files after the compression. The default
            is to keep the input file.

      -f  - Overwrites any existing compressed files. If this option is
            not specified, then the user is asked for every existing
            output file whether it should be overwritten.

  -os num - Outputs the XML data with space indendation. The number of
            spaces is specified by 'num'. Note that 'num' can also be zero.
            In this case, element tags are printed on separate lines
            with no indendation.
            Space indendation with one single space is the default,
            if the compressed file does not contain complete white spaces.

      -ot - Outputs the XML data with tabulator indentation. Each indentation
            is represented by one tabulator character.

      -on - No white spaces are added to the XML data. This is the default,
            if the compressed file already stores white spaces.

      -od - Uses DOS notation for printing newlines in formated XML
      -ou - Uses UNIX notation for printing newlines in formated XML


6. The Compression Verbose Mode

To analyse the effect of specified path expressions and user compressors,
the user can choose the 'verbose mode' (option '-v') to prints important
statistical information about the compression achieved for each single
container. The information is always printed to the standard output.

For each 'run' (i.e. each the container data stored when the memory
window is exhausted) produces a separate section in the output.
At the beginning of the section, information about the compression
of the structure, white space and special data container is printed.
Afterwards, the compression details for each of the data container is
displayed.
For example, consider the following command line for compressing
the content of weblog data (4MByte) :

   xmill -w -v weblog.xml

Note that we also preserve white spaces in the compressed file (option 'w').
The verbose output has the following form:


      Structure:  504693 ==>     4923 (0.975444%)
    Whitespaces:  582175 ==>     4459 (0.765921%)
        Special:       0 ==> Small...
           Sum:  1086868 ==>     9382 (0.863214%)

   //# <- CLF:host
             0:   141432 ==>    23811 (16.835652%)

   //# <- CLF:requestLine
             0:   309891 ==>    22199 (7.163487%)

   //# <- CLF:contentType
             0:   101496 ==>     3548 (3.495704%)

   //# <- CLF:statusCode
             0:    40000 ==>     1908 (4.770000%)

   //# <- CLF:date
             0:   200000 ==>    19235 (9.617500%)

   //# <- CLF:byteCount
             0:    42061 ==>    10869 (25.841040%)

   //# <- CLF:referer
             0:   276922 ==>    27856 (10.059150%)

   //# <- CLF:userAgent
             0:   400736 ==>    24348 (6.075820%)

   //# <- CLF:cooke2
             0:     7733 ==>      399 (5.159705%)

   Header:          Uncompressed:      207   Compressed:      176 (85.024155%)
   Structure:       Uncompressed:   504693   Compressed:     4923 (0.975444%)
   Whitespaces:     Uncompressed:   582175   Compressed:     4459 (0.765921%)
   Data:            Uncompressed:  1520271   Compressed:   134173 (8.825598%)
                                             --------------------
   Sum:                                      Compressed:   143731

The first few lines represent the information about the structure,
white space, and special data containers. The left column shows the
uncompressed size of the data and the right column the compressed
size. Note that the structure and the white spaces compressed extremely
well, because of the regularity of the data.
Since there are no special data sections in the XML file the
corresponding container size if zero.

Containers whose uncompressed size is smaller than 3KByte are not
compressed directly by gzip. Instead, all such containers are
grouped together and compressed in one single gzip run. This is more
efficient, since gzip yields better results for larger data blocks.
The compressed size of small containers is therefore not determinable
and only described as 'Small...'.

The lines afterwards represent the compression of the actual data 
containers. Each instantiation of symbol '#' in path expression '//#'
(the default path expression) has an associated container with the
data that from the correspondings path(s) in the XML file. For example,
all 'CLF:hosts' fields in the XML file are stored in the first container
and have an accumulated size of 141432 bytes. This compresses to
23811 bytes.

The container compression details are follows by a summary of the
compression achieved. The summary is divided into size parts:
1) the header, 2) the structure, and 3) the white spaces,
4) the special data, 5) the user compressor data, and 6) the actual data.
In the example above the special data summary and the user compressor
data are missing since they are empty.

1) The header summary shows the of the header. The header contains
essential information about the compressed file, such as the label
dictionary, the path expressions, the number and size of the containers,
etc.

2) The structure summary shows the sum of the size of all structure
containers. Recall that each run in the file has a separate structure
container.

3) The white space summary shows the sum of the size of all white space
containers.

4) The special data summary shows the sum of the size of all special data
containers.

5) The user compressor summary shows the size of the user compressor data.
Certain user compressors need to store data structures separate from
the actual data. For example, the dictionary compressor must store
the dictionaries. Section 6.1 shows a complex example of using
dictionary compressors.

6) The actual data summary shows the overall size of all data containers.

Finally, the overall size of the compressed file can be computed as the
sum of all sections.

6.1. Applying User Compressors

We have illustrated at the beginning of the section how the default
compressor '//#' creates containers for each instantiation of '#'.
As described in Section 4, the compression rate can be increased
substantially by applying user compressors to the text items identified
by the path expressions.

The following set of path expressions with user compressors is used
for compresing weblog data:

   -p//CLF:host=>seqcomb(u8 "." u8 "." u8 "." u8)
   -p//CLF:userAgent=>rt:seq(e "/" e)
   -p//CLF:byteCount=>u
   -p//CLF:cookie2=>e
   -p//CLF:statusCode=>e
   -p//CLF:contentType=>e
   -p//CLF:requestLine=>seq("GET " rep("/" e) " HTTP/1." e)
   -p//CLF:date=>seq(u "/" u8(2) "/" u8(2) "-" u8(2) ":" di(2) ":" di(2))
   -p//CLF:referer=>or(seq("file:" t) seq("http://" or(seq(rep("." e) "/" rep("/" e)) rep("." e))) t)

Here, the user compressor 'seqcomb(u8 "." u8 "." u8 "." u8)' compressed
IP numbers (note that we use seqcomb instead of 'seq' to achieve higher
compression rates!). The text items in 'CLF:userAgent' are separated by
symbol '/' and stored in two dictionaries for the left and the right part,
respectively. Note that we use the option 'rt' to treat right white spaces
as normal text. Similarly, the other user compressors are chosen to
achieve high compression rates based on the syntax of the text items.

The following shows the verbose output for the user compressor
specification shown above:

      Structure:  494835 ==>     5412 (1.093698%)
    Whitespaces:  562428 ==>     4474 (0.795480%)
        Special:       0 ==> Small...
           Sum:  1057263 ==>     9886 (0.935056%)

   //CLF:host=>seqcomb(u8 "." u8 "." u8 "." u8) <-
             0:    40000 ==>    15856 (39.640000%)

   //CLF:requestLine=>seq("GET " rep("/" e) " HTTP/1." e) <-
          Enum:     6411 ==>     2792 (43.550148%)
          Enum:        4 ==> Small...
             0:     9833 ==>     2796 (28.434862%)
             1:    28921 ==>    10625 (36.738010%)
             2:     9833 ==>      266 (2.705176%)
           Sum:    55002 ==>    16479 (29.960729%)

   //CLF:contentType=>e <-
          Enum:       92 ==> Small...
             0:     9982 ==>     2079 (20.827489%)
           Sum:    10074 ==>     2079 (20.637284%)

   //CLF:statusCode=>e <-
          Enum:       36 ==> Small...
             0:    10000 ==>     1387 (13.870000%)
           Sum:    10036 ==>     1387 (13.820247%)

   //CLF:date=>seq(u "/" u8(2) "/" u8(2) "-" u8(2) ":" di(2) ":" di(2)) <-
             0:    20000 ==>       45 (0.225000%)
             1:    10000 ==>       34 (0.340000%)
             2:    10000 ==>       33 (0.330000%)
             3:    10000 ==>       44 (0.440000%)
             4:    10000 ==>      385 (3.850000%)
             5:    10000 ==>     3974 (39.740000%)
           Sum:    70000 ==>     4515 (6.450000%)

   //CLF:byteCount=>u <-
             0:    18957 ==>     9238 (48.731339%)

   //CLF:referer=>or(seq("file:" t) seq("http://" or(seq(rep("." e) "/" rep("/" e)) rep("." e))) t) <-
          Enum:     3054 ==>     2027 (66.371971%)
          Enum:     9013 ==>     4717 (52.335515%)
          Enum:     3763 ==>     2561 (68.057401%)
             0:     8873 ==>      156 (1.758143%)
             1:      789 ==> Small...
             2:     8830 ==>     1458 (16.511891%)
             3:     5347 ==>      395 (7.387320%)
             4:    16884 ==>     2441 (14.457475%)
             5:     5347 ==>     1376 (25.734056%)
             6:    10184 ==>     5502 (54.025923%)
             7:     3483 ==>      406 (11.656618%)
             8:    11428 ==>     2441 (21.359818%)
             9:      973 ==> Small...
           Sum:    86206 ==>    23480 (27.237083%)

   //CLF:userAgent=>rt:seq(e "/" e) <-
          Enum:      598 ==> Small...
          Enum:    19562 ==>     3922 (20.049075%)
             0:     9858 ==>      956 (9.697707%)
             1:    13631 ==>     9074 (66.568850%)
           Sum:    43649 ==>    13952 (31.964077%)

   //# <- CLF:cooke2
             0:     7733 ==>      399 (5.159705%)

   //# <- CLF:requestLine
             0:     3570 ==>      598 (16.750700%)

   //# <- CLF:userAgent
             0:     2012 ==>      275 (13.667992%)

   Header:          Uncompressed:     3010   Compressed:     1738 (57.740864%)
   Structure:       Uncompressed:   494835   Compressed:     5412 (1.093698%)
   Whitespaces:     Uncompressed:   562428   Compressed:     4474 (0.795480%)
   User Compressor: Uncompressed:    42533   Compressed:    16019 (37.662521%)
   Data:            Uncompressed:   304706   Compressed:    72239 (23.707771%)
                                             --------------------
   Sum:                                      Compressed:    99882


Similar to the compression without any user compressor specification,
the output is divided into information about structure, white space,
special data, and actual data containers.

The main difference to the default compression based on '//#' is that
each instantiation of path expressions can have several associated
containers used by the underlying user compressor. Consider for example
the second path expression:

   //CLF:requestLine=>seq("GET " rep("/" e) " HTTP/1." e) <-
          Enum:     6411 ==>     2792 (43.550148%)
          Enum:        4 ==> Small...
             0:     9833 ==>     2796 (28.434862%)
             1:    28921 ==>    10625 (36.738010%)
             2:     9833 ==>      266 (2.705176%)
           Sum:    55002 ==>    16479 (29.960729%)

The 'seq' compressor parses HTTP request strings of the form
"GET <directory> HTTP/1.?" where <directory> is a UNIX style
directory path with separator '/'. The protocal can either be
'HTTP/1.0' or 'HTTP/1.1'. Hence, the '?'-symbol can either be
'0' or '1'.

The repeat compressor 'rep("/" e)' divides the directory path
into its components and stores the directory names in a dictionary.

There are two dictionary compressors involved. Each of them has
maintaines a dictionary that is stored in the compressed file. The
first dictionary (in 'rep("/" e)' ) has 6411 bytes, while the
second dictionary has only 4 bytes, since it has only two elements:
'0' and '1' for the protocol version in 'HTTP/1.?'.

Furthermore, the repeat compressor requires one container that stores
the number of path components - i.e. components separated by '/'.
This is container 0 in the list above. The size is 9833 bytes.
Furthermore, the dictionary compressor 'e' within the repeat compressor
stores the dictionary index for each component of the path.
This is container 1 in the list above with size 28921 bytes.

Lastly, the second dictionary compressor stores the dictionary indices
(0 or 1) for the string '0' or '1' in 'HTTP/1.?'. The size for
this container is 9833.

The order of the containers in the verbose output is directly
derived from the order of the user compressors in the user
compressor string. However, the additional user compressor data (such as
the dictionaries of dictionary compressors) are stored separately
and its information is printed at the beginning of the list.

The last line contains the sum of the (un)compressed data. Note the
improvement that is achieved over the original compression using '//#':

   //# <- CLF:requestLine
             0:   309891 ==>    22199 (7.163487%)

