File Formats - ZTR-Chunk Format

Chunk Format

The basic structure of a ZTR file is (header,chunk*) - ie header followed by zero or more chunks. Each chunk consists of a type, some meta-data and some data, along with the lengths of both the meta-data and data.

Byte number   0  1  2  3  4  5  6  7  8  9
            +--+--+--+--+---+---+---+---+--+--+  -  +--+--+--+--+--+--  -  --+
Hex values  |   type    |meta-data length  | meta-data |data length| data .. |
            +--+--+--+--+---+---+---+---+--+--+  -  +--+--+--+--+--+--  -  --+

Ie in C:

typedef struct {
    uint4 type;			/* chunk type (be) */
    uint4 mdlength;		/* length of meta-data field (be) */
    char *mdata;		/* meta data */
    uint4 dlength;		/* length of data field (be) */
    char *data;			/* a format byte and the data itself */
} ztr_chunk_t;

All 2 and 4-byte integer values are stored in big endian format.

The meta-data is uncompressed (and so it does not start with a format byte). The format of the meta-data is chunk specific, and many chunk types will have no meta-data. In this case the meta-data length field will be zero and this will be followed immediately by the data-length field.

The data length is the length in bytes of the entire 'data' block, including the format information held within it.

The first byte of the data consists of a format byte. The most basic format is zero - indicating that the data is "as is"; it's the real thing. Other formats exist in order to encode various filtering and compression techniques. The information encoded in the next bytes will depend on the format byte.

Data format 0 - Raw

Byte number   0 1  2       N
            +--+--+--  -  --+
Hex values  | 0|  raw data  |
            +--+--+--  -  --+

Raw data has no compression or filtering. It just contains the unprocessed data. It consists of a one byte header (0) indicating raw format followed by N bytes of data.

Data format 1 - Run Length Encoding

Byte number   0  1    2     3     4      5     6  7  8               N
            +--+----+----+-----+-----+-------+--+--+--+--  -  --+--+--+
Hex values  | 1| Uncompressed length | guard | run length encoded data|
            +--+----+----+-----+-----+-------+--+--+--+--  -  --+--+--+

Run length encoding replaces stretches of N identical bytes (with value V) with the guard byte G followed by N and V. All other byte values are stored as normal, except for occurrences of the guard byte, which is stored as G 0. For example with a guard value of 8:

Input data:

	20 9 9 9 9 9 10 9 8 7

Output data:

	1			(rle format)
	0 0 0 10		(original length)
	8			(guard)
	20 8 5 9 10 9 8 0 7	(rle data)

Data format 2 - ZLIB

Byte number   0  1    2     3     4    5  6  7         N
            +--+----+----+-----+-----+--+--+--+--  -  --+
Hex values  | 2| Uncompressed length | Zlib encoded data|
            +--+----+----+-----+-----+--+--+--+--  -  --+

This uses the zlib code to compress a data stream. The ZLIB data may itself be encoded using a variety of methods (LZ77, Huffman), but zlib will automatically determine the format itself. Often using zlib mode Z_HUFFMAN_ONLY will provide best compression when combined with other filtering techniques.

Data format 64/0x40 - 8-bit delta

Byte number   0       1        2      N 
            +--+-------------+--  -  --+
Hex values  |40| Delta level |   data  |
            +--+-------------+--  -  --+

This technique replaces successive bytes with their differences. The level indicates how many rounds of differencing to apply, which should be between 1 and 3. For determining the first difference we compare against zero. All differences are internally performed using unsigned values with automatic an wrap-around (taking the bottom 8-bits). Hence 2-1 is 1 and 1-2 is 255.

For example, with level set to 1:

Input data:

      10 20 10 200 190 5

Output data:

       1			(delta1 format)
       1			(level)
       10 10 246 190 246 71	(delta data)

For level set to 2: Input data:

      10 20 10 200 190 5

Output data:

       1			(delta1 format)
       2			(level)
       10 0 236 200 56 81	(delta data)

Data format 65/0x41 - 16-bit delta

Byte number   0       1        2      N 
            +--+-------------+--  -  --+
Hex values  |41| Delta level |   data  |
            +--+-------------+--  -  --+

This format is as data format 64 except that the input data is read in 2-byte values, so we take the difference between successive 16-bit numbers. For example "0x10 0x20 0x30 0x10" (4 8-bit numbers; 2 16-bit numbers) yields "0x10 0x20 0x1f 0xf0". All 16-bit input data is assumed to be aligned to the start of the buffer and is assumed to be in big-endian format.

Data format 66/0x42 - 32-bit delta

Byte number   0       1        2  3  4      N 
            +--+-------------+--+--+--  -  --+
Hex values  |42| Delta level | 0| 0|   data  |
            +--+-------------+--+--+--  -  --+

This format is as data formats 64 and 65 except that the input data is read in 4-byte values, so we take the difference between successive 32-bit numbers.

Two padding bytes (2 and 3) should always be set to zero. Their purpose is to make sure that the compressed block is still aligned on a 4-byte boundary (hence making it easy to pass straight into the 32to8 filter).

Data format 67-69/0x43-0x45 - reserved

At present these are reserved for dynamic differencing where the 'level' field varies - applying the appropriate level for each section of data. Experimental at present...

Data format 70/0x46 - 16 to 8 bit conversion

Byte number   0
            +--+--  -  --+
Hex values  |46|   data  |
            +--+--  -  --+

This method assumes that the input data is a series of big endian 2-byte signed integer values. If the value is in the range of -127 to +127 inclusive then it is written as a single signed byte in the output stream, otherwise we write out -128 followed by the 2-byte value (in big endian format). This method works well following one of the delta techniques as most of the 16-bit values are typically then small enough to fit in one byte.

Example input data:

	0 10 0 5 -1 -5 0 200 -4 -32 (bytes)
	(As 16-bit big-endian values: 10 5 -5 200 -800)

Output data:

       70			(16-to-8 format)
       10 5 -5 -128 0 200 -128 -4 -32

Data format 71/0x47 - 32 to 8 bit conversion

Byte number   0
            +--+--  -  --+
Hex values  |47|   data  |
            +--+--  -  --+

This format is similar to format 70, but we are reducing 32-bit numbers (big endian) to 8-bit numbers.

Data format 72/0x48 - "follow" predictor

Byte number   0  1     FF 100  101   N
            +--+--  -  -  - --+-- - --+
Hex values  |48| follow bytes |  data |
            +--+--  -  -  - --+-- - --+

For each symbol we compute the most frequent symbol following it. This is stored in the "follow bytes" block (256 bytes). The first character in the data block is stored as-is. Then for each subsequent character we store the difference between the predicted character value (obtained by using follow[previous_character]) and the real value. This is a very crude, but fast, method of removing some residual non-randomness in the input data and so will reduce the data entropy. It is best to use this prior to entropy encoding (such as huffman encoding).

Data format 73/0x49 - floating point 16-bit chebyshev polynomial predictor

Version 1.1 only. Replaced by format 74 in Version 1.2.

WARNING: This method was experimental and has been replaced with an integer equivalent. The floating point method may give system specific results.

Byte number   0  1  2      N
            +--+--+--  -  --+
Hex values  |49| 0|   data  |
            +--+--+--  -  --+

This method takes big-endian 16-bit data and attempts to curve-fit it using chebyshev polynomials. The exact method employed uses the 4 preceeding values to calculate chebyshev polynomials with 5 coefficents. Of these 5 coefficients only 4 are used to predict the next value. Then we store the difference between the predicted value and the real value. This procedure is repeated throughout each 16-bit value in the data. The first four 16-bit values are stored with a simple 1-level 16-bit delta function. Reversing the predictor follows the same procedure, except now adding the differences between stored value and predicted value to get the real value.

Data format 74/0x4A - integer based 16-bit chebyshev polynomial predictor

Version 1.2 onwards This replaces the floating point code in ZTR v1.1.

Byte number   0  1  2      N
            +--+--+--  -  --+
Hex values  |4A| 0|   data  |
            +--+--+--  -  --+

This page is maintained by staden-package. Last generated on 22 October 2002.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/formats_unix_14.html