This document contains the proposed SCF version 3.10 specification. Differences between this version and the, already in use, 3.00 version are displayed in bold. (No titles have been modified.)2`
Please submit all comments on this specification to jkb@mrc-lmb.cam.ac.uk by the end of July.
-- start of spec --
SCF format files are used to store data from DNA sequencing instruments. Each file contains the data for a single reading and includes: its trace sample points, its called sequence, the positions of the bases relative to the trace sample points, and numerical estimates of the accuracy of each base. Comments and "private data" can also be stored. The format is machine independent and the first version was described in Dear, S and Staden, R. "A standard file format for data from DNA sequencing instruments", DNA Sequence 3, 107-110, (1992).
Since then it has undergone several important changes. The first allowed for different sample point resolutions. The second, in response to the need to reduce file sizes for large projects, involved a major reorganisation of the ordering of the data items in the file and also in the way they are represented. Note that despite these changes we have retained the original data structures into which the data is read. Also this reorganisation in itself has not made the files smaller but it has produced files that are more effectively compressed using standard programs such as gzip. The io library included in the package contains routines that can read and write all the different versions of the format (including reading of compressed files). The header record was not affected by this change. This documentation covers both the format of scf files and the data structures that are used by the io library. Prior to version 3.00 these two things corresponded much more closely.
In order for programs to label themselves as supporting SCF files they need to adhere to one of SCF versions. If they do not support the latest version then the version of SCF supported should be clearly labelled. Note that although SCF 3.00 and SCF 3.10 are binary compatible, a key difference is that 3.10 does not allow programs to fail due to the existance or non-existance of comment types.
The file begins with a 128 byte header record that describes the location and size of the chromatogram data in the file. Nothing is implied about the order in which the components (samples, sequence and comments) appear. The version field is a 4 byte character array representing the version and revision of the SCF format. The current value of this field is "3.10".
/* * Basic type definitions */ typedef unsigned int uint_4; typedef signed int int_4; typedef unsigned short uint_2; typedef signed short int_2; typedef unsigned char uint_1; typedef signed char int_1; /* * Type definition for the Header structure */ #define SCF_MAGIC (((((uint_4)'.'<<8)+(uint_4)'s'<<8) \ +(uint_4)'c'<<8)+(uint_4)'f') typedef struct { uint_4 magic_number; uint_4 samples; /* Number of elements in Samples matrix */ uint_4 samples_offset; /* Byte offset from start of file */ uint_4 bases; /* Number of bases in Bases matrix */ uint_4 bases_left_clip; /* OBSOLETE: No. bases in left clip (vector) */ uint_4 bases_right_clip; /* OBSOLETE: No. bases in right clip (qual) */ uint_4 bases_offset; /* Byte offset from start of file */ uint_4 comments_size; /* Number of bytes in Comment section */ uint_4 comments_offset; /* Byte offset from start of file */ char version[4]; /* "version.revision", eg '3' '.' '0' '0' */ uint_4 sample_size; /* Size of samples in bytes 1=8bits, 2=16bits*/ uint_4 code_set; /* code set used (may be ignored) */ uint_4 private_size; /* No. of bytes of Private data, 0 if none */ uint_4 private_offset; /* Byte offset from start of file */ uint_4 spare[18]; /* Unused */ } Header;
For versions of SCF files 2.0 or greater Header.version is `greater than' "2.00"), the version number, precision of data (sample_size), the uncertainty code set are specified in the header. Otherwise, the precision is assumed to be 1 byte, and the code set to be the default code set. The following uncertainty code sets are recognised (but are generally ignored by programs reading this file). If in doubt, set code_set to 0 or 2.
0 {A,C,G,T,-} (default) 1 Staden 2 IUPAC (NC-IUB) 3 Pharmacia A.L.F. (NC-IUB) 4 {A,C,G,T,N} (ABI 373A) 5 IBI/Pustell 6 DNA* 7 DNASIS 8 IG/PC-Gene 9 MicroGenie
The sample data is the four chromatogram channels. If none exists the Header.samples value should be zero.
The trace information is stored at byte offset Header.samples_offset from the start of the file. For each sample point there are values for each of the four bases. Header.sample_size holds the precision of the sample values. The precision must be one of "1" (unsigned byte) and "2" (unsigned short). The sample points need not be normalised to any particular value, though it is assumed that they represent positive values. This is, they are of unsigned type.
With the introduction of scf version 3.00, in an attempt to produce efficiently compressed files, the sample points are stored in A,C,G,T order; i.e. all the values for base A, followed by all those for C, etc. In addition they are stored, not as their original magnitudes, but in terms of the differences between successive values. The C language code used to transform the values for precision 2 samples is shown below.
void delta_samples2 ( uint_2 samples[], int num_samples, int job) { /* If job == DELTA_IT: * change a series of sample points to a series of delta delta values: * ie change them in two steps: * first: delta = current_value - previous_value * then: delta_delta = delta - previous_delta * else * do the reverse */ int i; uint_2 p_delta, p_sample; if ( DELTA_IT == job ) { p_delta = 0; for (i=0;i<num_samples;i++) { p_sample = samples[i]; samples[i] = samples[i] - p_delta; p_delta = p_sample; } p_delta = 0; for (i=0;i<num_samples;i++) { p_sample = samples[i]; samples[i] = samples[i] - p_delta; p_delta = p_sample; } } else { p_sample = 0; for (i=0;i<num_samples;i++) { samples[i] = samples[i] + p_sample; p_sample = samples[i]; } p_sample = 0; for (i=0;i<num_samples;i++) { samples[i] = samples[i] + p_sample; p_sample = samples[i]; } } }
The io library data structure is as follows:
/* * Type definition for the Sample data */ typedef struct { uint_1 sample_A; /* Sample for A trace */ uint_1 sample_C; /* Sample for C trace */ uint_1 sample_G; /* Sample for G trace */ uint_1 sample_T; /* Sample for T trace */ } Samples1; typedef struct { uint_2 sample_A; /* Sample for A trace */ uint_2 sample_C; /* Sample for C trace */ uint_2 sample_G; /* Sample for G trace */ uint_2 sample_T; /* Sample for T trace */ } Samples2;
If no sequence exists then Header.bases should be zero. Otherwise this holds the number of called bases.
Information relating to the base interpretation of the trace is stored at byte offset Header.bases_offset from the start of the file. Stored for each base are: its character representation and a number (an index into the Samples data structure) indicating its position within the trace. The relative probabilities of each of the 4 bases occurring at the point where the base is called can be stored in prob_A prob_C, prob_G and prob_T and there may also be substitution, insertion and deletion probabilities too. Note that although the base calls are in sequential order it should not be assumed that the positions will therefore be numerically sorted (due to the possibility of compressions in the data).
From version 3.00 these items are stored in the following order: all "peak indexes", i.e. the positions in the sample points to which the bases corresponds; all the accuracy estimates for base type A, all for C,G and T; the called bases; this is followed by 3 sets of empty int1 data items (now substition, insertion and deletion scores - see below). These values are read into the following data structure by the routines in the io library.
The format of the prob_A, prob_C, prob_G and prob_T values was not specified, apart from being a 1-byte integral value. From version 3.10 we specify that all probability values will be stored as -10 * log10(P(error)), where P(error) is the probability of an error. This is the same format as phred and, more recently, other similar tools. If no probabilities are available, all four values should be set to zero. When only one value is available (for the called base), the relevant "prob_" field should be set according and the other three should be left as zero. For uncalled bases, or bases that are not A, C, G or T, all four probability values should be specified. Specifically for "N" or "-", all four probability values should be set to the same value (which is typically very low - note that using the log scale above a probability of correctness of 0.25 equates to a prob_* value of 1.25).
From version 3.10 onwards we may also store the substitution, insertion and deletion probability values. These are stored using the same scale as the prob_A, prob_C, prob_G and prob_T values. It is expected that the four prob_A, prob_C, prob_G and prob_T values will encode the absolute probability of that base call being correct, taking into account the chance of it being an overcalled base. For alignment algorithms it may be useful to obtain individual confidence values for the chance of insertion, deletion and substitution. These are stored in prob_ins, prob_del and prob_sub. In version 3.00 these fields existed in the SCF files, but were labelled as "uint_1 spare[3]".
/* * Type definition for the sequence data */ typedef struct { uint_4 peak_index; /* Index into Samples matrix for base posn */ uint_1 prob_A; /* Probability of it being an A */ uint_1 prob_C; /* Probability of it being an C */ uint_1 prob_G; /* Probability of it being an G */ uint_1 prob_T; /* Probability of it being an T */ char base; /* Called base character */ uint_1 prob_sub; /* Probability of this base call being a */ /* substitution for another base */ uint_1 prob_ins; /* Probability of it being an overcall */ uint_1 prob_del; /* Probability of an undercall at this point /* (extra base between this base and the */ /* previous base) */ } Base;
Comments are stored at offset Header.comments_offset from the start of the file. Lines in this section are of the format:
<Field-ID>=<Value>
<Field-ID> can be any single-line string, not including spaces or equals sign. The <Value> may be any single-line string. If you need to include newline characters in <Value> it is recommended that you escape them by using "\n".
No program should fail due to particular <Field-ID>s being missing (all should be considered as optional), however certain <Field-ID>s have historically become common place. Hence if any of the following <Field-ID>s are present they must be in the format specified below.
/* * Type definition for the comments */ typedef char Comments[]; /* Zero terminated list of \n separated entries */
The private data section is provided to store any information required that is not supported by the SCF standard. If the field in the header is 0 then there is no private data section. We impose no restrictions upon the format of this section. However we feel it maybe a good idea to use the first four bytes as a magic number identifying the used format of the private data.
(Note Samples1 can be replaced by Samples2 as appropriate.)
Length in bytes Data --------------------------------------------------------------------- 128 header Number of samples * 4 * sample size Samples1 or Samples2 structure Number of bases * 12 Base structure Comments size Comments Private data size private data
Length in bytes Data --------------------------------------------------------------------------- 128 header Number of samples * sample size Samples for A trace Number of samples * sample size Samples for C trace Number of samples * sample size Samples for G trace Number of samples * sample size Samples for T trace Number of bases * 4 Offset into peak index for each base Number of bases Accuracy estimate bases being 'A' Number of bases Accuracy estimate bases being 'C' Number of bases Accuracy estimate bases being 'G' Number of bases Accuracy estimate bases being 'T' Number of bases The called bases Number of bases * 3 Reserved for future use Comments size Comments Private data size Private data ---------------------------------------------------------------------------
"Forward byte and reverse bit" ordering will be used for all integer values. This is the same as used in the MC680x0 and SPARC processors, but the reverse of the byte ordering used on the Intel 80x86 processors.
Off+0 Off+1 +-------+-------+ uint_2 | MSB | LSB | +-------+-------+ Off+0 Off+1 Off+2 Off+3 +-------+-------+-------+-------+ uint_4 | MSB | ... | ... | LSB | +-------+-------+-------+-------+
To read integers on systems with any byte order use something like this:
uint_2 read_uint_2(FILE *fp) { unsigned char buf[sizeof(uint_2)]; fread(buf, sizeof(buf), 1, fp); return (uint_2) (((uint_2)buf[1]) + ((uint_2)buf[0]<<8)); } uint_4 read_uint_4(FILE *fp) { unsigned char buf[sizeof(uint_4)]; fread(buf, sizeof(buf), 1, fp); return (uint_4) (((unsigned uint_4)buf[3]) + ((unsigned uint_4)buf[2]<<8) + ((unsigned uint_4)buf[1]<<16) + ((unsigned uint_4)buf[0]<<24)); }