SCF 3.10 Request For Comments

This document contains the proposed SCF version 3.10 specification. Differences between this version and the, already in use, 3.00 version are displayed in bold. (No titles have been modified.)2`

Please submit all comments on this specification to jkb@mrc-lmb.cam.ac.uk by the end of July.

-- start of spec --

SCF File Format version 3.10

Intro

SCF format files are used to store data from DNA sequencing instruments. Each file contains the data for a single reading and includes: its trace sample points, its called sequence, the positions of the bases relative to the trace sample points, and numerical estimates of the accuracy of each base. Comments and "private data" can also be stored. The format is machine independent and the first version was described in Dear, S and Staden, R. "A standard file format for data from DNA sequencing instruments", DNA Sequence 3, 107-110, (1992).

Since then it has undergone several important changes. The first allowed for different sample point resolutions. The second, in response to the need to reduce file sizes for large projects, involved a major reorganisation of the ordering of the data items in the file and also in the way they are represented. Note that despite these changes we have retained the original data structures into which the data is read. Also this reorganisation in itself has not made the files smaller but it has produced files that are more effectively compressed using standard programs such as gzip. The io library included in the package contains routines that can read and write all the different versions of the format (including reading of compressed files). The header record was not affected by this change. This documentation covers both the format of scf files and the data structures that are used by the io library. Prior to version 3.00 these two things corresponded much more closely.

In order for programs to label themselves as supporting SCF files they need to adhere to one of SCF versions. If they do not support the latest version then the version of SCF supported should be clearly labelled. Note that although SCF 3.00 and SCF 3.10 are binary compatible, a key difference is that 3.10 does not allow programs to fail due to the existance or non-existance of comment types.

Header Record

The file begins with a 128 byte header record that describes the location and size of the chromatogram data in the file. Nothing is implied about the order in which the components (samples, sequence and comments) appear. The version field is a 4 byte character array representing the version and revision of the SCF format. The current value of this field is "3.10".

  /*
   * Basic type definitions
   */
  typedef unsigned int   uint_4;
  typedef signed   int    int_4;
  typedef unsigned short uint_2;
  typedef signed   short  int_2;
  typedef unsigned char  uint_1;
  typedef signed   char   int_1;
  
  /*
   * Type definition for the Header structure
   */
  #define SCF_MAGIC (((((uint_4)'.'<<8)+(uint_4)'s'<<8) \
                       +(uint_4)'c'<<8)+(uint_4)'f')
  
  typedef struct {
      uint_4 magic_number;
      uint_4 samples;          /* Number of elements in Samples matrix */
      uint_4 samples_offset;   /* Byte offset from start of file */
      uint_4 bases;            /* Number of bases in Bases matrix */
      uint_4 bases_left_clip;  /* OBSOLETE: No. bases in left clip (vector) */
      uint_4 bases_right_clip; /* OBSOLETE: No. bases in right clip (qual) */
      uint_4 bases_offset;     /* Byte offset from start of file */
      uint_4 comments_size;    /* Number of bytes in Comment section */
      uint_4 comments_offset;  /* Byte offset from start of file */
      char version[4];         /* "version.revision", eg '3' '.' '0' '0' */
      uint_4 sample_size;      /* Size of samples in bytes 1=8bits, 2=16bits*/
      uint_4 code_set;         /* code set used (may be ignored) */
      uint_4 private_size;     /* No. of bytes of Private data, 0 if none */
      uint_4 private_offset;   /* Byte offset from start of file */
      uint_4 spare[18];        /* Unused */
  } Header;

For versions of SCF files 2.0 or greater Header.version is `greater than' "2.00"), the version number, precision of data (sample_size), the uncertainty code set are specified in the header. Otherwise, the precision is assumed to be 1 byte, and the code set to be the default code set. The following uncertainty code sets are recognised (but are generally ignored by programs reading this file). If in doubt, set code_set to 0 or 2.

0       {A,C,G,T,-}   (default)
1       Staden
2       IUPAC (NC-IUB)
3       Pharmacia A.L.F. (NC-IUB)
4       {A,C,G,T,N}   (ABI 373A)
5       IBI/Pustell
6       DNA*
7       DNASIS
8       IG/PC-Gene
9       MicroGenie

Sample Points

The sample data is the four chromatogram channels. If none exists the Header.samples value should be zero.

The trace information is stored at byte offset Header.samples_offset from the start of the file. For each sample point there are values for each of the four bases. Header.sample_size holds the precision of the sample values. The precision must be one of "1" (unsigned byte) and "2" (unsigned short). The sample points need not be normalised to any particular value, though it is assumed that they represent positive values. This is, they are of unsigned type.

With the introduction of scf version 3.00, in an attempt to produce efficiently compressed files, the sample points are stored in A,C,G,T order; i.e. all the values for base A, followed by all those for C, etc. In addition they are stored, not as their original magnitudes, but in terms of the differences between successive values. The C language code used to transform the values for precision 2 samples is shown below.

  void delta_samples2 ( uint_2 samples[], int num_samples, int job) {
   
      /* If job == DELTA_IT:
       *  change a series of sample points to a series of delta delta values:
       *  ie change them in two steps:
       *  first: delta = current_value - previous_value
       *  then: delta_delta = delta - previous_delta
       * else
       *  do the reverse
       */
   
      int i;
      uint_2 p_delta, p_sample;
   
      if ( DELTA_IT == job ) {
          p_delta  = 0;
          for (i=0;i<num_samples;i++) {
              p_sample = samples[i];
              samples[i] = samples[i] - p_delta;
              p_delta  = p_sample;
          }
          p_delta  = 0;
          for (i=0;i<num_samples;i++) {
              p_sample = samples[i];
              samples[i] = samples[i] - p_delta;
              p_delta  = p_sample;
          }
      }
      else {
          p_sample = 0;
          for (i=0;i<num_samples;i++) {
              samples[i] = samples[i] + p_sample;
              p_sample = samples[i];
          }
          p_sample = 0;
          for (i=0;i<num_samples;i++) {
              samples[i] = samples[i] + p_sample;
              p_sample = samples[i];
          }
      }
  }

The io library data structure is as follows:

  /*
   * Type definition for the Sample data
   */
  typedef struct {
          uint_1 sample_A;           /* Sample for A trace */
          uint_1 sample_C;           /* Sample for C trace */
          uint_1 sample_G;           /* Sample for G trace */
          uint_1 sample_T;           /* Sample for T trace */
  } Samples1;
  
  typedef struct {
          uint_2 sample_A;           /* Sample for A trace */
          uint_2 sample_C;           /* Sample for C trace */
          uint_2 sample_G;           /* Sample for G trace */
          uint_2 sample_T;           /* Sample for T trace */
  } Samples2;

Sequence Information

If no sequence exists then Header.bases should be zero. Otherwise this holds the number of called bases.

Information relating to the base interpretation of the trace is stored at byte offset Header.bases_offset from the start of the file. Stored for each base are: its character representation and a number (an index into the Samples data structure) indicating its position within the trace. The relative probabilities of each of the 4 bases occurring at the point where the base is called can be stored in prob_A prob_C, prob_G and prob_T and there may also be substitution, insertion and deletion probabilities too. Note that although the base calls are in sequential order it should not be assumed that the positions will therefore be numerically sorted (due to the possibility of compressions in the data).

From version 3.00 these items are stored in the following order: all "peak indexes", i.e. the positions in the sample points to which the bases corresponds; all the accuracy estimates for base type A, all for C,G and T; the called bases; this is followed by 3 sets of empty int1 data items (now substition, insertion and deletion scores - see below). These values are read into the following data structure by the routines in the io library.

The format of the prob_A, prob_C, prob_G and prob_T values was not specified, apart from being a 1-byte integral value. From version 3.10 we specify that all probability values will be stored as -10 * log10(P(error)), where P(error) is the probability of an error. This is the same format as phred and, more recently, other similar tools. If no probabilities are available, all four values should be set to zero. When only one value is available (for the called base), the relevant "prob_" field should be set according and the other three should be left as zero. For uncalled bases, or bases that are not A, C, G or T, all four probability values should be specified. Specifically for "N" or "-", all four probability values should be set to the same value (which is typically very low - note that using the log scale above a probability of correctness of 0.25 equates to a prob_* value of 1.25).

From version 3.10 onwards we may also store the substitution, insertion and deletion probability values. These are stored using the same scale as the prob_A, prob_C, prob_G and prob_T values. It is expected that the four prob_A, prob_C, prob_G and prob_T values will encode the absolute probability of that base call being correct, taking into account the chance of it being an overcalled base. For alignment algorithms it may be useful to obtain individual confidence values for the chance of insertion, deletion and substitution. These are stored in prob_ins, prob_del and prob_sub. In version 3.00 these fields existed in the SCF files, but were labelled as "uint_1 spare[3]".

  /*
   * Type definition for the sequence data
   */
  typedef struct {
      uint_4 peak_index;        /* Index into Samples matrix for base posn */
      uint_1 prob_A;            /* Probability of it being an A */
      uint_1 prob_C;            /* Probability of it being an C */
      uint_1 prob_G;            /* Probability of it being an G */
      uint_1 prob_T;            /* Probability of it being an T */
      char   base;              /* Called base character        */

      uint_1 prob_sub;		/* Probability of this base call being a */
				/*     substitution for another base */
      uint_1 prob_ins;		/* Probability of it being an overcall */
      uint_1 prob_del;		/* Probability of an undercall at this point
				/*     (extra base between this base and the */
				/*      previous base) */

  } Base;

Comments

Comments are stored at offset Header.comments_offset from the start of the file. Lines in this section are of the format:

<Field-ID>=<Value>

<Field-ID> can be any single-line string, not including spaces or equals sign. The <Value> may be any single-line string. If you need to include newline characters in <Value> it is recommended that you escape them by using "\n".

No program should fail due to particular <Field-ID>s being missing (all should be considered as optional), however certain <Field-ID>s have historically become common place. Hence if any of the following <Field-ID>s are present they must be in the format specified below.

BandSpreadRatio: This indicates the amount of image processing done to create the curves in this file. It is used as reference only. (requested by LI-COR)
BCSW: Base calling software (and optionally version).
CONV: The software used to convert the trace to this file. See DATF and DATN.
DATE: Human readable date for the production of this sequence.
DATF: Format (and optionally version) of the original data that this file was created from (assuming that it was not written natively). See DATN and CONV.
DATN: The filename of the original data file. See DATF and CONV.
DYEP: This indicates the type of chemistry and dye(s) used to create this file. The format is an arbitrary string.
Enhancement ENHANCEMENT: This indicates whether any image processing has been done to create the curves in this file. It is used as reference only. (requested by LI-COR)
IMAGE: If this exists it contains the full drive, path and file name of the image that was used to generate the SCF file. This is used as a reference when locating raw data to reprocess. (requested by LI-COR)
LANE: The lane number of the sequence when loaded onto the gel, counting from the left edge.
MACH: Sequencing machine type and model.
MTXF: Matrix file name (relative or absolute path name) specified using whatever format is suitable for the OS under which this file was created.
NAME: This is the name of the sample. If it doesn't exist software that needs it should generate it from the filename. If it does exist as a parameter it should be the name of the sample ONLY, no drive, path or suffix information should be included and must be limited to 31 characters in length. Most software assumes that the name of the file is also the name of the sample and this is fine, the NAME parameter however allows some deviation from that.
OPER: The name of the operator who produced this sequence.
PRIM: The position, in samples, of the first base call in the raw (unprocessed) trace data.
RUND: A machine readable "run date" for the production of this sequence. The format should be "YYYYMMDD.HHMMSS - YYYYMMDD.HHMMSS" where YYYYMMDD encodes 4 digits of year, 2 digits of month number and 2 digits of day number (in the month) and HHMMSS encodes the time (hours, minutes, seconds) using a 24 hour clock.
SampleRemark: This is used only to pass along comment information from one processing step to the next. (requested by LI-COR)
SIGN: Average signal strength specified as "A=x,C=x,G=x,T=x" where 'x' is an integer or floating point number.
SPAC: Average base spacing specified as the number of trace samples per base call - an integer or floating point number.
SRCE: File souce - synonym for MACH.
TPSW: Trace processing software (and optionally version).

  /*
   * Type definition for the comments
   */
  typedef char Comments[];                /* Zero terminated list of
                                             \n separated entries */

Private data

The private data section is provided to store any information required that is not supported by the SCF standard. If the field in the header is 0 then there is no private data section. We impose no restrictions upon the format of this section. However we feel it maybe a good idea to use the first four bytes as a magic number identifying the used format of the private data.

File structure

From SCF version 3.0 onwards the in memory structures and the data on the disk are not in the same format. The overview of the data on disk for the different versions is summarised below.

Versions 1 and 2:

(Note Samples1 can be replaced by Samples2 as appropriate.)

Length in bytes                        Data
---------------------------------------------------------------------
128                                    header
Number of samples * 4 * sample size    Samples1 or Samples2 structure
Number of bases * 12                   Base structure
Comments size                          Comments
Private data size                      private data

Version 3:

Length in bytes                        Data
---------------------------------------------------------------------------
128                                    header
Number of samples * sample size        Samples for A trace
Number of samples * sample size        Samples for C trace
Number of samples * sample size        Samples for G trace
Number of samples * sample size        Samples for T trace
Number of bases * 4                    Offset into peak index for each base
Number of bases                        Accuracy estimate bases being 'A'
Number of bases                        Accuracy estimate bases being 'C'
Number of bases                        Accuracy estimate bases being 'G'
Number of bases                        Accuracy estimate bases being 'T'
Number of bases                        The called bases
Number of bases * 3                    Reserved for future use
Comments size                          Comments
Private data size                      Private data
---------------------------------------------------------------------------

Byte ordering and integer representation

"Forward byte and reverse bit" ordering will be used for all integer values. This is the same as used in the MC680x0 and SPARC processors, but the reverse of the byte ordering used on the Intel 80x86 processors.

         Off+0   Off+1  
       +-------+-------+  
uint_2 |  MSB  |  LSB  |  
       +-------+-------+  

         Off+0   Off+1   Off+2   Off+3
       +-------+-------+-------+-------+
uint_4 |  MSB  |  ...  |  ...  |  LSB  | 
       +-------+-------+-------+-------+

To read integers on systems with any byte order use something like this:

  uint_2 read_uint_2(FILE *fp)
  {
      unsigned char buf[sizeof(uint_2)];
  
      fread(buf, sizeof(buf), 1, fp);
      return (uint_2)
          (((uint_2)buf[1]) +
           ((uint_2)buf[0]<<8));
  }
  
  uint_4 read_uint_4(FILE *fp)
  {
      unsigned char buf[sizeof(uint_4)];
  
      fread(buf, sizeof(buf), 1, fp);
      return (uint_4)
          (((unsigned uint_4)buf[3]) +
           ((unsigned uint_4)buf[2]<<8) +
           ((unsigned uint_4)buf[1]<<16) +
           ((unsigned uint_4)buf[0]<<24));
  }

-- end of spec --