File Formats - Exp-Explain

Explanation of Records

Record: AC, ACcession line
Format: AC string
Explanation: A unique identifier for the reading.

Record: AP, Assembly Position
Format: AP Name_of_anchor_reading sense offset tolerance
Explanation: For readings whose position has been mapped by an external program, these records tell the "directed assembly" algorithm where to assemble the data. Positions are defined as offsets from an "anchor reading" which is the name of any reading already in the database, an orientation (sense, + or -), and a tolerance. Readings are aligned at relative position offset + or - tolerance.

Record: AQ, Average Quality of the reading.
Format: AQ Numeric value in range 1 - 99.
Explanation: The average value of the "numerical estimate of base calling accuracy" as calculated by program eba. The value is useful for monitoring data quality and could also be used for deciding on an order of assembly - for example assemble the highest quality readings first.

Record: AV, Accuracy Values
Format: AV q1 q2 q3 ... or a1,c1,g1,t1 a2,c2,g2,t2 ...
Explanation: The accuracy values lie in the range 1-99. Either 1 per base (eg 89 50 ... or 4 per base (eg 0,89,5,2 50,3,7,10). Bonfield,J.K and Staden,R. The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Res. 23 1406-1410, (1995).

Record: BC, Base Calling software

Record: CC, Comment line
Format: CC string
Explanation: Any comments can be added on any number of lines.

Record: CF, Cloning vector sequence File
Format: CF string
Explanation: The name of the file containing the sequence of the cloning vector, to be used by vector_clip (see section Screening Against Vector Sequences).

Record: CH, Special CHemistry
Format: CH number
Explanation: Used to flag readings as having been sequenced using a "special chemistry". The number is a bit pattern with a bit for each chemistry type, thus allowing combinations of chemistries to be listed. Currently bit 0 is used to distinguish between dye-primer (0) and dye-terminator (1) chemistries. Bits 1 to 4 inclusive indicate the type of chemistry: unknown (0, 0000), ABI Rhodamine (1, 0001), ABI dRhodamine (2, 0010), BigDye (3, 0011), Energy Transfer (4, 0100) and LiCor (5, 0101). So for example a BigDye Terminator has bits 00111 set which is 7 in decimal.

Record: CL, Cloning vector Left end
Format: CL number
Explanation: The base position in the sequence that contains the last base in the cloning vector. Currently gap4 only uses the CS line.

Record: CN, Clone Name
Format: CN string
Explanation: The name of the segment of DNA that the reading has been derived from. Typically the name of a physical map clone.

Record: CR, Cloning vector Right end
Format: CR number
Explanation: The base position in the sequence that contains the first base in the cloning vector. Currently gap4 only uses the CS line.

Record: CS, Cloning vector Sequence present in sequence
Format: CS range
Explanation: Regions of sequence found by vector_clip (see section Screening Against Vector Sequences) to be cloning vector. Used in assembly to exclude unwanted sequence.

Record: CV, Cloning Vector type
Format: CV string
Explanation: The type of the cloning vector used.

Record: DR, Direction of Read
Format: DR direction
Explanation: Whether forward or reverse primers were used. Allows mapping of forward and reverse reads off the same template. NOTE however that we do not encourage the use of this method as the terms direction, sense and strand can be confusing. Instead we encourage the use of the PRimer line.

Record: DT, DaTe of experiment
Format: DT dd-mon-yyyy
Explanation: Any date information.

Record: EN, Entry Name
Format: EN string
Explanation: The name given to the reading

Record: EX, EXperimental notes
Format: EX string
Explanation: Another type of comment line for additional information.

Record: FM, sequencing vector Fragmentation Method
Format: FM string
Explanation: Fragmentation method used to create sequencing library.

Record: ID, IDentifier
Format: ID string
Explanation: This is the name given to the reading inside the assembly database and is equivalent to the ID line of an EMBL entry.

Record: LE, Can be used to identify the location of materials
Format: LE string
Explanation: Originally a micro titre dish well number. Used in combination with LI.

Record: LI, Can be used to identify the location of materials
Format: LI string
Explanation: Originally a micro titre dish identifier. Used in combination with LE.

Record: LN, Local format trace file Name
Format: LN string
Explanation: The name of the local format trace file. This information is passed onto gap4, and allows for local formats to be used.

Record: LT, Local format trace file Type
Format: LT string
Explanation: The type of the local trace file type (usually SCF).

Record: MC, MaChine on which sequencing experiment was run
Format: MC string
Explanation: The lab's name for the sequencing machine used to create the data. Used for logging the performance of individual machines.

Record: MN, Machine generated trace file Name
Format: MN string
Explanation: The name of the trace file generated by the sequencing machine MC.

Record: MT, Machine generated trace file Type
Format: MT string
Explanation: The type of machine generated trace file.

Record: ON, Original base Numbers (positions)
Format: ON (eg) 1..43 0 45..63 65..74 0 75..536
Explanation: The A..B notation means that values A to B inclusive, so this example reads that bases 1 to 43 are unchanged, there is a change at 44, etc.

Record: OP, OPerator
Format: OP string
Explanation: Someone's name, possibly the person who ran the sequencing machine. Useful, with expansion of the string field for monitoring the performance of individuals!

Record: PC, Position in Contig
Format: PC number
Explanation: For preassembled data, the position to put the left end of the reading.

Record: PD, Primer Data
Format: PD sequence
Explanation: The primer sequence.

Record: PN, Primer Name
Format: PN string
Explanation: Name of primer used, using local naming convention. Could be a universal primer.

Record

PR, PRimer type

Format

PR number

Explanation

This record shows the direction of the reading and distinguishes between primers from the ends of the insert and those that are internal. It is important for the analysis of the relative orientations and positions of readings on templates. When the positions of readings on templates are analysed (see section Find read pairs) primer types 1,2,3 and 4 are represented using the symbols F,R,f and r respectively.

0: Unknown
1: Forward from beginning of insert
2: Reverse from end of insert
3: Custom forward i.e. a forward primer other than type 1.
4: Custom reverse i.e. a reverse primer other than type 2.

Record: PS, Processing Status
Format: PS explanation
Explanation: Indication of processing status.

Record: QL, poor Quality sequence present at Left (5') end
Format: QL position
Explanation: The sequence up to and including the base at the marked position are considered to be of too poor quality to be used. It may overlap with other marked sequences - CS, SL or SR. Used in assembly to exclude unwanted sequence.

Record: QR, poor Quality sequence present at Right (3') end
Format: QR position
Explanation: The sequence from and including the base at the marked position to the end is considered to be of too poor quality to be used. It may overlap with other marked sequences - CS, SL or SR. Used in assembly to exclude unwanted sequence.

Record: RS, Reference Sequence
Format: RS string
Explanation: The name of a sequence, usually in EMBL format, used to define the target sequence, base numbering and feature table data for a project. Used to define the numbering and changes produced by mutations in individual sequence readings (see section Introduction to mutation detection).

Record: SC, Sequencing vector Cloning site
Format: SC position
Explanation: The cloning site of the sequence vector. Used by vector_clip (see section Screening Against Vector Sequences).

Record: SE, SEnse (ie whether complemented)
Format: SE number
Explanation: For preassembled data, the sense of the reading (0 for forward, 1 for reverse).

Record: SF, Sequencing vector sequence File
Format: SF string
Explanation: The name of the file containing the sequence of the sequencing vector, to be used by vector_clip (see section Screening Against Vector Sequences).

Record: SI, Sequencing vector Insertion length
Format: SI range
Explanation: Expected insertion length of sequence in sequencing vector. Useful for selecting templates for further experiments.

Record: SL, Sequencing vector sequence present at Left (5') end
Format: SL position
Explanation: The sequence up to and including the base at the marked position are considered to be sequencing vector. Written by vector_clip (see section Screening Against Vector Sequences).

Record: SP, Sequencing vector Primer site (relative to cloning site)
Format: SP position
Explanation: Location of the primer using to sequence relative to cloning site. Used by vector_clip (see section Screening Against Vector Sequences).

Record: SQ, SeQuence
Format: SQ \nsequence blocks...\n//\n
Explanation: Complete sequence, as determined by the sequencing machine. The sequence is broken into blocks of 10 bases with 6 blocks per line separated by a space (see the example below).

Record: SR, Sequencing vector sequence present at Right (3') end
Format: SR position
Explanation: The sequence from and including the base at the marked position to the end are considered to be sequencing vector. Written by vector_clip (see section Screening Against Vector Sequences).

Record: SS, Screening Sequence
Format: SS string
Explanation: Note that in earlier versions of this documentation this field was explained incorrectly. Due to this the field is not currently being used by any of our programs. The original meaning was to specify a sequence to screen against. Any number of SS lines could be present to denote any number of screening sequences. In the future we may change the meaning of this field to be a single SS line containing a file of filenames of screening sequences. If this causes problems for people then we will choose a new line type, so please inform us now. Also note that contrary to previous documentation, vector_clip does not use this field (it uses the SF field instead).

Record: ST, STrands
Format: ST number
Explanation: Denotes whether this is a single or double stranded template. This is useful for deducing suitable templates for later experiments.

Record: SV, Sequencing Vector type
Format: SV string
Explanation: Type of sequencing vector used. Can be used for choosing templates for custom primer experiments.

Record

TC, Tag to be placed on the Consensus.

Format

TC TYPE S position..length

Explanation

These lines instruct gap4 to place tags on the consensus. The format defines the tag type which is a 4 character identifier and should start at column position 5), its strand ( "+", "-" or "=" which means both strands), its start position followed by the position of its end. These two values are separated by "..". Following lines starting TG with space characters up to column 10 are written into the comment field of the tag. For example the next three lines define a tag of type comment that is to be on both strands over the range 100 to 110 and the comment field will contain "This comment contains several lines".

TC   COMM = 100..110
TC        This comment contains
TC          several lines

Record: TG, Tag to be placed on the reading.
Format: TG TYPE S position..length
Explanation: These lines instruct gap4 to place tags on the reading. See TC for further information.

Record: TN, Template Name
Format: TN string
Explanation: The name of the template used in the experiment.

Record: WT, Wild Type trace file
Format: WT string
Explanation: The filename of the wild type trace file. Used for mutation studies.

This page is maintained by staden-package. Last generated on 22 October 2002.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/formats_unix_20.html