Calculating Consensus Sequences

In this section we describe the types of consensus which gap4 can produce, the formats they can be written in, and the algorithms that can be used. The algorithms are not only used to produce consensus sequence files, but in many other places throughout gap4 where an analysis of the current quality of the data is required. One important place is inside the Contig Editor (see section Editing in gap4) where they are used to produce an "on-the-fly" consensus, responding to every edit made by the user.

The currently active consensus algorithm is selected from the "Consensus algorithm" dialogue in the main gap4 Options menu (see section Consensus Algorithm).

There are four main types of consensus sequence file that can be produced by the program: Normal, Extended, Unfinished, and Quality. They are all invoked from the File menu.

"Normal" is the type of consensus file that would be expected: a consensus from the non-hidden parts of a contig. "Extended" is the same as "Normal" but the consensus is extended by inclusion of the hidden, non-vector sequence, from the ends of the contig.

"Unfinished" is the same as "Normal" except that any position where the consensus does not have good data for both strands is written using A,C,G,T characters, and the rest (which has good data for both strands) is written using a different set of symbols. This sequence can be used for screening against new readings: only the regions needing more readings will produce matches. By screening readings in this way, prior to assembly, users can avoid entering readings which will not help finish the project, and which may require further editing work to be performed.

"Quality" produces a sequence of characters of the same length as the consensus, but they instead encode the reliability of the consensus at each point.

Consensus sequence files can also encode the positions of the currently active tag types by changing the case of the tagged characters (marking) or writing them in a different character set (masking) (see section Active tags and masking).

The consensus algorithms are usually configured to produce only the characters A,C,G,T and "-", but it is possible to set them to produce the complete set of IUB codes. This mode is useful for some types of work and allows the range of observed base types at any position to be coded in the consensus. How the IUB codes are chosen is described in the introduction to the consensus algorithms (see section The Consensus Algorithms).

Depending on the type of consensus produced, the consensus sequence files can be written in three different formats: Experiment files (see section Experiment File), FASTA (Pearson,W.R. Using the FASTA program to search protein and DNA sequence databases. Methods in Molecular Biology. 25, 365-389 (1994)) or staden formats. If experiment file format is selected a further menu appears that allows users to select for the inclusion of tag data in the output file. For FASTA format the sequence headers include the contig identfier as the sequence name and the project database name, version number and the number of the leftmost reading in the contig as comments. e.g. ">xyzzy.s1 B0334.0.274" is database B0334, copy 0, and the left most reading for the contig is number 274, which has a name of xyzzy.s1. For staden format the headers include the project database name and the number of the leftmost reading in the contig. e.g. "<B0334.00274------->" is database B0334 and the left most reading for the contig is number 274. Staden format is maintained only for historical reasons - i.e. there may still be a few unfortunate people using it. Obviously Experiment file format can contain much more information, and can serve as the basis of a submission to the sequence library.

This page is maintained by staden-package. Last generated on 22 October 2002.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/gap4_unix_112.html