Gap4 - Con-Calculation

The Consensus Algorithms

The consensus calculation is a very important component of gap4. It is used to produce an "on-the-fly" consensus, responding to every individual change in the Contig Editor (see section Editing in gap4) and is used to produce the final sequence for submission to the sequence libraries. Some years ago Bonfield, J.K. and Staden, R. The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Res. 23, 1406-1410 (1995) we put forward the idea of using base call accuracy estimates in sequencing projects, and this has been partially realised with the values from the Phred program (Ewing, B. and Green, P. Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Research. Vol 8 no 3. 186-194 (1998)). These values are widely used and have defined a decibel type scale for base call confidence values and gap4 is currently set to use confidence values defined on this scale. An overview of our use of confidence values is contained in the introductory sections of the manual (see section The use of numerical estimates of base calling accuracy).

As is described elsewhere (see section List Consensus Confidence) being able to calculate the confidence for each base in the consensus sequence makes it possible to estimate the number of errors it contains, and hence the number of errors that will be removed if particular bases are checked and, if necessary, edited.

Gap4 caters for base calls with and without confidence values and hence provides a choice of algorithms. There are currently three consensus algorithms that may be used. The choice of the best algorithm will depend on the data that you have available and the purpose for which you are using gap4.

The currently active consensus algorithm is selected from the "Consensus algorithm" dialogue in the main gap4 Options menu (see section Consensus Algorithm).

The only way to produce a consensus sequence for which the reliability of each base is known, is to use reading data with base call confidence values. Their use, in combination with the Confidence Value algorithm (see section Consensus Calculation Using Confidence Values). is strongly recommended.

For base calls without confidence values use the Base Frequencies algorithm (see section Consensus Calculation Using Base Frequencies). This is also a fast algorithm so it may be appopriate for very high depth assemblies such those for mutation studies.

For data with simple base call accuracy estimates rather than those on the decibel scale, the Weighted Base Frequencies algorithm should be used (see section Consensus Calculation Using Weighted Base Frequencies).

All confidence values lie in the range 0 to 100. When readings are entered into a database, gap4 assigns a confidence of 99 to all bases without confidence values. For all three algorithms, a base with confidence of 100 is used to force the consensus base to that base type and to have a confidence of 100. However,if two or more base types at any position have confidence 100, the consensus will be set to "unknown", i.e. "-", and will have a confidence of 0. Note that dash ("-") is our preferred symbol for "unknown" as, within a sequence, it is more easily distinguished from A,C,G,T than "N".

The consensus sequence is also assigned a confidence, even when base call confidence values are not used to calculate it. The scale and meaning of the consensus confidence changes between consensus algorithms. However the consensus cutoff parameter always has the same meaning. A consensus base with a confidence 'X' will be called as a dash when 'X' is lower than the consensus cutoff, otherwise it is the determined base type.

Both the consensus cutoff and quality cutoff values can be set by using the "Configure cutoffs" command in the "Consensus algorithm" dialogue in the main gap4 Options menu (see section Consensus Algorithm). Within the Contig Editor (see section Editing in gap4) these values can be adjusted by clicking on the "<" and ">" symbols adjacent to the "C:" (consensus cutoff) and "Q:" (quality cutoff) displays in the top left corner of the editor. These buttons are repeating buttons - the values will adjust for as long as the left mouse button is held down. Changing these values lasts only as long as that invocation of the contig editor.

The consensus algorithms are usually configured to produce only the characters A,C,G,T,* and "-", but it is possible to set them to produce the complete set of IUB codes. This mode is useful for some types of work and allows the range of observed base types at any position to be coded in the consensus. The IUB code at any position is determined in the following way.

We assume that the user wants to know which base types have occurred at any point, but may want some control over the quality and relative frequency of those that are used to calculate the "consensus". For the simplest consensus algorithm there is no control over the quality of the base calls that are included, but the Consensus Cutoff can be used to control how the relative frequency affects the chosen IUB code. All base types whose computed "confidence" exceeds the Consensus Cutoff will be included in the selection of the IUB code. For example if only base type T reaches the Consenus Cutoff the IUB code will be T; if both T and C reach the cutoff the code will be Y; if A, C and T each reach the cutoff the code will be H; if A, C, G and T all reach the cutoff the code will be "N". For the Confidence Value algorithm the Quality Cutoff can be used to exclude base calls of low quality, so that all those that do not reach the Quality Cutoff are excluded from the IUB code calculation. Otherwise the logic of the code selection is the same as for the two simpler algorithms.

The algorithms are explained below.

This page is maintained by staden-package. Last generated on 22 October 2002.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/gap4_unix_117.html