Gap4 - Editor-Techniques-Cutoffs

Consensus and Quality Cutoffs

The most rapid editing technique (see section The use of numerical estimates of base calling accuracy) is only available if base call confidence values have been assigned to the reading data using a scale proportional to -log(error_rate). Using the "confidence" consensus method will make use of confidence values to give the most probable consensus sequence and a probability of each base being correct. Using the editor "consensus quality" search then provides an extremely quick way of identifying the lowest quality consensus bases. The List Confidence command will give information on the expected number of errors that can be fixed by examining all consensus bases with a quality less than a particular amount. This gives a good indication to the choice of theshold to use in the consensus quality search. Additionally you will also be told the expected error rates. With this system it is possible to stop editing once a particular average quality has been achieved.

Care should be taken in considering your desired error rate. An average error rate of 1 in 10,000 may be easily achievable. However there could still be consensus bases with very low confidence. Hence it is perhaps best to choose both an average error rate and a minimum consensus confidence for your finishing criteria. The consensus confidence values are scaled such that a confidence of 20 is a 1 in 100 error rate, 30 is 1 in 1000, 40 is 1 in 10000 and so on.

The rest of this section described methods to use when the aforementioned confidence values are not available.

The Consensus and Quality cutoff values used whilst editing are personal preference. Rather than state suggested values, we discuss the merits of using example values.

The meaning of the consensus and quality cutoff values changes slightly depending on the consensus algorithm in use. For more information on the algorithms and these values see section The Consensus Calculation.

With the "Base type frequencies" and "Quality weighted base type" methods, a consensus cutoff value of 100 means that every sequence disagreement will yield a dash in the consensus. Hence the "Next Search" button when in "problem" search mode can be used to verify every potential problem. This is a lot of work, but if you wish to make sure that all disagreements are checked this is the easiest way.

With a quality cutoff of -1, lowering the consensus cutoff value to (eg) 90 means that a base in the consensus will only be a dash when over 10% of the bases disagree with the majority at that point. So a base covered by 11 sequences, 10 of which state A and one of which states C would not be considered a problem and would not be found by the problem search. Note that this is regardless of the strand information. So if the As are on the positive strand and the single C is on the negative strand then this is still not considered a problem. However, see below.

Still working with the "Base type frequencies" and "Quality weighted base type" consensus methods, changing the quality cutoff to be 0 or more means that the consensus base is derived from the relative quality of bases instead of simple frequency counts. A quality cutoff of 0 and a consensus cutoff of 90 means that the base will be a dash only when the sum of the quality values for the most common base type (defined by the highest quality sum) is less than 90% of the total. In comparison with a quality cutoff of -1, this means that the above example of 10 A bases and 1 C base would be considered a problem if the C base had a sufficiently high quality.

If you have confidence values for each base available you may consider it unnecessary to check disagreements caused by poor quality data disagreeing with good quality data, although disagreements between good data and good data should always be checked. However it should be obvious from this that with a quality cutoff of 0 and a consensus cutoff of 100% every sequence conflict is still considered a potential problem. A specific change in the consensus cutoff (eg from 100% to 90%) will typically find less problems when the quality cutoff is 0 than when it is -1. This is entirely due to differences between good quality data and poor quality data being excluded.

Finally, the "Compare Strands" editor setting calculates two independent consensus sequences; one for each strand. The consensus shown is then the base calculated in each of the two consensus sequences if they agree, or dash if they do not. The "confidence" consensus algorithm already takes into account strand and chemistry when calculating the consensus base type and confidence, but will only lower the confidence value for strand disagreements, rather than setting the consensus base to be a dash. For all consensus methods enabling "Compare Strands" will force you to check all consensus bases where the evidence from each strand is conflicting.

This page is maintained by staden-package. Last generated on 22 October 2002.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/gap4_unix_71.html