#include <qual.h> int calc_consensus( int contig, int start, int end, int mode, char *con, char *con2, float *qual, float *qual2, float cons_cutoff, int qual_cutoff, int (*info_func)(int job, void *mydata, info_arg_t *theirdata), void *info_data); int database_info( int job, void *mydata, info_arg_t *theirdata);
This function calculates the consensus sequence for a given segment of a contig. It can produce a single consensus sequence using all readings, or split it into two sequences; one for each strand. Additionally, it can produce either one (combinded strands) or two (individual strands) sets of values relating to the accuracy of the returned consensus.
The contig, start and end arguments hold the contig and range to calculate the consensus for. The ranges are inclusive and start counting with the first base as position 1.
con and con2 are buffers to store the consensus. These are
allocated by the caller to be at least of size end-start+1. If
con2 is NULL
both strands are calculated as a single consensus to
be stored in con. Otherwise the top strand is stored in con and
the bottom strand is stored in con2.
mode should be one of CON_SUM
or CON_WDET
. CON_SUM
is the "normal" mode, which indicates that the consensus sequence is simply
the most likely base or a dash (depending on cons_cutoff. The
CON_WDET
mode is used to return special characters for bases that are
good quality and identical on both strands. Where one strand has a dash, the
consensus base for the other strand is used. Where both strands differ, and
are not dashes, the consensus is returned as dash. Note that despite requiring
the consensus for each starnd independently, this mode requires that
con2 is NULL
. To summarise the action of the CON_WDET
mode, the final consensus is derived as follows:
Top Bottom Resulting Strand Strand Base --------------------------- A A d C C e G G f T T i - - - - x x x - x x y -
[Where x and y are one of A, C, G or T, and x != y.]
qual_cutoff and cons_cutoff hold the quality and consensus cutoff paramaters used in the consensus algorithm for determining which bases are of sufficient quality to use and by how big a majority this base type must have before it is returned as the consensus base (otherwise "-" is used). For a complete description of how these parameters operate see the consensus algorithm description in the main Gap4 manual. (FIXME: should we duplicate this here?)
The qual and qual2 buffers are allocated by the caller to be the same size as the con and con2 buffers. They are filled with the a floating point representing the ratio of score for the consensus base type to the score for all base types (where the definition of score depends on the qual_cutoff parameter). This is the value compared against cons_cutoff to determine whether the consensus base is a dash. Either or both of qual and qual2 can be passed as NULL if no accuracy information is required. Note that the accuracy information for qual2 is only available when con2 has also been passed as non NULL.
The algorithm uses info_func to obtain information about the readings
from the database. info_data is passed as the second argument
(mydata) to info_func. info_func is called each time some
information is required about a reading or contig. It's purpose is to abstract
out the algorithm from the data source. There are currently two such
functions, the most commonly used of which is database_info
function
(the other being contEd_info
to fetch data from the contig editor
structures). The database_info
function obtains the sequence details
from the database. It requires a GapIO pointer to be passed as
info_data.
The function returns 0 for success, -1 for failure.