Programming with Gap4 - calc

calc_consensus

#include <qual.h>

int calc_consensus(
        int     contig,
        int     start,
        int     end,
        int     mode,
        char   *con,
        char   *con2,
        float  *qual,
        float  *qual2,
        float   cons_cutoff,
        int     qual_cutoff,
        int    (*info_func)(int          job,
                            void        *mydata,
                            info_arg_t  *theirdata),
        void   *info_data);

int database_info(
        int          job,
        void        *mydata,
        info_arg_t  *theirdata);

This function calculates the consensus sequence for a given segment of a contig. It can produce a single consensus sequence using all readings, or split it into two sequences; one for each strand. Additionally, it can produce either one (combinded strands) or two (individual strands) sets of values relating to the accuracy of the returned consensus.

The contig, start and end arguments hold the contig and range to calculate the consensus for. The ranges are inclusive and start counting with the first base as position 1.

con and con2 are buffers to store the consensus. These are allocated by the caller to be at least of size end-start+1. If con2 is NULL both strands are calculated as a single consensus to be stored in con. Otherwise the top strand is stored in con and the bottom strand is stored in con2.

mode should be one of CON_SUM or CON_WDET. CON_SUM is the "normal" mode, which indicates that the consensus sequence is simply the most likely base or a dash (depending on cons_cutoff. The CON_WDET mode is used to return special characters for bases that are good quality and identical on both strands. Where one strand has a dash, the consensus base for the other strand is used. Where both strands differ, and are not dashes, the consensus is returned as dash. Note that despite requiring the consensus for each starnd independently, this mode requires that con2 is NULL. To summarise the action of the CON_WDET mode, the final consensus is derived as follows:

 Top     Bottom   Resulting
Strand   Strand     Base
---------------------------
   A        A         d
   C        C         e
   G        G         f
   T        T         i
   -        -         -
   -        x         x
   x        -         x
   x        y         -

[Where x and y are one of A, C, G or T, and x != y.]

qual_cutoff and cons_cutoff hold the quality and consensus cutoff paramaters used in the consensus algorithm for determining which bases are of sufficient quality to use and by how big a majority this base type must have before it is returned as the consensus base (otherwise "-" is used). For a complete description of how these parameters operate see the consensus algorithm description in the main Gap4 manual. (FIXME: should we duplicate this here?)

The qual and qual2 buffers are allocated by the caller to be the same size as the con and con2 buffers. They are filled with the a floating point representing the ratio of score for the consensus base type to the score for all base types (where the definition of score depends on the qual_cutoff parameter). This is the value compared against cons_cutoff to determine whether the consensus base is a dash. Either or both of qual and qual2 can be passed as NULL if no accuracy information is required. Note that the accuracy information for qual2 is only available when con2 has also been passed as non NULL.

The algorithm uses info_func to obtain information about the readings from the database. info_data is passed as the second argument (mydata) to info_func. info_func is called each time some information is required about a reading or contig. It's purpose is to abstract out the algorithm from the data source. There are currently two such functions, the most commonly used of which is database_info function (the other being contEd_info to fetch data from the contig editor structures). The database_info function obtains the sequence details from the database. It requires a GapIO pointer to be passed as info_data.

The function returns 0 for success, -1 for failure.

This page is maintained by staden-package. Last generated on 1 March 2001.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/scripting_154.html