home up

What is a contig?

The need for the original use of the word contig has not diminished, but the current diversity of uses is causing confusion. Here we explain the origins of the term. The Sanger shotgun sequencing method relies on using computer programs to find matches to order sets of overlapping clones (Staden,R. "A strategy of DNA sequencing employing computer programs", Nucleic Acids Res. 7, 2601-2610, 1979). To aid discussion of this new class of object the word contig was invented (Staden,R. "A new computer method for the storage and manipulation of DNA gel reading data", Nucleic Acids Res. 8, 3673-3694 (1980)). This paper contained the following:

"Definition of a contig

In order to make it easier to talk about our data gained by the shotgun method of sequencing we have invented the word "contig". A contig is a set of gel readings that are related to one another by overlap of their sequences. All gel readings belong to one and only one contig, and each contig contains at least one gel reading. The gel readings in a contig can be summed to form a contiguous consensus sequence and the length of this sequence is the length of the contig."

This defines a contig to be a set of overlapping segments of DNA. It naturally allows a set to contain a single element (ready for further comparison). It also defines the length of a contig to be the length of the consensus sequence derived from it. In the light of the current confusion, note that consensus sequences and contigs are entirely different classes of object.

Its meaning was broadened when the same term was used by Coulson, A.R., Sulston, J.E., Brenner, S. and Karn, J., Proc. Natl. Acad. Sc. 83, 7821-7825, (1986) to define sets of overlapping fingerprint clones used in physical mapping of the nematode genome. Unfortunately they did not refer to the published definition, but in a footnote defined contig to be "groups of clones with contiguous nucleotide sequences". In the fingerprint comparison software they required their "groups of clones" to contain single clones otherwise they would miss overlaps. The length of the contigs could be estimated, but the Sanger Centre had to be created before it was possible to calculate their consensus sequences.

This was a natural and consistent extension to overlapping clones, but since then the value of the term has slowly been eroded, not helped by various dictionaries and encyclopedias. Here are two examples reported by confused correspondents.

Contig: A continuous sequence of DNA that has been assembled from overlapping cloned DNA fragments. The Encyclopedia of Molecular Biology (1994,Blackwell). No longer a set - merely a string of A,C,G,T, formerly known as "sequence".

Contig: One of a set of overlapping clones that represent a continuous region of DNA. Each contig is a genomic clone, usually in a cosmid or a yac. They are used in contig mapping. Contig mapping is a technique which relies on the use of overlapping clones, referred to as contigs. Oxford Dictionary of Biochemistry and Molecular Biology. i.e. a contig is a single clone that is part of a set of overlapping clones which is a contig.

Currently there are several contradictory defintions, often reducing the meaning to that contained in the Blackwell Encyclopedia of Molecular Biology, ie a replacement for "sequence". A web search will reveal many more. For example http://www.ncbi.nlm.nih.gov/genome/guide/build.html which defines a contig as a "Contiguous sequence constructed from many clone sequences. It may include draft and finished sequence. It may contain sequence gaps (within a clone), but it does not include gaps between clones". i.e. another type of sequence.


home up