Gap4 - Assembly-CAP2

Assembly CAP2

This mode of assembly uses the global assembly program CAP2, developed by Xiaoqiu Huang. Huang, X. An improved sequence assembly program. Genomics 33, 21-31 (1996).

The CAP2 program can be accessed via the Gap4 interface through the "Assembly" menu or as a stand alone program.

The CAP2 program, for use with Gap4, must be obtained via ftp from the author, Xiaoqiu Huang.

Email Xiaoqiu Huang (huang@cs.mtu.edu) stating that you want CAP2 for use with gap4 and the operating system for which you need the program (one of: SunOS 4.1.1; Solaris 2.4; DEC OSF/1 V3.0 and Digital Unix; Irix 5.3). He will then contact you to arrange for the retrieval of the binary file. The binary file is called cap2_s. Make this executable (eg chmod a+x cap2_s) and move it to the directory $STADENROOT/$MACHINE-bin. The CAP2 options on the "Assembly" menu should now be available.

Perform CAP2 assembly

[picture]

The assembly works on either a file or list of reading names in experiment file format (see section Experiment File). CAP2 assembles the readings and the alignments are written to the output window. Irrespective of the original file format, new reading files are written in the destination directory in experiment file format. If the destination directory does not already exist, then it is created. These new files contain the additional information required to recreate the same assembly within Gap4. This is done by the addition of an AP line. See section Directed Assembly.

It is also possible to tell the program to identify chimeric fragments, report repeat structures and resolve them by setting the "Find repeats/chimerics" radiobutton to "Yes". If this is set to "No", these tasks are not performed. At the present time, CAP2 can only resolve direct repeats and not reverse repeats.

Import CAP2 assembly

This mode imports the aligned sequences produced after CAP2 assembly into Gap4 and maintains the same alignment. Importing the files requires the directory containing the newly aligned readings, ie the destination directory used in "Perform CAP2 assembly". Readings which are not entered are written to a "list" or "file" specified in the "Save failures" entry box. This mode is functionally equivalent to "Directed assembly". See section Directed Assembly.

Perform and import CAP2 assembly

This mode performs both the assembly section Perform CAP2 assembly and the import section Import CAP2 assembly together. The assembled readings are written to the destination directory and then are automatically imported from this directory into Gap4.

Stand alone CAP2 assembly

The program can be alternatively accessed as a stand alone program with the following command line arguments

cap2_s -{format} file_of_filenames [-r] [-out destination_directory]

{format} is the file format of the file of filenames and is either in experiment file format or fasta format. Legal inputs are exp, EXP, fasta or FASTA.

file_of_filenames is the name of the file containing the reading names to be assembled for experiment files or a single file of readings in fasta format.

destination_directory is the name of a directory to which the new experiment files are written to. The default directory is "assemble".

-r is optional and is equivalent to the "Find repeats/chimerics" option above.

Further details about CAP2

The comments provided with CAP2 by Huang are detailed below.

   copyright (c) 1995-96 Xiaoqiu Huang and Michigan Technological University
   No part of this program may be distributed without prior written
   permission of the author.

        Xiaoqiu Huang
        Department of Computer Science
        Michigan Technological University
        Houghton, MI 49931
        E-mail: huang@cs.mtu.edu

        Proper attribution of the author as the source of the software would
        be appreciated:
             Huang, X. (1996)
             An Improved Sequence Assembly Program
             Genomics, 33:21-31.

   The CAP2 program assembles short DNA fragments into long sequences.
   CAP2 contains a number of improvements to the original version
   described in Genomics 14, pages 18-25, 1992. These improvements are:

   o  Use of a more efficient filter for quickly detecting pairs of
      fragments that could not overlap.
   
   o  Accurate evaluation of overlap strengths through the use
      of internally generated fragment-specific confidence vectors.

   o  Identification of fragments from repetitive sequences and
      resolution of ambiguities in assembly of those fragments.

   o  Identification of chimeric fragments.

   o  Automated refinement of poorly aligned regions of fragment
      alignments

   A chimeric fragment is made of two short pieces from non-adjacent
   regions of the DNA molecule. CAP2 may report a repeat structure like:

        F1      5' flanking
        F2      5' flanking
        I1      Internal
        I2      Internal
        I3      Internal
        T1      3' flanking
        T2      3' flanking

   where F1, F2, I1, I2, I3, T1 and T2 are fragment names. The
   structure means that I1 ,I2 and I3 are from two copies of
   a repetitive element, F1 and F2 flank the two copies at their
   5' end, T1 and T2 flank them at their 3' end.
   CAP2 produces the two copies in the final sequence by
   resolving the ambiguities in the repeat structure.

   CAP2 is efficient in computer memory: a large number of DNA 
   fragments can be assembled. The time requirement is acceptable;
   for example, CAP2 took 1.5 hours to assemble 829 fragments of a total
   of 393 kb nucleotides into a single contig on a Sun SPARC 5.
   The program is written in C and runs on Sun workstations.

   The CAP2 program can be run with the -r option. If this option
   is specified, then the program identifies chimeric fragments,
   reports repeat structures and resolves them.
   Otherwise, these tasks are not performed.

   Large integer values should be used for MATCH, MISMAT, EXTEND.

   The comments given above are for CAP2. Written on Feb. 11, 95. 

   Acknowledgements
     
      Kathryn Beal found a bug in the Filter procedure.
      The array elen was not always initialized.

   Below is a description of the parameters in the #define section of CAP.
   Two specially chosen sets of substitution scores and indel penalties
   are used by the dynamic programming algorithm: heavy set for regions
   of low sequencing error rates and light set for fragment ends of high
   sequencing error rates. (Use integers only.)

        Heavy set:                       Light set:

        MATCH     =  2                   MATCH     =  2
        MISMAT    = -6                   LTMISM    = -3
        EXTEND    =  4                   LTEXTEN   =  2

    In the initial assembly, any overlap must be of length at least OVERLEN,
    and any overlap/containment must be of identity percentage at least
    PERCENT. After the initial assembly, the program attempts to join
    contigs together using weak overlaps. Two contigs are merged if the
    score of the overlapping alignment is at least CUTOFF. The value for
    CUTOFF is chosen according to the value for MATCH.

    POS5 and POS3 are fragment positions such that the 5' end between base 1
    and base POS5, and the 3' end after base POS3 are of high sequencing
    error rates, say more than 5%. For mismatches and indels occurring in
    the two ends, light penalties are used.

    Acknowledgments
     The function diff() of Gene Myers is modified and used here.

    A file of input fragments looks like:

>G019uabh
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAA
GTCTTGCTTGAATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTAC
TCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACAGTAG
GACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTT
AATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTT
ATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTT
TAGATAAGCAAAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC
>G028uaah
CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTT
TAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGAT
TGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGC
TGGCAGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGC
ATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCT
TCCCCATCCCATCAGTCT
>G022uabh
TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTG
TAGGTGATTGGGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCC
CATTAAAACCCTTTATGCCCATACATCATAACACTACTTCCTACCCATAA
GCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAAC
ACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATT
GATTGAT
>G023uabh
AATAAATACCAAAAAAATAGTATATCTACATAGAATTTCACATAAAATAA
ACTGTTTTCTATGTGAAAATTAACCTAAAAATATGCTTTGCTTATGTTTA
AGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAATAATCCTCTAC
GATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATGGAAATAAAAT
GTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACATGAAATGCTTT
TTAAAAGAAAATATTAAAGTTAAACTCCCCTATTTTGCTCGTTTTTGCTT
ATCTAAAATACATTCTGCACAATCCCCAAAGATTGATCATACGTTAC
>G006uaah
ACATAAAATAAACTGTTTTCTATGTGAAAATTAACCTANNATATGCTTTG
CTTATGTTTAAGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAAT
AATCCTCTAAGATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATG
GAAATAAAATGTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACAT
GAAATGCTTTTTAAAAGAAAATATTAAAGTTAAACTCCCC

   A string after ">" is the name of the following fragment.
   Only the five upper-case letters A, C, G, T and N are allowed
   to appear in fragment data. No other characters are allowed.
   A common mistake is the use of lower case letters in a fragment.

   To run the program, type a command of form

        cap2 file_of_filenames [-r]

   The output goes to the terminal screen. So redirection of the
   output into a file is necessary. The output consists of three parts:
   overview of contigs at fragment level, detailed display of contigs
   at nucleotide level, and consensus sequences.
   The output of CAP on the sample input data looks like:

'+' = direct orientation; '-' = reverse complement

OVERLAPS            CONTAINMENTS

******************* Contig 1 ********************
G022uabh+
G019uabh+
                    G028uaah+ is in G019uabh+
G023uabh-
                    G006uaah- is in G023uabh-

DETAILED DISPLAY OF CONTIGS
******************* Contig 1 ********************
                .    :    .    :    .    :    .    :    .    :    .    :
G022uabh+   TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG
            ____________________________________________________________
consensus   TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG

                .    :    .    :    .    :    .    :    .    :    .    :
G022uabh+   GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC
            ____________________________________________________________
consensus   GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC

                .    :    .    :    .    :    .    :    .    :    .    :
G022uabh+   ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
G019uabh+   ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
G028uaah+                            CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
            ____________________________________________________________
consensus   ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG

                .    :    .    :    .    :    .    :    .    :    .    :
G022uabh+   AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG
G019uabh+   AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG
G028uaah+   AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG
            ____________________________________________________________
consensus   AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG

                .    :    .    :    .    :    .    :    .    :    .    :
G022uabh+   ATTGATTGATTGATTGAT                                          
G019uabh+   ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
G028uaah+   ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
            ____________________________________________________________
consensus   ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC

                .    :    .    :    .    :    .    :    .    :    .    :
G019uabh+   AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
G028uaah+   AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
            ____________________________________________________________
consensus   AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT

                .    :    .    :    .    :    .    :    .    :    .    :
G019uabh+   GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
G028uaah+   GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCC-ATCAGTCT     
            ____________________________________________________________
consensus   GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC

                .    :    .    :    .    :    .    :    .    :    .    :
G019uabh+   AGTCTTGTTACGTTATGACT-AATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC
G023uabh-         GTAACGT-ATGA-TCAATCTTTGGGGATTGTGCAGAATGT-ATTTTAGATAAGC
            ____________________________________________________________
consensus   AGTCTTGTAACGTTATGACTCAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC

                .    :    .    :    .    :    .    :    .    :    .    :
G019uabh+   AAAA-CGAGCAAAAT-GGGGAGTT-A-CTT-A-TATTT-CTTT-AAA--GC         
G023uabh-   AAAAACGAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT
G006uaah-                   GGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT
            ____________________________________________________________
consensus   AAAAACGAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT

                .    :    .    :    .    :    .    :    .    :    .    :
G023uabh-   TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT
G006uaah-   TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT
            ____________________________________________________________
consensus   TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT

                .    :    .    :    .    :    .    :    .    :    .    :
G023uabh-   AGTTTTTTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGAT
G006uaah-   AGTTTTTTTCCTATTTGTTTAAGATCTTAGAGGATTATTAAGCTGAACTCCTCAACTGAT
            ____________________________________________________________
consensus   AGTTTTTTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGAT

                .    :    .    :    .    :    .    :    .    :    .    :
G023uabh-   AAAAAGCATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAA
G006uaah-   AAAAAGCATGACATCTTAAACATAAGCAAAGCATATNNT-AGGTTAATTTTCACATAGAA
            ____________________________________________________________
consensus   AAAAAGCATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAA

                .    :    .    :    .    :    .    :    .    :    .    :
G023uabh-   AACAGTTTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT
G006uaah-   AACAGTTTATTTTATGT                                       
            ____________________________________________________________
consensus   AACAGTTTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT

CONSENSUS SEQUENCES
>Contig 1
TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG
GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
AATTAAAGACTTGTTTAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTG
ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
AGTCTTGTAACGTTATGACTCAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC
AAAAACGAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT
TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT
AGTTTTTTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGAT
AAAAAGCATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAA
AACAGTTTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT
*/

This page is maintained by staden-package. Last generated on 22 October 2002.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/gap4_unix_87.html