Gap4 - Assembly-CAP3

Assembly CAP3

This mode of assembly uses the global assembly program CAP3, developed by Xiaoqiu Huang. Huang, X. DNA Sequence Assembly under Forward-Reverse Constraints. In preparation. (1998).

The CAP3 program can be accessed via the Gap4 interface through the "Assembly" menu or as a stand alone program.

The CAP3 files for use with Gap4 must be obtained via ftp from the author, Xiaoqiu Huang.

Email Xiaoqiu Huang (huang@mtu.edu) stating that you want CAP3 for use with gap4 and the operating system for which you need the program (one of: Solaris 2; Digital Unix; SGI Irix; linux x86). He will then contact you to arrange for the retrieval of the binary files. The binary files are called cap3_s and cap3_create_exp_constraints. Make these executable (eg chmod a+x cap3_s) and move them to the directory $STADENROOT/$MACHINE-bin. The CAP3 options on the "Assembly" menu should now be available.

Perform CAP3 assembly

[picture]

The assembly works on either a file or list of reading names in experiment file format (see section Experiment File). CAP3 assembles the readings and the alignments are written to the output window. New reading files are written in the destination directory in experiment file format. If the destination directory does not already exist, then it is created. These new files contain the additional information required to recreate the same assembly within Gap4. This is done by the addition of an AP line. See section Directed Assembly.

CAP3 uses forward-reverse constraints to correct errors in assembly of reads. The constraints file is generated automatically using the information in the experiment files by setting the "Use constraint file" radiobutton to "Yes". The constraints file is named after the input file with the addition of ".con" ie if the input file is called fofn, the constraint file is called fofn.con. Note that if the "Use constraint file" is set to "No", then any files of the format input_file.con will be deleted from the current directory. For further details, see section Further details about CAP3.

CAP3 also can use quality values to determine the consensus sequence. If the quality values are present in the experiment files, then they are automatically used. For further details, see section Further details about CAP3.

Import CAP3 assembly

This mode imports the aligned sequences produced after CAP3 assembly into Gap4 and maintains the same alignment. Importing the files requires the directory containing the newly aligned readings, ie the destination directory used in "Perform CAP3 assembly". Readings which are not entered are written to a "list" or "file" specified in the "Save failures" entry box. This mode is functionally equivalent to "Directed assembly". See section Directed Assembly.

Perform and import CAP3 assembly

This mode performs both the assembly, see section Perform CAP3 assembly and the import, see section Import CAP3 assembly together. The assembled readings are written to the destination directory and then are automatically imported from this directory into Gap4.

Stand alone CAP3 assembly

The program can be alternatively accessed as a stand alone program with the following command line arguments

cap3_s -format file_of_filenames [-out destination_directory]

format is the file format of the file of filenames and is either in experiment file format or fasta format. Legal inputs are exp, EXP, fasta or FASTA.

file_of_filenames is the name of the file containing the reading names to be assembled for experiment files or a single file of readings in fasta format.

destination_directory is the name of a directory to which the new experiment files are written to. The default directory is "assemble".

To use forward-reverse reading constraints, an appropriate file_of_filenames.con file must exist in the current directory. This file can be created from experiment files using the program:

cap3_create_exp_constraints file_of_filenames

where file_of_filenames is the same file as used for cap3_s. For fasta files, the constraint file is created using the program:

formcon File_of_Reads Min_Distance Max_Distance

See below for more information.

If quality values are present in the experiment files, then these will be used automatically. For fasta files, the quality values must be in a separate file of the type file_of_filenames.qual. See below for more information.

Further details about CAP3

The comments provided with CAP3 by Huang are detailed below.

CONTIG ASSEMBLY PROGRAM Version 3 (CAP3)

     Xiaoqiu Huang
     Department of Computer Science
     Michigan Technological University
     Houghton, MI 49931
     E-mail: huang@cs.mtu.edu

Proper attribution of the author as the source of the software would be appreciated:

     Huang, X. (1998)
     DNA Sequence Assembly under Forward-Reverse Constraints.
     In preparation.

CAP3 uses forward-reverse constraints to correct errors in assembly of reads. CAP3 works better if a lot more constraints are used. If the file of sequence reads in FASTA format is named "xyz", then the file of forward-reverse constraints must be named "xyz.con". Each line of the constraint file specifies one forward-reverse constraint of the form:

ReadA   ReadB    MinimumDistance    MaximumDistance

where ReadA and ReadB are names of two reads, and MinimumDistance and MaximumDistance are distances (integers) in base pairs. The constraint is satisfied if ReadA in forward orientation occurs in a contig before ReadB in reverse orientation, or ReadB in forward orientation occurs in a contig before ReadA in reverse orientation, and their distance is between MinimumDistance and MaximumDistance. We have a separate program to generate a constraint file from the sequence file.

The program reports whether each constraint is satisfied or not. The report is in file `xyz.con.results'. A sample report file is given here:

CPBKY55F  CPBKY55R  500  6000  3210  satisfied
CPBKY92F  CPBKY92R  500  6000  497   unsatisfied in distance
CPBKY28F  CPBKY28R  500  6000   unsatisfied
CPBKY56F  CPBKY56R  500  6000   10th link between CPBKI23F+ and CPBKT37R-

The first four columns are simply taken from the constraint file.

Line 1 indicates that the constraint is satisfied, where the actual distance between the two reads is given on the fifth column.

Line 2 indicates that the constraint is not satisfied in distance, that is, the two reads in opposite orientation occur in the same contig, but their distance (given on the fifth column) is out of the given range.

Line 3 indicates that the constraint is not satisfied.

Line 4 indicates that this constraint is the 10th one that links two contigs, where the 3' read of one contig is CPBKI23F in plus orientation and the 5' read of the other is CPBKT37R in minus orientation. The information suggests that the two contigs should go together in the gap closure phase. Information about corrections made using constraints is reported in file named `.info'.

A feature to use quality values in determination of consensus sequences has been added. The file of quality values must be named `xyz.qual', where `xyz' is the name of the sequence file. Only the sequence file is given as an argument to the program. All the other input files must be in the same directory. CAP3 uses the same format of a quality file as Phrap. The quality values of contig consensuses are given in file `xyz.contigs.qual'. The results of CAP3 go to the standand output.

CAP3 also uses a more effective filter to speed up overlap computation.

CAP3 assumes that the low-quality ends of sequence reads have been trimmed. Otherwise, CAP3 may not work well. We have a separate program to trim low-quality ends and to produce a corresponging Phred quality file. If you need this program, please let us know. We plan to remove this assumption in the future.

The CAP3 program consists of two C source files: `cap3.c' and `filter.c'. To produce the executable code named cap3, use the command:

cc -O  cap3.c filter.c -o cap3

The usage is:

cap3  File_of_Reads  >  output

The file `output' contains the output of CAP3.

The features given above are new in CAP3. Below is for CAP2.

The CAP2 program assembles short DNA fragments into long sequences. CAP2 contains a number of improvements to the original version described in Genomics 14, pages 18-25, 1992. These improvements are:

Use of a more efficient filter for quickly detecting pairs of fragments that could not overlap.
Accurate evaluation of overlap strengths through the use of internally generated fragment-specific confidence vectors.
Identification of fragments from repetitive sequences and resolution of ambiguities in assembly of those fragments.
Identification of chimeric fragments.
Automated refinement of poorly aligned regions of fragment alignments

A chimeric fragment is made of two short pieces from non-adjacent regions of the DNA molecule. CAP2 may report a repeat structure like:

F1	5' flanking
F2	5' flanking
I1	Internal
I2	Internal
I3	Internal
T1	3' flanking
T2	3' flanking

where F1, F2, I1, I2, I3, T1 and T2 are fragment names. The structure means that I1 ,I2 and I3 are from two copies of a repetitive element, F1 and F2 flank the two copies at their 5' end, T1 and T2 flank them at their 3' end. CAP2 produces the two copies in the final sequence by resolving the ambiguities in the repeat structure.

CAP2 is efficient in computer memory: a large number of DNA fragments can be assembled. The time requirement is acceptable; for example, CAP2 took 1.5 hours to assemble 829 fragments of a total of 393 kb nucleotides into a single contig on a Sun SPARC 5. The program is written in C and runs on Sun workstations.

The CAP2 program can be run with the -r option. If this option is specified, then the program identifies chimeric fragments, reports repeat structures and resolves them. Otherwise, these tasks are not performed.

Large integer values should be used for MATCH, MISMAT, EXTEND.

The comments given above are for CAP2. Written on Feb. 11, 95.

Acknowledgements
  
   I thank Gene Spier for finding a problem with quality values for
   reverse complements.

Below is a description of the parameters in the #define section of CAP. Two specially chosen sets of substitution scores and indel penalties are used by the dynamic programming algorithm: heavy set for regions of low sequencing error rates and light set for fragment ends of high sequencing error rates. (Use integers only.)

	Heavy set:			 Light set:

	MATCH     =  2			 MATCH     =  2
	MISMAT    = -6			 LTMISM    = -3
	EXTEND    =  4			 LTEXTEN   =  2

In the initial assembly, any overlap must be of length at least OVERLEN, and any overlap/containment must be of identity percentage at least PERCENT. After the initial assembly, the program attempts to join contigs together using weak overlaps. Two contigs are merged if the score of the overlapping alignment is at least CUTOFF. The value for CUTOFF is chosen according to the value for MATCH.

POS5 and POS3 are fragment positions such that the 5' end between base 1 and base POS5, and the 3' end after base POS3 are of high sequencing error rates, say more than 5%. For mismatches and indels occurring in the two ends, light penalties are used.

Acknowledgments
   The function diff() of Gene Myers is modified and used here.

A file of input fragments looks like:

>G019uabh
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAA
GTCTTGCTTGAATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTAC
TCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACAGTAG
GACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTT
AATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTT
ATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTT
TAGATAAGCAAAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC
>G028uaah
CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTT
TAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGAT
TGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGC
TGGCAGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGC
ATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCT
TCCCCATCCCATCAGTCT
>G022uabh
TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTG
TAGGTGATTGGGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCC
CATTAAAACCCTTTATGCCCATACATCATAACACTACTTCCTACCCATAA
GCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAAC
ACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATT
GATTGAT
>G023uabh
AATAAATACCAAAAAAATAGTATATCTACATAGAATTTCACATAAAATAA
ACTGTTTTCTATGTGAAAATTAACCTAAAAATATGCTTTGCTTATGTTTA
AGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAATAATCCTCTAC
GATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATGGAAATAAAAT
GTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACATGAAATGCTTT
TTAAAAGAAAATATTAAAGTTAAACTCCCCTATTTTGCTCGTTTTTGCTT
ATCTAAAATACATTCTGCACAATCCCCAAAGATTGATCATACGTTAC
>G006uaah
ACATAAAATAAACTGTTTTCTATGTGAAAATTAACCTANNATATGCTTTG
CTTATGTTTAAGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAAT
AATCCTCTAAGATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATG
GAAATAAAATGTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACAT
GAAATGCTTTTTAAAAGAAAATATTAAAGTTAAACTCCCC

A string after ">" is the name of the following fragment. Only the five upper-case letters A, C, G, T and N are allowed to appear in fragment data. No other characters are allowed. A common mistake is the use of lower case letters in a fragment.

To run the program, type a command of form

cap  file_of_fragments

The output goes to the terminal screen. So redirection of the output into a file is necessary. The output consists of three parts: overview of contigs at fragment level, detailed display of contigs at nucleotide level, and consensus sequences. The output of CAP on the sample input data looks like:

'+' = direct orientation; '-' = reverse complement

OVERLAPS            CONTAINMENTS

******************* Contig 1 ********************
G022uabh+
G019uabh+
                    G028uaah+ is in G019uabh+
G023uabh-
                    G006uaah- is in G023uabh-

DETAILED DISPLAY OF CONTIGS
******************* Contig 1 ********************
                .    :    .    :    .    :    .    :    .    :    .    :
G022uabh+   TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG
            ____________________________________________________________
consensus   TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG

                .    :    .    :    .    :    .    :    .    :    .    :
G022uabh+   GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC
            ____________________________________________________________
consensus   GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC

                .    :    .    :    .    :    .    :    .    :    .    :
G022uabh+   ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
G019uabh+   ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
G028uaah+                            CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
            ____________________________________________________________
consensus   ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG

                .    :    .    :    .    :    .    :    .    :    .    :
G022uabh+   AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG
G019uabh+   AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG
G028uaah+   AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG
            ____________________________________________________________
consensus   AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG

                .    :    .    :    .    :    .    :    .    :    .    :
G022uabh+   ATTGATTGATTGATTGAT                                          
G019uabh+   ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
G028uaah+   ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
            ____________________________________________________________
consensus   ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC

                .    :    .    :    .    :    .    :    .    :    .    :
G019uabh+   AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
G028uaah+   AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
            ____________________________________________________________
consensus   AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT

                .    :    .    :    .    :    .    :    .    :    .    :
G019uabh+   GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
G028uaah+   GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCC-ATCAGTCT     
            ____________________________________________________________
consensus   GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC

                .    :    .    :    .    :    .    :    .    :    .    :
G019uabh+   AGTCTTGTTACGTTATGACT-AATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC
G023uabh-         GTAACGT-ATGA-TCAATCTTTGGGGATTGTGCAGAATGT-ATTTTAGATAAGC
            ____________________________________________________________
consensus   AGTCTTGTAACGTTATGACTCAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC

                .    :    .    :    .    :    .    :    .    :    .    :
G019uabh+   AAAA-CGAGCAAAAT-GGGGAGTT-A-CTT-A-TATTT-CTTT-AAA--GC         
G023uabh-   AAAAACGAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT
G006uaah-                   GGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT
            ____________________________________________________________
consensus   AAAAACGAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT

                .    :    .    :    .    :    .    :    .    :    .    :
G023uabh-   TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT
G006uaah-   TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT
            ____________________________________________________________
consensus   TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT

                .    :    .    :    .    :    .    :    .    :    .    :
G023uabh-   AGTTTTTTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGAT
G006uaah-   AGTTTTTTTCCTATTTGTTTAAGATCTTAGAGGATTATTAAGCTGAACTCCTCAACTGAT
            ____________________________________________________________
consensus   AGTTTTTTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGAT

                .    :    .    :    .    :    .    :    .    :    .    :
G023uabh-   AAAAAGCATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAA
G006uaah-   AAAAAGCATGACATCTTAAACATAAGCAAAGCATATNNT-AGGTTAATTTTCACATAGAA
            ____________________________________________________________
consensus   AAAAAGCATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAA

                .    :    .    :    .    :    .    :    .    :    .    :
G023uabh-   AACAGTTTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT
G006uaah-   AACAGTTTATTTTATGT                                       
            ____________________________________________________________
consensus   AACAGTTTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT

CONSENSUS SEQUENCES
>Contig 1
TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG
GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
AATTAAAGACTTGTTTAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTG
ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
AGTCTTGTAACGTTATGACTCAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC
AAAAACGAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT
TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT
AGTTTTTTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGAT
AAAAAGCATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAA
AACAGTTTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT

This page is maintained by staden-package. Last generated on 22 October 2002.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/gap4_unix_88.html