This mode of assembly uses the global assembly program CAP2, developed by Xiaoqiu Huang. Huang, X. An improved sequence assembly program. Genomics 33, 21-31 (1996).
The CAP2 program can be accessed via the Gap4 interface through the "Assembly" menu or as a stand alone program.
The CAP2 program, for use with Gap4, must be obtained via ftp from the author, Xiaoqiu Huang.
Email Xiaoqiu Huang (huang@cs.mtu.edu) stating that you want CAP2 for
use with gap4 and the operating system for which you need the program
(one of: SunOS 4.1.1; Solaris 2.4; DEC OSF/1 V3.0 and Digital Unix; Irix
5.3). He will then contact you to arrange for the retrieval of the
binary file. The binary file is called cap2_s. Make this executable (eg
chmod a+x cap2_s) and move it to the directory
$STADENROOT/$MACHINE-bin
. The CAP2 options on the "Assembly" menu
should now be available.
The assembly works on either a file or list of reading names in experiment file format (see section Experiment File). CAP2 assembles the readings and the alignments are written to the output window. Irrespective of the original file format, new reading files are written in the destination directory in experiment file format. If the destination directory does not already exist, then it is created. These new files contain the additional information required to recreate the same assembly within Gap4. This is done by the addition of an AP line. See section Directed Assembly.
It is also possible to tell the program to identify chimeric fragments, report repeat structures and resolve them by setting the "Find repeats/chimerics" radiobutton to "Yes". If this is set to "No", these tasks are not performed. At the present time, CAP2 can only resolve direct repeats and not reverse repeats.
This mode imports the aligned sequences produced after CAP2 assembly into Gap4 and maintains the same alignment. Importing the files requires the directory containing the newly aligned readings, ie the destination directory used in "Perform CAP2 assembly". Readings which are not entered are written to a "list" or "file" specified in the "Save failures" entry box. This mode is functionally equivalent to "Directed assembly". See section Directed Assembly.
This mode performs both the assembly section Perform CAP2 assembly and the import section Import CAP2 assembly together. The assembled readings are written to the destination directory and then are automatically imported from this directory into Gap4.
The program can be alternatively accessed as a stand alone program with the following command line arguments
cap2_s -{format} file_of_filenames [-r] [-out destination_directory]
{format} is the file format of the file of filenames and is either in experiment file format or fasta format. Legal inputs are exp, EXP, fasta or FASTA.
file_of_filenames is the name of the file containing the reading names to be assembled for experiment files or a single file of readings in fasta format.
destination_directory is the name of a directory to which the new experiment files are written to. The default directory is "assemble".
-r is optional and is equivalent to the "Find repeats/chimerics" option above.
The comments provided with CAP2 by Huang are detailed below.
copyright (c) 1995-96 Xiaoqiu Huang and Michigan Technological University No part of this program may be distributed without prior written permission of the author. Xiaoqiu Huang Department of Computer Science Michigan Technological University Houghton, MI 49931 E-mail: huang@cs.mtu.edu Proper attribution of the author as the source of the software would be appreciated: Huang, X. (1996) An Improved Sequence Assembly Program Genomics, 33:21-31. The CAP2 program assembles short DNA fragments into long sequences. CAP2 contains a number of improvements to the original version described in Genomics 14, pages 18-25, 1992. These improvements are: o Use of a more efficient filter for quickly detecting pairs of fragments that could not overlap. o Accurate evaluation of overlap strengths through the use of internally generated fragment-specific confidence vectors. o Identification of fragments from repetitive sequences and resolution of ambiguities in assembly of those fragments. o Identification of chimeric fragments. o Automated refinement of poorly aligned regions of fragment alignments A chimeric fragment is made of two short pieces from non-adjacent regions of the DNA molecule. CAP2 may report a repeat structure like:
F1 5' flanking F2 5' flanking I1 Internal I2 Internal I3 Internal T1 3' flanking T2 3' flanking
where F1, F2, I1, I2, I3, T1 and T2 are fragment names. The structure means that I1 ,I2 and I3 are from two copies of a repetitive element, F1 and F2 flank the two copies at their 5' end, T1 and T2 flank them at their 3' end. CAP2 produces the two copies in the final sequence by resolving the ambiguities in the repeat structure. CAP2 is efficient in computer memory: a large number of DNA fragments can be assembled. The time requirement is acceptable; for example, CAP2 took 1.5 hours to assemble 829 fragments of a total of 393 kb nucleotides into a single contig on a Sun SPARC 5. The program is written in C and runs on Sun workstations. The CAP2 program can be run with the -r option. If this option is specified, then the program identifies chimeric fragments, reports repeat structures and resolves them. Otherwise, these tasks are not performed. Large integer values should be used for MATCH, MISMAT, EXTEND. The comments given above are for CAP2. Written on Feb. 11, 95. Acknowledgements Kathryn Beal found a bug in the Filter procedure. The array elen was not always initialized. Below is a description of the parameters in the #define section of CAP. Two specially chosen sets of substitution scores and indel penalties are used by the dynamic programming algorithm: heavy set for regions of low sequencing error rates and light set for fragment ends of high sequencing error rates. (Use integers only.)
Heavy set: Light set: MATCH = 2 MATCH = 2 MISMAT = -6 LTMISM = -3 EXTEND = 4 LTEXTEN = 2
In the initial assembly, any overlap must be of length at least OVERLEN, and any overlap/containment must be of identity percentage at least PERCENT. After the initial assembly, the program attempts to join contigs together using weak overlaps. Two contigs are merged if the score of the overlapping alignment is at least CUTOFF. The value for CUTOFF is chosen according to the value for MATCH. POS5 and POS3 are fragment positions such that the 5' end between base 1 and base POS5, and the 3' end after base POS3 are of high sequencing error rates, say more than 5%. For mismatches and indels occurring in the two ends, light penalties are used. Acknowledgments The function diff() of Gene Myers is modified and used here. A file of input fragments looks like:
>G019uabh ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAA GTCTTGCTTGAATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTAC TCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACAGTAG GACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTT AATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTT ATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTT TAGATAAGCAAAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC >G028uaah CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTT TAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGAT TGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGC TGGCAGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGC ATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCT TCCCCATCCCATCAGTCT >G022uabh TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTG TAGGTGATTGGGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCC CATTAAAACCCTTTATGCCCATACATCATAACACTACTTCCTACCCATAA GCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAAC ACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATT GATTGAT >G023uabh AATAAATACCAAAAAAATAGTATATCTACATAGAATTTCACATAAAATAA ACTGTTTTCTATGTGAAAATTAACCTAAAAATATGCTTTGCTTATGTTTA AGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAATAATCCTCTAC GATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATGGAAATAAAAT GTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACATGAAATGCTTT TTAAAAGAAAATATTAAAGTTAAACTCCCCTATTTTGCTCGTTTTTGCTT ATCTAAAATACATTCTGCACAATCCCCAAAGATTGATCATACGTTAC >G006uaah ACATAAAATAAACTGTTTTCTATGTGAAAATTAACCTANNATATGCTTTG CTTATGTTTAAGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAAT AATCCTCTAAGATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATG GAAATAAAATGTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACAT GAAATGCTTTTTAAAAGAAAATATTAAAGTTAAACTCCCC
A string after ">" is the name of the following fragment. Only the five upper-case letters A, C, G, T and N are allowed to appear in fragment data. No other characters are allowed. A common mistake is the use of lower case letters in a fragment. To run the program, type a command of form cap2 file_of_filenames [-r] The output goes to the terminal screen. So redirection of the output into a file is necessary. The output consists of three parts: overview of contigs at fragment level, detailed display of contigs at nucleotide level, and consensus sequences. The output of CAP on the sample input data looks like: '+' = direct orientation; '-' = reverse complement
OVERLAPS CONTAINMENTS ******************* Contig 1 ******************** G022uabh+ G019uabh+ G028uaah+ is in G019uabh+ G023uabh- G006uaah- is in G023uabh- DETAILED DISPLAY OF CONTIGS ******************* Contig 1 ******************** . : . : . : . : . : . : G022uabh+ TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG ____________________________________________________________ consensus TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG . : . : . : . : . : . : G022uabh+ GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC ____________________________________________________________ consensus GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC . : . : . : . : . : . : G022uabh+ ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG G019uabh+ ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG G028uaah+ CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG ____________________________________________________________ consensus ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG . : . : . : . : . : . : G022uabh+ AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG G019uabh+ AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG G028uaah+ AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG ____________________________________________________________ consensus AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG . : . : . : . : . : . : G022uabh+ ATTGATTGATTGATTGAT G019uabh+ ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC G028uaah+ ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC ____________________________________________________________ consensus ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC . : . : . : . : . : . : G019uabh+ AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT G028uaah+ AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT ____________________________________________________________ consensus AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT . : . : . : . : . : . : G019uabh+ GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC G028uaah+ GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCC-ATCAGTCT ____________________________________________________________ consensus GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC . : . : . : . : . : . : G019uabh+ AGTCTTGTTACGTTATGACT-AATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC G023uabh- GTAACGT-ATGA-TCAATCTTTGGGGATTGTGCAGAATGT-ATTTTAGATAAGC ____________________________________________________________ consensus AGTCTTGTAACGTTATGACTCAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC . : . : . : . : . : . : G019uabh+ AAAA-CGAGCAAAAT-GGGGAGTT-A-CTT-A-TATTT-CTTT-AAA--GC G023uabh- AAAAACGAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT G006uaah- GGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT ____________________________________________________________ consensus AAAAACGAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT . : . : . : . : . : . : G023uabh- TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT G006uaah- TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT ____________________________________________________________ consensus TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT . : . : . : . : . : . : G023uabh- AGTTTTTTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGAT G006uaah- AGTTTTTTTCCTATTTGTTTAAGATCTTAGAGGATTATTAAGCTGAACTCCTCAACTGAT ____________________________________________________________ consensus AGTTTTTTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGAT . : . : . : . : . : . : G023uabh- AAAAAGCATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAA G006uaah- AAAAAGCATGACATCTTAAACATAAGCAAAGCATATNNT-AGGTTAATTTTCACATAGAA ____________________________________________________________ consensus AAAAAGCATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAA . : . : . : . : . : . : G023uabh- AACAGTTTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT G006uaah- AACAGTTTATTTTATGT ____________________________________________________________ consensus AACAGTTTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT CONSENSUS SEQUENCES >Contig 1 TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG AATTAAAGACTTGTTTAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTG ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC AGTCTTGTAACGTTATGACTCAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC AAAAACGAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT AGTTTTTTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGAT AAAAAGCATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAA AACAGTTTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT */