Gap4 - Assembly-Tips

General Comments and Tips on Assembly

The program has several methods for assembly and it may not be obvious which is most appropriate for a given problem. The following notes may help. They also contain information on methods for checking the correctness of an assembly.

If you have access to an external program that can generate the order and approximate positions of readings then Directed Assembly can be used. The same is true if the experimental method used generates an ordered set of readings (see section Directed Assembly).

If you have access to a external global assembly program that can produce an assembly and write out correct experiment files then Directed Assembly can still be used by specifying a "tolerance" of -1 (in the experiment file AP lines).

For routine shotgun assembly of whole data-sets or incremental data-sets Normal Shotgun Assembly can be used. Through the idea of "Masked assembly" this option also can also restrict the assembly to particular regions of the consensus (see section Normal shotgun assembly).

Note that Normal shotgun assembly (see section Normal Shotgun Assembly), Assemble independently (see section Assembly Independently), Assembly into single stranded regions (see section Assembly Single), Screen only (see section Screen Only), Put all readings in separate contigs (see section Assembly new), may require the parameter maxseq to be set beforehand (see section Set Maxseq). This parameter defines the maximum length of consensus that can be created. If you find that the assembly process is only entering the first few hundred of a batch of readings, try increasing maxseq.

If you have a batch of readings that are known to overlap one another, but which, due to repeats, may also match other places in the consensus, then it can be helpful to use Assemble Independently. This will ensure that the batch of readings are compared only to one another, and hence will not be assembled into the wrong places (see section Assemble independently).

Almost all readings are assembled automatically in their first pass through the assembly routine. Those that are not can be dealt with in two ways. Either they can be put through assembly again with less stringent parameters, or entered using the "Put all readings in new contigs" routine and then joined to the contig they overlap using Find Internal Joins See section Find Internal Joins.. If it is found that readings are not being assembled in their first pass through the assembler, then it is likely that the contigs require some editing to improve the consensus. Also it may be that poor quality data is being used, possibly by users over-interpreting films or traces. In the long term it can be more efficient to stop reading early and save time on editing. For those using fluorescent sequencing machines the unused data can be incorporated after assembly using the Contig Editor and Double Strand.

An independent and important check on assembly is obtained by sequencing both ends of templates. Providing the correct information is given in the experiment files gap can check the positions and orientations of readings from the same template (see section Find read pairs). Any inconsistencies are shown both textually and graphically. In addition this information can be used to find possible joins between contigs.

This page is maintained by staden-package. Last generated on 22 October 2002.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/gap4_unix_91.html