For most assembly engines to work well it is necessary to present them with data of good quality and which contains only the target sequence. One pre-assembly task is to locate and mark all segments of readings which contain vectors used in their production. In our package this task is performed by vector_clip which compares batches of readings against vector sequences. Sequence readings are stored in experiment file format (see section Experiment File) and, for the majority of projects each experiment file should contain the data required by vector_clip: the file names of the vectors to screen against, and, for the sequencing vector, the position of the cloning and primer sites. See section Vector_Primer file format for an alternative and simpler method of defining vector data for vector_clip. The program pregap4 (see section Pregap4), contains modules for creating experiment files from trace files, and for adding data about the vectors used. When vector_clip runs it adds records to the reading's experiment file to denote the start and end of any segments which are found to match the vectors.
For conventional sequencing projects there are two types of vector for which readings will need to be screened: the sequencing vector, and, for cases where, say, whole cosmids or BACs have been shotgunned, the cloning vector. These two screening tasks are different. When screening for the sequencing vector we may expect to find data to exclude, both from the primer region and, when the insert is short, from the other side of the cloning site. It is also a wise precaution to check for rearrangements of the sequencing vector. When screening out cosmid vector we may find that either the 5' end, or the 3' end, or the whole of the sequence is vector. Also for the cloning vector search we need to compare both strands of the sequence.
In order to filter out readings that contain the sequences of contaminant DNA such as E. coli, a separate program screen_seq should be used (see section Screening for known possible contaminant sequences)
A further type of search is required for a new method that is being developed at MRC HGMP, Hinxton, UK. This new method (M. Starkey, personal communication) is an application of a technology described as "molecular indexing" Unrau, P. and Deugau, K.V. (1994) Non-cloning amplification of specific DNA fragments from whole genomic DNA digests using DNA indexers. Gene 145, 163-169. It produces sequences with a primer at their 3' ends which need to be found and removed.
Some groups are using transposons to produce random start points for sequencing reactions, and vector_clip contains an experimental search procedure for dealing with the data generated by such methods.
Vector_clip is usually run as part of the pregap4 process (see section Pregap4) and will usually be called three times: the first to locate and mark the sequencing vector; next to check for vector rearrangements; and finally to locate and mark cosmid vector segments.
Vector_clip operates on batches of readings using files of file names: one input file and two output files - one for the names of the readings that pass and one for those that fail. The program also modifies the reading files.
In earlier versions of vector_clip all the information needed about the vector (i.e. its name, location on disk, the cloning and primer sites used) for each reading was expected to be stored in the reading's experiment file (See section Experiment File.) but, as is explained in the next paragraph, the newest version employs an alternative method for providing data about sequencing vectors. For notes on defining the cloning and primer sites, see section Defining Cloning and Primer Sites for Vector_Clip.
The 1999.0 release of the package contained an experimental new method of providing vector_clip with data about the vectors to search for. Using feedback from the trial period we have simplified the method and improved the algorithm.
The new method uses files containing, not the complete vector sequences, but the segments of sequence between the primers and the cloning site. These files are termed "vector_primer" files see section Vector_Primer file format, and the vector_primer mode of vector_clip uses the data in these files to search for the vectors.
The vector_primer file can contain the data for up to (at present) 100 vector and primer combinations, although it would not be efficient to compare each reading against an unnecessarily large number of records. When vector_clip finds a match to one of the vectors defined in the vector_primer file it can not only mark the matching segment in the reading, but also adds the name of the file containing the vector sequence, and the primer type to the readings experiment file. The vector file name can then be used by vector_clip in its search for vector rearrangements, and the primer type can be used by gap4 in its analysis of read pairs (Note, however, that for read pair analysis, gap4 still needs to know which readings came from the same template, so that data must be added to the Experiment file in some other way).
A big advantage of the vector_primer file method, is that it simplifies the task of providing vector_clip with data. In addition, the task of creating the vector_primer files is simplified in that the -V option in vector_clip removes the necessity for the records in the vector_primer file to contain precisely the sequence between the primer and the cloning site section Vector_Primer File Notes.
Vector_primer files are also used by the search for transposon data.
If setting up these programs seems a little daunting, it is important to realise that the majority users need not concern themselves with the details of vector_clip and the creation of experiment files for their readings; or if they do, these configuration operations are only performed once per project, and are made relatively easy by the use of pregap4.