screen_seq -- filters out sequence readings containing contaminating DNA
screen_seq
-
[lcwmiIsSpft
]
[-l
Length of minimum match (25)]
[-m
Maximum vector length (100000)]
[-i
Input file of reading file names]
[-I
Input file of single reading to screen]
[-s
Input file of sequence file names]
[-S
Input file of single sequence to screen against]
[-p
Passed output file of file names]
[-f
Failed output file of file names]
[-t
Test only mode]
screen_seq
searches sequence readings to
filter out those from extraneous DNA
such as vector or bacterial sequences. We have separated this task
from that of locating and marking the extents of sequencing vector and
other cloning vectors. There we require precise identification of the
junction between the vectors and the target DNA. The filtering process
described here is designed to spot strong matches between readings and a
panel of possible contaminating sequences, and it splits readings into
passes and fails. Readings that fail have a PS line containing the word
"contaminant" and a tag of type "CONT" added to their experiment file.
Normal usage would be to compare a batch of readings in experiment file format against a batch of possible contaminant sequences stored in (at present) simple text files. Each batch is presented to the program as a file of file names, and the program will write out two new files of file names: one containing the names of the files that do not match any of the contaminant sequences (the passes), and the other those that do match (the fails). It is also possible to compare single readings and single contaminant files by giving their file names (i.e. it is not necessary to use a file of file names for single files).
Given the frequent need to compare against the full E. coli genome the algorithm is designed to be fast. The user controls the speed and sensitivity by supplying a single parameter, "min_match". The program will find the longest exact match of at least min_match characters.
The search is conducted only over the clipped portion of the readings. On our Alpha machine it takes about 1 second to compare both strands of a reading against the 4.7 million bases of E. coli.
-l
Length of minimum match (25)
-m
Maximum vector length (100000)
-i
Input file of reading file names
-I
Input file of single reading to screen
-s
Input file of sequence file names to screen against
-S
Input file of single sequence to screen against
-p
Passed output file of file names
-f
Failed output file of file names
-t
Test only mode
Usage: screen_seq [options and paramters] Where options and parameters are: [-l minimum match (25)] [-m Max vector length (100000)] [-i readings to screen fofn] [-I reading to screen] [-s seqs to screen against fofn] [-S seq to screen against] [-t test only] [-p passed fofn] [-f failed fofn]
1. Screen the readings whose names are stored in fofn against a batch of possible contaminant sequences whose names are stored in vnames. Write the names of the readings that pass to file p and those that fail to file f. Increase the maximum sequence length to 5000,000 characters and require a minimum match of 20.
screen_seq -i fofn -s vnames -p p -f f -l20 -m5000000
2. Screen the single reading stored in xpg33.g1 against a batch of possible contaminant sequences whose names are stored in vnames. If the reading does not match write its name to file p, otherwise to file f. Increase the maximum sequence length to 5000,000 characters and require a minimum match of 20.
screen_seq -I xpg33.g1 -s vnames -p p -f f -l20 -m5000000
3. Screen the readings whose names are stored in fofn against a single possible contaminant sequence stored in ecoli.seq. Write the names of the readings that pass to file pass and those that fail to file fails. Increase the maximum sequence length to 5000,000 characters and require minimum match of 20.
screen_seq -i fofn -S ecoli.seq -p pass -f fails -l20 -m5000000
Limits
Screen_seq is currently set to be able to process a maximum of 10,000 readings and 5000 screening sequences in a single run. The maximum length of any screening sequence is 100,000 although this can be overridden by use of the -m parameter (set it to 5000000 for E. coli). At present the sequences to screen against must be stored in simple text files containing individual sequences, with no entry names, and <100 characters per line.
The following errors can be reported.
Inconsistencies in the selection of options, such as selecting -I and -i, should also cause the usage message (shown below) to appear, and the program to terminate.
PS record added to the experiment file for any reading that matches.
See section Experiment File.See section Screening Against Vector Sequences.