first previous next last contents

Writing Your Own Naming Schemes

The naming schemes are defined in the "component" files. At present two examples exist; both are naming schemes taken from the Sanger Centre. It is possible to define your own naming scheme, or indeed any other component. A component is basically just a file which you want to add (in its entirety) to the user's pregap4 configuration file. Typically these files end in the extension `.p4t'.

The naming schemes are defined by use of three variables: ns_name, ns_regexp and ns_lt.

ns_name is simply a text name for the naming scheme.

ns_regexp is a regular expression which will be matched against each sequence identifier. The bracketed segments are assigned to Tcl variables which can be referenced as $1, $2, $3 etc.

ns_lt is an array indexed by Experiment File line types. The contents of a particular array element is either a string containing the value for that line type or the word subst followed by a substitution list of the following format:

subst {string {pattern replacement} ... default_replacement}

In addition to this we need a bit of preamble stating that the following component is part of the pregap4 naming scheme section. This can be done by making sure the first line of the component file is [naming_scheme].

A completely new example naming scheme may be, in English, as follows:

The reading identifier will consist of the template name, followed by a full stop, followed by two characters to determine the primer type and position, a single character to determine the chemistry, and any extra characters needed to create a unique name. Forward and reverse readings from the same "insert" or "template" will share the same template name. This in turn allows for gap4 to know the relative positions, orientations and distances of two such readings and hence will allow it to point out possible problems.

Putting this more specifically: a template name is any string of alpha-numerics (a-z, 0-9 and underscore). The primer type could be defined as:

uf
universal forward primer item ur universal reverse primer
cf
custom forward primer
cr
custom reverse primer

The chemistry can be defined as:

p
Dye-Primer
P
Big dye-primer
t
Dye-Terminator
T
Big dye-terminator

For example fred.ufp, fred.urp and bert.cfT are all valid names.

The above variable definitions may seem complex so we shall work through the example naming scheme. Firstly we need to define the regular expression. To new users this can be complex, but is described in great detail in many places (try the Unix "grep" manual page). In the shortest form: dot (.) matches any character; square brackets delimit a set of characters, any one of which is allowed (or if it starts with ^ it is the complement set - any except those listed). Following a character or set with + indicates one or more copies of the preceeding expression, * is for zero or more copies, and ? is for zero or one copy.

So to define our example names we would start our component file with:

[naming_scheme]
set ns_name "Example naming scheme"
set ns_regexp {([^.]*)\.(..)(.).*}

The backslash in the above text is to state that we want to match a real full stop character instead of the "any character" that regular expressions usually regard full stop as meaning. The ns_regexp will store the three bracketed segments in $1, $2 and $3.

The first segment is the template name. To use this we simply add:

set ns_lt(TN) {$1}

The next segment is the primer type. The primer type is defined for gap4 as a single digit number. 0 is for unknown, 1 is universal forward primer, 2 is universal reverse primer, 3 is custom forward primer, and 4 is custom reverse primer. So we wish to map uf to 1, ur to 2, cf to 3, cr to 4, and anything else to 0. This is done with the following command:

set ns_lt(PR) {subst {$2 {uf 1} {ur 2} {cf 3} {cr 4} 0}}

The final segment is the chemistry. At present gap4 only distinguishes between dye-primer and dye-terminators, although our naming scheme also "knows about" big dyes. So we wish to map both p and P to chemistry type 0, and t and T to chemistry type 1. Anything else we'll also assume is dye-primer. In much the same way that the regular expressions work, we can use square brackets in our patterns to say "any of these letters". So the command for this is:

set ns_lt(CH) {subst {$3 {[pP] 0} {[tT] 1} 0}}

The final line to add to the component file is set_name_scheme. This is a pregap4 command which tells it that you have finished defining the naming scheme. So the completed component file is simply:

[naming_scheme]
set ns_name "Example naming scheme"
set ns_regexp {([^.]*)\.(..)(.).*}
set ns_lt(TN) {$1}
set ns_lt(PR) {subst {$2 {uf 1} {ur 2} {cf 3} {cr 4} 0}}
set ns_lt(CH) {subst {$3 {[pP] 0} {[tT] 1} 0}}
set_name_scheme

first previous next last contents
This page is maintained by staden-package. Last generated on 22 October 2002.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/pregap4_unix_52.html