The naming schemes are defined in the "component" files. At present two examples exist; both are naming schemes taken from the Sanger Centre. It is possible to define your own naming scheme, or indeed any other component. A component is basically just a file which you want to add (in its entirety) to the user's pregap4 configuration file. Typically these files end in the extension `.p4t'.
The naming schemes are defined by use of three variables: ns_name, ns_regexp and ns_lt.
ns_name is simply a text name for the naming scheme.
ns_regexp is a regular expression which will be matched against each
sequence identifier. The bracketed segments are assigned to Tcl variables
which can be referenced as $1
, $2
, $3
etc.
ns_lt is an array indexed by Experiment File line types. The contents of
a particular array element is either a string containing the value for that
line type or the word subst
followed by a substitution list of the
following format:
subst {
string {
pattern replacement}
...
default_replacement}
In addition to this we need a bit of preamble stating that the following
component is part of the pregap4 naming scheme section. This can be done by
making sure the first line of the component file is [naming_scheme]
.
A completely new example naming scheme may be, in English, as follows:
The reading identifier will consist of the template name, followed by a full stop, followed by two characters to determine the primer type and position, a single character to determine the chemistry, and any extra characters needed to create a unique name. Forward and reverse readings from the same "insert" or "template" will share the same template name. This in turn allows for gap4 to know the relative positions, orientations and distances of two such readings and hence will allow it to point out possible problems.
Putting this more specifically: a template name is any string of alpha-numerics (a-z, 0-9 and underscore). The primer type could be defined as:
uf
cf
cr
The chemistry can be defined as:
p
P
t
T
For example fred.ufp
, fred.urp
and bert.cfT
are all valid
names.
The above variable definitions may seem complex so we shall work through the
example naming scheme. Firstly we need to define the regular expression. To
new users this can be complex, but is described in great detail in many places
(try the Unix "grep" manual page). In the shortest form: dot (.
)
matches any character; square brackets delimit a set of characters, any one of
which is allowed (or if it starts with ^
it is the complement set - any
except those listed). Following a character or set with +
indicates one
or more copies of the preceeding expression, *
is for zero or more
copies, and ?
is for zero or one copy.
So to define our example names we would start our component file with:
[naming_scheme] set ns_name "Example naming scheme" set ns_regexp {([^.]*)\.(..)(.).*}
The backslash in the above text is to state that we want to match a real full
stop character instead of the "any character" that regular expressions usually
regard full stop as meaning. The ns_regexp
will store the three
bracketed segments in $1
, $2
and $3
.
The first segment is the template name. To use this we simply add:
set ns_lt(TN) {$1}
The next segment is the primer type. The primer type is defined for gap4 as a
single digit number. 0 is for unknown, 1 is universal forward primer, 2 is
universal reverse primer, 3 is custom forward primer, and 4 is custom reverse
primer. So we wish to map uf
to 1
, ur
to 2
,
cf
to 3
, cr
to 4
, and anything else to 0
.
This is done with the following command:
set ns_lt(PR) {subst {$2 {uf 1} {ur 2} {cf 3} {cr 4} 0}}
The final segment is the chemistry. At present gap4 only distinguishes between
dye-primer and dye-terminators, although our naming scheme also "knows about"
big dyes. So we wish to map both p
and P
to chemistry type
0
, and t
and T
to chemistry type 1
. Anything else
we'll also assume is dye-primer. In much the same way that the regular
expressions work, we can use square brackets in our patterns to say "any of
these letters". So the command for this is:
set ns_lt(CH) {subst {$3 {[pP] 0} {[tT] 1} 0}}
The final line to add to the component file is set_name_scheme
. This is
a pregap4 command which tells it that you have finished defining the naming
scheme. So the completed component file is simply:
[naming_scheme] set ns_name "Example naming scheme" set ns_regexp {([^.]*)\.(..)(.).*} set ns_lt(TN) {$1} set ns_lt(PR) {subst {$2 {uf 1} {ur 2} {cf 3} {cr 4} 0}} set ns_lt(CH) {subst {$3 {[pP] 0} {[tT] 1} 0}} set_name_scheme