Version-2000.0 Release Notes
James Bonfield, Kathryn Beal, Matthew Betts, Mark Jordan and Rodger Staden
This release has been slow to come out because we wanted it to
coincide with, and be identical to, the first Windows 9X/NT release of
the package. Achieving this has taken longer than planned, but has
left us with the same source code for all systems, and so in a strong
position for future developments. We have also given the manual a
major overhaul and produced separate versions for UNIX and
Windows. The different manuals include their system specific
screendumps and we have also introduced alternatives for users with
two button mice. (The middle mouse button may now be simulated by
using the Alt key with the left mouse button.) A new "mini-manual",
giving a quick (45 page) introduction to the programs, has also been
written. Finally, we now also include some notes and data for a
course in using the sequence assembly tools. These may be found in the
'course' subdirectory.
Note that only the "modern" programs have been ported to Windows
(i.e. pregap4 and its ancilliary programs like vector_clip,
screen_seq,..., trev, gap4, nip4 and sip4.) To complete the
equivalence between the UNIX and Windows versions of the package we
have removed all the "old" (mostly FORTRAN) programs from the UNIX
release. These old programs (such as nip, sip,...) are now freely
available to anyone as Digital UNIX, IRIX, Solaris and Linux binaries
from our ftp site and will not be upgraded in the future. The Windows
version will only be available through commercial distributors, but
the UNIX versions will still be available directly from us at LMB.
A major change is the addition of a software licencing system. This
means that the package can be made available to anybody in
demonstration mode, and can be switched to full functionality with a
key specific to that copy of the package on that machine. We can also
provide temporary licence keys that will enable full functionality for
a limited period. We plan for these temporary licences to be made
available to applicants who simply fill in a web form.
To accompany the demo versions of the package we have included a set of data
files. In demonstration mode, only these files can be loaded into the programs.
Program version numbers
Gap4 4.5
Nip4 1.2
Sip4 1.3
Pregap4 1.1
Trev 1.5
Operating systems
The binaries have been created in the following build environments. Typically
newer environments for the same operating system should work fine, but not
necessarily older systems. (For example, the binaries will not run under
RedHat Linux 5.x)
- Digital Unix 4.0E
- Irix 5.3
- RedHat Linux 6.1
- Solaris 2.6
Demo data sets
All pathnames listed below are relative to the installation root for the
package.
Here is a list of sequences which may be loaded into Sip4.
- userdata/atpase.embl
- userdata/blue.seq
- userdata/cemyo1.seq
- userdata/ecoli.0*
- userdata/lambda.seq
- userdata/lorist6.seq
- userdata/m13mp18.seq
- userdata/m13mp7.seq
- userdata/mysa_drome.embl
- userdata/mysa_human.embl
- userdata/mysd_caeel.seq
- userdata/pjb8.seq
- userdata/puc18.seq
For a good example of protein-protein similarity plots, try using
mysa_drome.embl and mysa_human.embl.
For dna-protein plots, try using cemyo1.seq against mysa_caeel.seq.
To see how Sip4 handles large sequences try using ecoli.00003 and
lambda.seq. This is a large comparison: 250Kb against 48.5Kb. Hence
the slower searches, such as Find Similar Spans, will take a long
time. We suggest searching with Find Matching Words using a word
length of 12.
For Nip4, the following sequences are accepted.
- userdata/atpase.embl
- userdata/blue.seq
- userdata/cemyo1.seq
- userdata/ecoli.0*
- userdata/lambda.seq
- userdata/lorist6.seq
- userdata/m13mp18.seq
- userdata/m13mp7.seq
- userdata/pjb8.seq
- userdata/puc18.seq
Gap4 in demonstration mode only allows access to one data set at present.
This has been base called with phred. The trace files may also be viewed in
Trev.
This database is a section from a c.elegans cosmid, still in several
contigs. You may try joining and editing contigs. The full
functionality of Gap4 is available except for assembling or
disassembling sequences.
For Pregap4 there is no demonstration mode. Pregap4 may be started and
configured, but it will not process any data.
Sequence Library Access Using SRS
We have not upgraded our sequence library interface to SRS 6. and
are aware of some problems in our use of SRS 5. We hope to include
a Web based interface in a later release.
Linux, Gnome and Enlightenment
There is a known problem with the Gap4 contig editor when using Enlightenment
as the window manager. This is the default for Gnome, but it is not known
whether the problem arises when using Enlightenment in other environments. The
symptom is that program will terminate instantly upon starting the contig
editor with a complaint about X_ConfigureEvents.
The solution is to change window managers, which may be adjusted using the
Gnome control panel.
Change log
Here is a list of changes since the 1999.0 (patched) release.
Gap4
Changes
- New plot: confidence values.
- New plot: reading coverage.
- New plot: readpair coverage.
- Primer suggestion from within the editor now ignores padding
characters.
- Searching by sequence from the editor now ignores padding characters
and searches both strands. Optionally it may also find non-exact
matches (mismatches, but not insertions or deletions) and can search
top, bottom or both strands.
- An extra editor search method is available: unpadded position
search. This makes it possible to jump to a specific unpadded position
within a padded consensus.
- The contig editor base numbers may now be shown as unpadded
positions. Note this may be slow on very large projects as it will
require frequent recalculation of the complete consensus.
- Confidence values can now be plotted in the Trace Display.
- The trace display now lists the original base confidence in the trace
display information line.
- Primer and Chemistry information is now visible as single character
codes in the trace display.
- The full reading name is now shown in the trace display. It is
superimposed over the top-left corner of the trace.
- The trace file search path is now adjustable from the options menu
(although the RAWDATA environment variable may still be used if
desired). This writes a database note (of type RAWD) and so the search
path will be remembered.
- New commands in the option menu: set alignment scores and set
genetic code.
- The alignment weight tables for Gap4 are now configurable; stored in
$STADTABL/nuc_matrix. The alignment gap open and gap extention
penalties may also be changed. The matrix file and penalties are
stored in tables/gaprc as the ALIGNMENT.MATRIX_FILE,
ALIGNMENT.OPEN.COST and ALIGNMENT.EXTEND.COST variables.
- Improved Find Internal Joins alignment algorithm. This is now a banded
alignment, and so is much faster. Also improved the "extended
consensus" calculation such that there are no more discontinuities in
the sequence (caused by padding). This does not fix any bugs, but the
change improves the sensitivity of find internal joins.
- The template display now separates out the plotting of consistent and
inconsistent templates.
- The colours in the Template Display for reverse and custom-reverse
primers have been changed from shades of grey to orange and
orange-red.
- The consensus algorithm now has a "Display IUB codes in consensus"
mode. The exact definition depends on the consensus algorithm being
used.
- The consensus algorithm may now be chosen from within the contig
editor.
- The consensus, when produced in fasta format, now uses the contig
identifier as a fasta sequence name instead of a reading number. The
reading number is included as a fasta comment after the contig
identifier.
- Gap4 now has "user levels"; currently only "beginner" and
"expert". Gap4 will start in beginner mode. Use the Options menu to
change to expert mode. Beginner mode has several of the less-often
used functions removed.
- The contig editor now has the notion of a reference sequence. This is
the sequence to which base numbering should be applied. The sequence
numbering can optionally start from an offset other than 1 and may
wrap-around at a predefined length. All these details are written into
the contig REFS note.
- Pressing the middle mouse button on the reading names in the editor
will now 'copy' the name to the paste buffer, allowing for easy
cut-n-paste.
- The control-h keybinding to remove readings from within the contig
editor now also works when the mouse pointer is above the
sequence names sub-window.
- Using the popup-menu (right mouse button) in the contig editor now
also moves the editor cursor. This makes displaying of tag information
(for example) more intuitive.
- Added a gaprc parameter to control whether the Delete key should act
in a Motif/Windows style or an Emacs style. Defaults on for Windows,
off for Unix.
- Added an assembly interface to cap3. (Cap3 is not included.)
- Multiple files for assembly may be picked directly from the assembly
dialogue without needing to first create a file of filenames.
- Added a new gap4 command line switch "-menu_file". This may be used to
customise local menu configurations. Try "gap4 -menu_file mito" for
example; see $STADENROOT/tables/gaprc_menu_mito.
Bug fixes
- Specifying ranges for masked consensus was masking the wrong
regions. (This worked correctly when all the consensus was output.)
This effected both saving of masked consensus and the masked assembly
options.
- Attempting to align sequences in the join editor with an overlap of
greater than 8Kb could crash Gap4. Now works with any length (although
it may be slow).
- Removed a problem with the "lock" mode of the join editor: by
extending cutoffs it was possible to `break' the lock position.
- The colours used for highlight disagreements in the join editor have
now been fixed.
- Fixed problems with unneeded cursors being left in various graphical
plots after joining contigs.
- Removed a problem with the "lock" mode of the join editor: by
extending cutoffs it was possible to `break' the lock position.
- The trace display could crash when scrolling traces immediately after
quitting an individual trace display.
- Diff against consensus trace (contig editor) was sometimes crashing.
- Fixed crash when trying to display multiple diffs between the same
pair of sequences.
- Using "Quality Clip" on databases that were not base called with a
phred-scaled confidence system would incorrectly clip.
- Fixed a crash with searching by reading name within the editor.
- Changed the focus bindings for the contig editor. This should prevent
the auto-raising problems that some systems have.
- Added back the "lost" editor menu item: group readings by template.
- Fixed a problem with the editor "search by file" mechanism and
specifying reading names when there are more than 10,000 sequences
assembled.
- Dump Contig (contig editor) using line lengths of > 300 would
sometimes crash gap4. It now supports up to 1000 (the limit allowed by
the dialogue).
- The day field of dates listed in the Note Selector window was
incorrect. (NOTE: This was not a year 2000 problem, but simply a mixup
of the date formatting.)
- Fixed a crash in Find Repeats when it found many copies of a
repeat on both strands.
- The contig comparator now ensures that all items are visible (even if
it means moving them). Previously there were some cases where joins
found in the cutoff data of the right-most contig were not visible
(although the "next" button worked).
- Fixed bug with the crosshairs in contig selector/comparator, which
when zoomed up displayed the position of the crosshairs incorrectly.
- Fixed a problem when trying to use the popup-menus from the
restriction enzyme plot.
- The restriction enzyme selector sometimes had problems updating the
enzyme list when using the filebrowser to pick a "personal" enzyme
file.
- Removed a crash in the restriction enzyme plot, triggered when leaving
the plot displaying while making a contig join.
- Fixed window resizing problems in the restriction enzyme plot.
- The default confidence value for sequences without confidence values
is now 2, instead of 99.
- Improved the error messages when attempting to assemble non-existent
files.
- The debug "log file" was switched after a copy database command:
subsequent logs went to the copies log file.
- Local GTAGDB files did not correctly override tags found in the tables
directory.
- The show relationships output was sometimes poorly tabulated.
- The difference clip function no longer quality clips the extreme start
and end sequences. This depended on correct confidence values and
sometimes failed when they were not present.
Nip4
Changes
- Added a general-purpose weight matrix search function.
- A new program, make_weights, is available to produce weight matrices
from a set of aligned sequences.
- The weight matrix searching functions now use log-odds scores.
- The splice junction search is improved for human data - using a
better weight matrix (which is now also user-definable). The new
search uses log-odds scores.
- The gene search functions now all use log-odds scores, which means
they may be compared to one another easily. The codon preference
search has been generalised; it now has the ability to make use of
coding and non-coding tables (automatically generating non-coding
tables from the coding table if it is missing), and hence now the base
preference search is a sub-option of the codon preference search.
There are now also options to normalise to average amino acid
composition and to use the amino acid composition alone.
- Stop codons may now automatically be plotted when requesting a gene
search plot.
- The author test gene search now requests percentage error instead of
window length.
- Filenames may now be specified on the command line.
- Codon tables may now be appended to existing tables. This concatenated
codon table file is required for some of the newer search options.
- Codon tables may now be written to the output window without needing
to save them to disk.
- Improved the genetic code selection.
- The string search options can now distinguish between iub codes and
"literal" characters.
- Changed the default scoring for the trna search and improved the use
of conserved bases.
Bug fixes
- Fixed bug in the stop codons plot where the determination of the
reading frame was incorrect if the starting position was not 1.
- Fixed window resizing problems in the restriction enzyme plot.
- Fixed bug with base bias search if all the results are 0 - this caused
the program to crash.
- Fixed miscellaneous bugs dealing with moving and superimposing plots.
Sip4
Changes
- Filenames may now be specified on the command line. The last two will
be the horizontal and vertical sequences.
- Added a "nearest dot" option to the sequence display which moves to
the visibly-nearest point in the 2D dotplot, rather than the
mathematically nearest match.
Bugs fixed
- The local alignment code now properly handles mixed case alignments.
- If a comparison function was performed on a sequence which doesn't
have a library eg an aligned sequence, sip4 would crash.
- Switched the default values for the gap start and gap extend penalties
(they were the wrong way around).
- Both sip4 and nip4 no longer complain when the default sequence
library cannot be found.
Pregap4
Changes
- New module: qclip (Quality Clip) which replaces the old clip module.
This can clip a sequence based on the average confidence values
assigned by eba, phred, ATQA or other similar tools.
- Added an ATQA module.
- Added an extract_seq module.
- Wrote a new program find_renz which searches for a restriction
enzyme site in a vector file, and added a corresponding hook to
pregap4 so that we can now type in (eg) SmaI to the cut-site box
and pregap4 will work things out accordingly.
- Included a more complete vector-primer file. The sequencing vector
clip module now has an interface to select subsets from this file.
- Multiple files may now be picked for processing directly from the
pregap4 GUI without needing to first create a file of filenames.
- Improved handling of files contained within remote directories.
Pregap4 now has an output directory, allowing the results from
processing data spread over several directories to be stored within a
single output directory.
- The E.Coli genome used by Screen Sequences has now been split up into
several small chunks, which slightly speeds up the search and uses
much less memory.
- The blast screen module can now output tags if desired. This tags all
matches, regardless of whether the sequence meets the specified "match
fraction". Hence by setting the match fraction to greater than 1.0
this module may be used to add arbitrary match tags to sequences
without rejecting any of them.
- The E value is now adjustable in the blast module.
- The cross_match module now works with the newer cross_match releases
(which no longer support experiment file format).
- The cross_match module now has a gap_size parameter (adjustable only
from the pregap config files). Matches found within $gap_size of the
end of the sequence or within $gap_size of another match are stitched
together. gap_size is initially set to 15 bases.
- Added an "other arguments" interface to the phrap module.
- We now report simple status information such as "module x needs
configuring" in the blank space between the Run and Help buttons at
the bottom of the configure window.
- The "save all parameters" buttons have had their names changed to
(hopefully) become clearer.
- A new function has been added, named "Load Naming Convention". This
is identical to "Include Config Component" except for the default
location for the file-browser.
Bugs
- Removing a module with the Add/Remove modules command no longer gives
Tcl error messages when the module being removed is the currently
active module.
- The saving of the sequencing vector clip module parameters now
correctly saves the vector-primer file details.
- Fixed problems when dealing with configuration files produced on
different systems with different $STADENROOT settings.
- Improved error handling when dealing with third-party tools which have
not been installed on the local system (eg Phred, RepeatMasker, etc).
- Improved error handling in phred module.
Trev
Changes
- Now displays sequence confidence values.
- Can now handle multiple files from within the "open" command.
- Printed trace files may now have a title, which defaults to the trace
file name.
- Saving in plain-text format now only writes out the good
quality region.
Bugs fixed
- Trev failed to correctly write edited confidence values back to
experiment files.
- The "unable to configure trace options" message no longer appears when
failing to load a file.
- The "undo clipping" command failed when the previous right-hand clip
point was the very end of the sequence.
- Certain "corrupt" trace files would cause crashes in the trace
printing code. Specifically when base calls are positioned beyond the
end of the 'sample' data.
Misc
Changes
- We now include a full tutorial on using the sequence
assembly tools, including documentation and data. This may
be found in the "course" subdirectory, but also please check
for newer versions at
ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/course/
- Vector_clip now has a minimum length of 5' sequence to match when
using vector_primer files. This allows us to over-specify the vector
sequence so that we can use the same vector_primer file line for
multiple primer sites (with the same cut site).
- Vector_clip now writes SF and PR records to the experiment files
(derived from the matches found). If no match is found, no PR record
is written. If the vector rearrangement search finds there is no SF
record in an experiment file it writes a CC record to that effect,
but writes the reading name to the passed file of file names.
- Vector_primer file format simplified: the numeric values included
incase they were needed in the future have been dropped.
- Screen_seq will now add tags to the failing sequences.
- Extract_seq can now handle multiple files on the command line . It
also has a -fasta_out option which, when combined with specifying
multiple files, provides a handle way of converting many Experiment
files into a single fasta file.
- Init_exp now has a "-conf" option to force the confidence values to be
written to the Experiment Files (even when no edits have been made).
- The order of SCF comments written by makeSCF has been sanitised. The
textual date format has also been fixed (it was two months out,
although the numerical date output was correct).
- Online help now automatically looks for Netscape for viewing HTML
documents.
Bugs fixed
- Gap4, Sip4 and Nip4 work better with small screen sizes (eg 800x600)
although this still isn't perfect. We recommend a screen resolution of
at least 1024x768.
- The pathname expansion functions no longer expand environment
variables unless a dollar symbol preceeds them.
- In vector_clip, default 5' positions that are higher than the sequence
length was causing crashes.
- Fixed a bug with the cosmid clipping in extract_seq. It worked find
with CS lines, but was incorrect for Experiment files using a CL and
CR notation.
- The sequence format recognition in Nip4 and Sip4 would fail when a
protein sequence in plain text format started with "SQ".
- More robust handling of ABI format data. We make up data with 0 level
samples when the base data extends beyond the stored sample data.
- Improved the sequence format recognising code used within nip4 and
sip4.
- MakeSCF no longer adds duplicate MACH fields to the SCF comments.
- Fixed a trace rescaling bug in MakeSCF. Only causes problems when
using the -normalise command line switch.
- Using non-default colours now works better. (Certain dialogue
components were ignoring the user-defined colours.)