Version-2000.0 Release Notes

James Bonfield, Kathryn Beal, Matthew Betts, Mark Jordan and Rodger Staden

This release has been slow to come out because we wanted it to coincide with, and be identical to, the first Windows 9X/NT release of the package. Achieving this has taken longer than planned, but has left us with the same source code for all systems, and so in a strong position for future developments. We have also given the manual a major overhaul and produced separate versions for UNIX and Windows. The different manuals include their system specific screendumps and we have also introduced alternatives for users with two button mice. (The middle mouse button may now be simulated by using the Alt key with the left mouse button.) A new "mini-manual", giving a quick (45 page) introduction to the programs, has also been written. Finally, we now also include some notes and data for a course in using the sequence assembly tools. These may be found in the 'course' subdirectory.

Note that only the "modern" programs have been ported to Windows (i.e. pregap4 and its ancilliary programs like vector_clip, screen_seq,..., trev, gap4, nip4 and sip4.) To complete the equivalence between the UNIX and Windows versions of the package we have removed all the "old" (mostly FORTRAN) programs from the UNIX release. These old programs (such as nip, sip,...) are now freely available to anyone as Digital UNIX, IRIX, Solaris and Linux binaries from our ftp site and will not be upgraded in the future. The Windows version will only be available through commercial distributors, but the UNIX versions will still be available directly from us at LMB.

A major change is the addition of a software licencing system. This means that the package can be made available to anybody in demonstration mode, and can be switched to full functionality with a key specific to that copy of the package on that machine. We can also provide temporary licence keys that will enable full functionality for a limited period. We plan for these temporary licences to be made available to applicants who simply fill in a web form.

To accompany the demo versions of the package we have included a set of data files. In demonstration mode, only these files can be loaded into the programs.

Program version numbers

Gap4 4.5

Nip4 1.2

Sip4 1.3

Pregap4 1.1

Trev 1.5

Operating systems

The binaries have been created in the following build environments. Typically newer environments for the same operating system should work fine, but not necessarily older systems. (For example, the binaries will not run under RedHat Linux 5.x)

Digital Unix 4.0E
Irix 5.3
RedHat Linux 6.1
Solaris 2.6

Demo data sets

All pathnames listed below are relative to the installation root for the package.

Here is a list of sequences which may be loaded into Sip4.

userdata/atpase.embl
userdata/blue.seq
userdata/cemyo1.seq
userdata/ecoli.0*
userdata/lambda.seq
userdata/lorist6.seq
userdata/m13mp18.seq
userdata/m13mp7.seq
userdata/mysa_drome.embl
userdata/mysa_human.embl
userdata/mysd_caeel.seq
userdata/pjb8.seq
userdata/puc18.seq

For a good example of protein-protein similarity plots, try using mysa_drome.embl and mysa_human.embl.

For dna-protein plots, try using cemyo1.seq against mysa_caeel.seq.

To see how Sip4 handles large sequences try using ecoli.00003 and lambda.seq. This is a large comparison: 250Kb against 48.5Kb. Hence the slower searches, such as Find Similar Spans, will take a long time. We suggest searching with Find Matching Words using a word length of 12.

For Nip4, the following sequences are accepted.

userdata/atpase.embl
userdata/blue.seq
userdata/cemyo1.seq
userdata/ecoli.0*
userdata/lambda.seq
userdata/lorist6.seq
userdata/m13mp18.seq
userdata/m13mp7.seq
userdata/pjb8.seq
userdata/puc18.seq

Gap4 in demonstration mode only allows access to one data set at present. This has been base called with phred. The trace files may also be viewed in Trev.

demo/gap4/DEMO.0

This database is a section from a c.elegans cosmid, still in several contigs. You may try joining and editing contigs. The full functionality of Gap4 is available except for assembling or disassembling sequences.

For Pregap4 there is no demonstration mode. Pregap4 may be started and configured, but it will not process any data.

Sequence Library Access Using SRS

We have not upgraded our sequence library interface to SRS 6. and are aware of some problems in our use of SRS 5. We hope to include a Web based interface in a later release.

Linux, Gnome and Enlightenment

There is a known problem with the Gap4 contig editor when using Enlightenment as the window manager. This is the default for Gnome, but it is not known whether the problem arises when using Enlightenment in other environments. The symptom is that program will terminate instantly upon starting the contig editor with a complaint about X_ConfigureEvents.

The solution is to change window managers, which may be adjusted using the Gnome control panel.

Change log

Here is a list of changes since the 1999.0 (patched) release.

Gap4

Changes

New plot: confidence values.
New plot: reading coverage.
New plot: readpair coverage.
Primer suggestion from within the editor now ignores padding characters.
Searching by sequence from the editor now ignores padding characters and searches both strands. Optionally it may also find non-exact matches (mismatches, but not insertions or deletions) and can search top, bottom or both strands.
An extra editor search method is available: unpadded position search. This makes it possible to jump to a specific unpadded position within a padded consensus.
The contig editor base numbers may now be shown as unpadded positions. Note this may be slow on very large projects as it will require frequent recalculation of the complete consensus.
Confidence values can now be plotted in the Trace Display.
The trace display now lists the original base confidence in the trace display information line.
Primer and Chemistry information is now visible as single character codes in the trace display.
The full reading name is now shown in the trace display. It is superimposed over the top-left corner of the trace.
The trace file search path is now adjustable from the options menu (although the RAWDATA environment variable may still be used if desired). This writes a database note (of type RAWD) and so the search path will be remembered.
New commands in the option menu: set alignment scores and set genetic code.
The alignment weight tables for Gap4 are now configurable; stored in $STADTABL/nuc_matrix. The alignment gap open and gap extention penalties may also be changed. The matrix file and penalties are stored in tables/gaprc as the ALIGNMENT.MATRIX_FILE, ALIGNMENT.OPEN.COST and ALIGNMENT.EXTEND.COST variables.
Improved Find Internal Joins alignment algorithm. This is now a banded alignment, and so is much faster. Also improved the "extended consensus" calculation such that there are no more discontinuities in the sequence (caused by padding). This does not fix any bugs, but the change improves the sensitivity of find internal joins.
The template display now separates out the plotting of consistent and inconsistent templates.
The colours in the Template Display for reverse and custom-reverse primers have been changed from shades of grey to orange and orange-red.
The consensus algorithm now has a "Display IUB codes in consensus" mode. The exact definition depends on the consensus algorithm being used.
The consensus algorithm may now be chosen from within the contig editor.
The consensus, when produced in fasta format, now uses the contig identifier as a fasta sequence name instead of a reading number. The reading number is included as a fasta comment after the contig identifier.
Gap4 now has "user levels"; currently only "beginner" and "expert". Gap4 will start in beginner mode. Use the Options menu to change to expert mode. Beginner mode has several of the less-often used functions removed.
The contig editor now has the notion of a reference sequence. This is the sequence to which base numbering should be applied. The sequence numbering can optionally start from an offset other than 1 and may wrap-around at a predefined length. All these details are written into the contig REFS note.
Pressing the middle mouse button on the reading names in the editor will now 'copy' the name to the paste buffer, allowing for easy cut-n-paste.
The control-h keybinding to remove readings from within the contig editor now also works when the mouse pointer is above the sequence names sub-window.
Using the popup-menu (right mouse button) in the contig editor now also moves the editor cursor. This makes displaying of tag information (for example) more intuitive.
Added a gaprc parameter to control whether the Delete key should act in a Motif/Windows style or an Emacs style. Defaults on for Windows, off for Unix.
Added an assembly interface to cap3. (Cap3 is not included.)
Multiple files for assembly may be picked directly from the assembly dialogue without needing to first create a file of filenames.
Added a new gap4 command line switch "-menu_file". This may be used to customise local menu configurations. Try "gap4 -menu_file mito" for example; see $STADENROOT/tables/gaprc_menu_mito.

Bug fixes

Specifying ranges for masked consensus was masking the wrong regions. (This worked correctly when all the consensus was output.) This effected both saving of masked consensus and the masked assembly options.
Attempting to align sequences in the join editor with an overlap of greater than 8Kb could crash Gap4. Now works with any length (although it may be slow).
Removed a problem with the "lock" mode of the join editor: by extending cutoffs it was possible to `break' the lock position.
The colours used for highlight disagreements in the join editor have now been fixed.
Fixed problems with unneeded cursors being left in various graphical plots after joining contigs.
Removed a problem with the "lock" mode of the join editor: by extending cutoffs it was possible to `break' the lock position.
The trace display could crash when scrolling traces immediately after quitting an individual trace display.
Diff against consensus trace (contig editor) was sometimes crashing.
Fixed crash when trying to display multiple diffs between the same pair of sequences.
Using "Quality Clip" on databases that were not base called with a phred-scaled confidence system would incorrectly clip.
Fixed a crash with searching by reading name within the editor.
Changed the focus bindings for the contig editor. This should prevent the auto-raising problems that some systems have.
Added back the "lost" editor menu item: group readings by template.
Fixed a problem with the editor "search by file" mechanism and specifying reading names when there are more than 10,000 sequences assembled.
Dump Contig (contig editor) using line lengths of > 300 would sometimes crash gap4. It now supports up to 1000 (the limit allowed by the dialogue).
The day field of dates listed in the Note Selector window was incorrect. (NOTE: This was not a year 2000 problem, but simply a mixup of the date formatting.)
Fixed a crash in Find Repeats when it found many copies of a repeat on both strands.
The contig comparator now ensures that all items are visible (even if it means moving them). Previously there were some cases where joins found in the cutoff data of the right-most contig were not visible (although the "next" button worked).
Fixed bug with the crosshairs in contig selector/comparator, which when zoomed up displayed the position of the crosshairs incorrectly.
Fixed a problem when trying to use the popup-menus from the restriction enzyme plot.
The restriction enzyme selector sometimes had problems updating the enzyme list when using the filebrowser to pick a "personal" enzyme file.
Removed a crash in the restriction enzyme plot, triggered when leaving the plot displaying while making a contig join.
Fixed window resizing problems in the restriction enzyme plot.
The default confidence value for sequences without confidence values is now 2, instead of 99.
Improved the error messages when attempting to assemble non-existent files.
The debug "log file" was switched after a copy database command: subsequent logs went to the copies log file.
Local GTAGDB files did not correctly override tags found in the tables directory.
The show relationships output was sometimes poorly tabulated.
The difference clip function no longer quality clips the extreme start and end sequences. This depended on correct confidence values and sometimes failed when they were not present.

Nip4

Changes

Added a general-purpose weight matrix search function.
A new program, make_weights, is available to produce weight matrices from a set of aligned sequences.
The weight matrix searching functions now use log-odds scores.
The splice junction search is improved for human data - using a better weight matrix (which is now also user-definable). The new search uses log-odds scores.
The gene search functions now all use log-odds scores, which means they may be compared to one another easily. The codon preference search has been generalised; it now has the ability to make use of coding and non-coding tables (automatically generating non-coding tables from the coding table if it is missing), and hence now the base preference search is a sub-option of the codon preference search. There are now also options to normalise to average amino acid composition and to use the amino acid composition alone.
Stop codons may now automatically be plotted when requesting a gene search plot.
The author test gene search now requests percentage error instead of window length.
Filenames may now be specified on the command line.
Codon tables may now be appended to existing tables. This concatenated codon table file is required for some of the newer search options.
Codon tables may now be written to the output window without needing to save them to disk.
Improved the genetic code selection.
The string search options can now distinguish between iub codes and "literal" characters.
Changed the default scoring for the trna search and improved the use of conserved bases.

Bug fixes

Fixed bug in the stop codons plot where the determination of the reading frame was incorrect if the starting position was not 1.
Fixed window resizing problems in the restriction enzyme plot.
Fixed bug with base bias search if all the results are 0 - this caused the program to crash.
Fixed miscellaneous bugs dealing with moving and superimposing plots.

Sip4

Changes

Filenames may now be specified on the command line. The last two will be the horizontal and vertical sequences.
Added a "nearest dot" option to the sequence display which moves to the visibly-nearest point in the 2D dotplot, rather than the mathematically nearest match.

Bugs fixed

The local alignment code now properly handles mixed case alignments.
If a comparison function was performed on a sequence which doesn't have a library eg an aligned sequence, sip4 would crash.
Switched the default values for the gap start and gap extend penalties (they were the wrong way around).
Both sip4 and nip4 no longer complain when the default sequence library cannot be found.

Pregap4

Changes

New module: qclip (Quality Clip) which replaces the old clip module. This can clip a sequence based on the average confidence values assigned by eba, phred, ATQA or other similar tools.
Added an ATQA module.
Added an extract_seq module.
Wrote a new program find_renz which searches for a restriction enzyme site in a vector file, and added a corresponding hook to pregap4 so that we can now type in (eg) SmaI to the cut-site box and pregap4 will work things out accordingly.
Included a more complete vector-primer file. The sequencing vector clip module now has an interface to select subsets from this file.
Multiple files may now be picked for processing directly from the pregap4 GUI without needing to first create a file of filenames.
Improved handling of files contained within remote directories. Pregap4 now has an output directory, allowing the results from processing data spread over several directories to be stored within a single output directory.
The E.Coli genome used by Screen Sequences has now been split up into several small chunks, which slightly speeds up the search and uses much less memory.
The blast screen module can now output tags if desired. This tags all matches, regardless of whether the sequence meets the specified "match fraction". Hence by setting the match fraction to greater than 1.0 this module may be used to add arbitrary match tags to sequences without rejecting any of them.
The E value is now adjustable in the blast module.
The cross_match module now works with the newer cross_match releases (which no longer support experiment file format).
The cross_match module now has a gap_size parameter (adjustable only from the pregap config files). Matches found within $gap_size of the end of the sequence or within $gap_size of another match are stitched together. gap_size is initially set to 15 bases.
Added an "other arguments" interface to the phrap module.
We now report simple status information such as "module x needs configuring" in the blank space between the Run and Help buttons at the bottom of the configure window.
The "save all parameters" buttons have had their names changed to (hopefully) become clearer.
A new function has been added, named "Load Naming Convention". This is identical to "Include Config Component" except for the default location for the file-browser.

Bugs

Removing a module with the Add/Remove modules command no longer gives Tcl error messages when the module being removed is the currently active module.
The saving of the sequencing vector clip module parameters now correctly saves the vector-primer file details.
Fixed problems when dealing with configuration files produced on different systems with different $STADENROOT settings.
Improved error handling when dealing with third-party tools which have not been installed on the local system (eg Phred, RepeatMasker, etc).
Improved error handling in phred module.

Trev

Changes

Now displays sequence confidence values.
Can now handle multiple files from within the "open" command.
Printed trace files may now have a title, which defaults to the trace file name.
Saving in plain-text format now only writes out the good quality region.

Bugs fixed

Trev failed to correctly write edited confidence values back to experiment files.
The "unable to configure trace options" message no longer appears when failing to load a file.
The "undo clipping" command failed when the previous right-hand clip point was the very end of the sequence.
Certain "corrupt" trace files would cause crashes in the trace printing code. Specifically when base calls are positioned beyond the end of the 'sample' data.

Misc

Changes

We now include a full tutorial on using the sequence assembly tools, including documentation and data. This may be found in the "course" subdirectory, but also please check for newer versions at ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/course/
Vector_clip now has a minimum length of 5' sequence to match when using vector_primer files. This allows us to over-specify the vector sequence so that we can use the same vector_primer file line for multiple primer sites (with the same cut site).
Vector_clip now writes SF and PR records to the experiment files (derived from the matches found). If no match is found, no PR record is written. If the vector rearrangement search finds there is no SF record in an experiment file it writes a CC record to that effect, but writes the reading name to the passed file of file names.
Vector_primer file format simplified: the numeric values included incase they were needed in the future have been dropped.
Screen_seq will now add tags to the failing sequences.
Extract_seq can now handle multiple files on the command line . It also has a -fasta_out option which, when combined with specifying multiple files, provides a handle way of converting many Experiment files into a single fasta file.
Init_exp now has a "-conf" option to force the confidence values to be written to the Experiment Files (even when no edits have been made).
The order of SCF comments written by makeSCF has been sanitised. The textual date format has also been fixed (it was two months out, although the numerical date output was correct).
Online help now automatically looks for Netscape for viewing HTML documents.

Bugs fixed

Gap4, Sip4 and Nip4 work better with small screen sizes (eg 800x600) although this still isn't perfect. We recommend a screen resolution of at least 1024x768.
The pathname expansion functions no longer expand environment variables unless a dollar symbol preceeds them.
In vector_clip, default 5' positions that are higher than the sequence length was causing crashes.
Fixed a bug with the cosmid clipping in extract_seq. It worked find with CS lines, but was incorrect for Experiment files using a CL and CR notation.
The sequence format recognition in Nip4 and Sip4 would fail when a protein sequence in plain text format started with "SQ".
More robust handling of ABI format data. We make up data with 0 level samples when the base data extends beyond the stored sample data.
Improved the sequence format recognising code used within nip4 and sip4.
MakeSCF no longer adds duplicate MACH fields to the SCF comments.
Fixed a trace rescaling bug in MakeSCF. Only causes problems when using the -normalise command line switch.
Using non-default colours now works better. (Certain dialogue components were ignoring the user-defined colours.)