Version-2001.0 Release Notes

James Bonfield, Kathryn Beal, Yaping Cheng, Mark Jordan and Rodger Staden

The major visible changes in this release include a new sequencing project experiment suggestion program "pre_finish"; spin a replacement/combination for nip4 and sip4 which, very importantly provides the first graphical user interface to EMBOSS (http://www.hgmp.mrc.ac.uk/Software/EMBOSS/);plus improvements and additions to our existing programs pregap4, gap4 and trev.

At the time of the last release we stated that the MS Windows version would only be available commercially, but we are very pleased to say that at the beginning of this year we obtained permission to distribute the MS version under identical conditions to those for the UNIX versions. The functionality of the programs is the same on all systems (except there is no EMBOSS release for Windows).

Providing the MS Windows version has had quite a big startup cost in terms of progamming effort, but the problems have now been solved and lessons learnt, and it should not slow down progress in the future. Simple things like allowing spaces in file names or different window manager behavior can create a lot of work.

This release is the first to use our new trace file format ZTR which has advantages of reduced file size and flexibility over SCF and removes the need for the use of external compression programs. Over time we expect it to replace SCF as the preferred trace file format. In addition to ZTR, Gap4 now also contains an interface to Perkin Elmer's BioLIMS database.

People are performing ever increasing sizes of sequencing project, for example shotgunning whole bacteria, and although it may not be evident to users doing smaller projects, a major improvement in this release of gap4 is numerous speedups of what were becoming slow tasks. These larger projects also require more readings so we have increased the possible number of readings to 99,999,999 and already sites are using around 200,000. The permitted length of reading names has also been increased to 40 characters.

We have always recommended making the fewest number of changes to the original reading data. This results in many pads appearing in contigs (which are stripped when the final consensus is created) and meant that some searches in gap4 failed when the targets included pad characters. We have improved some of the affected routines by stripping pads before the searches are applied.

Hyperlinks have been introduced to results in the Output window and to the results produced by the new reading name, template name and tag content searches. The hyperlinks can be used to invoke the Contig Editor etc.

The gap4_viewer have been free for some time, even for commercial users, and provides full gap4 functionality in a read-only mode. Making the gap4_viewer free to all means that anybody (including commercial sequencing companies returning results to customers) can usefully send their gap4 databases to colleagues at other sites: the databases are machine independent and anybody with access to MS Windows or UNIX/LINUX could obtain and use the viewer. Anyone who has done any sequencing will know that seeing the assembly and the traces is the best way of assessing the reliability of the consensus. From this release we will not be distributing a separate gap4_viewer as we have incorporated its limited functionality into the standard version of gap4. This means that the downloaded version of gap4 will work with full functionality on the demo and course data included in the package and in viewer mode on any other data. Full functionality on all data will require a licence as before (free to academic users).

A more detailed list of the program changes made is given below. The manual has also been updated. Whenever we receive queries on topics that are not documented or which are inadequately explained we add to or improve the corresponding sections in the manual.

Note that we welcome comments and suggestions about the package, particularly ideas about what users would like to see added. For this release we are looking especially for sites that would like to try out our new experiment suggestion program (pre_finish) and to contribute to the design of its user interface. See the $STADENROOT/lib/finish/METHODS file for more information.

Program version numbers

Gap4 4.6
Spin 1.0
Pregap4 1.2
Trev 1.6

Operating systems

The binaries have been created in the following build environments. Typically newer environments for the same operating system should work fine, but not necessarily older systems. (For example, the binaries will not run under RedHat Linux 5.x, but will run on RedHat Linux 7.x)

Digital Unix 4.0E
Irix 6.5
RedHat Linux 7.1
Solaris 8

Demo data sets

The course (see course/*_docs/*.pdf) may be run when in demonstration mode (ie without needing a licence). Specifically all demonstration data files are considered as valid sequences and so are exempt from the licence restrictions. All pathnames listed below are relative to the installation root for the package.

Here is a list of sequences which may be loaded into spin:

userdata/5H1E_HUMAN.seq
userdata/atpase.seq
userdata/cemyo1.seq
userdata/ECAE129.seq
userdata/ecoli.0*
userdata/lambda.seq
userdata/mysd_caeel.seq
userdata/mysa_human.seq
userdata/mysa_drome.seq
userdata/spin_dna.seq
tables/vectors/lorist6.seq
tables/vectors/m13mp18.seq
tables/vectors/m13mp7.seq
tables/vectors/pBs.seq
tables/vectors/pgem3zfm.seq
tables/vectors/pgem3zfp.seq
tables/vectors/pgem5zfm.seq
tables/vectors/psc194.seq
tables/vectors/puc18.seq

For a good example of protein-protein similarity plots, try using mysa_drome.seq and mysa_human.seq.

For dna-protein plots, try using cemyo1.seq against mysa_caeel.seq.

To see how spin handles large sequences try using ecoli.00003 and lambda.seq. This is a large comparison: 250Kb against 48.5Kb. Hence the slower searches, such as Find Similar Spans, will take a long time. We suggest searching with Find Matching Words using a word length of 12.

Gap4 in demonstration mode allows access to:

demo/gap4/traces.tar
A tar of trace files base called with phred. Gap4 will automatically read the files directly from the tar file (via the traces.tar.index lookup file).
demo/gap4/DEMO.0*
This database is a section from a c.elegans cosmid, still in several contigs. You may try joining and editing contigs. The full functionality of Gap4 is available except for assembling or disassembling sequences.
course/data/shotgun_data/* trace and experiment files
course/data/long_reads/* trace and experiment files (long reads)
course/data/ABI_Data/* original ABI files
course/data/phred_data/* trace files base called with phred
course/data/mutations/* scf files for mutation studies

Pregap4 in demonstration mode allows access to the same files listed for Gap4.

Sequence Library Access Using EMBOSS

The spin interface to EMBOSS provides access to the sequence libraries provided you know the names of the files you want to extract!

Linux, Gnome and Enlightenment

There is a known problem with the Gap4 contig editor when using Enlightenment as the window manager, although this may have been fixed by now. This is the default for earlier versions of Gnome, but it is not known whether the problem arises when using Enlightenment in other environments. The symptom is that program will terminate instantly upon starting the contig editor with a complaint about X_ConfigureEvents.

The solution is to change window managers, which may be adjusted using the Gnome control panel.

Change log

Here is a list of changes since the 2000.0 release.

Gap4

Changes

Major changes to Find Internal Joins, consisting of new hashing and alignment algorithms. The new alignments allow Find Internal Joins to deal with very long sequences and sequences containing many pads. There now exists a fast (less sensitive) mode and a slower (more sensitive) mode. It is recommended that the fast mode is used first to join any particularly long matches.
The join-editor "align" algorithm has been greatly improved. It aligns long sequences quickly and works better with highly padded data.
Gap4 database I/O now uses less memory and is much faster at opening databases.
Large speed improvement in directed assembly. Speed can also be further increased by setting the minimum percentage mismatch to zero. This disables updating the consensus, but should only be used when no alignments are needed (for example when inputting data from phrap).
Large speed improvement to Enter Tags.
Determining reading and template numbers from their names is much faster. This speeds up many algorithms.
The contig editor has a "store undo" toggle, which may be used to prevent undo information being stored. This gives useful memory saving in some complex cases, such as trying to align two very long sequences or using shuffle pads on very long contigs.
New search commands in the list menu: search by reading name, template name and annotation contents. These produce clickable-lists, to bring up the contig editor or template displays centred at the relevant locations.
The text output window now contains 'hypertext' links for reading names. This allows the user to left-click to bring up the contig editor centred on that reading, or right click to get a list of options (template display and display notes). At present only the output from the new search tools and from show relationships has these links. Please let us know if you would like further sequence-links.
Improved shuffle pads algorithm which should work better on data with a high sequence depth. However it doesn't work better in all cases. If people prefer the older method then please tell us.
The use of hidden data in some algorithms (Find Internal Joins, Check Assembly, Save Consensus) now allows selection of the best poor quality data by analysis of average confidence value.
Directed assembly now has an "enter all readings" mode. If a sequence fails to enter then subsequent sequences which depended on it will now start new contigs (instead of also failing).
More chemistry information is now available, including BigDye terminator, Licor, etc.
Now correctly handles more than 99,999 sequences - now up to 99,999,999.
Reading names can now be up to 40 characters long. To help control the ever-lengthening output the name display width can be adjusted in the Contig Editor's dump contig command.
(BETA) Interface to PE's BIOLIMS database. This includes direct reading and writing of both trace files and sequence assemblies.
ZTR the new trace file format can be used by gap4.
Consensus output in fasta format now allows characters other than a,c,g and t. (This is not strictly correct, but makes fasta output more useful in conjunction with masking and marking.)
New "NOTE" type 'INFO'. These are displayed in the contig editor information line; containing user defined comments.
(BETA) There is now a new program (pre_finish) to automatically choose finishing reactions. At present this deals with long gel readings, resequencing (using different chemistries), primer walks and clone walks. It is completely customisable, but at present has no easy to use interface.
Improved the database.log file information, for easier debugging. It now also contains the output from the Check Database function.
The "use special chemistry" (double chemistry implies double stranded) option in the consensus algorithm dialogue box is now disabled by default.
The ruler lines displayed in the template display (and similar plots) now have the ruler ticks shown (by default). More ticks are now also shown.
Restriction enzymes can now be detected in data containing pads.
Traces can now be manually complemented within the trace display.
The editor search by "quality" mode now ignores pads and so-called "bad data" (N's) as problems when the user is using the confidence- values consensus mode.
Invoking the contig editor from the contig selector or template display ruler now brings up the editor with the editor cursor positioned approximately where the popup menu was placed.
The maximum database size (-maxdb command line argument) can be set within gap4 itself, although this only takes effect when opening the next database.
Directed assembly automatically increases both maxseq and maxdb when entering large external assemblies. (Unfortunately this is still not true for the Normal Shotgun Assembly function.)

Bug fixes

Worked around an X server bug on Linux, where the X server could crash after changing the number of columns of traces displayed.
Minor improvements to trace-editor cursor positioning.
The phrap assemble option was stripping lower-case t's from sequence names when supplied via a list (but worked via the more normal route of a file-of-filenames).
Remove file-descriptor leak when changing genetic code tables.
Fixed a problem with list confidence where the error rate could sometimes overflow (only a problem for extremely low error-rates).
Contig editor searching (backwards only) for "edits" could sometimes miss edits that were neighbouring, and in some cases could even crash.
Templates containing one sequence with both 5' and 3' vector are now correctly labelled. (Mainly effects read pair and template display functions.)
The colours for the read coverage and strand consistency plots were swapped around (now red is +ve and black is -ve).
The "list contigs" window (contig selector) was not working correctly when it was resorted - the popup menu would invoke the commands on the wrong contigs.
Solved some display glitches in the trace display when scrolling with the confidence values shown.
We now reject zero length sequences in directed assembly.
The Copy Database with garbage collection enabled no longer crashes when the database is opened in read-only mode.
Complementing contigs from within the template display would crash if the display was showing many contigs.
Fixed bug in template display which occured if only the ruler was displayed and you zoomed in, scrolled, zoomed out then zoomed again and the ticks were out of sync with the ruler.
Fixed the Screen Only assembly method. We broke the dialogue when updating the hidden data controls.
Joining contig A to contig B when contig B has the restriction enzyme map displayed would crash when the enzymes are next "touched" in the restriction enzyme map.
When zero sequences cover a consensus point (which is legitimately possible when marking sequences for removal) clicking on the consensus in the editor to display the confidence values could cause a crash. The display could also be invalid when scrolling. We now initialise the consensus confidences correctly in such cases.
Trace confidences on solaris would sometimes be displayed incorrectly.
The template display "information" popup (from highlighting templates) now displays the observed template size when appropriate, and indicates the estimated size for templates which span contigs.
Fixed the problem where occasionally the editor "sequence names" panel would be blank (until a refresh occurs).
Some entries in the vector-primer file had the forward and reverse sequences wrong; these have now been switched around. The vectors are pBS, pgem* and puc18. Users should be aware that this means old projects had forward and reverse sequence attributes switched for sequences from these vectors.

Nip4/Sip4 (now spin)

Changes

Merged Nip4 and Sip4 into a single program named spin.
Added a graphical interface to the EMBOSS suite of programs.
New prokaryotic ribosome binding site weights file (PERCEPTRON.WTS).
Added E coli promoter weight matrices (for spin) to tables directory. tables/prokprom_35.wts tables/prokprom_10.wts
Alignments now use multiple symbols for displaying the similarity (controlled by the user). This is also visible in the spin two-sequence window.
EMBL features tables are now loaded. At present this is just used by the translation code, allowing selection of one or more CDS features.
If EMBOSS is installed and setup, spin can fetch sequences directly from remote sequence databanks.
Now have the choice to save in EMBL and Fasta format. The EMBL format writes out the feature table, provided that the sequence has not been edited.

Bugs fixed

Fixes relating to the "Sequence manager" and sequence selection.
Fixed comparison functions which didn't update correctly when using a _rf123 sequence (should be treated as being a protein sequence)
Fixed comparison functions which don't fill out some of the entry boxes if you press cancel and then bring up the same dialogue box
Fixed bug in plot base composition and emboss graphical functions when plotting over a range.
Fixed crash on alphas which occurs when the y dimension is 0 eg plot base composition over a range (not including the ends).
Fixed crash on alphas when search for string
Fixed various sip functions which crashed or hung if no matches were found
When using rf123 sequences, the program automatically creates 3 new translated sequences but added a unique identifier to the end of the name in the same way as the other new sequence generating functions do.
When using rf123 sequences over a range, didn't calculate the start position correctly which caused eg align sequences to crash
The cursors in sequence pair display when there are several sequences in the display (eg when comparing dna vs protein) could not be selected as a pair.
Fixed weight matrix search which added the minimum y value to the score when it shouldn't have
Fixed an uninitialised memory read causing crashes (mainly on Solaris) when reading experiment/EMBL files.
Zooming plots when the crosshairs are shown sometimes produced tcl error messages.
The default score for find similar spans was sometimes blank.
Blank lines in score matrices were producing incorrect results. We now also check for non square matrices, to check for missing rows or columns.

Pregap4

Changes

Changed the widget-set used from Tix to incr-widgets. This solves some of the odd colour and font problems.
The mutation detection module now allows the forward and reverse strand wild-type traces to be specified independently.
The vector-primer file mode of vector clip is now the default mode.
ZTR support, to allow for the new compressed trace format.
The phred and ATQA modules now automatically convert their input files to SCF, so an explicit conversion is no longer required.
The ABI/ALF to SCF module has been replaced with a generalise convert trace file format module. This can convert to SCF, CTF and ZTR formats. Additionally it allows for trace normalisation and quantisation (down-scaling to 8-bit (for example) values).
The quality clip module may now also reject files if they are shorter than a specified length.
Made the eba module auto-convert ABI and ALF format to SCF. This means that it can be placed before the convert trace file module and hence is in a consistent position with similar tools (phred, atqa). The native output format is now ZTR.
The ticks and crosses are now [x] and [ ] respectively. The module name is also appropriately highlighted. This makes it easier to see which modules are enabled while also solving a problem with the dingbats fonts on some systems.
Removed the "Select parameters to save" buttons from each module. Please tell us if you really need this feature. (It caused too much confusion.)

Bug fixes

Blast module now handles very long filenames much better.
Fixed a problem in the RepeatMasker, Crossmatch and Blast modules when dealing with sequences containing lowercase letters.
Some parameters of the Extract Sequence module were not saved when "Save All Parameters" was used.
Better handling of duplicate sample names.
Fixed problems in the phred module when dealing with filenames containing odd characters (regexp meta-characters).
The enter assembly module now sets maxseq and maxdb correctly. It also now removes the BUSY file when an error occurs.
Solved a problem when loading files of filenames containing partially relative pathnames (eg D:dir/file).
Fixed a bug in the history tracking of files for the convert trace module. It was setting the file_orig_name to be file_id(name) instead of just name. Hence the pregap.log file contained incorrect values.
Screen_seq reported match positions using the coordinates from the wrong sequences when writing the tags i.e. the reading positions were from the vector and vice versa.
The cross_match pregap4 module can now handle spaces in filenames and it can has improved error handling for when cross_match fails.
The blast screen pregap4 module has problems with spaces in database pathnames. It does not work with spaces in the directory components (achieved using the BLASTDB environment variable instead of a .ncbirc file), but the filename component still fails with spaces.
Fixed a bug in exp2read function which cleared the base positions regardless of whether the experiment file overrode them using ON records.
Exiting pregap4 when trev is open no longer produces trev error messages.

Vector_clip

Changes

No longer writes out primer type (PR line) information when this already exists in an experiment file.

Bug fixes

Fixed out-by-one positioning of matches on the reverse strand.
In vector_primer mode, the default left-clip value was incorrectly applied.
In vector_primer mode, if a match is found at 3' only then the SF line (vector filename) was incorrectly set.

Trace file handling (io_lib)

Changes

New formats supported: CTF and ZTR (highly compressed). A new program 'convert_trace' may be used to switch between the formats. Convert_trace subsumes the previous functionality of init_exp and makeSCF, although both still exist for backwards compatibility.
A new program get_comment is now used to extract the text fields from all formats of trace files. This replaces the getABI* and get_scf_field programs. Get_comment is used instead of these in pregap4.
Support for on-the-fly decompression via szip.
Traces can be directly read from tar files. Needs RAWDATA setting to "TAR=filename.tar".

Bug fixes

Fixed a minor problem dealing with space characters in plain text files.
Wrong timestamps were sometimes reported for ABI files.

Trev

Changes

No longer needs a licence; can freely view any file.

Bug fixes

More robust when dealing with partially complete traces; such as traces missing confidence values, base positions or even sequence.
The trace widget "sequence" method crashed if a trace wasn't loaded. This could be called from trev when bringing up the search box and then switching to a malformed trace without cancelling the search box.
Fixed a crash in trev where switching "next" to a trace which is malformed or does not exist caused the postscript printing options to be freed multiple times.
Allow deletion of the last base.
Fixed the trace printing code for traces with no trace data.

Misc

Changes

GUI updated to use tk8.3.x instead of 8.0.
Demonstration mode (ie no licence present) will now allow the course data to be used; processed via pregap4 and assembled and edited within Gap4.
Improved Unix package initialisation. The staden.login and staden.profile files are now optional when simply running the main (graphical) tools.
The licence file may now be pointed to by the STADLICENCE environment variables. Licences for multiple system types (unix/windows) and multiple package releases may now be mixed in the same file.
The demo data, as well as the course data, can be assembled and checked in demo mode.

Bug fixes

Removed a file-descriptor 'leak' in extract_seq.
Under linux the MANPATH is now set correctly in the login setup.
Minor bug-fixes to eba (estimate base accuracies), potentially causing crashes (not observed).
Redirecting output from the text-output window could cause crashes on some systems (typically Linux).
Many fixes relating to the differences between the way MS Windows and UNIX window managers function. For example which window is "fronted".
Many fixes relating to the fact that MS frequently uses spaces (" ") in filenames.
Set the line width for spin plots to 0 (was 2). This greatly speeds up drawing on Windows.
Program colours under CDE and certain X windows emulators now appear correctly. We override the CDE system defaults. If you wish to change the Staden Package colours use the Set Colours command in the Options menu.