Version-2002.0 Release Notes, October 2002

James Bonfield, Kathryn Beal, Yaping Cheng, Mark Jordan and Rodger Staden

The most visible additions to this release are our improved methods for automatic mutation detection and for the visualisation and reporting of mutations within Gap4. For more information on this topic we have prepared a separate web page demonstrating the new mutation features. This effects both Gap4 and Pregap4. (Be sure to remove (or edit) any pregap4.config or config.pg4 files to see the new modules listed within Pregap4.)

Other key changes include (i) there was a reading length limit of 30,000 bases in gap4, this has been removed so that "readings" of any length can now be assembled (as before the length of a contig is effectively unlimited); (ii) the ability to assemble EMBL format sequence files and their feature tables into gap4 databases; (iii) new program copy_reads for copying useful finishing data from one gap4 database to another; (iv) new program polyA_clip for trimming polyA/T from readings. A more detailed list of the program changes made is given below.

This is also the first official release of the MacOSX version (10.2) of the package. Initially we created and made available an Aqua version, but we felt this was too buggy (although many users said that having the functionality of our package available on the Mac made the bugs seem insignificant), so this version requires X11. See the Apple XFree86 site to download X11. As far as we have checked, the MacOSX-X11 version appears to be as bug-free as the other versions of the package. Many of the bugs in the Aqua version were in the Tk libraries and it looked as though fixes would require greater knowledge of the Mac than we possess. We would be interested to receive comments about fixing these bugs and the need for an Aqua version. As the Mac version was built on 10.2 we are not sure if it will work correctly with 10.1.

Our experiment suggestion / finishing program, now named "prefinish", is being used to help complete around 500 clones per month at the Wellcome Trust Sanger Institute. The program is still under development, but is clearly very useful, and we welcome requests from others interested in trying it.

A large number of bugs have been fixed in this release (some are listed below), many with the help of purify and valgrind.

Program version numbers

Gap4 4.7
Spin 1.1
Pregap4 1.3
Trev 1.7

Operating systems

The binaries for this beta release have been created in the following build environments. Typically newer environments for the same operating system should work fine, but not necessarily older systems. (For example, the binaries will not run under RedHat Linux 5.x, but will run on RedHat Linux 7.x)

Digital Unix 4.0E
RedHat Linux 7.1
Solaris 8
Windows 2000
MacOSX 10.2

Demo data sets

The course (see course/*_docs/*.pdf) may be run when in demonstration mode (ie without needing a licence). Specifically all demonstration data files are considered as valid sequences and so are exempt from the licence restrictions. All pathnames listed below are relative to the installation root for the package.

Here is a list of sequences which may be loaded into spin:

userdata/5H1E_HUMAN.seq
userdata/atpase.seq
userdata/cemyo1.seq
userdata/ECAE129.seq
userdata/ecoli.0*
userdata/lambda.seq
userdata/mysd_caeel.seq
userdata/mysa_human.seq
userdata/mysa_drome.seq
userdata/spin_dna.seq
tables/vectors/lorist6.seq
tables/vectors/m13mp18.seq
tables/vectors/m13mp7.seq
tables/vectors/pBs.seq
tables/vectors/pgem3zfm.seq
tables/vectors/pgem3zfp.seq
tables/vectors/pgem5zfm.seq
tables/vectors/psc194.seq
tables/vectors/puc18.seq

For a good example of protein-protein similarity plots, try using mysa_drome.seq and mysa_human.seq.

For dna-protein plots, try using cemyo1.seq against mysa_caeel.seq.

To see how spin handles large sequences try using ecoli.00003 and lambda.seq. This is a large comparison: 250Kb against 48.5Kb. Hence the slower searches, such as Find Similar Spans, will take a long time. We suggest searching with Find Matching Words using a word length of 12.

Gap4 in demonstration mode allows access to:

demo/gap4/traces.tar
A tar of trace files base called with phred. Gap4 will automatically read the files directly from the tar file (via the traces.tar.index lookup file).
demo/gap4/DEMO.0*
This database is a section from a c.elegans cosmid, still in several contigs. You may try joining and editing contigs. The full functionality of Gap4 is available except for assembling or disassembling sequences.
course/data/shotgun_data/* trace and experiment files
course/data/long_reads/* trace and experiment files (long reads)
course/data/ABI_Data/* original ABI files
course/data/phred_data/* trace files base called with phred
course/data/mutations/* scf files for mutation studies

Pregap4 in demonstration mode allows access to the same files listed for Gap4.

Change log

Here is a list of changes since the 2001.0 release.

Gap4

Changes

Removed the maximum single-sequence length. This was 30000, but now we can handle sequences from any length (implying that split_seq is now largely redundant).
Added Report Mutations function to tabulate mutations (from MUTA/HETE tags or from differences to the reference sequence).
Greatly improved "reference sequence" base numbering in the editor.
Can now right-click on sequence names to specify sequences as a reference sequence or trace.
We can now assemble EMBL files. The features get turned into (sometimes several) tags.
The Y-scaling on difference traces now zooms centred on the baseline.
New associated program: copy_reads. This is used with gap4 databases for overlapping clones (eg two BAC clones). It copies the overlapping sequences from one database into the other.
More chemistry types now supported and listed appropriately: BigDye v3 and MegaBACE ET.
The editor "show reading quality" command now does so regardless of whether the quality cutoff is -1.
Re-enabled the clear button in the list editor for the "readings" list.
Under Unix BUSY files are now detected as real "in-use" files versus old files left after an abort. Gap4 will override the BUSY file in such cases.
Disabled automatic execution of OPEN and CLOS notes unless the "-exec_notes" command line switch is used.
New "Save Settings" option in the editor.
The buttons around traces are now hidden in popup menus. This provides more room for traces (especially when in 4 columns mode). The old-style look is still available for those who prefer it.
Improved tag selector window to cope with having many more tag types.
Pads are now stripped from the sequence string search options (both in the contig editor search tool and the main gap4 menus).
The Template Display now has an option to turn off auto contig positioning (based on read-pair overlaps).
The default consensus algorithm is now to use confidence values (as EBA has been recalibrated).
New "List Contigs" window to replace the old one. This has several columns showing name, length and number of sequences. Clicking on a column sorts the list.
Tweaked the Template Display colours so that they are easier for screen projection and hopefully easier for colour-blind users.
Zero cutters are now listed separately in the restriction enzyme "Output enzyme by enzyme" output.
After assembly we now detect contigs that have the vast majority of fwd/rev reading pairs aligned in the same orientation (such as would be the norm for a mutation detection study). Having found such cases we complement these if necessary to "correct" the orientation.
All restriction enzymes may now be selected by using the Control-A binding.

Bug fixes

Fixed out-by-one error in reporting of the length of spanning read-pairs.
Fixed determination of (in)consistent status for spanning templates when shown in the template display.
Contig selector now remembers "display diagonal" setting after a clear command.
Removed a potential crash in Find Read Pairs.
The scrollbar arrows in the trace display now work properly.
Fixed crash in Find Internal Joins when sequences at the end of the contig contained more than 2050 base pairs of hidden data.
Fixed a rare crash in the trace display.
Reporting of probability values for consensus bases in the contig editor now works correctly (instead of sometimes reporting zero).
The "Clear All" command of the contig selector no longer loses the status line.
Fixed various out-by-one errors in directed assembly.
Fixed crashes in saving the consensus as experiment file format when tags exist without comments.
Improved handling of long filenames during assembly when the sequences being assembled are in one directory and the program was started from another directory.
Contig selector: fixed a crash in "list contigs" when tags have been selected to draw in the contig selector.
Prevent attempts to update the contig order using the template display when opening databases in read-only mode.
Better checking of maxseq before shotgun assembly starts (to avoid a crash).
Disassemble readings now corrects maxseq if it requires increasing.
Quality Clip can no longer adjust the contig length. This cures shifting of consensus tags.
Disabled use of sequence ranges when selecting a single contig for Suggest Primers. This case was bugged, but the code is being superseded anyway.
Disabled use of sequence ranges for Double Stranding (as it was buggy).
Fixed reading confidence values from non-SCF files that are referenced from an Experiment file TN record when the Experiment file does not have confidence values itself. These would sometimes incorrectly get set to "2".
Fixed the positioning of plots in Strand Coverage when contig sub-ranges were used.
Fixed a rare crash in the manual primer selection code (in the contig editor).
Many plots now only zoom in X when Y zooming is inappropriate (eg consistency plots).
Fixed a few bugs where zoom or selection drag-out boxes could be left on the screen.
Restriction enzymes that have recognition sequences within a sequence range, but have their cut site outside the sequence range are now visible in the graphical plots.
Fixed a corruption of sequence "notes" when disassembling the last reading in the database.

Trev

Changes

"Information" now displays the number of bases, number of samples, the baseline (used for difference traces) and the maximum trace amplitude.
Minor menu reorganisation. File->Save As is now a cascading menu including the File Type (to avoid problems with selecting this on some operating systems). The View->Display menu is now part of the main View menu. It is also no longer possible to turn off the trace portion (which was somewhat pointless).

Bug fixes

Some page printing parameters could cause divide-by-zero errors.
Fixed a bug caused by attempts to perform edit operations before loading a trace.
Confidence values are now loaded from experiment files instead of the associated trace file. Fixed problems in saving them too.

Spin

Changes

May now select multiple files from the sequence load function.
Merged the "Configure max number of matches" and "Configure default number of matches" dialogues into "Configure matches".
The sequence editor search command now gives the same output as string search when finding results on the bottom strand.
Increased the size of entry names for fasta files from 20 to 50. Also added a scrollbar to the personal file browser.
Can now set the sequence structure (linear or circular) via the Sequences menu.
Added "Change Directory" dialogue to File menu.
Removed size constraints on the sequence display window.
Now uses drop-down boxes for selecting sequence names.
Can now also use the left mouse button to drag plots around (in addition to the existing middle button binding).
Changed the allocation of colours for plots. These are now specified in the 'spinrc' file.
Renamed "Count base composition" to be "Count sequence composition" to reflect that it is not DNA specific.
Allow use of the left mouse button for dragging cursors. This should make things much easier for Mac users.

Bug fixes

Dropping a plot onto non-plot type window (eg the main text output window) no longer produces Tk errors.
Better handling of tabbed notebooks in the auto-generated EMBOSS interface (eg for primer3). Updated prebuilt interface files to EMBOSS 2.5.1.
The "all" key box in plots now only brings up a menu for results in that window (rather than all results).
Fixed several bugs relating to dragging and dropping plots.
Aligning a short protein against a long protein now correctly reads in the new aligned sequences, instead of possibly confusing it with a DNA sequence.
Fixed problems with multiple sequence displays showing multiple, different, lists of enzymes.
Fixed a rare graphics display bug of the Author Test whereby a horizontal line could be drawn across a plot.
Fixed the "rescan matches" option of Find Similar Spans. It produced an incorrect plot when a sequence sub-range had been chosen.
String Search now works if a sequence sub-range is used.
The Restriction Enzyme Map now works if a sequence sub-range is used.
Fixed a problem where temporary "lists" were not deleted when using backslashes in windows pathnames.
Fixed some zooming problems in the consistency displays.
Linear/Circular status is now inherited when new sequences are created (eg by complementing).

Pregap4

Changes

New modules: reference traces, trace difference and heterozygous scanner. These form part of the new mutation detection methods. (See elsewhere for details.) This replaces the old trace_diff module.
New module: polyA clip.
The Phred module now allows for additional phred arguments to be specified. It also behaves better when finding non-trace files.
Extra suffixes on Sanger Centre naming scheme to support BigDyeV3 and MegaBACE ET chemistries.
Estimate Base Accuracies (eba) now produces values that are scaled to have equivalent magnitude to phred scores. (The old scale is still available as an option.)
Removed the non-compact (separate) window layout style.
We can now read FASTA files into pregap4. The Initialise Experiment file module automatically splits them into separate experiment files.
Added Select All Modules and Deselect All Modules options
Rationalised the menus a bit. It may take a bit of getting used to, but now things like Run are in the Modules menu and the various Save options are in File.

Bug fixes

The Edit Experiment File Line Types window is now modal to prevent corruptions by attempting to do other things at the same time.
Disabled the vector clipping modules rejection of sequences shorter than 16 base pairs (as this is a task for the quality clipping module).
Fixed the "del_temp_files" error message from the convert trace module.
Fixed a rare crash in screen against vector.

Misc

Changes

NT experiment file line type (Gap4 "NoTe"s) is now read by the assembly functions.
Retired the Repe program - we recommend use of RepeatMasker instead.
Upgraded the Tcl/Tk versions used to 8.4.0. This should not have any visible effect except perhaps for a few graphical bug fixes on Windows (ticks in menu, for example).

Bug fixes

Ran all the programs through Purify and Valgrind and removed any bugs spotted (mostly small memory leaks).

Prefinish

There is still no graphical interface to this, but considerable improvements have been made in the experiment suggestions. Please contact jkb@mrc-lmb.cam.ac.uk if you are interested in testing this component.