Version-2001.0 Release Notes
James Bonfield, Kathryn Beal, Yaping Cheng, Mark Jordan and Rodger Staden
The major visible changes in this release include a new sequencing project experiment
suggestion program "pre_finish"; spin a replacement/combination for nip4 and
sip4 which, very importantly provides the first graphical user interface to
EMBOSS (http://www.hgmp.mrc.ac.uk/Software/EMBOSS/);plus improvements and
additions to our existing programs pregap4, gap4 and trev.
At the time of the last release we stated that the MS Windows version would
only be available commercially, but we are very pleased to say that at the
beginning of this year we obtained permission to distribute the MS version
under identical conditions to those for the UNIX versions. The functionality
of the programs is the same on all systems (except there is no EMBOSS release
for Windows).
Providing the MS Windows version has had quite a big startup cost in terms of progamming
effort, but the problems have now been solved and lessons learnt, and it should not slow
down progress in the future. Simple things like allowing spaces in file names or
different window manager behavior can create a lot of work.
This release is the first to use our new trace file format ZTR which has
advantages of reduced file size and flexibility over SCF and removes the need
for the use of external compression programs. Over time we expect it to replace
SCF as the preferred trace file format. In addition to ZTR, Gap4 now also
contains an interface to Perkin Elmer's BioLIMS database.
People are performing ever increasing sizes of sequencing project, for example
shotgunning whole bacteria, and although it may not be evident to users doing
smaller projects, a major improvement in this release of gap4 is numerous
speedups of what were becoming slow tasks. These larger projects also require
more readings so we have increased the possible number of readings to
99,999,999 and already sites are using around 200,000. The permitted
length of reading names has also been increased to 40 characters.
We have always recommended making the fewest number of changes to the original
reading data. This results in many pads appearing in contigs (which are
stripped when the final consensus is created) and meant that some searches in
gap4 failed when the targets included pad characters. We have improved some of
the affected routines by stripping pads before the searches are applied.
Hyperlinks have been introduced to results in the Output window and to the
results produced by the new reading name, template name and tag content
searches. The hyperlinks can be used to invoke the Contig Editor etc.
The gap4_viewer have been free for some time, even for commercial users, and provides full
gap4 functionality in a read-only mode. Making the gap4_viewer free to all means that
anybody (including commercial sequencing companies returning results to customers)
can usefully send their gap4 databases to colleagues at other sites: the databases
are machine independent and anybody with access to MS Windows or UNIX/LINUX could obtain
and use the viewer. Anyone who has done any sequencing will know that seeing the
assembly and the traces is the best way of assessing the reliability of the consensus.
From this release we will not be distributing a separate gap4_viewer as we have
incorporated its limited functionality into the standard version of gap4. This means
that the downloaded version of gap4 will work with full functionality on the demo and
course data
included in the package and in viewer mode on any other data. Full functionality
on all data will require a licence as before (free to academic users).
A more detailed list of the program changes made is given below. The manual has also been
updated. Whenever we receive
queries on topics that are not documented or which are inadequately explained
we add to or improve the corresponding sections in the manual.
Note that we welcome comments and suggestions about the package, particularly
ideas about what users would like to see added. For this release we are
looking especially for sites that would like to try out our new experiment
suggestion program (pre_finish) and to contribute to the design of its
user interface. See the $STADENROOT/lib/finish/METHODS file for more
information.
Program version numbers
- Gap4 4.6
- Spin 1.0
- Pregap4 1.2
- Trev 1.6
Operating systems
The binaries have been created in the following build environments. Typically
newer environments for the same operating system should work fine, but not
necessarily older systems. (For example, the binaries will not run under
RedHat Linux 5.x, but will run on RedHat Linux 7.x)
- Digital Unix 4.0E
- Irix 6.5
- RedHat Linux 7.1
- Solaris 8
Demo data sets
The course (see course/*_docs/*.pdf) may be run when in demonstration mode (ie
without needing a licence). Specifically all demonstration data files are
considered as valid sequences and so are exempt from the licence
restrictions.
All pathnames listed below are relative to the installation root for the
package.
Here is a list of sequences which may be loaded into spin:
- userdata/5H1E_HUMAN.seq
- userdata/atpase.seq
- userdata/cemyo1.seq
- userdata/ECAE129.seq
- userdata/ecoli.0*
- userdata/lambda.seq
- userdata/mysd_caeel.seq
- userdata/mysa_human.seq
- userdata/mysa_drome.seq
- userdata/spin_dna.seq
- tables/vectors/lorist6.seq
- tables/vectors/m13mp18.seq
- tables/vectors/m13mp7.seq
- tables/vectors/pBs.seq
- tables/vectors/pgem3zfm.seq
- tables/vectors/pgem3zfp.seq
- tables/vectors/pgem5zfm.seq
- tables/vectors/psc194.seq
- tables/vectors/puc18.seq
For a good example of protein-protein similarity plots, try using
mysa_drome.seq and mysa_human.seq.
For dna-protein plots, try using cemyo1.seq against mysa_caeel.seq.
To see how spin handles large sequences try using ecoli.00003 and
lambda.seq. This is a large comparison: 250Kb against 48.5Kb. Hence
the slower searches, such as Find Similar Spans, will take a long
time. We suggest searching with Find Matching Words using a word
length of 12.
Gap4 in demonstration mode allows access to:
- demo/gap4/traces.tar
A tar of trace files base called with phred. Gap4 will automatically
read the files directly from the tar file (via the traces.tar.index
lookup file).
- demo/gap4/DEMO.0*
This database is a section from a c.elegans cosmid, still in several
contigs. You may try joining and editing contigs. The full
functionality of Gap4 is available except for assembling or
disassembling sequences.
- course/data/shotgun_data/* trace and experiment files
- course/data/long_reads/* trace and experiment files (long reads)
- course/data/ABI_Data/* original ABI files
- course/data/phred_data/* trace files base called with phred
- course/data/mutations/* scf files for mutation studies
Pregap4 in demonstration mode allows access to the same files listed for
Gap4.
Sequence Library Access Using EMBOSS
The spin interface to EMBOSS provides access to the sequence libraries
provided you know the names of the files you want to extract!
Linux, Gnome and Enlightenment
There is a known problem with the Gap4 contig editor when using Enlightenment
as the window manager, although this may have been fixed by now. This is the
default for earlier versions of Gnome, but it is not known whether the problem
arises when using Enlightenment in other environments. The symptom is that
program will terminate instantly upon starting the contig editor with a
complaint about X_ConfigureEvents.
The solution is to change window managers, which may be adjusted using the
Gnome control panel.
Change log
Here is a list of changes since the 2000.0 release.
Gap4
Changes
- Major changes to Find Internal Joins, consisting of new hashing and
alignment algorithms. The new alignments allow Find Internal
Joins to deal with very long sequences and sequences containing many
pads.
There now exists a fast (less sensitive) mode and a slower (more
sensitive) mode. It is recommended that the fast mode is used first to
join any particularly long matches.
- The join-editor "align" algorithm has been greatly improved. It aligns
long sequences quickly and works better with highly padded data.
- Gap4 database I/O now uses less memory and is much faster at opening
databases.
- Large speed improvement in directed assembly. Speed can also be
further increased by setting the minimum percentage mismatch to
zero. This disables updating the consensus, but should only be used
when no alignments are needed (for example when inputting data from
phrap).
- Large speed improvement to Enter Tags.
- Determining reading and template numbers from their names is much
faster. This speeds up many algorithms.
- The contig editor has a "store undo" toggle, which may be used to
prevent undo information being stored. This gives useful memory saving in
some complex cases, such as trying to align two very long sequences or
using shuffle pads on very long contigs.
- New search commands in the list menu: search by reading name, template
name and annotation contents. These produce clickable-lists, to bring
up the contig editor or template displays centred at the relevant
locations.
- The text output window now contains 'hypertext' links for reading
names. This allows the user to left-click to bring up the contig
editor centred on that reading, or right click to get a list of
options (template display and display notes).
At present only the output from the new search tools and from show
relationships has these links. Please let us know
if you would like further sequence-links.
- Improved shuffle pads algorithm which should work better on data
with a high sequence depth. However it doesn't work better in all
cases. If people prefer the older method then please tell us.
- The use of hidden data in some algorithms (Find Internal Joins, Check
Assembly, Save Consensus) now allows selection of the best poor
quality data by analysis of average confidence value.
- Directed assembly now has an "enter all readings" mode. If a sequence
fails to enter then subsequent sequences which depended on it will now
start new contigs (instead of also failing).
- More chemistry information is now available, including BigDye
terminator, Licor, etc.
- Now correctly handles more than 99,999 sequences - now up to
99,999,999.
- Reading names can now be up to 40 characters long. To help control the
ever-lengthening output the name display width can be adjusted in the
Contig Editor's dump contig command.
- (BETA) Interface to PE's BIOLIMS database. This includes direct
reading and writing of both trace files and sequence assemblies.
- ZTR the new trace file format can be used by gap4.
- Consensus output in fasta format now allows characters other than
a,c,g and t. (This is not strictly correct, but makes fasta output
more useful in conjunction with masking and marking.)
- New "NOTE" type 'INFO'. These are displayed in the contig editor
information line; containing user defined comments.
- (BETA) There is now a new program (pre_finish) to automatically choose
finishing reactions. At present this deals with long gel readings,
resequencing (using different chemistries), primer walks and clone
walks. It is completely customisable, but at present has no easy to
use interface.
- Improved the database.log file information, for easier debugging. It
now also contains the output from the Check Database function.
- The "use special chemistry" (double chemistry implies double stranded)
option in the consensus algorithm dialogue box is now disabled by
default.
- The ruler lines displayed in the template display (and similar plots)
now have the ruler ticks shown (by default). More ticks are now
also shown.
- Restriction enzymes can now be detected in data containing pads.
- Traces can now be manually complemented within the trace display.
- The editor search by "quality" mode now ignores pads and so-called
"bad data" (N's) as problems when the user is using the confidence-
values consensus mode.
- Invoking the contig editor from the contig selector or template
display ruler now brings up the editor with the editor cursor
positioned approximately where the popup menu was placed.
- The maximum database size (-maxdb command line argument) can be set
within gap4 itself, although this only takes effect when opening the
next database.
- Directed assembly automatically increases both maxseq and maxdb when
entering large external assemblies. (Unfortunately this is still not
true for the Normal Shotgun Assembly function.)
Bug fixes
- Worked around an X server bug on Linux, where the X server could crash
after changing the number of columns of traces displayed.
- Minor improvements to trace-editor cursor positioning.
- The phrap assemble option was stripping lower-case t's from sequence
names when supplied via a list (but worked via the more normal route
of a file-of-filenames).
- Remove file-descriptor leak when changing genetic code tables.
- Fixed a problem with list confidence where the error rate could
sometimes overflow (only a problem for extremely low error-rates).
- Contig editor searching (backwards only) for "edits" could sometimes
miss edits that were neighbouring, and in some cases could even
crash.
- Templates containing one sequence with both 5' and 3' vector are now
correctly labelled. (Mainly effects read pair and template display
functions.)
- The colours for the read coverage and strand consistency plots were
swapped around (now red is +ve and black is -ve).
- The "list contigs" window (contig selector) was not working correctly
when it was resorted - the popup menu would invoke the commands on the
wrong contigs.
- Solved some display glitches in the trace display when scrolling with
the confidence values shown.
- We now reject zero length sequences in directed assembly.
- The Copy Database with garbage collection enabled no longer crashes
when the database is opened in read-only mode.
- Complementing contigs from within the template display would crash if
the display was showing many contigs.
- Fixed bug in template display which occured if only the ruler was
displayed and you zoomed in, scrolled, zoomed out then zoomed again
and the ticks were out of sync with the ruler.
- Fixed the Screen Only assembly method. We broke the dialogue when
updating the hidden data controls.
- Joining contig A to contig B when contig B has the restriction enzyme
map displayed would crash when the enzymes are next "touched" in the
restriction enzyme map.
- When zero sequences cover a consensus point (which is legitimately
possible when marking sequences for removal) clicking on the consensus
in the editor to display the confidence values could cause a crash.
The display could also be invalid when scrolling. We now initialise
the consensus confidences correctly in such cases.
- Trace confidences on solaris would sometimes be displayed
incorrectly.
- The template display "information" popup (from highlighting templates)
now displays the observed template size when appropriate, and
indicates the estimated size for templates which span contigs.
- Fixed the problem where occasionally the editor "sequence names" panel
would be blank (until a refresh occurs).
- Some entries in the vector-primer file had the forward and reverse
sequences wrong; these have now been switched around. The vectors are
pBS, pgem* and puc18. Users should be aware that this means old
projects had forward and reverse sequence attributes switched for
sequences from these vectors.
Nip4/Sip4 (now spin)
Changes
- Merged Nip4 and Sip4 into a single program named spin.
- Added a graphical interface to the EMBOSS suite of programs.
- New prokaryotic ribosome binding site weights file (PERCEPTRON.WTS).
- Added E coli promoter weight matrices (for spin) to tables directory.
tables/prokprom_35.wts
tables/prokprom_10.wts
- Alignments now use multiple symbols for displaying the similarity
(controlled by the user). This is also visible in the spin
two-sequence window.
- EMBL features tables are now loaded. At present this is just used by
the translation code, allowing selection of one or more CDS features.
- If EMBOSS is installed and setup, spin can fetch sequences directly
from remote sequence databanks.
- Now have the choice to save in EMBL and Fasta format. The EMBL format
writes out the feature table, provided that the sequence has not been
edited.
Bugs fixed
- Fixes relating to the "Sequence manager" and sequence selection.
- Fixed comparison functions which didn't update correctly when using a
_rf123 sequence (should be treated as being a protein sequence)
- Fixed comparison functions which don't fill out some of the entry
boxes if you press cancel and then bring up the same dialogue box
- Fixed bug in plot base composition and emboss graphical functions when
plotting over a range.
- Fixed crash on alphas which occurs when the y dimension is 0 eg plot
base composition over a range (not including the ends).
- Fixed crash on alphas when search for string
- Fixed various sip functions which crashed or hung if no matches
were found
- When using rf123 sequences, the program automatically creates 3 new
translated sequences but added a unique identifier to the end of the
name in the same way as the other new sequence generating functions do.
- When using rf123 sequences over a range, didn't calculate the start
position correctly which caused eg align sequences to crash
- The cursors in sequence pair display when there are several sequences
in the display (eg when comparing dna vs protein) could not be
selected as a pair.
- Fixed weight matrix search which added the minimum y value
to the score when it shouldn't have
- Fixed an uninitialised memory read causing crashes (mainly on Solaris)
when reading experiment/EMBL files.
- Zooming plots when the crosshairs are shown sometimes produced tcl
error messages.
- The default score for find similar spans was sometimes blank.
- Blank lines in score matrices were producing incorrect results. We now
also check for non square matrices, to check for missing rows or
columns.
Pregap4
Changes
- Changed the widget-set used from Tix to incr-widgets. This solves some
of the odd colour and font problems.
- The mutation detection module now allows the forward and reverse
strand wild-type traces to be specified independently.
- The vector-primer file mode of vector clip is now the default mode.
- ZTR support, to allow for the new compressed trace format.
- The phred and ATQA modules now automatically convert their input files
to SCF, so an explicit conversion is no longer required.
- The ABI/ALF to SCF module has been replaced with a generalise convert
trace file format module. This can convert to SCF, CTF and ZTR
formats. Additionally it allows for trace normalisation and
quantisation (down-scaling to 8-bit (for example) values).
- The quality clip module may now also reject files if they are shorter
than a specified length.
- Made the eba module auto-convert ABI and ALF format to SCF. This means
that it can be placed before the convert trace file module and hence
is in a consistent position with similar tools (phred, atqa). The
native output format is now ZTR.
- The ticks and crosses are now [x] and [ ] respectively. The module
name is also appropriately highlighted. This makes it easier to see
which modules are enabled while also solving a problem with the
dingbats fonts on some systems.
- Removed the "Select parameters to save" buttons from each module.
Please tell us if you really need this feature. (It caused too
much confusion.)
Bug fixes
- Blast module now handles very long filenames much better.
- Fixed a problem in the RepeatMasker, Crossmatch and Blast modules when
dealing with sequences containing lowercase letters.
- Some parameters of the Extract Sequence module were not saved when
"Save All Parameters" was used.
- Better handling of duplicate sample names.
- Fixed problems in the phred module when dealing with filenames
containing odd characters (regexp meta-characters).
- The enter assembly module now sets maxseq and maxdb correctly. It also
now removes the BUSY file when an error occurs.
- Solved a problem when loading files of filenames containing partially
relative pathnames (eg D:dir/file).
- Fixed a bug in the history tracking of files for the convert trace
module. It was setting the file_orig_name to be file_id(name) instead
of just name. Hence the pregap.log file contained incorrect values.
- Screen_seq reported match positions using the coordinates from the
wrong sequences when writing the tags i.e. the reading positions were
from the vector and vice versa.
- The cross_match pregap4 module can now handle spaces in filenames
and it can has improved error handling for when cross_match fails.
- The blast screen pregap4 module has problems with spaces in database
pathnames. It does not work with spaces in the directory components
(achieved using the BLASTDB environment variable instead of a .ncbirc
file), but the filename component still fails with spaces.
- Fixed a bug in exp2read function which cleared the base positions
regardless of whether the experiment file overrode them using ON
records.
- Exiting pregap4 when trev is open no longer produces trev error
messages.
Vector_clip
Changes
- No longer writes out primer type (PR line) information when this
already exists in an experiment file.
Bug fixes
- Fixed out-by-one positioning of matches on the reverse strand.
- In vector_primer mode, the default left-clip value was incorrectly
applied.
- In vector_primer mode, if a match is found at 3' only then the SF line
(vector filename) was incorrectly set.
Trace file handling (io_lib)
Changes
- New formats supported: CTF and ZTR (highly compressed).
A new program 'convert_trace' may be used to switch between the
formats. Convert_trace subsumes the previous functionality of init_exp
and makeSCF, although both still exist for backwards compatibility.
- A new program get_comment is now used to extract the text fields from
all formats of trace files. This replaces the getABI* and
get_scf_field programs. Get_comment is used instead of these in
pregap4.
- Support for on-the-fly decompression via szip.
- Traces can be directly read from tar files. Needs RAWDATA setting to
"TAR=filename.tar".
Bug fixes
- Fixed a minor problem dealing with space characters in plain text
files.
- Wrong timestamps were sometimes reported for ABI files.
Trev
Changes
- No longer needs a licence; can freely view any file.
Bug fixes
- More robust when dealing with partially complete traces; such as
traces missing confidence values, base positions or even sequence.
- The trace widget "sequence" method crashed if a trace wasn't loaded.
This could be called from trev when bringing up the search box and
then switching to a malformed trace without cancelling the search box.
- Fixed a crash in trev where switching "next" to a trace which is
malformed or does not exist caused the postscript printing options
to be freed multiple times.
- Allow deletion of the last base.
- Fixed the trace printing code for traces with no trace data.
Misc
Changes
- GUI updated to use tk8.3.x instead of 8.0.
- Demonstration mode (ie no licence present) will now allow the course
data to be used; processed via pregap4 and assembled and edited within
Gap4.
- Improved Unix package initialisation. The staden.login and
staden.profile files are now optional when simply running the main
(graphical) tools.
- The licence file may now be pointed to by the STADLICENCE environment
variables. Licences for multiple system types (unix/windows) and
multiple package releases may now be mixed in the same file.
- The demo data, as well as the course data, can be assembled and
checked in demo mode.
Bug fixes
- Removed a file-descriptor 'leak' in extract_seq.
- Under linux the MANPATH is now set correctly in the login setup.
- Minor bug-fixes to eba (estimate base accuracies), potentially causing
crashes (not observed).
- Redirecting output from the text-output window could cause crashes on
some systems (typically Linux).
- Many fixes relating to the differences between the way MS Windows and
UNIX window managers function. For example which window is "fronted".
- Many fixes relating to the fact that MS frequently uses spaces (" ")
in filenames.
- Set the line width for spin plots to 0 (was 2). This greatly speeds up
drawing on Windows.
- Program colours under CDE and certain X windows emulators now appear
correctly. We override the CDE system defaults. If you wish to change
the Staden Package colours use the Set Colours command in the Options
menu.