Package Version 1996.0

1st Febrary 1996

Package Version 1996.0 released.

One of the major changes (and the one that took most time to produce) in this release is that GAP4 and TREV now have online help and we have created our own WWW pages. The help can be browsed from within the programs using Netscape or the simple inbuilt WWW browser included with the package. Now that this information is available we give less detail in these Release Notes as the online help reflects the current status of the programs.

Apart from the online help, the main changes in this release are to GAP4 (for those who missed it, the GAP4 paper came out: Bonfield,J K, Smith,K F and Staden,R. A New DNA Sequence Assembly Program. Nucleic Acids Res. 23, 4992-4999 (1995) ) but it also includes improvements, bug fixes and additions to other components of the package. Several changes have been made in response to requests by users and we encourage more groups to contact us with suggestions for improvements and with bug reports.

As an example of this, the Genome Sequencing Centre, in St Louis asked if we could reduce the file size for SCF files and so reduce the cost of disk storage for large projects such as theirs. We decided to change the way the different types of data were stored in SCF files so that compression programs such as gzip would work more effectively on them. These new style SCF files (SCF version 3.00) can be compressed to around 40% of their original size. Programs like TREV and GAP4 can read the files in compressed or uncompressed form (as well as all the older styles of SCF format). All the new code for this purpose is in our io_lib directory. The documentation for this useful library of functions has also been improved. Other io_lib changes include two new programs - "extract_seq" extracts only the sequence component of either a trace or experiment file, and "scf_update" can be used to convert between SCF formats 2 and 3; the RAWDATA environment variable is now used as a list of directories when looking for a trace file to load (from an experiment file); and a few minor bug fixes.

The Sanger Centre said they wanted to treat readings produced using dye terminators as being equivalent to having a reading from both strands of the sequence. That is they wanted all of the functions that check for coverage of both strands of the sequence, for example the Experiment Suggestion functions or the Quality Plot, to treat single stranded segments that are covered by dye terminator readings as double stranded. This has been implemented by introducing a new Experiment file record type, the CH or "Special Chemistry" record, and a new entry in the GAP4 database for storing "Reading Flags". If the CH record is present and set to 1, the reading can be treated as equivalent to two readings, one from each strand. Within GAP4 users can choose whether they want such readings to be treated in this way for any of the functions that calculate a consensus sequence.

We made another change to Experiment file format. The terminology surrounding the direction of a reading on a template, the readings sense, its strand and orientation is confusing. In an attempt to simplify it we have extended the Primer Type (PR record) definition by adding the type 4. Now 0 means unknown, 1 means forward from universal primer, 2 reverse from universal reverse primer, 3 means forward custom primer (previously it was any custom primer), and 4 means reverse custom primer. We hope this makes it easier to create correct Experiment files and means that the Direction of Read (DR record) is no longer required (although we will continue to support it for now).

While looking at GAP4 databases from external laboratories we have discovered that many contain missing or conflicting records, particularly relating to template information, and have tracked this down to errors/omissions in their Experiment files. In some cases this was due to use of external (bugged) programs for setting up the Experiment files, but it has also emphasised the need for us to help people to use PREGAP. To improve this we have updated the PREGAP documentation and added some template configuration files. It is important to realise that to get the best from GAP4 it is necessary to give it complete and correct data about all the readings it assembles.

Changes to GAP4 include a modified Quality Plot to make the problems more apparent; a new "Independent Assembly" function in which a batch of readings can be assembled as though they were the only readings in the database, ie they will only be compared with one another; command line arguments for the maximum consensus length and maximum number of records in the database are now available; for long-running tasks, like assembly, results are now written to the Output Window while the function is running, rather than buffered up until the task has finished; several functions have been greatly speeded-up; padding characters are now given an accuracy estimate that is the mean of the characters adjacent to it; the code for checking on Read Pairs and for plotting readings and templates in the Template Display has been greatly improved (including bug fixes) and the listed output adjusted accordingly; a bug in assembly that allowed reading names, that were not the same as the Experiment file name, to be entered more than once was fixed; a bug caused by reading names of 16 characters (thanks to colleagues in Japan) was found and fixed; a bug that sometimes gave incorrect consensus sequences in Find Internal Joins was fixed; consensus and quality cutoff figures were previously often not used; a consensus tag corruption occured in some specific joining cases; extract readings now outputs correct TN lines and is more robust with very long sequences. Large numbers of less serious bugs were also fixed.

TREV can read SCF files via their Experiment file. All edits are saved to Experiment files, rather than to the SCF file. Several small bugs fixed in TREV. ALFSPLIT and CONVERT have also been improved. People have started to use REPE for sequence families other than Alu and in doing so have uncovered a number of Alu specific assumptions, which have now been removed.

Two bugs have been fixed in the pattern search routines in NIP and NIPL. Bernard Caudron at the Pasteur pointed out that the CODATA version of PIR files had changed and this had broken our sequence library index creation programs and our reading routines. He sent fixes and they are included in the Release.

Silicon Graphics have now fixed the bug in their Fortran that was breaking our sequence library access routines and so, once again libraries can be read on SGI machines. This bug fix is available as a patch from SGI for current systems, and is fixed as standard in the forthcoming Irix 6.2 release.

In summary the major changes have been the addition of online help to GAP4 and TREV, numerous bug fixes and speedups to GAP4, and changes to SCF and Experiment file formats. Feedback welcome.