home up

Version-1.6.0 Release Notes, October 2005

The main changes in this release, in addition to the usual round of bug fixes, are the support in Gap4 for very large databases, for traces from the "454" instrument and for a new SNP Candidates plot. In addition to these substantial speed improvments have been made.

64-bit file size support

With databases getting ever bigger we finally reached the point where the gap4 database files needed to be larger than 2Gb in filesize. This posed a problem for the old 32-bit file offset code (mandated by the format of the .aux file) so we have made Gap4 64-bit aware.

Firstly note that Gap4 still uses 32-bit file sizes by default and so is fully backwards compatible. Gap4 may create 64-bit databases by specifying the "-bitsize 64" command line option. Gap4 will automatically detect the format on subsequent opens so this option only need to be specified at creation time. However note that 64-bit databases are incompatible with the old 32-bit ones and so will not be read by older Gap4s. Fortunately copy_db has also been modified to support conversion between 32-bit to 64-bit and vice versa (assuming the database isn't too large to fit in 32-bits).

Dealing with such massive databases also showed a number of areas where the speed was too slow. We've made considerable improvements across multiple places and these changes should also help speed up normal 32-bit usage too.

The SNP Candidates plot

This new plot has been designed to graphically show the locations where a consensus column has a strong chance of being made up of multiple sequences, eg due to it being a SNP or a variance in a collapsed repeat.

Stage 2 of this plot is to cluster the differences to attempt to pull apart the sequencing into a sets where each set has no or few internal differences. These sets may then be viewed directly in the contig editor by colour coding and sorting the sequences. The editor also has manual control over moving sequences from one set to another, but currently saving of edits made this data is not supported.

Finally the consensus sequence for each set may be saved or alternatively the sets may be split apart to form multiple contigs.

The 454 sequencing machine

This new instrument by 454 Life Sciences is based on pyrosequencing techniques and so produces a very different style of trace to traditional Sanger sequencing. The 454 machine SFF format is now natively supported as is a prototype for the TVF archive (although both of these format have not been finalised yet and so may be subject to change). Trev and Gap4 support viewing of "454 flowgrams".

The new flowgrams are substantially smaller than traditional traces which gives rise to a number of interesting issues with file management. To compensate for this there are now a variety of tools for packing multiple small traces into a single large archive. This can be either a unix tar file (with a faster indexing method than before) or a new ZTA (ZTR Archive) file. See the io_lib package for more details and related tools.

We are aware that there are a number of issues to still address with incorporating pyrosequencing data with traditional Sanger sequencing data. There is a new replacement for Shuffle Pads which greatly improves alignments of 454 data produced by Gap4 or Phrap (but isn't needed if you use 454's own assembler), but there are still issues to resolve with the use of confidence values. We expect to be concentrating on this in the oncoming months.







Bug fixes

home up