Methods

ZOOM Video Tutorials

Lots of data in little time, with big results: ZOOM

Speed
Several techniques are employed to increase speed while maintaining high sensitivity.
The secret is spaced seeds; developed to reduce time for DNA similarity searches in its predecessor, PatternHunter.1

A spaced seed is a given pattern such as 11*1**11111*1**1111. The number of 1-positions, considered the weight of the seed, while * represent wild cards and thus do not require a match. By matching this pattern across DNA, meaning all 1s match between the two sets, even if the * do not, the similarity is identified.

Different spaced seeds are going to have different hit probabilities in any given sample. Therefore in ZOOM and PatternHunter alike, it is reasonable to identify several optimized spaced seeds. In order to find all high-scoring local alignments between two long DNA sequences, this method identifies all hits and performs extensions near these hits. This reduces computing time on most low-scoring local alignments, as often no hit is produced; therefore, speed is greatly improved.2

Accuracy
ZOOM extends the spaced seed strategy in short reads mapping, and performs specialized optimization for next generation sequencing by using different spaced seeds at several designated positions of the read. Maintaining low memory usage, while maximizing high throughput are key objectives for ZOOM. Other optimizations for next generation sequencing include 100% sensitivity for a wide range of read length and mismatch numbers.

ZOOM uses different spaced seeds at several designated positions of the read. Thus, a spaced seed becomes the combination of its pattern and the read position it is used. For mapping accuracy consideration, ZOOM's spaced seed weight is set to achieve full sensitivity, given read length X and mismatches bound Y.

This raises two different objectives: If a seed weight is too small, it will result in an extremely slow mapping process. If the see weight is too high, many-many seeds will be required in order to achieve full sensitivity, requiring more memory, which slows down the mapping process as well.

Solution: ZOOM produces a tight lower bound on the number of spaced seeds to be used. In the benchmarking section, this scenario will be presented with a weight 15 for read length 33 with two mismatches.

Insertions and Deletions
In this benchmarking, ZOOM displays a basic model, where only mismatches are considered, but what about insertions and deletions between reads and the reference genome? To consider the input reads set as a whole, instead of mapping them one by one, ZOOM builds hash tables for the reads set using the spaced seeds. For a given seed, the reads sharing the same letters at the 1-positions of the seed are grouped in the same entry of the hashtable.

Then ZOOM scans the reference genome and finds read candidates that have hits with the current genome position and only verifies these candidates. Using these spaced seeds, filtering to find read candidates has been designed not to miss any reads that are within the mismatch threshold within the target region. Furthermore, without indexing the reference genome, ZOOM continues to be very memory efficient.

Single-end & Paired-end Read Mapping
ZOOM supports the mapping of single and paired end reads. When the mapping distance between two paired reads is within a user calibrated range, mapping information is collected. Experiments have proven that paired-end information aids in the identification of true mapping positions and significantly contributes to mapping accuracy.

Scoring
ZOOM utilizes the quality score of Illumina reads. Low quality Illumina reads are recognized and reads are mapped relying on only high quality bases. For AB SOLiD data, sequencing errors in color space can be corrected, polymorphisms on base space and sequencing errors are marked respectively.

Footnotes:

  1. Ma B, Tromp J, Li M. PatternHunter: Faster and More Sensitive Homology Search. Bioinformatics. 2002 Mar;18(3):440-5.
  2. Lin H, Zhang Z, Zhang MQ, Ma B, Li M. ZOOM! Zillions Of Oligos Mapped. Bioinformatics. 2008 Nov 1;24(21):2431-7.