Method

RAPTOR Video Tutorials

BLAST was designed in the 1980s to speed up the Smith-Waterman algorithm (1970s) for homology searching by sacrificing sensitivity for speed. Today, BLAST and Smith-Waterman are no longer sufficient for the exponential growth of genomics data. PatternHunter uses modern homology search technology invented by BSI.1

One such technology is optimized multiple spaced seeds. With new algorithms and ideas, PatternHunter challenges the conventions of homology searching and offers a more robust solution. Resist sacrificing sensitivity for speed. PatternHunter approaches the Smith-Waterman method (previously held popular for its sensitivity), while running thousands of times faster.

Spaced Seeds

It is unnecessarily intensive to compare each position in a query with each position in the database due to the magnitude of databases and queries. A more time efficient solution is to use a heuristic method for homology searching.

BLAST uses a heuristic method where a short, continuous sequence of letters as a "seed". An exact match of this seed suggests to the algorithm that a longer match may be found locally. Unfortunately this means BLAST only tries to find homologs in regions with hits. While PatternHunter also uses seeds; the difference is that the seeds can be made up of a discontinuous sequence of letters. By adjusting the relative positions of letters in our discontinuous sequence, we can optimize the seed to increase sensitivity.

The relative positions of the letters are denoted by binary variables within the string. For example: in the seed model "111010010100110111", a "1" means the letter at that position is required to match, and a "0" means the letter at that position is not required to match. The number of 1s is called the weight of the seed.

For example, the following homology can be "hit" (detected) by the above mentioned spaced seed.

PatternHunter (spacedseeds)

Consecutive Seeds

There are two factors that affect the performance of a seed: the selectivity and the sensitivity.

Selectivity determines the search speed - the greater required matches (more 1's) in a seed results in fewer hits, and thus a faster search.

Sensitivity determines the search quality - not all homologs can be hit by a given seed. For example, the seed 11111111111 cannot hit the above-mentioned alignment. Researchers will want to optimize the seed, so that the number of homologs hit by the seed is maximized.

Two seeds with the same weight will generate approximately the same number of hits, therefore a spaced seed and a consecutive seed with the same weight will have very similar selectivity; however, the spaced seed will have better sensitivity. This is because when a consecutive seed finds a hit, a second hit at the next position of the homology is very likely, as it requires only one more letter match. The second hit is redundant because only one hit is required to find the homology. 1

PatternHunter (spacedseeds 2)

Spaced seeds are more independent. Therefore it is more difficult to have more than one hit in a homology.

PatternHunter (spacedseeds 3)

Therefore, using the approximately same amount of hits, a spaced seed will detect more homologies.

Multiple Seed Increases Sensitivity

Any given seed may fail to detect some homologs. Due to different seeds tending to fail at different homologs, using several different seeds simultaneously can significantly improve the success rate. One should note that it is very important to optimize the combination of the multiple seeds, so that their detection ability is complementary to each other.

 

Footnote:

  1. Ma B, Tromp J, Li M. PatternHunter: Faster and More Sensitive Homology Search. Bioinformatics. 2002 Mar;18(3):440-5.