Authors:
Anna Gambin
1
;
Sławomir Lasota
2
;
Michał Startek
2
;
Maciej Sykulski
2
;
Laurent Noé
3
and
Gregory Kucherov
3
Affiliations:
1
University of Warsaw and Mossakowski Medical Research Centre Polish Academy of Sciences, Poland
;
2
University of Warsaw, Poland
;
3
LIFL/CNRS/INRIA, France
Keyword(s):
Sequence alignment, Protein BLAST, Subset seed, DFA, Genetic algorithm.
Related
Ontology
Subjects/Areas/Topics:
Algorithms and Software Tools
;
Bioinformatics
;
Biomedical Engineering
;
Sequence Analysis
Abstract:
The seeding technique became central in the theory of sequence alignment and there are several efficient tools applying seeds to DNA homology search. Recently, a concept of subset seeds has been proposed for
similarity search in protein sequences.
We experimentally evaluate the applicability of subset seeds to protein homology search. We advocate the use of multiple
subset seeds derived from a hierarchical tree of amino acid residues. Our method computes, by an evolutionary algorithm, seeds that are
specifically designed for a given protein family. The representation of seeds by deterministic finite automata (DFAs) is developed and built into the NCBI-BLAST software. This extended tool, named SeedBLAST, is
compared to the original NCBI-BLAST on the GPCR protein family. Our results demonstrate a clear superiority of SeedBLAST in terms
of efficiency, especially in the case of twilight zone hits.
SeedBLAST is an open source software freely available http://bioputer.mimuw.edu.pl/papers/
sblast.
Supplementary material and user manual are also provided.
(More)