Authors:
Bráulio Roberto Gonçalves Marinho Couto
1
;
Macelo Matos Santoro
2
and
Marcos Augusto dos Santos
2
Affiliations:
1
Centro Universitário de Belo Horizonte / UNI-BH, Brazil
;
2
UFMG, Brazil
Keyword(s):
Genomics, Matrix analysis, BLAST, SVD.
Related
Ontology
Subjects/Areas/Topics:
Algorithms and Software Tools
;
Bioinformatics
;
Biomedical Engineering
;
Pattern Recognition, Clustering and Classification
;
Sequence Analysis
Abstract:
The dominant methods to search for relevant patterns in protein sequences are based on character-by-character matching, performed by software known as BLAST. In this paper, sequences are recoded as p-peptide frequency matrix that is reduced by singular value decomposition (SVD). The objective is to evaluate the association between statistics used by BLAST and similarity metrics used by SVD (Euclidean distance and cosine). We chose BLAST as a standard because this string-matching program is widely used for nucleotide searching and protein databases. Three datasets were used: mitochondrial-gene sequences, non-identical PDB sequences and a Swiss-Prot protein collection. We built scatter graphs and calculated Spearman correlation () with metrics produced by BLAST and SVD. Euclidean distance was negatively correlated with bit score (>-0.6) and positively correlated with E value (>+0.7). Cosine had negative correlation with E value (>-0.7) and positive correlation with bit score (>+0.
8). Besides, we made agreement tests between SVD and BLAST in classifying protein families. For the mitochondrial gene database, we achieved a kappa coefficient of 1.0. For the Swiss-Prot sample there is an agreement higher than 80%. The fact that SVD has a strong correlation to BLAST results may represent a possible core technique within a broader algorithm.
(More)