peptides will lead to problem during the similarity
search step. According to Stuart (Stuart et al., 2002),
tripeptides may prove useful with highly diverged
sequences and tetrapeptides with highly related
proteins. On the other hand, larger peptides will
remain real undetected similarity, even between very
highly related proteins.
Representing proteins as frequency vectors of p-
peptides has the limitation that it does not consider
the occurrences order of p-peptides in the sequence.
Despite this possible ambiguity, several studies have
shown that this approach is surprisingly effective in
discriminatory analysis of protein sequences (Vinga
and Almeida, 2003). Anyway, before using this
protein vector representation, we made an analysis
of its ambiguity rate according to the number of
amino-acids (p) in the matrix of frequency protein-
peptide. We compared 26,675 non-identical proteins
longer than 100 amino-acids and selected from the
PDB dataset. To identify ambiguities during vector
recoding, we compared 355,764,475 sequences-
pairs. The percentage of ambiguity felt from about
4%, when used only one amino-acid in the matrix of
frequencies (p=1) to less than 0.5% in proteins with
two or more amino-acids. The percentage of
uncertainty was calculated considering the number
of different sequences with the coding for all
sequences that were compared pair-to-pair (26,675).
It is noteworthy that in all pairs with identical vector
coding, even among the 1,267 pairs with p=1, the
protein involved was exactly the same, with minor
changes of amino-acids in some positions. This
happened because, before analysis, we removed
from the PDB database only sequences with 100%
identity. We can say that the ambiguity is a
theoretical possibility in principle but not in practice.
2.3 Singular Value Decomposition
After the generation of the p-peptide frequency
matrix (M) representing each dataset with n
sequences, the matrix itself is subjected to SVD
(Deerwester et al., 1990; Berry et al., 1995) and
factorized as M = USV
T
. Where U is the p x p
orthogonal matrix having the left singular vectors of
M as its columns, V is the n x n orthogonal matrix
having the right singular vectors of M as its
columns, and S is the p x n diagonal matrix with the
singular values
σ
1
≥ σ
2
≥ σ
3
... ≥ σ
r
of M in order along
its diagonal (r is the rank of M or the number of
linearly independent columns or rows of M). This is
performed by many software, including MATLAB
(The Mathworks, 1996), used in this work. The
matrix (U) is related to the p-peptides of the dataset,
whilst (V) is associated with the proteins studied.
The central matrix (S) contains the singular values
of (M) in decreasing order. These singular values are
directly related with independent characteristics
within the dataset. Actually, the largest values of (S)
provide meaning of the peptides and proteins in the
matrix (M). On the other hand, the smaller singular
values identify less significant aspects and the noisy
inside the dataset (Eldén, 2006). The number of
significant singular values from SVD analysis shows
how many process or groups can be hidden in
database.
For the sequence similarities analysis, instead of
using the original matrix M, a rank reduction of M is
done by using the k-largest singular values of M, or
k-largest singular triplet U
k
, S
k
, V
k
, where k < r. The
truncated matrix M
k
= U
k
S
k
(V
k
)
T
has two main
advantages. Reduced dimensionality makes the
problem computationally approachable, which is
crucial in whole genome analysis. Besides, and very
important, the rank reduction improve accuracy of
protein matrix by discarding noise and reducing the
variability in p-peptide usage for the same protein
family (Couto et al., 2007). The choice of k, the
number of singular values that must be used in the
reconstruction of the protein matrix after SVD, is
critical and normally empirically decided. Ideally,
the k factor or matrix dimension must be large
enough to fit all the real structure in the data, and
small enough not to fit the sampling error or
unimportant details. In this work we used the
method proposed by Everitt and Dunn, that
recommends analyzing the relative variances of each
singular values. Singular values which relative
variance is less than 0.7/n, where n is the number of
proteins in the document-term matrix, must be
ignored (Everitt and Dunn, 2001).
3 RESULTS
Firstly, we analyzed 620 sequences randomly
selected from the first database with mitochondrial
gene families. BLAST, actually bl2seq.exe program
with default parameters, were used to compare each
pair of sequence, which totalling 191,890
comparisons. The same proteins were recoded as
vectors in a high-dimensional space that was
reduced by SVD and analyzed according to the
methods described by Couto (Couto et al., 2007).
Scatter plots were built and suggested that Euclidean
distance is negatively related with bit score, but
positively correlated with E value. For the cosine we
found a negative association with E value and a
SINGULAR VALUE DECOMPOSITION (SVD) AND BLAST - Quite Different Methods Achieving Similar Results
191