components, the number of process that exists
hidden in the database. More specifically, as an
application of SVD, we want to show that the
number of the most significant singular values is
associate with the number of protein families in a
sequence database. Such prediction can be used in
phylogenetic inference, data mining, clustering etc,
making experimental tests more efficient, and
avoiding randomly determination for possible
outcomes.
2 SYSTEM AND METHODS
Programs implemented for this analysis were written
in MATLAB (The Mathworks, 1996), using its
inbuilt functions (SVD, sparse matrix manipulation
subroutines etc). Four datasets were used in this
paper. The first evaluated database had 64 vertebrate
mitochondrial genomes composed of 832 proteins
from 13 known gene families (ATP6, ATP8, COX1,
COX2, COX3, CYTB, ND1, ND2, ND3, ND4,
ND4L, ND5 and ND6). This curated protein
database was downloaded from online information
by Stuart et al. paper (Stuart et al., 2002). The
second database was composed by sequences from
proteins retrieved from GenBank in 19/04/2006. It is
a random 100 sequences sample of each protein
type: globin, cytochrome, histone, cyclohydrolase,
pyrophosphatase, ferredoxin, keratin and collagen
and 200 other proteins, totalling 1,000 sequences
from ten different types of genes. The third database
was the file "pdb_seqres.txt.gz", located in
http://bioserv.rpbs.jussieu.fr/PDB/. This file has
121,556 redundant protein sequences from PDB
(Protein Data Bank), which was reduced to 37,561
non-identical sequences. From this file we recovered
all sequences related to six types of enzymes:
Ligase, Isomerase, Lyase, Hydrolase, Transferase
and Oxidoreductase, which totalled 10,915 proteins.
We also recovered a sample of 219 globins from the
PDB file that was used as another test set. Besides,
we extracted 86 sequences of haemoglobin alpha-
chain and a sample from the PDB file with all
sequences higher than 47 amino acids (31,906
proteins from several types of genes). Each of the
above sequence files was analyzed by MATLAB
subroutines that generate twelve tripeptide sparse
matrices as described by Stuart (Stuart et al., 2002)
and adapted by Couto (Couto et al., 2007).
All sequences were recoded as 3-peptide
frequency values using all possible overlapping
tripeptide window. With 20 amino-acids it is
generated a matrix M (8,000 x n), where n is the
number of proteins to be analyzed. After the
generation of the tripeptide frequency matrix (M),
the matrix itself is subjected to SVD (Deerwester et
al., 1990; Berry et al., 1995) and factorized as M =
USV
T
. Where U is the p x p orthogonal matrix
having the left singular vectors of M as its columns,
V is the n x n orthogonal matrix having the right
singular vectors of M as its columns, and S is the p x
n diagonal matrix with the singular values
σ
1
≥ σ
2
≥
σ
3
... ≥ σ
r
of M in order along its diagonal (r is the
rank of M or the number of linearly independent
columns or rows of M). These singular values are
directly related to independent characteristics within
the dataset. Actually, the largest values of (S)
provide the meaning of the peptides and proteins in
the matrix (M). On the other hand, the smaller
singular values identify less significant aspects and
the noisy inside the dataset (Eldén, 2006).
In this work our focus is only in the matrix (S)
and its diagonal values (s
i
) that make up the singular
value spectrum. The magnitude of any singular
value is indicative to its importance in explaining the
data (Wall et al., 2003). Then, the objective here is
to visualize the singular value spectrum as plots that
help biologists to discover the main components, the
process, and the groups hidden in the database. Two
graphs were built:
a) the scree plot, with 25 bigger singular
values for each database;
b) the cumulative relative variance (V
i
)
captured by the ith-singular value:
V
i
= 1 − (S
i
)
2
/∑
k
(S
k
)
2
; S
i
= ith-singular
value; k = 1, 2, … n.
The visual examination of the scree plot looks
for a “gap” or an “elbow” that indicates how many
significant singular values exist in database. After
the “gap” there is no significant value. The second
graph helps to understand how much variance is
explained by each singular value. Despite the effort
for automatic analysis, graphic visual inspection still
is one of the most commonly used in practice for
dimensionality selection (Zhu and Ghodsi, 2006).
3 RESULTS
When there is only one specific type of protein in
database, as haemoglobin alpha-chain, the singular
value spectrum obtained shows a “big gap” after the
first eigenvalue (Figure 1). Such result is confirmed
by the second graph (Figure 2) that indicates more
than 90% variance is explained by the first singular
value, which is compatible with the database itself.
BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms
314