testing dataset. Each family was divided into a
training and testing set. Precisely, 75% of family
members were used as training data for individual
models, while remaining 25% of each family was
compiled into a classification testing set.
Table 1: Accuracy of protein classification for individual
protein families from PROSITE and overall accuracy.
Class
cswHM
M
VOGUE HMMER HMM
PDOC00662 81,82 81,82 72,73 27,27
PDOC00670 85,71 80,36 73,21 71,4
PDOC00561 90,48 95,24 42,86 61,9
PDOC00064 85,71 85,71 85,71 85,71
PDOC00154 71,88 71,88 71,88 59,38
PDOC00224 91,67 87,5 100 79,17
PDOC00271 91,89 89,19 100 64,86
PDOC00343 92,85 89,29 96,43 71,43
PDOC00397 80 100 40 60
PDOC00443 85,71 100 85,71 85,71
Average 86,38 85,11 80,43 67,66
We trained the cswHMM with different values of
parameters of maximal pattern length, maximal gap
length, minimal pattern frequency, similarity
function coefficients and the switching probability.
The models that gave the best performance were
chosen. The optimal value of maximal pattern length
was found to be 6 to 7. The maximal gap length was
found not to significantly influence the results if
higher than 5. The average probability of switching
between sub-models was 0,86. This means that most
of the modelled patterns were gapped.
Table 1 shows the comparison of results of
different models on individual datasets and the
overall probability of prediction. cswHMM is
comparable with other methods and in overall
probability it even slightly surpass them, but this
type of application should not be the primary
function of cswHMM. We use it more as a
validation of our proposed model whose main
purpose is to improve analyses of sequences with
mixed contexts and to identify those contexts. To do
so, we currently develop a tool to analyze individual
sub-models and their performance during sequence
analysis.
3.2 Loop Modelling
We analyzed eight arbitrarily selected protein
families defined in the Pfam database (Finn, 2010),
using the alignment of seed sequences of each
family to determine conserved (protein core) and
variable (loops) regions. In each alignment we
identified possible loop positions as positions where
more than 30% of aligned sequences had a gap.
Amino acids at these positions were used to create a
database of short sequences that represent the
possible loops.
Table 2: Logarithmic probabilities for combined HMM,
classic profile HMM and their comparison.
Pfam code cswHMM pHMM difference %
PF00078 -3,738 -3,670 0,07 1,84
PF00024 -3,856 -3,756 0,10 2,66
PF00117 -3,722 -3,709 0,01 0,34
PF00171 -3,834 -3,798 0,04 0,95
PF00227 -3,699 -3,685 0,01 0,39
PF00246 -3,786 -3,695 0,09 2,46
PF03129 -3,856 -3,752 0,10 2,77
PF01436 -3,864 -3,812 0,05 1,36
Average -3,794 -3,735 0,060 1,60
We trained a four-state HMM on this database to
create a simple loop model for each family.
Subsequently, we used the core profile HMM of
each family and combined it with the corresponding
loop model to create a simple cswHMM. We
computed logarithmic probabilities of generating
individual family without insertion states sequences
with the new model and compared them with the
logarithmic probabilities for classical pHMM. The
results are shown in Table 2 and show slight
improvement of generating probability with
cswHMM.
The loop modelling experiment has shown that
the cswHMM approach may find use in protein
sequence analysis where blocks of amino acids
interfacing with different environments (protein
core, membrane or cytoplasm) are interspersed with
contrasting blocks. Numerical treatment
corresponding to the proposed cswHMM provided
an average 1,6% improvement in sequence
description, consistently better for all studied
families (Table 2). Models used within the
cswHMM framework could possibly represent
different building blocks, where at least one of the
block types follows a predetermined order of amino
acids and where mixing of the blocks is variable or
optional.
4 CONCLUSIONS AND FUTURE
WORK
In this paper we have presented a new model that is
a type of variable-order hidden Markov model with
ability to analyze mixed contexts in sequences. We
BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms
212