Figure 1: (a) Is an illustration of structure-based sequence alignment and hidden state paths. In Sequences 1 and 2, the
uppercase and lowercase letters, respectively represent aligned core blocks and unaligned regions. Secondary structure (ss)
types (helix, ‘h’; strand, ‘e’; coil, ‘c’) are shown for Sequence 1. The hidden state paths for three models are shown below
the amino acid sequences. (b) Model structure of HMM_1_1_0. (c) Model structure of HMM_1_1_1. (d) Model structure of
HMM_1_3_1. This illustration was taken from the work of Pei and Grishin (Pei and Grishin, 2006).
Table 1: Compressed alphabets evaluated in this study. The
first column is the alphabet name. The number at the end of
the name indicates the number of classes for the alphabet.
In the second column are the classes or, in other words, as
the amino acids are grouped on the alphabet.
Alphabet Classes
Dayhoff(6) AGPST,C,DENQ,FWY,HKR,ILMV
SE-B(6) AST,CP,DEHKNQR,FWY,G,ILMV
SE-B(8) AST,C,DHN,EKQR,FWY,G,ILMV,P
Li-A(10) AC,DE,FWY,G,HN,IV,KQR,LM,P,ST
Li-B(10) AST,C,DEQ,FWY,G,HN,IV,KR,LM,P
Murphy(10) A,C,DENQ,FWY,G,H,ILMV,KR,P,ST
SE-B(10) AST,C,DN,EQ,FY,G,HW,ILMV,KR,P
SE-V(10) AST,C,DEN,FY,G,H,ILMV,KQR,P,W
Solis-D(10) AM,C,DNS,EKQR,F,GP,HT,IV,LY,W
Solis-G(10) AEFIKLMQRVW,C,D,G,H,N,P,S,T,Y
SE-B(14) A,C,D,EQ,FY,G,H,IV,KR,LM,N,P,ST,W
hoff(6). Note the classes are named from A to F in
the order presented in the Table 1. For example the
amino acid M was converted to class F, D to class C
and P to class A.
Original: MDPFLVLLHSVSSSLSSSELTELKYLCL
Converted: FCADFFFFEAFAAAFAAACFACFEDFBF
In this example, the first substring (with k = 6) is
FCADFF and the second is CADFFF.
3 IMPLEMENTED CHANGES
The MUMMALS’ algorithm core is based on Prob-
Cons (Do et al., 2005) and its probabilistic consis-
tency measure. The first one defines more com-
plex and sophisticated hidden Markov models and
employs a k-mer count method similar to MUS-
CLE (Edgar,2004b) and MAFFT (Katoh et al., 2005).
In these works, it is unclear how the parameters were
chosen. Therefore, in our work, we made a systematic
evaluation for k-mer count method parameters. We
performed three distinct evaluations changing some
aspects in the original MUMMALS algorithm. In
two of them, we evaluated different options for the
k-mer count method. In the third experiment, we eval-
uated the algorithm applying a standard distance ma-
trix computation aiming to compare against the k-mer
count method.
During the planning phase of the test some ques-
tions arose, such as: “Would a change in k lead to con-
siderable variation in the MSA score?” or “Would a
change in k affect the runtime of the algorithm?”. Ini-
tially we evaluated a version with the k value ranging
between 3 and 14, the inferior limit of 3 was chosen
because substrings under this value are of no signif-
icance and the upper limit of 14 was considered due
to the time/result ratio. The objective was to visual-
ize the effect of altering the length of the substrings.
As we will see in Section 4, the answer to both initial
questions were positive.
In the second experiment, we evaluate alterna-
tive compressed alphabets, such as: SE-B(6), SE-
B(8), Li-A(10), Li-B(10), Murphy(10), SE-B(10),
SE-V(10), Solis-D(10), Solis-G(10) and SE-B(14),
whose classes are shown in Table 1. In this experi-
ment we chose to range k from 6 to 10 because these
were the values that represented the best time / result
ratio in our previous experiment varying the alpha-
bet from 3 to 14. For more information about the al-
phabets consult the study by Robert C. Edgar (Edgar,
BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms
228