Search for Latent Periodicity in Amino Acid Sequences with
Insertions and Deletions
Valentina Pugacheva
1
, Alexander Korotkov
2
and Eugene Korotkov
1,2
1
Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Leninsky
Ave. 33 bld. 2, 119071, Moscow, Russia
2
National Research Nuclear University “MEPhI”, Kashurskoe shosse, 31. Moscow 115409, Russia
Keywords: Genetic Algorithm, Latent Periodicity, Dynamic Programming, Amino Acid Sequences.
Abstract: The aim of this study was to show that amino acid sequences have a latent periodicity with insertions and
deletions of amino acids in unknown positions of the analyzed sequence. Genetic algorithm, dynamic
programming, and random weight matrices were used to develop the new mathematical algorithm for latent
periodicity search. The method makes the direct optimization of the position-weight matrix for multiple
sequence alignment without using pairwise alignments. The developed algorithm was applied to analyze the
amino acid sequences of a small number of proteins. This study showed the presence of latent periodicity
with insertions and deletions in the amino acid sequences of such proteins, for which the presence of latent
periodicity was not previously known. The origin of latent periodicity with insertions and deletions is
discussed.
1 INТRODUCTION
The development and application of mathematical
methods in the study of symbolic sequences is of
particular importance to achieve great success in the
sequencing of various genomes. It also increases the
accumulation of information about the complete
genomes of many species (Ekblom and Wolf, 2014).
If mathematical methods are not applied, a big part
of the known nucleic and amino acid sequences will
be stored away in computer data banks, without
significant usage. This is especially true for
eukaryotic genomes. The task of developing new
mathematical methods entails finding new
mathematical laws to explain sequence organization
and the relationship of these laws with the biological
functions of various parts of the genome (Almirantis
et al., 2014). These studies show the relationship
between certain mathematical regularities observed
in sequences with their biological properties.
Latent periodicity is one of the structural
regularities of sequences and is widely represented
in amino and DNA sequences (Korotkov et al.,
2003a, 2003b). A periodicity is considered as latent
if the similarity between any two periods is not
statistically significant or if it belongs to the twilight
zone (Durbin et al., 1998). Perfect periodicity can
become latent periodicity if it accumulates over 1.0
mutations per amino acid in the studied sequence
(Suvorova et al., 2014). The distinctive property of
latent periodicity is that it cannot be detected by
pairwise comparisons of amino acid sequences
(Turutina et al., 2006). However, latent periodicity
can be found if we apply a mathematical method to
directly detect the multiple alignment of amino acid
sequences without constructing pairwise alignments.
The periods of a sequence with latent periodicity are
sequences for multiple alignment and the multiple
alignment can be statistically significant. The goal of
this study was to find multiple alignments of amino
acid sequences (periods) in the absence of
statistically important pairwise alignments.
There is a significant gap in the mathematical
approaches presently used to search for latent
periodicities in symbolic and numeric sequences.
Spectral approaches enable the discovery of enough
"fuzzy" periodicity in protein sequences without
insertion(s) or deletion(s) of amino acids. Fourier
transform, wavelet transform, information
decomposition and some other methods can be
attributed to a number of spectral methods (Tiwari et
al., 1997; Lobzin and Chechetkin, 2000;
Kravatskaya et al., 2011; Korotkov et al., 2003a; de
Pugacheva, V., Korotkov, A. and Korotkov, E.
Search for Latent Periodicity in Amino Acid Sequences with Insertions and Deletions.
DOI: 10.5220/0005630401170127
In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016) - Volume 3: BIOINFORMATICS, pages 117-127
ISBN: 978-989-758-170-0
Copyright
c
2016 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
117
Sousa Vieira, 1999; Meng et al., 2013; Suvorova et
al., 2014; Sosa et al., 2013; Kumar et al., 2006).
However, these approaches have a significant
limitation, such as the fact that they do not allow the
detection of periodicity with insertions and
deletions.
On the other hand, methods based on dynamic
programming can accurately find insertions and
deletions (Pellegrini, 2015). However, methods
based on dynamic programming cannot detect latent
periodicity, in a situation where the statistical
significance of similarity between any two periodic
sequences is small (Korotkov et al., 2003; Turutina
et al., 2006). This is due to the fact that the
periodicity of amino acid sequences (with the
number of periods greater than or equal to 4) was
detected by pairwise alignment between periods. In
the absence of statistically significant pairwise
alignments, these approaches are incapable of
finding latent periodicity. First of all, it concerns
algorithms and programs such as REP (Andrade et
al., 2000), Internal Repeat Finder (Marcotte et al.,
1999), Prospero (Mott, 1999), RADAR (Heger &
Holm, 2000), REPRO (Heringa & Argos, 1993)
TRUST (Szklarczyk & Heringa, 2004) and
PTRStalker (Pellegrini et al., 2012). It is also
difficult to detect latent periodicity by the programs
XSTREAM (Newman & Cooper, 2007) and T-
REKS (Jorda & Kajava, 2009) because the similarity
between different periods is very low in the case of
latent periodicity. This leads to lack of seeds and
identical short strings. The Markov models and
neural networks are inefficient for finding latent
periodicity, since there are no training samples. The
following programs were used in previous studies
HHrep (Söding et al., 2006), HHRepID (Biegert &
Söding, 2008) and the approaches developed in the
works of Palidwor et al. (2009) and Rubinson &
Eichman (2012).
Therefore, in this study a mathematical method
was proposed that considers this gap and finds the
latent periodicity of any symbolic sequence in the
presence of insertions and deletions (in unknown
positions of the analyzed sequence) and in the
absence of a known position-weight matrix.
Any periodicity of the sequence S with length N
can be characterized by either the frequency matrix
(Korotkov et al., 2003b) or the position-weight
matrix M (Shelenkov et al., 2006) calculated from
frequency matrix. Amino acids are the signs of the
rows of this matrix while period positions serve as
the signs of the columns. The element of this matrix
m(i,j) indicates the weight which has the amino acid
i in position j of the period. The positions of the
period changed from 1 to n. The sequence S
1
of
length N, which is an artificial periodic sequence
1,2,...,n, was introduced. Here, the numbers were
treated as symbols and columns in the matrix M
were consistent with them. For a period equal to n,
the sequence S corresponds to a certain frequency
matrix and weight matrix M(20,n). The problem was
formulated as follows. We have a sequence S with
length N. It is necessary to find such optimal
weighting matrix M
0
, where the local alignment of
sequences S
1
and S have the greatest statistical
significance. Under the statistical significance, the
probability P is that F
r
> F
max
, where F
max
is the
maximum weight of a local alignment of sequences
S
r
and S
1
, using the some optimal matrix M
0
. Here,
F
r
is the maximum weight of a local alignment of
randomly mixed sequences S
r
and S
1
using the some
optimal matrix M
r
. We search a matrix M
0
, which
have the lowest probability P. It is always possible
to set the threshold level of the probability P
0
and if
the probability P(F
r
>F
max
) is less than P
0,
then a
local alignment of sequences S and S
1
is found
,
using
the some optimum matrix M
0
and this alignment can
be considered as statistically significant.
It is possible to use the local alignment
algorithm, for alignment of the amino acid sequence
S and an artificial periodic sequence S
1
, relative to
the known weight matrix (Smith and Waterman,
1981). It is necessary to find the optimal weight
matrix M
0
. The objective of this study was to
develop a mathematical approach for finding the
matrix M
0
, as well as a method for assessing the
probability P. To find the optimal weight matrix, a
genetic algorithm was used, as well as a local
alignment algorithm. The Monte Carlo method was
used to estimate the probability P.
A mathematical method was developed in this
paper to find more than 3 tandem repeats in amino
acid sequences. The method was used for direct
optimization of the position-weight matrix for
multiple sequence alignment without using pairwise
alignments. This means that for each n, a matrix M
0
is found, the probability P is estimated and we build
the alignment of the sequences S and S
1
using M
0
matrix. It is not the goal of this study to analyze all
the known amino acid sequences, since the
developed method requires very large computer
resources. The developed algorithm was applied to
search for latent periodicity with insertions and
deletions in the amino acid sequences of a small
number of proteins This study showed the presence
of latent periodicity with insertions and deletions in
the amino acid sequences of proteins, for which the
presence of latent periodicity was not previously
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms
118
known.
2 MATHEMATICAL METHODS
AND ALGORITHMA
A genetic algorithm was used to search for the
optimal weight matrix M
0
for period n. A genetic
algorithm is a heuristic search algorithm for solving
optimization problems and is a form of direct
random search (Mitchell, 1998). It is often used to
optimize the functions of several variables. The
general view of the algorithm is as shown in Figure
1. Usually, the problem is formalized, so that a
solution could be found as a vector, where each
element can be a bit, a number, or some other object.
This vector is considered as an "organism." Usually,
a set of initial organisms are randomly created
(Gondro and Kinghorn, 2007). Each of these
organisms was measured using an objective
function, which is regarded as a "fitness function."
As a result, every organism is associated a certain
fitness value, which determines how well the
organism solves the problem. Organisms are
selected from this set of organisms (it can be called
“generation”) for application of the "genetic
operators" (“crossing” and “mutation”, taking into
account the value of “fitness”). The new organisms
were gotten as a result of the application of these
operators. The value of fitness was also calculated
for new organisms, and then selection of the best
organisms to the next generation was done. This set
of actions was repeated iteratively, and thereby
simulating the "evolutionary process". This process
was allowed to continue for several life cycles
(generations), before executing the stop criterion of
the algorithm. Such a criterion can be either finding
the global or suboptimal solutions or exhaustion of
the number of generations released for evolution. In
this study, the organisms are the weighting matrix of
the periodicity. This set was called Q
n
or population.
Each matrix has 20 rows and n columns. Matrix
elements m(i,j) are some numbers that show the
weight amino acids i to column number j. A larger
weight of the element m(i,j) corresponds to a high
probability of the presence of the amino acid i at
position j of the period. As the assessment of fitness
(objective function) for the organism (weight matrix
M), the maximum value of the similarity function
F
max
was considered for the local alignment
(Altschul et al., 1990). A local alignment was built
between the sequences S
1
and S, using a weight
matrix M to calculate the objective function. The
calculation of F
max
was conducted for each organism
(weight matrix M). The process was repeated after
applying genetic operators to the organisms. The
process was stopped after a stable population was
achieved, that is, increase in the values of F
max
was
stopped. As a result, the matrix M
0
was defined for
the period length n with the greatest F
max
. The
alignment of sequences S
1
and S was well built using
the matrix M
0
. The algorithm discussed is as shown
in Figure 1. The algorithm was repeated for n from 2
to 100.
Figure 1: The main stages of the genetic algorithm used in
the study.
2.1 Initialization
The first step of the algorithm is to provide a zero
generation of organisms (weight matrix in our case)
for the local alignment. A random population of
organisms was selected as zero generation. The zero
generation of organisms must be maximally diverse
Search for Latent Periodicity in Amino Acid Sequences with Insertions and Deletions
119
in order to more quickly achieve a stable population
and find matrix M
0
, maximum of F
max
. Organisms
(matrix of the size 20×n) can be viewed as points in
space with a size of 20×n. It is possible to achieve
the maximum diversity of organisms, if the points
are selected in space 20×n spaced at a distance
D>D
0
. The coordinates of these points are the initial
matrices (organisms). The distance between both
matrices (organisms), the Euclidean distance
between the two points in space 20xn, was taken.
20
2,0
1 2
1 1
( ( , ) ( , ))
n
i j
D m i j m i j
(1)
where m
1
(i,j) and m
2
(i,j) are elements of the two
matrices (M
1
and M
2
) compared. The population size
should be large enough. Organisms having a high
fitness function are distributed too quickly in
populations with a small size. The population
becomes homogeneous and the probability of
continuation of the evolution becomes very small.
This means that the algorithm can find the local
rather than the global maximum of F
max,
in the case
of a small population size. At the same time,
descendants produced in large populations are likely
to be more varied, although an increase in F
max
is
much slower. A population size equal to 10
4
was
used for all the results presented here. These 10
4
weight matrices were chosen so as to cover the space
20×n, as fully as possible. Each matrix M (organism)
was created by comparing the sequence S
1
with a
random sequence of length N. The random sequence
Sr
i
was obtained by mixing the original sequence S,
with i varied from 1 to 10
4
. The frequency matrix
V(20,n) was completed as follows. To elements of
the matrix v(sr(k),s
1
(k)), a value of 1 was added for
all k from 1 to N, where sr(k) is an element of the
sequence Sr
i
. Then, based on the matrix V, the
weighting matrix M(20,n) was calculated as:
( , ) ( , )
( , )
( , ) 1 ( , )
v i j Np i j
m i j
Np i j p i j
(2)
where the partial sums for lines are
( ) ( , )
j
and for columns are
( ) ( , )
i
y j v i j
,
,
( , )
i j
N v i j
and probabilities
2
( , ) ( ) ( )
p i j x i y j N
. If this
matrix is the first in the population, then it is
automatically included in the initial population. If
this matrix is not the first matrix, it is compared with
all the matrices (organisms) already included in the
population and the distance from each matrix was
calculated using Formula 1. If the distances are
greater than the D
0,
then the matrix is included in the
initial population. Otherwise, this matrix is rejected
and a new matrix is created. The level of D
0
was
chosen so that the initial population will have from
10
4
to 1.05×10
4
from 5×10
5
random matrices. Let us
call the population of organisms (matrices) as Q
n
.
2.2 Calculation of Fitness and
Statistical Significance of the
Organism
After the birth of a new organism (creating a new
matrix M), the first step is to assess the fitness of the
organism. This is the determination of F
max
of the
local alignment (Smith and Waterman, 1981) for
sequences S
1
and S, using a weighting matrix M. The
higher the value, F
max
corresponds to a better
alignment and to a lower probability P(F
r
>F
max
).
The fitness of the organism (chapter 2) is higher for
larger values of F
max
. In more detail, the construction
of a local alignment is discussed subsequently in
paragraph 2.6. After completion of the genetic
algorithm, an argument of the normal distribution
for the organism M
max
(which have the highest F
max
)
was calculated using the formula:
max
max
mk
F M F
Z
D F
(3)
where
1 2
max
max max max
, ,...,
N
F F F F
are the maximum
weights of the local alignments between random
sequence Sr
i
and the sequence S
1
, is determined
using the best weight matrix M
max
.
Calculation of the vector
max
F
was performed
using random sequences Sr
i
derived from the amino
acid sequence S by random mixing. In total, 200
random sequences were created (N
R
= 200). The
necessity of using values Z
mk
instead of F
max
at the
end of the calculation, is due to the fact that the
direct calculation of the probability P(F
r
>F
max
) is
difficult, because of the very large amount of
computations. Furthermore, while reducing the
probability P(F
r
>F
max
), the amount of computations
grew very quickly and for a good periodicity in the
sequence S, the calculations could not be performed
within a reasonable time. Therefore, it is convenient
to use Z
mk
as a measure of statistical significance of
F
max
for the matrix M
max
. A similar calculation was
performed for all the investigated period of length n
and the dependence Z
mk
(n), was obtained.
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms
120
2.3 Completion of the Genetic
Algorithm
Proofs that the genetic algorithm necessarily reach
the global optimum, even for an infinite number of
iterations, are currently non-existent (Mitchell,
1998). This necessitated the decision to stop the
algorithm adopted by the heuristic criteria.
Therefore, a decision was reached to use a
combination of the two most common genetic
algorithm stopping criteria (Banzhaf et al., 1998).
The evolutionary process was continued as long as
the best organism (matrix with the highest F
max
) will
not be repeated for several generations, or will limit
the number of iterations reached (10
4
). In this paper,
the resulting solution is considered as the found
global optimum. Figure 2 shows an example of the
growth of F
max
for the best organism in the
population.
Figure 2: The graph of growth of the fitness F
max
for the
best individual in the population in the process of
evolution.
2.4 Choice of Parents in the Genetic
Algorithm
The choice of parents was made using a combination
of approaches: the elite and fitness proportionate
selection, also known as roulette-wheel selection
(Bäck, 1996). To do this, firstly, all organisms were
sorted on the degree of F
max
increase and then, 20%
of organisms with the highest F
max
, were selected.
Thereafter, two parents were selected among them
with a probability that depends on the F
max
. If
i
max
F
is the fitness of the organism i in the population,
then the probability of the organism selection is as
follows:
1
/
K
i i
i max max
k
P F F
where K is the size of the
population. It is more likely that more adapted
organisms will be selected as parents, if this
approach is used. However, for the less fit
individuals, there is still a chance of being selected
for reproduction and survival during evolution. This
is an advantage over the purely elite strategy, despite
the impracticality, an organism (weighting matrix)
can contain successful portions (successful matrix
elements). Then, these properties of the organism
can be taken up by evolution and can contribute to
the global maximum.
2.5 Reproduction of Organisms
The recombination operator was used immediately
after the selection of parents for the creation of
descendants. The essence of recombination is that
created descendants should inherit genetic
information from both parents. Then, the mutation
operator was applied for each descendant.
2.5.1 Recombination of Organisms and the
Creation of Descendants
A combination of the two-point crossover and
differential crossing was used to create descendants.
In this case, the organisms (matrix) were considered
as a linear vector. This means that the matrix rows
were built one behind the other in a line. These
vectors were then closed in a ring formed by a
compound at the ends of these vectors. Then, the
random selection of two points on the ring was
performed and the segment from one ring was used
to replace the segment of the other ring (Fogel,
1998; Fogel, 2010). Two-point crossover showed an
improvement over the single point crossover.
Further addition of crossover points impairs the
activity of the genetic algorithm as the increased
destruction of organisms and evolutionary process
slows down (Spears and De Jong, 1991; Sywerda,
1989).
Afterwards, the intermediate recombination was
used. The values of "genes" of the organism (weight
matrix elements) other than the value of the parental
"genes", occur at an intermediate recombination.
This leads to the emergence of new organisms with
fitness that could be better than that of the parents.
Such recombination operator in the literature is
sometimes called differential crossing. If
x
and
y
are two organisms in a population (two weight
matrix with elements x(i,j)
и y(i,j)), then the
descendant is calculated by the formula (Radcliffe,
1991):
ij ij ij ij
z x x y
where i=1,2,...,20,
j=1,2,...,n and
0,1
are random values with a
uniform distribution. Here, the matrix of weights
(organism) was considered as a vector. To create
Search for Latent Periodicity in Amino Acid Sequences with Insertions and Deletions
121
descendants after the two-point crossover, two
parents were involved. Then, two descendants (w
and v) were formed using Formulae 4 and 5:
( , ) ( , ) 1 ( , )
w i j x i j y i j
(4)
( , ) 1 ( , ) ( , )
v i j x i j y i j
(5)
2.5.2 Creation of Mutations
By one of two methods, mutations were introduced
in the descendants W and V. The initial method of
introducing mutations (probability for each method
was 0.5) was randomly chosen. The first method
replaced the randomly selected element of the
weight matrix on a random number that is uniformly
distributed in the range from -1 to 1. The probability
for a replacement p
1
is equal to 0.01. All elements of
all descendants exposed a random change of values.
Changes were made to the whole matrix (all its
values) on some small value, in the second method
of making mutations. The intensity of the whole
matrix mutation was determined by the probability
p
2
, which was randomly selected from the range of
0.001 to 0.03. Each descendant element
( , )
w i j
of
the matrix
W
was replaced with a new element,
calculated according to the formula:
2
( , ) ( , ) ( , )
v i j w i j p w i j
where i=1,2,...,20 and j =
1,2, ..., n. After making mutational changes, the
fitness of descendants (W and V) was evaluated, that
is, F
max
was calculated for them. The descendant
with a maximum value of F
max
was added to the
population Q
n
. Concurrently, the worst organism
with the smallest value of F
max
was removed from
the population Q
n
. This method of replacing
organisms in the population maintains the
population size.
2.6 Construction of the Alignment and
Choice of Weight for Deletion
2.6.1 Alignment of Amino Acid Sequence
using the Random Matrices
A local alignment of sequences S
1
and S was
conducted using the weight matrices (organisms)
and affine function penalty for insertions and
deletions, to search F
max
and the matrix M
0
(Durbin
et al., 1998). To construct the alignment, the
matrices for similarity functions F, F
1
and F
2
were
filled for each matrix M from the population (set
Q
n
). Matrix M changed and turned into a matrix M'.
1
1
2
1
1
2
2
0
( 1, 1) '( ( ), ( ))
( , ) max
( 1, 1)
( 1, 1)
( 1, )
( , ) max
( 1, )
( , 1)
( , ) max
( , 1)
F i j m s i s j
F i j
F i j d
F i j d
F i j d
F i j
F i j e
F i j d
F i j
F i j e
(6)
where s
1
(i) and s(i) are letters from the sequences S
1
and S, d is the price for opening insertion or deletion
in the sequences S
1
and S, e is the price for the
continued insertion or deletion in the sequences S
1
and S. Here, i and j changed from 1 to N. The
matrices F, F
1
and F
2
have a dimension equal to
N×N, where N is the length of sequences S
1
and S.
F
max
was selected as the maximum element of the
matrix F. The coordinates of this element are i
m
and
j
m
.
Simultaneously, by calculating the matrixes F, F
1
and F
2
inverse transition matrix F' (same dimensions
as the matrix F) were also filled. Each element of the
matrix F’(i,j) contains the number of the matrix (1
for F, 2 for F
1
and 3 for F
2
) and the number of
element of the matrix F or F
1
or F
2
, which has a
maximum value in Formula 6. Using the inverse
transition matrix F’, the alignment of the sequences
S
1
and S was built. The path in the matrix F’ from
the point (i
m
, j
m
) to the point (i
0
, j
0
), corresponds to
the created alignment. At the first instance, the point
(i
0
, j
0
) F’ is equal to zero and serves as the beginning
of the alignment. The matrix M (organisms) from the
set Q
n
(population) was used to create the alignment
of sequences S
1
and S. For every matrix M from the
set Q
n
, the values R and K
d
were calculated before
carrying out the alignment as:
20
2 2
1 1
( , )
n
i j
R m i j
(7)
20
1 1
( , ) ( ) ( )
n
d
i j
K m i j f i t j
(8)
where f(i)=b(i)/N, b(i) is the number of amino acids
of type i in the sequence S, t(j)=1/n, N is the total
number of amino acids in the sequence S. For
calculation of the alignment, a changed matrix
'
M
has to satisfy two conditions. The first condition is
that R for the matrix
'
M
with the same period length
n would be identical and equal to 5(20n)
1/2
. The
dependence R~n
1/2
allows a similar distribution for
F
max
to be obtained, for a study of the different
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms
122
random sequences Sr
i
. These random sequences
were obtained by mixing the original sequence S.
The second condition is that the distribution
functions for F
max
for each matrix from the set Q
n
should be close to each other. Such a distribution
function can be determined for each matrix from the
set Q
n,
if this matrix is used to calculate the
alignments of the sequence S with each random
sequence from the set Sr
i
. K
d
was selected for each
matrix from the set Q
n
which would provide
maximum identity
l
(see below this paragraph).
The above two conditions enabled the
replacement of the matrix M by the matrix that
satisfies Equations 7 and 8. Equation 7 is the
equation of the sphere in space 20×n and Equation 8
is an equation of the plane. If the matrix satisfies
these conditions, then it lies on the circle C formed
by the intersection of the sphere (Equation 7) by the
plane (Equation 8). Matrix M was considered as a
point in space 2n and from this point, the nearest
point was taken which lies on the circle C. The
coordinates of this point are the desired matrix
'
M
.
It is possible to use Equations 7 and 8 and to
calculate the matrix
'
M
. Actually, it means that if
we have the constant R, K
d
, matrix M and calculate
f(i) for the sequence S, then the matrix
'
M
(if there
is the circle C) can be clearly defined. Matrix Mis
used in Equation 6.
The next task was to choose the constant K
d
for
each matrix from the set Q
n
, which would provide
the maximum identity of the distribution function of
the F
max
. The average length of a random alignment
l
for each matrix from the Qn set, as the average
for difference (j
m
-j
0
) along with the calculation of the
distribution function of the F
max
. Here, j
m
is the
coordinate of F
max
in sequence S, j
0
is the coordinate
where F=0.0 in the calculation of the alignment
(coordinate of the beginning of the alignment in the
sequence S). The average length of the random
alignment chosen is equal to N/5. This value
provides the best determination of the alignment
boundaries with respect to the actual boundaries for
the model sequences of length N. As model
sequences, random sequences were selected for the
insertion of a local alignment with periodicity for
which Z> 10.0 (Korotkov et al,. 2003a) and length is
from N/10 to N/2.
The constant K
d
was selected iteratively. K
d
provides
l
to be approximately N/5 and obviously
lies in the range from K
1
=0 to K
2
=-20. Then, the
middle of this interval was taken. If
l
was more
than N/5, then K
1
=(K
1
+K
2
)/2 is calculated and if
l
is
less than N/5, K
2
=(K
1
+K
2
)/2 is calculated and the
process was repeated. Upon reaching the value
l
=N/5±20, selection of the constant K
d
stopped.
Random sequences were created by the
following algorithms. A number sequence was
generated using a random number generator of the
same length as the amino acid sequence. Thereafter,
the sequence of random numbers was arranged in
ascending order and the permutations made were
memorized. These changes were applied to the
amino acid sequence. Random amino acid sequences
of good quality were created by this algorithm.
2.6.2 Weights of the Deletions and Other
Constants
The constant d for each period n was determined
separately. The constant e was selected as 0.25d. A
total of 100 test sequences were analyzed which
were created for the period n as follows. Artificial
sequences were created with length equal to 1000
amino acids and contained a period n. The statistical
significance of this periodicity Z(n) defined by the
information decomposition method is equal to 7.0
(Korotkov et al., 2003a). Insertions or deletions were
introduced into the sequence randomly for every 50
amino acids. A constant d was chosen which
provides the greatest value Z
mk
by using Formula 3.
This value was applied for alignments using
weighting matrices from the set Q
n
.
2.7 Selection of the Threshold Z
0
Initially, Z
0
was estimated as the threshold for Z
mk
(n)
to cut the influence of statistical noise. The method
of this study was used to analyze 300 amino acid
sequences. Therefore, the estimation of Z
0
for 300
random amino acid sequences was done. The
sequence had a length equal to 600 amino acids and
a period equal to 19 amino acids with 1.5 random
changes per amino acid. To create the mutation,
random positions were chosen in the sequence.
Then, we changed the amino acid in a selected
position that was randomly chosen (with probability
which is equal for all amino acids). This was done
900 times for each sequence. From 4 to 15 inserts
having the length, one amino acid was added in each
sequence at random locations. This set was called
Q
19
. The ability of the developed approach to detect
periodicity in a multitude Q
19
was tested. The results
showed that periodicity can be detected in 93% of
cases. We believe it is possible to achieve 100%
result, but the number of iterations should be
increased to approximately 10
5
(see paragraph 2.3).
Then, these 300 sequences were analyzed and
Search for Latent Periodicity in Amino Acid Sequences with Insertions and Deletions
123
Z
mk
(19) was calculated for each of them. Next, these
300 sequences were shuffled and a random sequence
was obtained. Then, the random sequences were
analyzed and a set of values Z
mk
(19) were obtained.
Then, Z
0
equal to 10.0 was chosen since
N
random
(10.0)/N
real
(10.0) < 5%. It means that the
number of errors of the first kind is less than 0.05.
Therefore, N
random
(10.0) shows a number of
R
mk
Z
(19)
with values equal to or more than 10.0; N
real
(10.0)
indicates the number of Z
mk
(19) equal to or greater
than 10.0. The level of 10.0 was chosen for all n.
The computational complexity of the algorithm is
the reason why only 300 amino acid sequences were
analyzed. An analysis of 300 sequences required
about 6 months of calculations on a computer cluster
with 10 AMD FX-8350 processors. Therefore, the
task of analyzing the entire Swiss-prot database was
not done, because it would require a lot of computer
resources. The intention of the authors was to show
that periodicity exists in amino acid sequences with
many substitutions as well as where there are amino
acid insertions and deletions. This periodicity can be
detected by the approach developed in this study,
despite being combined with other methods. The
300 amino acid sequences are enough to solve this
problem.
3 EXAMPLES OF AMINO ACID
SEQUENCES
In total, 300 amino acid sequences randomly
selected from the Swiss-prot data bank (Boeckmann
et al., 2003) were studied. In the process of
selection, any sequence having already known
amino acid repeats or repetitive domains (Kajava,
2012) were excluded from the set. As a result, 71
sequences were detected by our algorithm (any
Z(n)>10.0) of having regions with the periodicity of
various lengths. Lengths of regions with periodicity
are more than 40 amino acids and number of periods
is more than 3. Three typical examples of sequences
having insertions and deletions were considered and
were found to have latent periodicity.
Figure 4 shows a second example of the
spectrum Z(n) for the sequence Q1D823 (Yang et
al., 2004), which contains the adventurous-gliding
motility protein. The region from 35 to 1373 amino
acids contains periodicity with length equal to 7
amino acids, which can be revealed with deletions
and insertions only. The Z(7) of this region has a
maximum value for all period lengths and is equal to
15.6. This region contains 4 extended coiled coil
regions. Alignment containing 20 deletions and
insertions of different lengths, that is, the average
length between the insertions and deletions is about
67 amino acids. Periodicity equal to 7 amino acids is
typical for the coiled coil regions. This periodicity
has the form HPPHCPC, where the positions of the
period is referred to as abcdefg. Here, H represents
hydrophobic residues, C represents typically charged
residues, and P represents polar (and therefore,
hydrophilic) residues. The positions of the heptad
repeat are commonly denoted by the lowercase
letters a through g. These motifs are the basis for
most coiled coils, particularly leucine zippers, which
have predominantly leucine in the d position of the
heptad repeat. The periodicity observed in sequence
Q1D823, is different from the periodicity specific
for the coiled coil. It can be assumed that there are
different heptad repeats, capable of forming a coiled
coil. It is also likely that such a difference is due to
insertions or deletions of amino acids. The findings
of the present work indicate that the resulting matrix
probably can be used to locate regions with long
coiled coils.
Figure 3: Spectrum Z(n) for the sequence O42918.
Figure 4: Spectrum Z(n) for the sequence Q1D823.
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms
124
Figure 5 shows the Z(n) for the amino acid sequence
P48681 (Dahlstrand et al., 1992) in a region from
182 to 1248 amino acids. The period which is equal
to 11 amino acids is clearly visible. The periods of
22 and 33 amino acids are induced by the main
period which equals 11 amino acids. The sequence
with periodicity includes some coil regions and tail.
The periodicity was discovered only in the presence
of 24 amino acid insertions or deletions of various
lengths. In the absence of insertions and deletions,
this periodicity is not detectable.
We analyzed 71 amino sequences by the
programs REP (Andrade et al., 2000), Internal
Repeat Finder (Marcotte et al., 1999), Prospero
(Mott, 1999), RADAR (Heger & Holm, 2000),
REPRO (Heringa & Argos 1993), TRUST
(Szklarczyk & Heringa 2004) and PTRStalker
(Pellegrini et al., 2012). These programs found
periodicity in these sequences, if Z is more than
18.2. If Z lies in the interval from 18.2 to 15.5, these
programs found only ~34% of our results. Also, if
10.0<Z<15.5, then these methods found nothing.
Totally, these methods found 6 regions with latent
periodicity from 71 which was found in this work.
As is written above (see paragraph 1), it is the
consequence of using pairwise alignments between
periods for the detection of latent periodicity
(number of periods is more than 3).
Figure 5: Spectrum Z(n) for the sequence P48681.
The question arises about the role of the observed
periodicity in the structure and functions of proteins.
Two assumptions were put forward about the
functional role of the detected periodicity. Firstly,
the periodicity found could be some property which
provides a certain secondary structure (Jernigan and
Bordenstein, 2015). This assumption has been
expressed for the amino acid repeats, which were
found earlier (Jorda et al., 2010; Kajava, 2012). In
this study, there are periods of length 6 and 7 amino
acids which may participate in the formation of α-
helixes. Secondly, the periodicity found may reflect
a certain spatial repeatability of protein parts
belonging to 3D structures. For known repeats, this
can be observed for the Zn-finger domains (Lee et
al., 1989), Ig-domains (Sawaya et al., 2008) and the
human matrix metalloproteinase (Elkins et al.,
2002). In the work of Kajava (2012), "the structural
classification of the repetitive proteins based on the
length of their repeats" provides additional
information.
The origin of multiple tandem repeats in proteins
can be associated with the processes of multiple
tandem duplications in DNA (De Grassi and
Ciccarelli, 2009). It may come to the formation of
new proteins (Björklund et al., 2006). Further
evolution and accumulation of mutations (amino
acid substitutions, deletions and insertions) could
lead to the creation of latent periodicity with many
amino acid substitutions, insertions and deletions.
Periodicity was detected in the present work.
In the future, the computation time for this
algorithm can be reduced and all known amino acid
sequences accumulated in the Swiss-prot database
will be analyzed again. Increase in performance is
possible due to the use of other methods instead of a
genetic algorithm for optimization of the weight
matrix M or application for calculations using large
computing clusters.
ACKNOWLEDGEMENTS
This work was supported by the grant 2014-04-
00164 of the Russian Fund of Fundamental
Research.
REFERENCES
Almirantis, Y. et al., 2014. Editorial: Complexity in
genomes. Computational biology and chemistry, 53 Pt
A, pp.1–4.
Altschul, S.F. et al., 1990. Basic local alignment search
tool. Journal of molecular biology, 215(3), pp.403–
410.
Andrade, M. a et al., 2000. Homology-based method for
identification of protein repeats using statistical
significance estimates. Journal of molecular biology,
298(3), pp.521–537.
Bäck, T., 1996. Evolutionary Algorithms in Theory and
Practice: Evolution Strategies, Evolutionary
Programming, Genetic Algorithms, Oxford University
Press.
Search for Latent Periodicity in Amino Acid Sequences with Insertions and Deletions
125
Banzhaf, W. et al., 1998. Genetic programming: an
introduction: on the automatic evolution of computer
programs and its applications.
Biegert, a & Söding, J., 2008. De novo identification of
highly diverged protein repeats by probabilistic
consistency. Bioinformatics (Oxford, England), 24(6),
pp.807–14.
Björklund, A.K., Ekman, D. & Elofsson, A., 2006.
Expansion of protein domain repeats. PLoS
computational biology, 2(8), p.e114.
Boeckmann, B. et al., 2003. The SWISS-PROT protein
knowledgebase and its supplement TrEMBL in 2003.
Nucleic acids research, 31(1), pp.365–370.
Custer, M. et al., 1997. Identification of a new gene
product (diphor-1) regulated by dietary phosphate. The
American journal of physiology, 273(5 Pt 2), pp.F801–
F806.
Dahlstrand, J. et al., 1992. Characterization of the human
nestin gene reveals a close evolutionary relationship to
neurofilaments. Journal of cell science, 103 ( Pt 2,
pp.589–97.
Durbin, R. et al., 1998. Biological Sequence Analysis:
Probabilistic Models of Proteins and Nucleic Acids,
Cambridge University Press.
Ekblom, R. & Wolf, J.B.W., 2014. A field guide to whole-
genome sequencing, assembly and annotation.
Evolutionary Applications, 7(9), pp.1026–1042.
Elkins, P.A. et al., 2002. Structure of the C-terminally
truncated human ProMMP9, a gelatin-binding matrix
metalloproteinase. Acta crystallographica. Section D,
Biological crystallography, 58(Pt 7), pp.1182–92.
Fogel, D.B., 2010. EVOLUTIONARY COMPUTATION
Toward a New Philosophy of Machine Intelligence,
Fogel, D.B., 1998. Evolutionary Computation: The Fossil
Record.
Gondro, C. & Kinghorn, B.P., 2007. A simple genetic
algorithm for multiple sequence alignment. Genetics
and molecular research : GMR, 6(4), pp.964–82.
De Grassi, A. & Ciccarelli, F.D., 2009. Tandem repeats
modify the structure of human genes hosted in
segmental duplications. Genome biology, 10(12),
p.R137.
Heger, A. & Holm, L., 2000. Rapid automatic detection
and alignment of repeats in protein sequences.
Proteins: Structure, Function and Genetics, 41(2),
pp.224–237.
Heringa, J. & Argos, P., 1993. A method to recognize
distant repeats in protein sequences. Proteins, 17(4),
pp.391–41.
Jernigan, K.K. & Bordenstein, S.R., 2015. Tandem-repeat
protein domains across the tree of life. PeerJ, 3,
p.e732.
Jorda, J. et al., 2010. Protein tandem repeats - the more
perfect, the less structured. The FEBS journal,
277(12), pp.2673–82.
Jorda, J. & Kajava, A. V, 2009. T-REKS: identification of
Tandem REpeats in sequences with a K-meanS based
algorithm. Bioinformatics (Oxford, England), 25(20),
pp.2632–8.
Kajava, A. V, 2012. Tandem repeats in proteins: from
sequence to structure. Journal of structural biology,
179(3), pp.279–88.
Korotkov, E.V., Korotkova, M.A. & Kudryashov, N.A.,
2003. The informational concept of searching for
periodicity in symbol sequences. Molekuliarnaia
Biologiia, 37(3), pp.436–451.
Korotkov, Korotkova & Kudryashov, 2003. Information
decomposition method to analyze symbolical
sequences. Physics Letters, Section A: General,
Atomic and Solid State Physics, 312(3-4), pp.198–210.
Kravatskaya, G.I. et al., 2011. Coexistence of different
base periodicities in prokaryotic genomes as related to
DNA curvature, supercoiling, and transcription.
Genomics, 98(3), pp.223–231.
Kumar, L., Futschik, M. & Herzel, H., 2006. DNA motifs
and sequence periodicities. In silico biology, 6(1-2),
pp.71–8.
Lee, M.S. et al., 1989. Three-dimensional solution
structure of a single zinc finger DNA-binding domain.
Science (New York, N.Y.), 245(4918), pp.635–7.
Lobzin, V. V. & Chechetkin, V.R., 2000. Order and
correlations in genomic DNA sequences. The spectral
approach. Uspekhi Fizicheskih Nauk, 170(1), p.57.
Marcotte, E.M. et al., 1999. A census of protein repeats.
Journal of molecular biology, 293(1), pp.151–160.
Meng, T. et al., 2013. Wavelet analysis in current cancer
genome research: a survey. IEEE/ACM transactions
on computational biology and bioinformatics / IEEE,
ACM, 10(6), pp.1442–59.
Mitchell, M., 1998. An Introduction to Genetic
Algorithms.
Mott, R., 1999. Local sequence alignments with
monotonic gap penalties. Bioinformatics (Oxford,
England), 15(6), pp.455–62.
Newman, A.M. & Cooper, J.B., 2007. XSTREAM: a
practical algorithm for identification and architecture
modeling of tandem repeats in protein sequences.
BMC bioinformatics, 8, p.382.
Palidwor, G.A. et al., 2009. Detection of alpha-rod protein
repeats using a neural network and application to
huntingtin. PLoS computational biology, 5(3),
p.e1000304.
Pellegrini, M., 2015. Tandem Repeats in Proteins:
Prediction Algorithms and Biological Role. Frontiers
in bioengineering and biotechnology, 3, p.143.
Pellegrini, M., Renda, M.E. & Vecchio, A., 2012. Ab
initio detection of fuzzy amino acid tandem repeats in
protein sequences. BMC Bioinformatics, 13, p.S8.
Radcliffe, N.J., 1991. Equivalence Class Analysis of
Genetic Algorithms. Complex Systems, 5(2), pp.183–
205.
Rubinson, E.H. & Eichman, B.F., 2012. Nucleic acid
recognition by tandem helical repeats. Current opinion
in structural biology, 22(1), pp.101–9.
Sawaya, M.R. et al., 2008. A double S shape provides the
structural basis for the extraordinary binding
specificity of Dscam isoforms. Cell, 134(6), pp.1007–
18.
BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms
126
Shelenkov, A., Skryabin, K. & Korotkov, E., 2006. Search
and classification of potential minisatellite sequences
from bacterial genomes. DNA research : an
international journal for rapid publication of reports
on genes and genomes, 13(3), pp.89–102.
Smith, T.F. & Waterman, M.S., 1981. Identification of
common molecular subsequences. Journal of
Molecular Biology, 147, pp.195–197.
Söding, J., Remmert, M. & Biegert, A., 2006. HHrep: de
novo protein repeat detection and the origin of TIM
barrels. Nucleic acids research, 34(Web Server issue),
pp.W137–42.
Sosa, D. et al., 2013. Periodic distribution of a putative
nucleosome positioning motif in human, nonhuman
primates, and archaea: mutual information analysis.
International journal of genomics, 2013, p.963956.
de Sousa Vieira, M., 1999. Statistics of DNA sequences: a
low-frequency analysis. Physical review. E, Statistical
physics, plasmas, fluids, and related interdisciplinary
topics, 60(5 Pt B), pp.5932–5937.
Spears, W.M. & De Jong, K.D., 1991. On the Virtues of
Parameterized Uniform Crossover,. Proceedings of the
Fourth International Conference on Genetic
Algorithms, Morgan Kaufmann Publishers Inc. San
Francisco, CA, USA, pp.230–236.
Suvorova, Y.M., Korotkova, M.A. & Korotkov, E. V,
2014. Comparative analysis of periodicity search
methods in DNA sequences. Computational biology
and chemistry, 53 Pt A, pp.43–48.
Sywerda, G., 1989. Uniform crossover in genetic
algorithms. Proceedings of the third international
conference on Genetic algorithms, Morgan Kaufmann
Publishers Inc. San Francisco, CA, USA ©1989, pp.2–
9.
Szklarczyk, R. & Heringa, J., 2004. Tracking repeats using
significance and transitivity. Bioinformatics (Oxford,
England), 20 Suppl 1, pp.i311–7.
Tiwari, S. et al., 1997. Prediction of probable genes by
Fourier analysis of genomic sequences. Computer
applications in the biosciences CABIOS, 13(3),
pp.263–270.
Turutina, V.P. et al., 2006. Identification of Amino Acid
Latent Periodicity within 94 Protein Families. Journal
of Computational Biology, 13(4), pp.946–964.
Yang, R. et al., 2004. AglZ is a filament-forming coiled-
coil protein required for adventurous gliding motility
of Myxococcus xanthus. Journal of bacteriology,
186(18), pp.6168–78.
Search for Latent Periodicity in Amino Acid Sequences with Insertions and Deletions
127