4.2 Window Size Definition
The size of the nucleotide sequence used in training
has a direct influence on the quality of the prediction
model (Silva et al., 2011; LIU and WONG, 2003).
Extraction windows can be symmetric, with the same
number of nucleotides in the upstream regions ( re-
gion of the sequence before TIS ) and downstream
(region after of TIS), or asymmetric, with a number
other than nucleotides for each region. Preliminary
studies indicate that asymmetric-sized windows pro-
vide greater accuracy (Silva et al., 2011). We will
adopt asymmetric windows in this work being the re-
gion upstream with the lowest number of nucleotides.
For the definition of the nucleotide number of
the upstream region, we use the ribosome scan-
ning model and the Kozak’s consensus (Kozak,
1984), that identifies a conservative pattern at the
−6,−5, −4,−3,−2,−1, 1, 2, 3, 4 positions with the
sequence (GCC[A or G]CCAUGG), where predom-
inance of the nucleotides [A/G] and [G] in the posi-
tions of −3 and 4, respectively. A greater number of
nucleotides in the upstream region was used by (Tza-
nis et al., 2007), where the conservation of the −7
position was also identified. For the experiments of
this work, we adopted a window with 9 nucleotides in
the upstream region, since the mRNA scanning model
is done by codon and, besides that, we guarantee the
conservative positions identified in other works.
For the downstream region, it was found in (Pinto
et al., 2017) that the larger is the region the greater
is the accuracy achieved by the SVM classifier, thus
adopting the size of 1081 nucleotides in the down-
stream region, so that we can have a better fit in the
classifier. For the extraction of the conservative rules,
we used the 20 nucleotide downstream size to ease
the computational time of extraction, due to the high
computational cost of the algorithm used in rule ex-
traction.
4.3 Extraction of Conservative
Characteristics
In this work, we used the ’Find Implications’ algo-
rithm, proposed by (Carpineto et al., 1999). This al-
gorithm allows to extract implications using formal
concept analysis for the extraction of all rules. Given
a formal concept (X,Y), and a concept lattice, the al-
gorithm looks for implications of (X,Y) where there
are implications P → Q, with P
T
Q =
/
0, R ⊂ Y,=
Y − P, so that this implication can not be obtained
from the concept (W,Z). This algorithm requires a
large computational effort due to its complexity that
is proportional to O(|C|k
2
|M|q), where C is the con-
cept number, k is the largest number of attributes in
the premise, M is the number of attributes and q is
the largest number of relation per concept. Since the
database was very large, causing a great computa-
tional effort, we divided the base into groups of 500
sequences and, in the end, we made an intersection
between the implications generated for each analyzed
organism.
Since the TIS attribute is common to all positive
sequences, it was possible to observe rules where TIS
is the conclusion of a certain premise as exempli-
fied in the 2.4 section. After obtaining the impli-
cations of each organism, the common rules among
all organisms were collected. This extraction was
made through the intersection between the sets of
rules acquired from each organism. We also con-
sider that rules with support greater than or equal to
30%, within all bases, should be used as conserva-
tive characteristics to increase classifier performance
(25% would be considered random since they are four
nucleotides).
It were added characteristics as supporting vector
G in our base as binary values demonstrating the ex-
istence of that rule in the sequence. The vector G is
formed by Equation 3.
G(n) =
(
1 if V (n) == N (n)
0 Otherwise
(3)
where V is a vector with the values of the sequence in
the positions of the conservative characteristics found
and N is the vector with the values that each position
must have according to the implications found.
4.4 Support Vector Machines Classifier
SVM is a machine learning technique capable of solv-
ing linear and non-linear classification problems. It
separates examples using a linear decision surface and
increasing the distance between training points (Silva
et al., 2011).
The efficiency of SVM classifier depends on the
proper selection of the parameters of the kernel func-
tion used and the smoothing parameter of the optimal
hyperplane separation margin, represented by sym-
bol C. In this work, the Gaussian RBF (Radial Basis
Function) kernel function was used, which acts as a
structural regulator. The RBF function is defined by
the Equation 4 and its parameter is represented by the
symbol gamma (γ)
K(x
i
,x
j
) = exp(−γkx− x
′
k
2
) (4)
To define the parameters C and γ, was used the
’Grid Search’ method, implemented for the class lib-