For H/ACA box, the performance of the ML classi-
fiers, both using the full set of known biological char-
acteristics and a reduced number of features, showed
equivalent prediction performance. In the experiment
with the insertion mutation, performance decreased
with increasing mutation levels for all the three ML
classifiers.
In summary, our results show that ML methods
can be competitive with traditional homology search
methods, provided sufficiently large sets of indepen-
dent instances for test and training sets. This require-
ment, however, is prohibitive for most practical appli-
cations. We therefore suggest that the careful produc-
tion of artificial data is a promising approach that can
be pursued in practice, at least for families of ncRNAs
for which an adequate diverse set of representatives is
available. Our data also indicate that the knowledge
of a sufficiently large set of biologically relevant fea-
tures is important for the performance of ML-based
homology search.
Clearly, the present study is only a first step.
It remains open whether the ML methods can also
compete with more sophisticated methods of homol-
ogy search such as Hidden Markov Models (HMMs)
(Eddy, 1996) or covariance models (CMs) (Nawrocki
and Eddy, 2013), which similar to ML models also
convey information of local and non-local correla-
tions, respectively.
REFERENCES
Achawanantakun, R., Chen, J., Sun, Y., and Zhang, Y.
(2015). LncRNA-ID: Long non-coding RNA IDentifi-
cation using balanced random forests. Bioinformatics,
31(24):3897–3905.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and
Lipman, D. J. (1990). Basic local alignment search
tool. J. Mol. Biol., 215(3):403–410.
Barber, D. (2012). Bayesian Reasoning and Machine
Learning. Cambridge University Press, Cambridge,
UK.
Bartschat, S., Kehr, S., Tafer, H., Stadler, P. F., and Hertel,
J. (2014). snoStrip: a snoRNA annotation pipeline.
Bioinformatics, 30(1):115–116.
Bratkovi
ˇ
c, T., Bo
ˇ
zi
ˇ
c, J., and Rogelj, B. (2020). Functional
diversity of small nucleolar RNAs. Nucleic Acids Re-
search, 48(4):1627–1651.
Breiman, L. (2001). Random forests. Machine Learning,
45:5–32.
de Araujo Oliveira, J. V., Costa, F., Backofen, R., Stadler,
P. F., Machado Telles Walter, M. E., and Hertel, J.
(2016). SnoReport 2.0: new features and a refined
Support Vector Machine to improve snoRNA identifi-
cation. BMC Bioinformatics, 17 Suppl. 18:464.
Eddy, S. R. (1996). Hidden Markov models. Current Op.
Struct. Biol., 6:361–365.
Falaleeva, M. and Stamm, S. (2013). Processing of snoR-
NAs as a new source of regulatory non-coding RNAs:
snoRNA fragments form a new class of functional
RNAs. BioEssays, 35(1):46–54.
Georgakilas, G. K., Grioni, A., Liakos, K. G., Chalupova,
E., Plessas, F. C., and Alexiou, P. (2020). Multi-
branch Convolutional Neural Network for Identifica-
tion of Small Non-coding RNA genomic loci. Scien-
tific Reports, 10(1):9486.
Goldschmidt, R. and Passos, E. (2005). Data mining: a
Practical guide. Gulf Professional Publishing.
Gruber, A. R., Findeiß, S., Washietl, S., Hofacker, I. L., and
Stadler, P. F. (2010). RNAz 2.0: improved noncoding
RNA detection. Pac. Symp. Biocomput., 15:69–79.
Gulli, A. and Pal, S. (2017). Deep Learning with Keras.
Packt Publishing Ltd, Birmingham, UK.
Haykin, S. (1999). Neural Networks: A Comprehensive
Foundation. Prentice-Hall, Englewood Cliffs.
Lorenz, R., Bernhart, S. H., H
¨
oner zu Siederdissen, C.,
Tafer, H., Flamm, C., Stadler, P. F., and Hofacker, I. L.
(2011). ViennaRNA Package 2.0. Alg. Mol. Biol.,
6:26.
Nawrocki, E. P. and Eddy, S. R. (2013). Infernal 1.1: 100-
fold faster RNA homology searches. Bioinformatics,
29(22):2933–2935.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., Vanderplas, J. T., Passos, A.,
Cournapeau, D., Brucher, M., Perrot, M., and Duch-
esnay,
´
E. (2011). Scikit-learn: Machine learning in
python. J. Machine Learning Res., 12:2825–2830.
Rastogi, A. and Gupta, D. (2014). GFF-Ex: a genome fea-
ture extraction package. BMC research notes, 7:315.
Russell, S. and Norvig, P. (2010). Artificial Intelligence: A
Modern Approach. Prentice-Hall, Englewood Cliffs,
3nd edition.
Satoh, N. (2003). The ascidian tadpole larva: compara-
tive molecular development and genomics. Nature Re-
views Genetics, 4(4):285–295.
Waldl, M., Thiel, B., Ochsenreiter, R., Holzenleiter, A.,
de Araujo Oliveira, J. V., Walter, M. E. M. T., Wolfin-
ger, M. T., and Stadler, P. F. (2018). TERribly difficult:
Searching for telomerase RNAs in Saccharomycetes.
Genes, 9:372.
Zhang, Y., Huang, H., Zhang, D., Qiu, J., Yang, J., Wang,
K., Zhu, L., Fan, J., and Yang, J. (2017). A Review on
Recent Computational Methods for Predicting Non-
coding RNAs. BioMed Res. Intl., 2017:1–14.
Zhang, Y. and Rajapakse, J. C. (2009). Machine Learning
in Bioinformatics. John Wiley & Sons, Hoboken, NJ.
Machine Learning Studies of Non-coding RNAs based on Artificially Constructed Training Data
183