performance using accuracy and we also report the
macro precision, macro-recall, and macro-F1 mea-
sure to reduce the effect of class imbalance in the
test dataset. According to all the evaluation metrics
we considered, the proposed approach show better
performance. One of the strengths of the proposed
method lies in it’s simplicity. The method learns do-
main embeddings using a single layer neural network.
Due to the use of shallow neural network, the training
is faster than other multi-layer deep networks. We
have used hierarchical softmax loss function to make
training even faster. Unlike other hierarchical clas-
sification models like ECPred(Dalkiran et al., 2018)
and DEEPre(Li et al., 2018), the proposed method
learns single model instead of learning many models
each for every class. The method is scalable for larger
dataset using CUDA based GPU units. Although the
proposed method performs well, there is still scope of
improvement specially for level-3 and level-4 predic-
tions. As a future plan, we envision to improve the
method for more precise predictions and also to apply
the similar approach for protein function annotation
using Gene Ontology Terms.
REFERENCES
Altschul, S. F., Madden, T. L., Sch
¨
affer, A. A., Zhang,
J., Zhang, Z., Miller, W., and Lipman, D. J. (1997).
Gapped blast and psi-blast: a new generation of pro-
tein database search programs. Nucleic Acids Re-
search, 25(17):3389–3402.
Asgari, E. and Mofrad, M. R. (2015). Continuous
distributed representation of biological sequences
for deep proteomics and genomics. PloS one,
10(11):e0141287.
Bakheet, T. M. and Doig, A. J. (2009). Properties and iden-
tification of human protein drug targets. Bioinformat-
ics, 25(4):451–457.
Berger, B., Daniels, N. M., and Yu, Y. W. (2016). Com-
putational biology in the 21st century: Scaling with
compressive algorithms. Commun. ACM, 59(8):72–
80.
Cai, C., Han, L., Ji, Z., and Chen, Y. (2004). Enzyme family
classification by support vector machines. Proteins:
Structure, Function, and Bioinformatics, 55(1):66–76.
Cai, C., Han, L., Ji, Z. L., Chen, X., and Chen, Y. Z.
(2003). Svm-prot: web-based support vector ma-
chine software for functional classification of a protein
from its primary sequence. Nucleic acids research,
31(13):3692–3697.
Cai, Y.-D. and Chou, K.-C. (2005). Predicting enzyme sub-
class by functional domain composition and pseudo
amino acid composition. Journal of Proteome Re-
search, 4(3):967–971.
Chou, K.-C. (2009). Pseudo amino acid composition and its
applications in bioinformatics, proteomics and system
biology. Current Proteomics, 6(4):262–274.
Cornish-Bowden, A. (2014). Current IUBMB recommen-
dations on enzyme nomenclature and kinetics. Per-
spectives in Science, 1(1-6):74–87.
Dalkiran, A., Rifaioglu, A. S., Martin, M. J., Cetin-Atalay,
R., Atalay, V., and Do
˘
gan, T. (2018). ECPred: a tool
for the prediction of the enzymatic functions of pro-
tein sequences based on the EC nomenclature. BMC
Bioinformatics, 19(1):334.
des Jardins, M., Karp, P. D., Krummenacker, M., Lee, T. J.,
and Ouzounis, C. A. (1997). Prediction of enzyme
classification from protein sequence without the use
of sequence similarity. In Proc Int Conf Intell Syst
Mol Biol, volume 5, pages 92–99.
Dobson, P. D. and Doig, A. J. (2005). Predicting enzyme
class from protein structure without alignments. Jour-
nal of molecular biology, 345(1):187–199.
Finn, R. D., Clements, J., and Eddy, S. R. (2011). HMMER
web server: interactive sequence similarity searching.
Nucleic Acids Research, 39(2):W29–W37.
Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. (2012). Cd-
hit: accelerated for clustering the next-generation se-
quencing data. Bioinformatics, 28(23):3150–3152.
Gattiker, A., Michoud, K., Rivoire, C., Auchincloss, A. H.,
Coudert, E., Lima, T., Kersey, P., Pagni, M., Sigrist,
C. J., Lachaize, C., Veuthey, A.-L., Gasteiger, E., and
Bairoch, A. (2003). Automated annotation of micro-
bial proteomes in SWISS-PROT. Computational Bi-
ology and Chemistry, 27(1):49–58.
Huang, W.-L., Chen, H.-M., Hwang, S.-F., and Ho, S.-
Y. (2007). Accurate prediction of enzyme subfam-
ily class using an adaptive fuzzy k-nearest neighbor
method. Biosystems, 90(2):405–413.
Jones, P., Binns, D., Chang, H.-Y., Fraser, M., Li, W.,
McAnulla, C., McWilliam, H., Maslen, J., Mitchell,
A., Nuka, G., et al. (2014). Interproscan 5: genome-
scale protein function classification. Bioinformatics,
30(9):1236–1240.
Joulin, A., Grave, E., Bojanowski, P., Douze, M., J
´
egou,
H., and Mikolov, T. (2016). Fasttext.zip: Com-
pressing text classification models. arXiv preprint
arXiv:1612.03651.
Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T.
(2017). Bag of tricks for efficient text classification.
In Proceedings of the 15th Conference of the Euro-
pean Chapter of the Association for Computational
Linguistics: Volume 2, Short Papers, pages 427–431.
Association for Computational Linguistics.
Kimothi, D., Soni, A., Biyani, P., and Hogan, J. M. (2016).
Distributed representations for biological sequence
analysis. arXiv preprint arXiv:1608.05949.
Kretschmann, E., Fleischmann, W., and Apweiler, R.
(2001). Automatic rule generation for protein anno-
tation with the C4.5 data mining algorithm applied on
SWISS-PROT. Bioinformatics, 17 10:920–6.
Kumar, N. and Skolnick, J. (2012). Eficaz2. 5: application
of a high-precision enzyme function predictor to 396
proteomes. Bioinformatics, 28(20):2687–2688.
Functional Annotation of Proteins using Domain Embedding based Sequence Classification
169