Functional Annotation of Proteins using Domain Embedding based

Sequence Classiﬁcation

Bishnu Sarker

, David W. Ritchie and Sabeur Aridhi

University of Lorraine, Inria, Loria, CNRS, F-54000, Nancy, France

Keywords:

Machine Learning, Representation Learning, Protein Function Annotation, Bioinformatics, Domain

Embedding.

Abstract:

Due to the recent advancement in genomic sequencing technologies, the number of protein sequences in pub-

lic databases is growing exponentially. The UniProt Knowledgebase (UniProtKB) is currently the largest and

most comprehensive resource for protein sequence and annotation data. The May 2019 release of the Uniprot

Knowledge base (UniprotKB) contains around 158 million protein sequences. For the complete exploita-

tion of this huge knowledge base, protein sequences need to be annotated with functional properties such as

Enzyme Commission (EC) numbers and Gene Ontology terms. However, there is only about half a million

sequences (UniprotKB/SwissProt) are reviewed and functionally annotated by expert curators using informa-

tion extracted from the published literature and computational analyses. The manual annotation by experts are

expensive, slow and insufﬁcient to ﬁll the gap between the annotated and unannotated protein sequences. In

this paper, we present an automatic functional annotation technique using neural network based based word

embedding exploiting domain and family information of proteins. Domains are the most conserved regions

in protein sequences and constitute the building blocks of 3D protein structures. To do the experiment, we

used fastText

, a library for learning of word embeddings and text classiﬁcation developed by Facebook’s AI

Research lab. The experimental results show that domain embeddings perform much better than k-mer based

word embeddings.

1 INTRODUCTION

1.1 Protein Function Annotation

Proteins are important components of all biological

systems. Understanding protein function is one of

the keys to understanding life at the molecular level,

and is central to understanding disease processes and

guiding drug discovery efforts (Bakheet and Doig,

2009). In this age of rapid and affordable amino-acid

sequencing technologies, the number of protein se-

quences accumulating in public databases is rising at

an increasing rate (Berger et al., 2016). In order to

enrich and exploit this huge quantity of data, these se-

quences should be annotated with functional proper-

ties. The UniProt knowledge base (UniProtKB) (The

UniProt Consortium, 2015) is currently the largest

public sequence database. It consists of two compo-

nents: (i) the UniProtKB/Swiss-Prot database which

contains protein sequences with reliable information

https://orcid.org/0000-0002-9716-4989

https://github.com/facebookresearch/fasttext

that has been reviewed by expert bio-curators, and (ii)

the UniProtKB/TrEMBL database that stores unanno-

tated sequences. Thus, for all proteins in UniProtKB,

we have the primary amino acid sequences as well

as some further information such as InterPro domain

deﬁnitions which may have been identiﬁed from fam-

ilies of similar sequences or 3D protein structures.

Enzymes are usually labelled following the En-

zyme Commission (EC) system (Cornish-Bowden,

2014), the widely used numerical enzyme classiﬁca-

tion scheme. The EC System assigns each enzyme

a four digits number. The challenge, therefore, is to

annotate the un-annotated protein sequences automat-

ically with appropriate EC number. The EC classiﬁ-

cation system has a hierarchical structure. The ﬁrst

level consists of the six main enzyme classes: (i) ox-

idoreductases, (ii) transferases, (iii) hydrolases, (iv)

lyases, (v) isomerases and (vi) ligases, represented by

the ﬁrst digit. Each main class node further extends

out several subclass nodes, specifying sub classes of

the enzymes, represented by the second digit. Simi-

larly, the third digit indicates the sub-subclass and the

fourth digit denotes the sub-sub-sub classes. Let us

Sarker, B., Ritchie, D. and Aridhi, S.

Functional Annotation of Proteins using Domain Embedding based Sequence Classiﬁcation.

DOI: 10.5220/0008353401630170

In Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2019), pages 163-170

ISBN: 978-989-758-382-7

163

consider as an example a Type II restriction enzyme,

which is annotated as EC 3.1.21.4. The ﬁrst digit, 3,

denotes that it is a hydrolase. The second digit, 1, in-

dicates that it acts on ester bonds. The third digit, 21,

shows that it is an endodeoxyribonuclease producing

5-phosphomonoesters. The last digit, 4, speciﬁes that

it is a Type II site-speciﬁc deoxyribonuclease.

1.2 Text Classiﬁcation

One of the important tasks in natural language pro-

cessing is to classify text into classes such as tags, cat-

egories, labels, and so on. Text classiﬁcation is widely

used in web search, information retrieval, ranking

and document classiﬁcation. Due to recent successes,

neural network based models are prevalent in text

classiﬁcations. Although the representation capabil-

ity of neural networks is higher, training neural net-

work based deep learning models are computationally

expensive due to the presence of non-linear hidden

layer (Mikolov et al., 2013; Joulin et al., 2017). On

the other hand, linear classiﬁers are simple and efﬁ-

cient, yet achieve better performance in the context

of text classiﬁcation. However, linear classiﬁer like

SVM does not share parameters among features and

classes (Joulin et al., 2017), which eventually limits

the generalization power of linear classiﬁer. The fast-

Text(Joulin et al., 2016) text classiﬁcation tool uses

log-linear model with a shallow neural network to

build a simple, fast, and efﬁcient text classiﬁer with

word embedding.

1.3 Text Classiﬁcation for Protein

Function Annotation

Natural language text posses a deﬁned linguistic

structure containing an array of words delimited by

various punctuation marks. Whereas biological texts

such as protein sequences are strings of letters se-

lected from an alphabet consists of 20 letters, each

representing an amino acid (Kimothi et al., 2016). Es-

sentially one string stands for a single protein. Un-

like natural texts, there is no way of formally deﬁn-

ing words or phrases in protein sequences. Therefore,

using a text classiﬁcation model requires further pre-

processing of protein sequences. The most common

way of pre-processing is to break the sequences into

biological words commonly known as K-mers that are

smaller units of size k composed of consecutive al-

phabets. The pre-processing can be done in two dif-

ferent ways: 1) Overlapping k-mers achieved by mov-

ing a k-size window over the sequence. For exam-

ple, let us break ”MAPPSV FSEV ” into overlapping

3-mers. The corresponding 3-mers are MAP, APP,

PPS, PSV, SV F, V FS, FSE, and SEV . Therefore,

the biological sequence ”MAPPSV FSEV ” is trans-

formed into following space delimited text: MAP

APP PPS PSV SVF VFS, FSE, SEV; (2) Using non-

overlapping k-mers, k numbers of sequences are gen-

erated by splitting the original sequences into non-

overlapping words of k consecutive letters with a

starting position moved by one letter for each newly

generated sequence (Asgari and Mofrad, 2015; Ki-

mothi et al., 2016). For example, for the sequence

”MAPPSV FSEV ”, considering 3-mers, the 3 newly

generated space delimited sequences are as follows:

1. MAP PSV FSE

2. APP SVF SEV

3. PPS VFS

Non-overlapping K-mers have been used for learn-

ing word embedding tasks and have been shown to

have better prediction accuracy when applied to fam-

ily classiﬁcation task (Asgari and Mofrad, 2015).

Overlapping K-mers are widely used in homology

based sequence search in large databases of protein

sequences like (Altschul et al., 1997).

This paper presents a novel way to tokenize the

protein sequences for the purpose of functional anno-

tation. Instead of k-mers, we use domain and fam-

ily information of protein in order to learn protein

domain embeddings. The rationale behind using do-

main information is that the domains are the meaning-

ful units of protein sequence conserved across simi-

lar sequences. Domains may be considered as natural

building blocks of proteins. Due to evolution, pro-

tein domains may have gone through changes such

as duplication, fusion, recombination to produce pro-

teins with distinct structures and functions (Kummer-

feld and Teichmann, 2009). On the other hand, k-

mer words do not carry any biologically signiﬁcant

meaning by themselves. Thus, tokenizing a protein

sequence into sentence of domains is more informa-

tive than sentence of k-mer words. The experimen-

tal results presented in experimental section verify the

higher performance of domain embeddings.

Two steps are necessary to prepare a training cor-

pus using domain information:

(1) Firstly, for each of the sequences, identify

domain signatures using InterProScan (Jones et al.,

2014; Quevillon et al., 2005; Mitchell et al., 2018),

which is a sequence analysis software that integrates

different protein signature recognition methods into

one resource. InterProScan provides domain signa-

tures along with their location of appearance in the

sequence.

(2) Secondly, InterProScan output is processed to

collect the domain signatures and sorting them ac-

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

164

cording to their location of appearance. The domains

are organized in ascending order of their starting po-

sition in the sequence to form the domain-sentence.

Thus each line of the ﬁnal corpus is a list of domains

found by InterProScan for a given protein sequence

along with true EC labels.

In this paper, we propose an automatic protein

function annotation technique that uses a shallow neu-

ral network based text classiﬁcation method based on

domain embeddings. To accomplish the task, We

have used fastText developed by Facebook Artiﬁcial

Intelligence Research team to train a supervised se-

quence classiﬁcation model as well as domain em-

beddings. We show a comparative analysis of over-

lapping 3-mers, non-overlapping k-mers and domain

embeddings for protein function annotation with en-

zyme commission number.

The remainder of the paper is organized as fol-

lows. Section 2 presents a brief summary of the previ-

ous works. Section 3 describes the proposed domain

embedding based method for automatic protein func-

tion annotation. Section 4 describes our experimental

setup and the obtained results.

2 PREVIOUS WORK

To annotate the protein sequences, the UniProt cu-

rators use two systems: (1) UniRule, uses a man-

ually designed if-then rules (Gattiker et al., 2003),

and (2) the Statistical Automatic Annotation System

(SAAS) (Kretschmann et al., 2001), which uses de-

cision tree based algorithm to automatically gener-

ate UniRule rules. Several approaches for predict-

ing Enzyme Commission (EC) numbers that exploit

structural and sequence similarities of proteins have

been described previously (Dobson and Doig, 2005;

Yang et al., 2015; Nagao Chioko and Kenji, 2014;

Rahman et al., 2014; Kumar and Skolnick, 2012;

Quester and Schomburg, 2011; Yu et al., 2009). Ad-

ditionally, machine learning methods have also been

used extensively in (des Jardins et al., 1997; Na-

gao Chioko and Kenji, 2014; Li et al., 2016; Huang

et al., 2007; Lu et al., 2007; Nasibov and Kandemir-

Cavas, 2009; Li et al., 2018; Shen and Chou, 2007).

For example, DEEPre (Li et al., 2018) predicts EC

numbers combining multiple tools and techniques in-

cluding PSI-Blast (Altschul et al., 1997), HMMER

(Finn et al., 2011), Convolutional and Recurrent Neu-

ral Networks, and sequence encoding using position

speciﬁc scoring matrix (PSSM) to perform dimen-

sionality uniformization, feature selection, and clas-

siﬁcation model training. EzyPred (Shen and Chou,

2007) predicts enzyme functional classes and sub-

classes using a top-down approach. EzyPred exploits

functional and evolutionary information of proteins

using pseudo amino acid composition (Chou, 2009)

and functional encoding. Based on two features,

EzyPred uses a modiﬁed K-Nearest Neighbor Classi-

ﬁer called OET-KNN (Optimized Evidence-Theoretic

K-Nearest Neighbour).

SVM-Prot is a support vector machine (SVM)

based classiﬁcation method detailed in (Cai et al.,

2003; Cai et al., 2004; Cai and Chou, 2005) and later

updated by adding two more classiﬁers, K-Nearest

Neighbor (KNN) and Probabilistic Neural Networks

(PNN) for improved performance in 2016 (Li et al.,

2016). This approach transforms protein sequences to

numerical representations based on various physico-

chemical properties such as polarity, hydrophobicity,

surface tension, charge, normalized van der Waals

volume, polarizability, secondary structure, solvent

accessibility, molecular weight, solubility, and the

numbers of hydrogen bond donors and acceptors in

side chain atoms.

ECPred (Dalkiran et al., 2018) is a hierarchical

prediction approach that starts by assigning the query

sequence as either an enzyme or non-enzyme. If

the query sequence is predicted to be an enzyme,

ECPred predictions follow the hierarchy of the EC

numbering system. Recently published high perform-

ing machine learning based approaches like DEEPre,

ECPred trains many models to accomplish the hier-

archical predictions. They train one model for each

class. For example, in DEEPre, one model to distin-

guish enzyme and non-enzyme, then if enzyme, an-

other model to predict the main class and 6 models

to predict the second digit of the EC and so on. The

works in (Kimothi et al., 2016; Asgari and Mofrad,

2015; Matsuda et al., 2005) presents unsupervised

word embedding based protein classiﬁcation tech-

niques using continuous bag of words (CBOW) and

Skip-gram model proposed by (Mikolov et al., 2013).

(Asgari and Mofrad, 2015) explores non-overlapping

3-mer embeddings and applied the method for pro-

tein family classiﬁcation. Although, they show an

improved performance for protein domain classiﬁca-

tion, they have not explored the functional annotation

of proteins. In this article, we explore the protein do-

main embedding for functional classiﬁcation of pro-

teins. We show that the proposed method performs

with higher accuracy than 3-mer embeddings in func-

tional annotation of protein using EC Number. Here

we learn single model to predict the complete EC an-

notation thus the proposed method does not incur the

overhead of learning many class-based hierarchical

models.

Functional Annotation of Proteins using Domain Embedding based Sequence Classiﬁcation

165

h1

INPUT

VxN

N-Dimension

Hidden

Layer

OUTPUT

SOFTMAX

Nxv

Figure 1: Continuous Bag of Words Model Architecture

(Mikolov et al., 2013).

3 METHODS

Our proposed enzyme classiﬁcation technique is

based on supervised text classiﬁcation model for nat-

ural language processing. The supervised classiﬁca-

tion model is designed in a similar way to Continuous

Bag of Words (CBOW) described in (Mikolov et al.,

2013) architecture, where the middle word is replaced

by the label. The typical CBOW model architecture

is shown in Fig. 1. CBOW employs very simple neu-

ral network with single hidden layer for learning the

projection of individual word. In supervised mode,

for N textual documents or sentences, the following

negative log-likelihood is optimized over the classes

as described in (Joulin et al., 2017):

−

∑

n=1

log( f (W

W x

)),

Where W is a weight matrix that serves as a look-up

table over the words connecting input to the hidden

layer, W

is a weight matrix that connects hidden layer

with output layer where softmax function f is applied

to compute the probability distribution of the labels.

is the normalized bag of words of the n-th docu-

ment. y

is the label under consideration.

The discretely learned word representations are

averaged to learn the text representation which is then

fed into a linear classiﬁer. The model uses stochastic

gradient descent based back propagation for optimiz-

ing the loss function.

To apply this model to enzyme classiﬁcation, we

have used domain and family signatures of proteins

as discrete words. The individual domain signatures

found by running InterProscan against each protein

sequence serves the purpose of words. While prepar-

ing the training corpus, the domains are arranged in

ascending order of their location of appearance in the

sequence. For the comparison purposes, we have also

used overlapping and non-overlapping k-mer based

biological words processed from sequence data. The

Enzyme Commission (EC) numbers are used as labels

without any further processing.

After pre-processing the protein sequences to gen-

erate domains and K-mers and associating appropri-

ate EC labels, we prepared a large corpus of biological

texts for the purpose of learning embeddings and clas-

siﬁcation models. The work ﬂow for the classiﬁcation

task is shown in Fig:2. This ﬁgure shows the steps in-

volved in training a domain embedding model using

fastText supervised learning. The raw sequence data

is transformed into domain data using InterProScan

(Jones et al., 2014) sequence analysis tool and then

feed into a supervised learning model to learn the em-

beddings and classiﬁcation model. Finally, the test

data are feed into the model to predict the EC annota-

tions.

Fasttext

Supervised

Learning

Domain

Embeddings

Classiﬁcation Model

using Domain

Embeddings

Predicted

Annotations

Protein

Sequences

and

Labels

Pre-processing

using

InterProScan

Protein

Domain

Corpus

Training Set

Test Set

Splitting

Corpus

Figure 2: Data preparation and training work-ﬂow for Do-

main Embedding based Protein Function Annotation.

4 EXPERIMENTS AND RESULT

ANALYSIS

In this Section, we ﬁrst present the used data. Then,

we present our experimental protocol and we discuss

the obtained results.

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

166

4.1 Dataset and Training

To evaluate the method, we have used a well deﬁned

dataset of enzyme and non-enzyme proteins curated

from UniprotKB (The UniProt Consortium, 2015).

This dataset is called “NEW” and was published by

(Li et al., 2018) and was constructed as follows:

1. The SWISS-PROT (released on September 7,

2016) database was separated into enzymes and

non-enzymes based on their annotation.

2. To guarantee uniqueness and correctness, enzyme

sequences with more than one set of EC num-

bers or incomplete EC number annotation were

excluded.

3. To avoid fragment data, enzyme sequences anno-

tated with ‘fragment’ or with less than 50 amino

acids were excluded. Enzyme sequences with

more than 5000 amino acids were also excluded.

4. Redundancy bias is removed using CD-HIT (Fu

et al., 2012) with 40% similarity threshold to sift

the raw dataset, resulting in 22,168 low-homology

enzyme sequences.

5. To construct the non-enzyme part, 22,168 non-

enzyme protein sequences were randomly col-

lected from the SWISS-PROT (released on

September 7, 2016) non-enzyme part, which were

also subject to the above (ii–iv) steps. Thus the

original dataset contains 22,168 enzymes and an

equal number of non-enzymes.

To build the corpus for learning the embeddings,

each sequence has undergone three different pre-

processing treatment for 3 different types of sequence

tokenization described in the section 1.3. For over-

lapping k-mer, each sequence is split into overlapping

K-mer words. We have chosen 3-mer for our experi-

ment meaning a window of size 3 is moved over the

sequence to split it into overlapping 3-mers.

For non-overlapping k-mer, each sequence was

transformed into 3 sequences of non-overlapping 3-

mers. And same label is associated with all of the

3 sequences as they are generated from the same se-

quence with single EC label.

In the last case, we used domain and family

signatures of proteins as words to build the train-

ing corpus. Each sequence was transformed into

a list of domains using InterProScan(Jones et al.,

2014) software of Version 5.35-74.0. We built 3

large corpus using the 3-mers (overlapping and non-

overlapping) and domains and each sentence is asso-

ciated with EC labels to train the supervised classi-

ﬁcation model. For non-enzyme, we have labelled

them with ”

label NANZ”.

For the implementation purpose, we used fast-

Text(Joulin et al., 2016) which is library for learn-

ing of embeddings and text classiﬁcation developed

by facebook’s AI research.

4.2 Evaluation

To evaluate our method, we used 10-fold cross-

validation on NEW Dataset. We split the text cor-

pus into 10 parts, and considered one of them as a

testing set with the remaining 9 parts composedly be-

ing used as the training set. The cross validation re-

sults show promising performance of the domain em-

bedding based Enzyme classiﬁcation. For each query

sequence, we picked the top ranked annotation only.

To validate the performance, we computed the accu-

racy, macro-precision, macro-recall, and macro-F1-

measure at different levels of EC number. The fol-

lowing formulae (as used in (Li et al., 2018)) were

used to compute the evaluation metrics:

accuraccy(y, y

) =

N−1

∑

i=0

1(y

= y

Here, y and y

are the list of ground truths and pre-

dicted annotations. The accuracy is computed for

each level of EC annotation. As the problem is a mul-

ticlass classiﬁcation problem, we computed macro-

precision, macro-recall, and macro-F1 score as fol-

lows:

Macro − precision(y, y

) =

|M|

∑

l∈M

precision(y

, y

Macro − recall(y, y

) =

|M|

∑

l∈M

recall(y

, y

Macro − F1(y, y

) =

|M|

∑

l∈M

F1 measure(y

, y

Here, y

is the part of y with the label l and y

is the

part of y

with label l. And M is the set of classes.

In general the precision, recall, and F1-Measure are

computed as follows when two sets A and P are given:

precision =

|A ∩ P|

|P|

recall =

|A ∪ P|

|A|

F1 − measure =

2 × precision × recall

precision + recall

Here, A is the set of ground truths and P is the set of

predictions. As EC numbers are hierarchical with 4

levels, we report level-wise precision, recall and F1-

measure. Level-1 denotes main class, level-2 denotes

Functional Annotation of Proteins using Domain Embedding based Sequence Classiﬁcation

167

Table 1: The experimental results of 10-fold Cross-Validation on ”NEW” dataset.

LEVEL EMBEDDINGS ACCURACY MACRO-PRECISION MACRO-RECALL MACRO-F1

LEVEL-0

3-MER NON-OVERLAP 0.741 0.768 0.776 0.772

3-MER OVERLAP 0.738 0.816 0.825 0.82

DOMAIN 0.943 0.968 0.97 0.969

LEVEL-1

3-MER NON-OVERLAP 0.523 0.243 0.24 0.241

3-MER OVERLAP 0.529 0.251 0.245 0.248

DOMAIN 0.929 0.918 0.913 0.916

LEVEL-2

3-MER NON-OVERLAP 0.467 0.09 0.089 0.089

3-MER OVERLAP 0.47 0.101 0.097 0.099

DOMAIN 0.918 0.823 0.812 0.817

LEVEL-3

3-MER NON-OVERLAP 0.452 0.062 0.06 0.061

3-MER OVERLAP 0.455 0.073 0.07 0.071

DOMAIN 0.909 0.757 0.757 0.757

LEVEL-4

3-MER NON-OVERLAP 0.434 0.037 0.036 0.037

3-MER OVERLAP 0.439 0.052 0.05 0.051

DOMAIN 0.851 0.62 0.633 0.626

sub-class, level-3 denotes sub-sub-class and level-4

denotes sub-sub-sub class. We add level-0 to the step

of identifying enzyme or non-enzyme. For each query

sequence, we pick the top-most annotation. For eval-

uation purposes, we split the 4-digit EC annotation

into its constituent parts. Then, for level-1 we con-

sider ﬁrst digit, for level-2 we take ﬁrst 2 digits, for

level-3 we take ﬁrst 3-digits and ﬁnally for level-4 we

take all four digits together.

In our experiments, we have trained the model us-

ing 0.9 as learning rate, 150 hidden units, window size

of 5 and trained for 200 epochs. We have used Hier-

archical Softmax as loss function and word n-grams

of 2. The cross validation result is shown in the table

Table-1 shows the accuracy, precision, recall and

F1 measure of the cross-validation performed over the

NEW dataset. As EC numbers consist of four dig-

its, we present the evaluation metrics for four lev-

els (level-1, 2, 3, 4). To measure the efﬁciency

in classifying enzymes and non-enzymes among the

test sequences, we present level-0 accuracy, preci-

sion, recall and F1 measure. The result shows that

the proposed domain embedding based classiﬁcation

can differentiate enzymes and non-enzymes with an

accuracy of 94.3% which is better than k-mer based

embeddings(73%-74%). Along with accuracy, we

also report the macro precision, recall, and F1 score

weighted over classes. Macro-{precision, recall and

F-1} scores give a reliable measure for unbalanced

data. Domain embeddings perform with very high

precision and recall of 97% for level-0 prediction

task. It is interesting to see that the macro-F1 score

conﬁrms the higher accuracy shown by the proposed

domain embeddings.

For level-1 predictions, we predict the main class

if the protein is an enzyme. There are 6 different main

classes. We measure the level-1 accuracy as how ac-

curately it can identify the non-enzymes and enzymes

with the correct main class. Similar to level-0, we also

present the class based macro-{precision, recall, and

F1} scores as the test data is class imbalanced. The

performance measures show that domain embedding

based classiﬁcation performs better in all metrics.

In a similar fashion, we also report the accuracy,

macro-{precision, recall and F1} scores for level-2,

level-3 and level-4 predictions. For all of the levels,

our proposed annotation technique outperforms K-

mer based embeddings. However, as we go to higher

EC levels, the accuracy falls off because a higher EC

level is very speciﬁc. For example, a Level-4 EC

number describes an enzyme that is speciﬁc for a par-

ticular type of substrate molecule. However, the do-

main embeddings based classiﬁcation shows promis-

ing performance for level-4 prediction also.

In summary, from the results shown in Table 1,

it is evident that domain based embeddings perform

noticeably better than k-mer based word embeddings

in all evaluation metrics and also for all levels of EC

hierarchy.

5 CONCLUSION

In this paper, we propose a novel protein function an-

notation approach using domain embedding based se-

quence classiﬁcation instead of k-mer based word em-

bedding. To show the superior performance of the

proposed method, we used 10-fold cross-validation

on benchmark dataset. We measured the annotation

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

168

performance using accuracy and we also report the

macro precision, macro-recall, and macro-F1 mea-

sure to reduce the effect of class imbalance in the

test dataset. According to all the evaluation metrics

we considered, the proposed approach show better

performance. One of the strengths of the proposed

method lies in it’s simplicity. The method learns do-

main embeddings using a single layer neural network.

Due to the use of shallow neural network, the training

is faster than other multi-layer deep networks. We

have used hierarchical softmax loss function to make

training even faster. Unlike other hierarchical clas-

siﬁcation models like ECPred(Dalkiran et al., 2018)

and DEEPre(Li et al., 2018), the proposed method

learns single model instead of learning many models

each for every class. The method is scalable for larger

dataset using CUDA based GPU units. Although the

proposed method performs well, there is still scope of

improvement specially for level-3 and level-4 predic-

tions. As a future plan, we envision to improve the

method for more precise predictions and also to apply

the similar approach for protein function annotation

using Gene Ontology Terms.

REFERENCES

Altschul, S. F., Madden, T. L., Sch

affer, A. A., Zhang,

J., Zhang, Z., Miller, W., and Lipman, D. J. (1997).

Gapped blast and psi-blast: a new generation of pro-

tein database search programs. Nucleic Acids Re-

search, 25(17):3389–3402.

Asgari, E. and Mofrad, M. R. (2015). Continuous

distributed representation of biological sequences

for deep proteomics and genomics. PloS one,

10(11):e0141287.

Bakheet, T. M. and Doig, A. J. (2009). Properties and iden-

tiﬁcation of human protein drug targets. Bioinformat-

ics, 25(4):451–457.

Berger, B., Daniels, N. M., and Yu, Y. W. (2016). Com-

putational biology in the 21st century: Scaling with

compressive algorithms. Commun. ACM, 59(8):72–

80.

Cai, C., Han, L., Ji, Z., and Chen, Y. (2004). Enzyme family

classiﬁcation by support vector machines. Proteins:

Structure, Function, and Bioinformatics, 55(1):66–76.

Cai, C., Han, L., Ji, Z. L., Chen, X., and Chen, Y. Z.

(2003). Svm-prot: web-based support vector ma-

chine software for functional classiﬁcation of a protein

from its primary sequence. Nucleic acids research,

31(13):3692–3697.

Cai, Y.-D. and Chou, K.-C. (2005). Predicting enzyme sub-

class by functional domain composition and pseudo

amino acid composition. Journal of Proteome Re-

search, 4(3):967–971.

Chou, K.-C. (2009). Pseudo amino acid composition and its

applications in bioinformatics, proteomics and system

biology. Current Proteomics, 6(4):262–274.

Cornish-Bowden, A. (2014). Current IUBMB recommen-

dations on enzyme nomenclature and kinetics. Per-

spectives in Science, 1(1-6):74–87.

Dalkiran, A., Rifaioglu, A. S., Martin, M. J., Cetin-Atalay,

R., Atalay, V., and Do

gan, T. (2018). ECPred: a tool

for the prediction of the enzymatic functions of pro-

tein sequences based on the EC nomenclature. BMC

Bioinformatics, 19(1):334.

des Jardins, M., Karp, P. D., Krummenacker, M., Lee, T. J.,

and Ouzounis, C. A. (1997). Prediction of enzyme

classiﬁcation from protein sequence without the use

of sequence similarity. In Proc Int Conf Intell Syst

Mol Biol, volume 5, pages 92–99.

Dobson, P. D. and Doig, A. J. (2005). Predicting enzyme

class from protein structure without alignments. Jour-

nal of molecular biology, 345(1):187–199.

Finn, R. D., Clements, J., and Eddy, S. R. (2011). HMMER

web server: interactive sequence similarity searching.

Nucleic Acids Research, 39(2):W29–W37.

Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. (2012). Cd-

hit: accelerated for clustering the next-generation se-

quencing data. Bioinformatics, 28(23):3150–3152.

Gattiker, A., Michoud, K., Rivoire, C., Auchincloss, A. H.,

Coudert, E., Lima, T., Kersey, P., Pagni, M., Sigrist,

C. J., Lachaize, C., Veuthey, A.-L., Gasteiger, E., and

Bairoch, A. (2003). Automated annotation of micro-

bial proteomes in SWISS-PROT. Computational Bi-

ology and Chemistry, 27(1):49–58.

Huang, W.-L., Chen, H.-M., Hwang, S.-F., and Ho, S.-

Y. (2007). Accurate prediction of enzyme subfam-

ily class using an adaptive fuzzy k-nearest neighbor

method. Biosystems, 90(2):405–413.

Jones, P., Binns, D., Chang, H.-Y., Fraser, M., Li, W.,

McAnulla, C., McWilliam, H., Maslen, J., Mitchell,

A., Nuka, G., et al. (2014). Interproscan 5: genome-

scale protein function classiﬁcation. Bioinformatics,

30(9):1236–1240.

Joulin, A., Grave, E., Bojanowski, P., Douze, M., J

egou,

H., and Mikolov, T. (2016). Fasttext.zip: Com-

pressing text classiﬁcation models. arXiv preprint

arXiv:1612.03651.

Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T.

(2017). Bag of tricks for efﬁcient text classiﬁcation.

In Proceedings of the 15th Conference of the Euro-

pean Chapter of the Association for Computational

Linguistics: Volume 2, Short Papers, pages 427–431.

Association for Computational Linguistics.

Kimothi, D., Soni, A., Biyani, P., and Hogan, J. M. (2016).

Distributed representations for biological sequence

analysis. arXiv preprint arXiv:1608.05949.

Kretschmann, E., Fleischmann, W., and Apweiler, R.

(2001). Automatic rule generation for protein anno-

tation with the C4.5 data mining algorithm applied on

SWISS-PROT. Bioinformatics, 17 10:920–6.

Kumar, N. and Skolnick, J. (2012). Eﬁcaz2. 5: application

of a high-precision enzyme function predictor to 396

proteomes. Bioinformatics, 28(20):2687–2688.

Functional Annotation of Proteins using Domain Embedding based Sequence Classiﬁcation

169

Kummerfeld, S. K. and Teichmann, S. A. (2009). Protein

domain organisation: adding order. BMC Bioinfor-

matics, 10(1):39.

Li, Y., Wang, S., Umarov, R., Xie, B., Fan, M., Li, L., and

Gao, X. (2018). DEEPre: sequence-based enzyme EC

number prediction by deep learning. Bioinformatics,

34(5):760–769.

Li, Y. H., Xu, J. Y., Tao, L., Li, X. F., Li, S., Zeng,

X., Chen, S. Y., Zhang, P., Qin, C., Zhang, C.,

et al. (2016). Svm-prot 2016: a web-server for ma-

chine learning prediction of protein functional fami-

lies from sequence irrespective of similarity. PloS one,

11(8):e0155290.

Lu, L., Qian, Z., Cai, Y.-D., and Li, Y. (2007). ECS: an au-

tomatic enzyme classiﬁer based on functional domain

composition. Computational Biology and Chemistry,

31(3):226–232.

Matsuda, S., Vert, J.-P., Saigo, H., Ueda, N., Toh, H.,

and Akutsu, T. (2005). A novel representation of

protein sequences for prediction of subcellular loca-

tion using support vector machines. Protein Science,

14(11):2804–2813.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013). Distributed representations of words

and phrases and their compositionality. In Advances in

neural information processing systems, pages 3111–

3119.

Mitchell, A. L., Attwood, T. K., Babbitt, P. C., Blum, M.,

Bork, P., Bridge, A., Brown, S. D., Chang, H.-Y., El-

Gebali, S., Fraser, M. I., et al. (2018). Interpro in

2019: improving coverage, classiﬁcation and access

to protein sequence annotations. Nucleic acids re-

search, 47(D1):D351–D360.

Nagao Chioko, N. N. and Kenji, M. (2014). Prediction of

detailed enzyme functions and identiﬁcation of speci-

ﬁcity determining residues by random forests. PLoS

One, 9(1).

Nasibov, E. and Kandemir-Cavas, C. (2009). Efﬁciency

analysis of KNN and minimum distance-based clas-

siﬁers in enzyme family prediction. Computational

Biology and Chemistry, 33(6):461–464.

Quester, S. and Schomburg, D. (2011). EnzymeDetec-

tor: an integrated enzyme function prediction tool and

database. BMC Bioinformatics, 12(1):376.

Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder,

N., Apweiler, R., and Lopez, R. (2005). InterProScan:

protein domains identiﬁer. Nucleic Acids Research,

33(suppl

2):W116–W120.

Rahman, S. A., Cuesta, S. M., Furnham, N., Holliday, G. L.,

and Thornton, J. M. (2014). EC-BLAST: a tool to

automatically search and compare enzyme reactions.

Nature Methods, 11(2):171.

Shen, H.-B. and Chou, K.-C. (2007). Ezypred: a top–down

approach for predicting enzyme functional classes and

subclasses. Biochemical and biophysical research

communications, 364(1):53–59.

The UniProt Consortium (2015). UniProt: a hub for pro-

tein information. Nucleic Acids Research, 43(D204-

D212).

Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., and Zhang,

Y. (2015). The i-tasser suite: protein structure and

function prediction. Nature methods, 12(1):7.

Yu, C., Zavaljevski, N., Desai, V., and Reifman, J. (2009).

Genome-wide enzyme annotation with precision con-

trol: Catalytic families (CatFam) databases. Proteins:

Structure, Function, and Bioinformatics, 74(2):449–

460.

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

170