Deep Neural Network (DNN) in this study. It is a
multilayer neural network, which improves its
representational power by combining features
extracted in each layer. It can suffer from a local
minimum and overfitting. However, the use of new
activation functions like ReLU and adopting dropout
to avoid overfitting, can provide high performance
in classification problems by DNN.
2.2 Dataset
In this study, we used annotated human nuclear
protein databases described in Goldberg’s thesis
(Goldberg, 2016). Among them, we downloaded
HPRD (Prasad et al., 2009), NMPdb (Mika and
Rost, 2005), NPD (Dellaire et al., 2003), and
UniProt (The UniProt Consortium, 2017). Since we
could not access to NOPdb and NSort/D, we did not
use them. These databases contain 4,111 sequences
in FASTA format. To eliminate homologous
sequences, we used UniqueProt software (Mika and
Rost, 2003). Using the condition HVAL<0, 319
samples (protein sequences) were selected. The
breakdown list of the sequences is shown in Table 1.
Among the 13 classes of subcellular location, the
largest class (Nucleolus) contains 117 samples. In
contrast, only three samples belong to the smallest
class (Nuclear pore complex). It means that the data
used in this study are highly class-imbalanced. It is
well known that for class-imbalanced data, a
classifier tends to frequently predict the label of
majority class, then the performance of classification
is decreased by the class imbalance.
Table 1: The number of samples in each subcellular
location.
Subcellular location the number of samples
Cajal bodies 11
Chromatin 66
Nuclear envelope 45
Nuclear lamina 14
Nuclear matrix 47
Nuclear pore complex 3
Nuclear speckles 30
Nucleolus 117
Nucleoplasm 16
Perinucleolar compartment 4
PML bodies 7
Kinetochore 5
Spindle apparatus 26
In the field of machine learning, a sample is
typically represented as a tuple of numerical values
called a feature vector so that it can be accepted by
the algorithms of regression, classification, and
clustering. In the case of protein sequence
classification, there exist some popular methods of
calculating such feature values. Using protr package
for R (Xiao et al., 2015) in addition to PROFEAT
web service (Li et al., 2006), we executed the
following 17 methods and generated the features that
characterize the human nuclear protein sequences
above.
Amino Acid Composition Descriptor(AAC)
Dipeptide Composition Descriptor(DC)
Tripeptide Composition Descriptor(TC)
AminoAcid/Dipeptide/Tripeptide(ADT)
Normalized Moreau-Broto autocorrelation
descriptors (MoreauBroto)
Moran autocorrelation descriptors(Moran)
Geary autocorrelation descriptors(Geary)
Composition(CTDC)
Transition(CTDT)
Distribution(CTDD)
Conjoint Triad Descriptors(CTriad)
Sequence-order-coupling number(SOCN)
Quasi-sequence-order descriptors(QSO)
Pseudo-Amino Acid Composition(PAAC)
Amphiphilic Pseudo-Amino Acid Composition
(APAAC)
Composition/Transition/Distribution(CTD)
Total amino acid properties(TAAP)
2.3 Prediction and Performance
Evaluation
In the experiment, we used Chainer (Tokui et al.,
2015), a deep learning framework based on Python,
for the implementation of classifier. For the purpose
of comparison, we also implemented SVM using
scikit-learn, a popular library of machine learning
functions on Python. To validate the performance of
a model trained by a classifier, we used two methods
of performance evaluation: leave-one-out cross-
validation and nested cross-validation.
Leave-one-out cross-validation divides dataset
into minimum parts (i.e. one part consists of one
sample). All parts except one for test are merged and
used for training, then the performance (accuracy, in
this study) is evaluated by using the test set with
only one sample. After repeating this process the
same number of times as the number of all samples
(i.e. 319 times), final performance is calculated.
In nested cross-validation (double cross-
validation or stratified cross-validation), each
training set of cross-validation is further cross-
validated mainly for tuning some parameters of a
classifier. In this experiment, we adopted 3-fold and