
Table 1: Details of the various datasets.
Integrated Dataset
Database hERG inhibitors Inactive
compounds
ChEMBL (version 22) 4793 5275
GOSTAR 3260 3509
NCGC 232 1234
hERGCentral 4,321 274,536
hERG integrated dataset 9,890 281,329
on 3,721 compounds measured in a binding assay
and 765 compounds measured in a functional assay
collected from ChEMBL. The prediction models
were constructed separately from each data set, and
showed prediction accuracies of 79.7%–80.1% and
69.2%–90.7%, respectively.
Wang et al. (Wang, S. et al, 2012) developed
hERG classification models using naive Bayesian
classification and recursive partitioning based on
molecular properties and the ECFP 8 fngerprints, and
recorded 85% accuracy (Wang, S. et al, 2016).
Schyman et al. (Schyman. P, 2016) combined 3D
(David C. Kombo et al,2012) similarity conformation
and 2D similarity ensemble approach, and achieved
69% sensitivity and 95% specificity on an indepen-
dent external data set.
Recently, Keiji Ogura, Tomohiro Sato, Hitomi
Yuki and Teruki Honma (Keiji Ogura et al, 2019) used
Support Vector Machines (SVMs) on an integrated h-
ERG database having more than 291,000 structurally
diverse compounds. They achieved kappa statistics of
0.733 and accuracy of 98.4%. Table 1 provides a sum-
mary of various datasets used across existing works.
They have made the dataset publicly available for re-
search purposes.
1.2 Contributions of this Work
Most of the works (Wang, S. et al, 2012),(Wang,
S. et al, 2016),(Schyman. P, 2016),(Keiji Ogura et
al, 2019) which we have reviewed have either used
descriptors (2-D, 3-D and 4-D) or fingerprints. On
the other hand, unlike the above, we have used only
2-D descriptors for our classification model. 2D-
descriptors deal with the molecular topology of the
compounds i.e. topological indices and fragment
counts. 2D-Descriptors incorporate precious chemi-
cal information like size, degree of branching, flexi-
bility etc. Generating 2D descriptors of the SMILES
compounds usually takes less time than 3D descrip-
tors. Even with just 2D descriptors, we demonstrate
that the proposed ensemble model achieves a very
good performance.
In this study, we have developed an ensemble
model having two Random Forest Classifiers, two
Support Vector Machines and one Dense Neural Net-
work which achieved a AUC score (Area Under the
ROC Curve) of 0.96 and Cohen’s Kappa of 0.9195.
Most of the existing models have used Sup-
port Vector Machines, Random Forest Classifiers and
Naive Bayesian Classifiers for prediction. However,
in addition to these models, we have also used Deep
Neural Networks and two different Ensemble classi-
fiers for the task. We have found that the Deep Neural
Networks and the Ensemble classifier yield the high-
est performance.
We have also worked with data augmentation for
our class-imbalanced dataset. We have used SMOTE
(Synthetic Minority Oversampling Technique) (N. V.
Chawla, et al, 2011) for augmenting data. Data aug-
mentation is a very useful procedure because the data
that it generates is almost similar to the training data.
For some diseases, the drugs available can be quiet
less, and doing prediction with less data points can
lead to over-fitting. Data Augmentation can prove
useful by not only creating new data but may also
help in understanding the underlying distribution of
each property (descriptors or fingerprints) of drugs.
Unlike most existing works, we also sug-
gest an automatic approach based on information
gain/entropy to shortlist (or select) features from a
larger set. To our knowledge, the only exception
among the existing methods, which are considered
such a selection is work by (Keiji Ogura et al, 2019)
which involves the NSGA-II (Non-dominated Sorting
Genetic Algorithm-II) for descriptor selection.
Finally, towards the end of the paper, we pro-
vide a consolidated summary of the various works in
this domain, the datasets used, the feature descriptors
and methods employed, and their performance across
several metrics. We note that although this work is
not analysing imaging data, it involves core machine
learning on an important problem in biology. Con-
sidering that this is a relatively recent application do-
main, such an overview provides a good perspective
of the trade-offs of the approaches and paves the way
for more standardized benchmarking and extensions
in this area.
2 DATASET AND DESCRIPTORS
In this work, we have used a dataset provided
in one of the competitions of the Drug Discovery
Hackathon, organized by the Govt. of India; The
dataset has been made by Dr Kunal Roy, profes-
sor from Department of Pharmaceutical Technology,
Jadavpur University. http://sites.google.com/
Virtual Screening of Pharmaceutical Compounds with hERG Inhibitory Activity (Cardiotoxicity) using Ensemble Learning
153