PREDICTING NEW HUMAN DRUG TARGETS BY USING

FEATURE SELECTION TECHNIQUES

Eduardo Campos dos Santos

, Braulio Roberto Gonçalves Marinho Couto

Marcos A. dos Santos

and Julio Cesar Dias Lopes

Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais / UFMG

Av. Antônio Carlos 6627, 31270-901, Belo Horizonte, Brazil

Centro Universitário de Belo Horizonte / UNI-BH, Av. Professor Mário Werneck 1685, 30455-610, Belo Horizonte, Brazil

Departamento de Ciência da Computação, UFMG, Av. Antônio Carlos 6627, 31270-901, Belo Horizonte, Brazil

Departamento de Química, UFMG, Av. Antônio Carlos 6627, 31270-901, Belo Horizonte, Brazil

Keywords: Human drug target, Logistic regression, Case-control study, Prediction models.

Abstract: Drug target identification and validation are critical steps in the drug discovery pipeline. Hence, predicting

potential “druggable targets”, or targets that can be modulated by some drug, is very relevant to drug

discovery. Approaches using structural bioinformatics to predict “druggable domains” have been proposed,

but they have only been applied to proteins that have solved structures or that have a reliable model

predicted by homology. We show that available protein annotation terms may be used to explore semantic-

based measures to provide target similarity searching and develop a tool for potential drug target prediction.

We analysed 1,541 human protein drug targets and 29,580 human proteins not validated as drug targets but

which share some InterPro annotations with a known drug target. We developed a semantic-based similarity

measure by using singular value decomposition over InterPro terms associated with drug targets, performed

statistical analyses and built logistic regression models. We present a probabilistic model summarised in a

closed mathematical formula that allows human protein drug targets to be predicted with a sensitivity of

89% and a specificity of 67%.

1 INTRODUCTION

The identification and validation of drug targets are

critical steps in the drug discovery pipeline. Thus, it

is important to improve the discovery of hidden

target similarities or off-target similarities that can

help select “druggable targets”. Here, we consider

“druggable targets” to be those human or pathogen

proteins that may be modulated by some orally

bioavailable compound. Conversely, “undruggable

targets” are those proteins that are considered too

difficult to be modulated by some drug. Even

“undruggable targets” have been addressed, in

particular in oncology studies (Verdine and

Walensky, 2007); (Schreiber, 2009), but it is

valuable to distinguish the “more-druggable” and the

“less-druggable” targets before incurring substantial

expenditure and effort (Cheng et al., 2007). To

identify “druggable” and “undruggable” proteins,

some researchers have been developing structure-

based approaches to identify “druggable” and

“undruggable” binding sites and cavities (Haupt and

Schroeder, 2011; Moriaud et al., 2011; Gao et al.,

2008). However, as the majority of drug targets for

small molecule therapeutics are formed by proteins

with unsolved three-dimensional structures,

structure-based design is not possible. Therefore,

sequence similarity performs an important role in

finding novel “druggable” targets. Indeed, current

public resources containing drug target information

like the Therapeutic Target Database – TTD (Zhu et

al., 2010) and DrugBank (Wishart et al., 2008)

provide target similarity searching based only on the

BLAST algorithm.

Although high sequence similarity is a good

initial guide, it is known that there are also important

structural similarities and other correlations even for

proteins with low sequence similarity (Vidovic and

Schürer, 2009); (Krissinel, 2007); (Gan et al., 2002);

(Betts et al., 2001). Knowledge-based approaches

may help develop a classification program. Indeed,

in an influential paper, Hopkins and Groom (2002)

137

Campos dos Santos E., Gonçalves Marinho Couto B., A. dos Santos M. and Dias Lopes J..

PREDICTING NEW HUMAN DRUG TARGETS BY USING FEATURE SELECTION TECHNIQUES.

DOI: 10.5220/0003734501370142

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2012), pages 137-142

ISBN: 978-989-8425-90-4

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

proposed 130 InterPro entries as sufficient to predict

all the druggable human proteins. This proposal,

together with studies that evaluate semantic

similarity measures based on Gene Ontology

annotations (Lord et al., 2003); (Chagoyen et al.,

2006), suggest an approach based on investigating

semantic similarity measures of protein targets based

on their InterPro annotations.

In this paper, we explore semantic similarity

across InterPro entries annotated to known drug

targets as an alternative to sequence similarity for

target similarity prediction. The validated targets

were collected from TTD (Zhu et al., 2010);

DrugBank (Wishart et al., 2008) and KEGG-Drug

(Kanehisa et al., 2010).We represented the targets in

a vector space model (VSM) in which targets were

recoded as column vectors and the descriptors

(rows) were, initially, all the InterPro terms that

occur in the target set. The next step was to reduce

the dimensionality of the problem. The goal is to

select those descriptors that result in the “best”

model. The rationale for minimising the number of

descriptors in the model is that the resultant model

tends to be more efficient when redundant and

irrelevant attributes are eliminated (Hosmer and

Lemeshow, 2000; Chen et al., 2008). We applied the

cosine similarity measure as described by Chagoyen

et al. (2006) to compute the pairwise similarity

among the targets represented in a transformed

feature space reduced by using Singular Value

Decomposition (SVD). We showed that our

annotation-based similarity metric is consistent with

BLAST and results in better discrimination of the

target clusters. Afterwards, we projected other

human proteins in the reduced space, calculated the

similarity coefficient between each new protein to

each validated target and then constructed a control

set approximately five times the size of the original

validated group. This control set was used in a case-

control study. It was constructed by selecting a

subset of the proteins that resulted in lower

maximum similarity coefficients against the drug

target set. Then, we applied regression models to

minimise the number of the descriptors from the

original full data set.

2 MATERIAL AND METHODS

We constructed a matrix with 1,541 binary vectors

that represents known protein drug targets retrieved

from public databases (TTD (Zhu et al., 2010),

Drug-Bank (Wishart et al., 2008) and KEGG-Drug

(Kanehisa et al., 2010). Each protein-representing

vector is a set of 2,700 binary descriptors, each of

them representing an InterPro annotation. Therefore,

protein drug targets were recoded as vectors in 

where m is the number of InterPro descriptors

analysed (2,700). In this way, the target database is a

sparse matrix M, with dimension m x n, where n is

the number of proteins in the data set and each row

of the binary vectors indicates the presence or

absence of an InterPro annotation:















mnmm

xxx



22221

11211

....

where x

is the presence (1) or absence (0) of the

InterPro descriptor i on the protein drug target vector

j. The matrix M was decomposed by using SVD

(Golub and Kahan, 1965) and factorised as M =

USV

. The singular values placed in decreasing

order along the main diagonal of S are directly

related to the independent characteristics within the

dataset (Deerwester et al., 1990; Berry et al., 1995;

Eldén, 2006). To transform the matrix M

2700x1541

an information retrieval system, it was necessary to

determine the best low-rank approximation M

reduced space:







eeee

kkkk

vsuVSUMM

(1)

where u

and v

are, respectively, the column vectors

of U and the row vectors of V both related to the e-

th singular value in decreasing order and k is the

rank of the matrix M

. We selected k = 320 factors

by applying the scree test (Cattell, 1966) to

determine the low-rank approximation M

(Figure

1). The factorisation provided a reduced

dimensionality space in which relationships among

the drug targets could be established. The similarity

between any pair of drug targets was calculated as

the cosine of the angle between the respective target

representing vectors on the reduced space.

Therefore, the similarity measure of a pair of targets

is equivalent to the dot product between the

respective rows of the matrix V

To validate our semantic-based similarity metric,

we compared our results with those given by the

BLAST algorithm. Figure 2 shows the scatter plot of

a distance-like coefficient given by our methodology

BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms

138

versus the bit score given by BLAST. To convert the

pairwise cosine coefficients into pairwise distance

coefficients, we applied the transformation formula

proposed by Stuart et al. (Stuart et al., 2002) d

= -

ln((1 + cos

)/2). The exponential rate of the

correlation scatter plot illustrates the known

characteristic of SVD as distances become more

discriminated – similar entities become more similar

and dissimilar entities become more dissimilar in the

reduced space vector.

Figure 1: Singular values of M (as obtained by SVD

factorisation) plotted in decreasing order. The X axis

corresponds to the singular value index. The first k = 320

largest were selected by the scree test.

The second step was to select the control group,

i.e., protein sequences classified as undruggable

targets. To do so, we collected from the UniProt

(The UniProt Consortium, 2010) 29,580 human

proteins that are not validated as drug targets but do

share InterPro annotation with any of 1,541 drug

targets. Each one of the 29,580 non-target candidate

sequences was recoded as a vector in 

2700

, where

the space dimensionality (2,700) is given by the

number of InterPro descriptors considered to

generate the query vectors (q). Thus, each query

vector was projected into reduced space obtained by

SVD; formally, q*=q

. Afterwards, we computed

the pairwise distance coefficient similarity among

the reduced vector queries (q*) and all drug target

vectors in the reduced space (M

), which generated

1,541 pairwise distances for each of the 29,580 non-

target candidate sequences. The maximum pairwise

distance of each candidate sequence was selected,

and the percentile 75 (p

) of these maximum

distances was chosen as the cut-off value to classify

a candidate sequence as a non-target protein. All

sequences with vector query (q*) with maximum

pairwise distance less than 1.2821 (p

) were

classified as non-target sequences, becoming the

control group, totalling 7,830 proteins.

Figure 2: Correlation scatter plot of the pairwise distance

dij between protein vectors (d

=-ln((1 + cos

)/2)) and

BLAST bit score. The exponential rate may be explained

by the known characteristic of SVD as distances becoming

more discriminate – similar entities become more similar

and dissimilar entities become more dissimilar in the

reduced space vector. A negative correlation was expected

because the higher the similarity between two proteins,

smaller the related distance and the higher the bit score.

The third step of this study was to build a model

to predict new human druggable target proteins. This

was done by performing a case-control study

(Schlesselman, 1982). Approximately 20% of the

1,541 targets (384 sequences) were extracted

randomly for validation, and the remaining 1,157

were used as the case set. For the control group,

7,830 non-target sequences were randomly assigned

as either the case set (5,821 sequences) or for model

validation (2,009 sequences). Thus, the final sample

size was 6978 (5821 + 1157). All InterPro

annotations were considered as variable candidates

for the model. During the SVD analysis, we used

2,700 InterPro annotations of five types: Family (F),

Domain (D), Region (R), Active Site (A) and

Binding Site (B). However, to avoid redundancies,

we considered only InterPro annotations of F, D or

G types during the predictive model construction.

Thus, only 2,390 Interpro annotations were

considered in the model analysis.

A logistic regression model was developed for

the case-control study, allowing feature selection. In

addition to feature selection, the logistic model can

also be used to predict the probability () that a

sequence is a druggable target based on a

combination of the k InterPro annotations selected in

the model:

(2)

PREDICTING NEW HUMAN DRUG TARGETS BY USING FEATURE SELECTION TECHNIQUES

139

In Equation (2),  is the probability of a

sequence belonging to the drug target group, k is the

number of explanatory features (InterPro

annotations) significantly selected for the model and



is the regression coefficient for each InterPro (i =

1, 2, 3 …k). The model-building strategy for the

feature selection was an automatic forward stepwise

logistic regression performed by SPSS - Statistical

Package for the Social Sciences (SPSS Inc., 2008).

Before performing the logistic regression, a

univariate analysis was performed using Fisher’s

exact test on a pre-selected subset of the 2,390

InterPro used in the stepwise logistic regression

(Altman, 1991). Only InterPro annotations with a p-

value less than or equal to 0.05, by Fisher’s exact

test (univariate analysis), were used in the

multivariate analysis. This stringent cut-off was

chosen because of the excessive number of

candidate features (2,390).

After model building,, definition of the best cut-

off for the probabilities calculated by the logistic

model in order to classify a new sequence as a

potential drug target was made by ROC – ‘receiver

operating characteristic’ curve analysis (Altman,

1991).

3 RESULTS

The sample size used in the case-control study was

composed of 1,157 targets (cases) and 5,821 non-

target sequences (controls), totaling 6,978 proteins.

Univariate analysis performed by Fisher’s exact test

selected 587 InterPro entries from 2,390 annotations

initially involved in the study. Some InterPro

annotations were selected because their presence

increases the chance of a sequence to be a druggable

target (as is the case of IPR001828,Table 1). Other

InterPro annotations were selected because their

presence reduced the chance that a sequence would

be a druggable target (for example

IPR001828,Table

2).

From the 587 InterPro entries selected from the

univariate analysis that were automatically

forwarded to stepwise logistic regression, 66 were

identified as independently associated with the drug

target status. Table 3 presents the InterPro

annotations identified and the  parameters from

Equation (2) estimated for the logistic regression

model to predict drug target sequences. If the beta

value is negative, the presence of the InterPro

annotations reduces the chance that a sequence is a

druggable target. On the other hand, if the beta value

is positive, the presence of the InterPro annotation

increases the chance that a sequence is a druggable

target.

Table 1: Univariate analysis for InterPro IPR001828 – its

presence increases the chance that a sequence is a

druggable target.

InterPro

IPR001828

Sample

size

Number of

target

sequences

Percent

of target

sequence

p-value

Presence 21 19 90% < 0.001

Absence 6,957 1,138 16%

Total

6,978

1,157

17%

Table 2: Univariate analysis for InterPro IPR016175 – its

presence reduces the chance that a sequence is a druggable

target.

InterPro

IPR016175

Sample

size

Number of

target

sequences

Percent

of target

sequence

p-value

Presence 230 0 0% < 0.001

Absence 6,748 1,157 17%

Total

6,978

1,157

17%

Because the results of logistic model in Equation

(2) provide a probability value ranging from 0.0 to

1.0, we need to choose a cut-off value to define if a

sequence is in the drug target group. Actually,

logistic regression allows us to distinguish those

sequences likely or unlikely to be a druggable target,

providing a probability value. Usually the cut-off is

0.50, meaning that if the probability that the

sequence is in the drug target group is higher than

0.50, then the sequence is classified as a potential

druggable target. However, other cut-offs can be

used according to the ROC analysis (Figure 3). The

best cut-off in probability is 0.25, which maximises

both sensitivity and specificity, being nearest the top

left-hand corner of ROC curves.

To validate the model, we reserved 384 known

targets and 2,009 control sequences, totalling 2,393

proteins. Classification quality of these sample

queries is summarised in Table 4. The sensitivity of

classifying unknown sequences was 89%, and the

specificity was 67%. Because we used 0.25 as a cut-

off, if the probability model for a query is higher

than 0.25, the sequence is classified as a potential

druggable target.

BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms

140

Table 3: Logistic regression model built for predicting if a

sequence is a druggable target.

InterPro



p-value InterPro



p-value

IPR016175 -7.3 0.067 IPR001023 -1.9 0.009

IPR012677 -4.8 0.000 IPR020685 -1.9 0.000

IPR010993 -4.5 0.004 IPR003593 -1.6 0.000

IPR004000 -3.7 0.000 IPR003596 -1.4 0.034

IPR000883 -3.4 0.001 IPR016040 -0.6 0.001

IPR008973 -3.3 0.000 IPR001452 1.3 0.032

IPR001173 -2.8 0.006 IPR020683 1.3 0.045

IPR016137 -2.6 0.011 IPR013099 1.4 0.065

IPR013783 -2.6 0.000 IPR000980 1.5 0.016

IPR013766 -2.5 0.012 IPR015421 1.6 0.000

IPR002213 -2.4 0.001 IPR011029 1.6 0.006

IPR011009 -2.4 0.000 IPR000472 1.8 0.030

IPR000873 -2.1 0.003 IPR013816 1.8 0.041

IPR000010 -2.1 0.040 IPR000889 1.8 0.031

IPR003597 -2.1 0.001 IPR011348 2.2 0.080

IPR008753 -2.0 0.052 IPR007698 2.2 0.080

IPR001353 -1.9 0.008 IPR014756 2.2 0.004

IPR011497 2.2 0.074 IPR015741 3.3 0.001

IPR005225 2.3 0.013 IPR017193 3.3 0.023

IPR001251 2.3 0.028 IPR000626 3.3 0.023

IPR002035 2.4 0.000 IPR020663 3.3 0.000

IPR001841 2.5 0.028 IPR008979 3.3 0.002

IPR011304 2.6 0.028 IPR009130 3.6 0.018

IPR000157 2.6 0.012 IPR014729 3.8 0.000

IPR013027 2.7 0.014 IPR001828 3.8 0.000

IPR002314 2.8 0.034 IPR003116 3.9 0.002

IPR008957 2.9 0.000 IPR020722 4.0 0.039

IPR015015 2.9 0.020 IPR020727 4.0 0.023

IPR011992 3.0 0.000 IPR009134 5.0 0.007

IPR005834 3.1 0.010 IPR002126 5.2 0.001

IPR009030 3.2 0.001 IPR008424 5.2 0.000

IPR005821 3.2 0.000 IPR000353 5.6 0.000

IPR000001 3.2 0.030 IPR016243 7.7 0.000

Obs.: 

= -0.9

Figure 3: ROC curve analysis for predicting a druggable

target. The best cut-off for maximum sensitivity and

specificity is a probability higher than 0.25 (area under the

curve = 0.828).

Table 4: Classification quality of sample queries with the

logistic regression model for predicting drug targets.

Group

Classification using a cut-off

= 0.25 in logistic probability

Total

(+) (-)

Drug target 340 44 384

Non-target 661 1,348 2,009

Total 1,001 1,392 2,393

4 CONCLUSIONS

We identified 66 features (InterPro entries) that

allow retrieval of protein drug targets with a

sensitivity of 89% and a specificity of 67%.

The model provided a statistical evaluation over

the current protein annotation to predict potential

drug targets or, at least, potential “druggable

targets”, meaning proteins that potentially can be

modulated by an orally bioavailable drug. The

model gives us a closed formula to calculate the

probability that a given sequence, described by their

biological annotations, is druggable.

Though “druggable targets” are different from

“therapeutic drug targets”, their prediction is a good

contribution to drug development focusing on drug

target research.

Our model differs from the approach of Hopkins

and Groom (2002) by including not only InterPro

annotations that contribute positively to classifying a

protein as druggable, but also by including those

annotations that contribute negatively. Our model is

more restrictive and gives results closer to the

proteins that actually are therapeutic drug targets.

REFERENCES

Altman, D. G. (1991). Practical Statistics for Medical

Research. Chapman & Hall.

Betts, M. J., Guigó, R., Agarwal, P., Russell, R. B. (2001).

Exon structure conservation despite low sequence

similarity: a relic of dramatic events in evolution? The

EMBO journal, 20(19), 5354–5360.

Cattel, R. B. (1966). The scree test for the number of

factors. Multivariate Behavioural Research, 1, 245-76.

Chagoyen, M., Carmona-Saez, P., Gil, C., Carazo, J. M.,

Pascual-Montano, A. (2006). A literature-based

similarity metric for biological processes. BMC

Bioinformatics, 7, 363–375.

PREDICTING NEW HUMAN DRUG TARGETS BY USING FEATURE SELECTION TECHNIQUES

141

Chen, M.-c., sheng Chen, L., chin Hsu, C., Rong Zeng, W.

(2008). An information granulation based data mining

approach for classifying imbalanced data. Information

Sciences, 178, 3214–3227.

Cheng, A. C., Coleman, R. G., Smyth, K. T., Cao, Q.,

Soulard, P., Caffrey, D. R., Salzberg, A. C., Huang, E.

S. (2007). Structure-based maximal affinity model

predicts small-molecule druggability. Nature

Biotechnology, 25(1), 71–75.

Berry, M. W. et al., 1995. Using linear algebra for

intelligent information retrieval. SIAM Review, 37,

573-595.

Deerwester, S. et al., 1990. Indexing by Latent Semantic

Analysis. Journal of the American Society for

Information Science, 41(6), 1-13.

Eldén, L., 2006. Numerical linear algebra in data mining.

Acta Numerica, 327-384.

Gan, H. H., Perlow, R. A., Roy, S., Ko, J., Wu, M.,

Huang, J., Yan, S., Nicoletta, A., Vafai, J., Sun, D.,

Wang, L., Noah, J. E., Pasquali, S., Schlick, T. (2002).

Analysis of protein sequence/structure similarity

relationships. Biophysical Journal, 83, 2781–2791.

Gao, Z., Li, H., Zhang, H., Liu, X., Kang, L., Luo, X.,

Zhu, W., Chen, K., Wang, X., and Jiang, H. (2008).

Pdtd: a web-accessible protein database for drug target

identification. BMC Bioinformatics, 9(1), 104.

Golub, G and Kahan, W. (1965). Calculating the singular

values and pseudo-inverse of a matrix. SIAM J.

Numer. Anal. Ser. B, Vol.2, No.2, p. 205-224, 1965.

Haupt, V. J. and Schroeder, M. (2011). Old friends in new

guise: repositioning of known drugs with structural

bioinformatics. Briefings in Bioinformatics.

Hopkins, A. L., Groom, C. R. (2002). The druggable

genome. Nature reviews. Drug discovery, 1(9), 727–

730.

Hosmer, D. W. and Lemeshow, S. (2000). Applied logistic

regression (Wiley Series in probability and statistics).

Wiley-Interscience Publication.

Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M.,

Hirakawa, M. (2010). Kegg for representation and

analysis of molecular networks involving diseases and

drugs. Nucleic Acids Research, 38, D355–D360.

Krissinel, E. (2007). On the relationship between sequence

and structure similarities in proteomics.

Bioinformatics, 23(6), 717–723.

Liu, T., Chen, Z., Zhang, B., Ma, W.-y., Wu, G. (2004).

Improving text classification using local latent

semantic indexing. In Proceedings of the Fourth IEEE

International Conference on Data Mining, ICDM ’04,

pages 162–169, Washington, DC, USA. IEEE

Computer Society.

Lord, P., Stevens, R., Brass, A., Goble, C. (2003).

Investigating semantic similarity measures across the

gene ontology: the relationship between sequence and

annotation. Bioinformatics, 19, 1275–1283.

Moriaud, F., Richard, S. B., Adcock, S. A., Chanas-

Martin, L., Surgand, J.-S., Ben Jelloul, M., and

Delfaud, F. (2011). Identify drug repurposing

candidates by mining the protein data bank. Briefings

in Bioinformatics.

Schlesselman, J. J. (1982). Case-Control Studies. Oxford

U. Press.

The UniProt Consortium. (2010). The Universal Protein

Resource (UniProt) in 2010. Nucleic Ac-ids Res,

38(suppl 1):D142-D148.

Schreiber, S. L. (2009). Organic chemistry: Molecular

diversity by design. Nature, 457, 153–154.

SPSS Inc. (2008). Statistic Package for Social Science

(SPSS) for Windows.

Stuart, G. W., Moffett, K., Leader, J. J. (2002). A

comprehensive vertebrate phylogeny using vector

representations of protein sequences from whole

genomes. Mol Biol Evol, 19(4), 554–562.

Verdine, G. L., Walensky, L. D. (2007). The challenge of

drugging undruggable targets in cancer: Lessons

learned from targeting bcl-2 family members. Clinical

Cancer Research, 13(24), 7264–7270.

Vidovic, D., Schürer, S. C. (2009). Knowledge-based

characterization of similarity relationships in the

human protein – tyrosine phosphatase family for

rational inhibitor design. Journal of Medicinal

Chemistry, 52(21), 6649–6659.

Wishart, D. S., Knox, C., Guo, A. C., Cheng, D.,

Shrivastava, S., Tzur, D., Gautam, B., Hassanali, M.

(2008). Drugbank: a knowledge base for drugs, drug

actions and drug targets. Nucleic Acids Research –

Database issue, 36, D901–D906.

Zhu, F., Han, B., Kumar, P., Liu, X., Ma, X., Wei, X.,

Huang, L., Guo, Y., Han, L., Zheng, C., and Chen, Y.

(2010). Update of TTD: Therapeutic target database.

Nucleic Acids Research – Database issue, 38, D787–

D791.

BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms

142