Identifying Drug Repositioning Targets using Text Mining

Eduardo Barçante, Milene Jezuz, Felipe Duval, Ernesto Caffarena,

Oswaldo G. Cruz and Fabricio Silva

Fundação Oswaldo Cruz – Instituto Oswaldo Cruz, Av. Brasil 4365, Rio de Janeiro, Brazil

Keywords: Text Mining, Drug Repositioning, Clusters, PubMed Abstracts, Neglected Diseases, Ontology, Biomedical

Technology, Drug Industry, Information Extraction, Semi-Structured.

Abstract: The current scenario of computational biology relies on the know-how of many technological areas, with

focus on information, computing, and, particularly on the construction and use of existing Internet databases

such as MEDLINE, PubMed and PDB. In recent years, these databases provide an environment to access,

integrate and produce new knowledge by storing ever increasing volumes of genetic or protein data. The

transformation and management of these data in a different way than from the one that were originally

thought can be a challenge for research in biology. The problems appear by the lack of textual structure or

appropriate markup tags. The main goal of this work is to explore the PubMed database, the main source of

information about health sciences, from the National Library of Medicine. By means of this database of

digital textual documents, we aim to develop a method capable of identifying protein terms that will serve

as a substrate to laboratory practices for repositioning drugs. In this perspective, in this work we use text

mining to extract terms related to protein names in the field of neglected diseases.

1 INTRODUCTION

The improvement achieved in genomics research

can be seen as an excellent guidance to the new

social-economic dynamics that, among other politics

embraced by some countries, propose health as a

requirement for sustainable development. The

challenges proposed by the United Nations

declaration (2000) and the demand for new

information, which naturally arise from the

technological evolution, force the nations to

revaluate their health system research. The

orientation is that the results obtained in the

researches should be incorporated into health actions

and consequently reach sustainability of social

progress.

This way, it is necessary to acquire mechanisms

and strategies that can yield advances in the

biotechnological research and promote the essential

ways for them to be reused in the current knowledge

base. According to Markus, the necessary base,

composed by the raw material of innovation

processes, will bring improvements in health and in

all systems on which they depend on (Markus,

2001). However, a large number of challenges

follow these established set of goals, mainly in

biology, where biological databases present different

structures, representations and, standards (Schmitt,

2011). This conflicts with different cultures,

interdisciplinary models, technological limits and

urgent answers to the particular problems faced by

the population from each country.

The challenges rise in different contexts relative

to many productive sectors including the scientific

environment. An example of this is the rapid

increasing in the production of biological databases,

represented in the form of the scientific literature

that is widely publicized, articles, dissertations and

in genomics and protein databases too. Figure 1

shows the exponential growth of the scientific

production since the early nineties.

The motivation of this work is to explore this

kind of literature with information recovered from a

variety of databases, focused mainly on neglected

diseases (DNDi, 2010) such as malaria (WHO,

2013) and Chagas disease, which are substantial

issues that line up with the goals of the millennium

to be reached until 2015.

The neglected diseases can be regarded as a

group of diseases that can be found in tropical

regions (WHO 2010). However, up to the middle of

1970's, they were associated to places under extreme

348

Barçante E., Jezuz M., Duval F., Caffarena E., G. Cruz O. and Silva F..

Identifying Drug Repositioning Targets using Text Mining.

DOI: 10.5220/0005134903480353

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2014), pages 348-353

ISBN: 978-989-758-048-2

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: PubMed Frequency of the publications per year.

poverty. Today, they can be seen as a barrier in the

development of the countries located in the same

equatorial bands (INCT, 2014). They are called

“neglected” because the investments destined to

research, mainly by the pharmaceutics industry, do

not corroborate a sustainable cycle turned to

prevention and production of new drugs. (BRASIL,

2010).

In this article, we present a method to explore

digital textual documents deposited in PubMed

(PubMed, 2014). This methodology performs

queries in the digital documents selecting protein

names and relating them with protein databases with

the aim to form group of names of similar proteins,

in structure or function.

The identification of this set may provide inputs

for screening of molecules or may contribute with

data to drug repositioning.

Drug repositioning is defined on the basis of

descriptors in Health Sciences like deliberate and

methodical practice of finding new applications for

existing drugs (DECS, 2014).

James Black study (cited in Haupt, 2011)

anticipated that to achieve a new drug is necessary to

start from a known one: “The most fruitful basis for

the discovery of a new drug is to start with an old

drug.” As indicated by Wermuth (cited in Torres,

2014), to start from structural and therapeutically

diverse molecules, which have their bioavailability

and toxicity already tested, is the most beneficial

path to accomplish a new drug. Both authors defend

that the base of the method is the safety of a known

and approved drug providing considerable reduce of

time and cost in the testing phase. Many studies

performed by research groups at the Oswaldo Cruz

Foundation (FIOCRUZ) propose methods to

classify , organize and recover information in the

field of biotechnology, such as prioritization of

targets (Belloze, 2013); drug repositioning (Jardim,

2013); text mining (Jezuz, 2013), where they test

performance and similarity among biological and

scientific textual databases.

This paper is organized as following. In the

second and third sections, we will show a theoretical

framework that will guide the present work and

address the reuse, text mining and ontologies. The

proposed method will be explained in the fourth

section and the results in the fifth.

2 THE DATA REUSE AND TEXT

MINING

The reuse of data consists in the transformation and

the reorganization of data, in processes different

from the ones which they were originally designed

(Markus, 2001).

The problems appear due to the lack of a

standard record, textual structure or tags for

processing by computers, lack of appropriated

keywords to aid the process of searching and

recovering or by the large number of the possible

relations that must be built to individualize the

wanted information in more than one database, etc.

On the other hand, the reuse of technologies and

theories consolidated can be regarded as strong

points to the development of this work. It can be

quoted, the pioneer works of Swanson et al

(Swanson, 1986, 1987, 2006) that, by means of a

single counting of words could set a numerical

ranking, foreseeing significant mathematical and

quantitative relationships between datasets obtained

from the databases MEDLINE and Medical Subject

Heading (MeSH) (PubMed, 2014).

According to Feldman, extracting information is

one of the most important techniques used in text

mining, which is as a method of analyzing

unstructured texts or in natural language, with the

goal of recovering information and knowledge that

hardly would be achieved by humans in a single

reading (Feldman, 2002).

Therefore, text mining as a method of textual

analysis identifies and extracts information,

transforming texts in significant indexes intended to

prediction, clusters, etc. (Witten, 2004).

3 PROTEIN ONTOLOGY

Within the framework of neglected diseases, we will

IdentifyingDrugRepositioningTargetsusingTextMining

349

make a prospection of similar terms in the following

databases Protein Information Resource (PIR, 2014),

The Open Biological and Biomedical Ontologies

(OBO, 2014), Protein Data Bank (PDB, 2014) e

UniProt Consortium (UniProt, 2013). We will

employ methods for the formation of domains and

association of terms, such as controlled vocabularies

(Lancaster, 1986); descriptors and disjoint sets

(Swanson, 2006); Information Management

(Berners-Lee, 1990), among others.

We will get terms that set an ontological

representation (Campos et al, 2009) for proteins and

its explicit relationships to obtain different classes

that are displayed by agglomeration methods.

The assumption is based on the conjecture that

agglomerate information can establish a relevant

relationship related to the construction of the groups

that will be formed to assess the level of similarity

between the structures, functions and protein names.

4 RESEARCH METHODOLOGY

The project is divided into two phases: phase I sets

the programming framework, the terms that will be

used to search and recover abstracts in articles

deposited in PubMed database, the text mining. In

phase II, we will inspect for protein names in the

complete text articles previously selected, the search

for proteins with similar structure or function in

biological databases. And lastly to suggest inputs for

repositioning drugs.

4.1 Phase I

This work methodology will be developed using the

R programming language and environment (R

Foundation, 2002). It is a suitable free software

idealized to manipulate large amounts of data,

optimized to calculate and present results

graphically.

The data source will be PubMed. Currently, this

database contains more than 23 millions of

biomedical quoted literature of MEDLINE,

scientific journals and books online. Some citations

can include links towards the complete text content

of PubMed Central and sites of editors (PubMed,

2014).

The chosen terms will comprehend keywords,

correlated words to the topic or subjects related to

the query. Here, will use the following terms:

dengue, Chagas disease, malaria, leishmaniasis,

plasmodium and trypanosome.

The inputs will consist of abstracts collected in the

PubMed, written in their standard adopted format

(NLM, 2014) to semi-structured in R programming

language and environment (Feinerer, 2014).

The semi-structured data will be named as a

textual corpus or simply corpus.

The digital textual documents, in their raw

format, i.e. originated by metadata in XML format,

will require treatment for the formation of a textual

corpus (Feinerer, 2008), which needs to be modified

in a way to maintain in its content only the relevant

words to the proposed topic.

The preprocessing should be understood as an

initial phase in the text mining. First, the spurious

words that do not reflect the central theme are

removed. The objective is to extract a set of words

that represent all of a textual body that was

submitted to the natural processing of language. This

matrix term versus document follows the model of a

vector space (Salton, 1975) and has the purpose to

obtain a set of documents, their terms and their

respective frequencies. The third step is the analysis

and visualization of the data by means of clusters,

dendograms and word clouds, among other

techniques and functions.

Spurious data and stop words are terms that do

not translate the central theme of the text, such as

prepositions, articles, country name, slang, etc.

Consequently, they will be eliminated to obtain a

concise textual body, what will facilitate the

execution of the following procedures. It is

necessary to remove: a) words previously read in

the dictionary; b) country names, continents and

nationalities; c) prefix, suffix and verbs; d)

measurement units; e) terms identified during all the

processing that will not be in accordance with the

results obtained in the following phases. Therefore,

this group will form a new group of spurious terms

and will be verified and registered in the words

dictionary if needed.

In the present project, indexing and normalizing

the textual body will consist in disambiguate words

to reduce variability. The goal of this is to reduce to

a common term one set of words that have the same

sense or meaning.

The extracting terms will yield a set of words

after the processing of the textual corpus, in indexed

and normalized forms.

Finally, an analysis of the terms obtained in the

extracting process will be done to identify which

abstracts best represent the central topic that will be

representative of their corresponding full texts.

KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

350

4.2 Phase II

Phase II will comprehend the search for protein

names in the complete article texts previously

selected along with the search for proteins with

similar structures or function in biological databases,

including proteins and ligands, to finally suggest

candidates for drug repositioning.

5 PRELIMINARY RESULTS

Figure 2 shows that the number of publications for

neglected diseases, Trypanosoma, malaria,

Plasmodium and leishmaniasis has been

continuously increasing the last years.

Figure 2: Neglected diseases terms in PubMed.

In table 1, there is a list of the files recovered

from the PubMed database and used in this work.

Table 1: Downloads summaries.

Terms Abstracts

Chagas disease 4.762

Dengue 7.524

Leishmaniasis 12.114

Malaria 30.298

Figure 3 shows the set of terms obtained from

abstracts retrieved from the Pubmed database, which

arose as results by means of the verification in the

protein databases PDB & UNIPROT. The outcomes

represent the crossing between proteins versus

disease names based on the frequency according to

which those terms appeared in the abstracts.

Figure 3: Proteins related to disease. Jezuz (2013).

With the proposed method it was possible to

collect the ligands related to the proteins names

found in UniProtKB. Figure 4 shows the set of

ligands and their relationship with extracted terms of

the articles listed regarding the neglected diseases

Dengue, Malaria and Chagas.

Figure 4: Ligands related to disease. Jezuz(2013).

From the ligands related to the proteins, it was

possible to obtain more information about the drugs

IdentifyingDrugRepositioningTargetsusingTextMining

351

in the DrugBank database. The recovery of this

information enabled a link of drugs for neglected

diseases as can be seen in Figure 5.

Figure 5: Drugs related to disease. Jezuz (2013).

6 DISCUSSION

The proposed method is under construction, based

on theories and methods widely used and recognized

in the scientific scope. From the procedures related

with text mining, it was possible to reveal a way to

extract terms from the abstracts of scientific articles

and validate them as biological entities.

With graphs generation it was possible to

visualize relationships between data obtained from

pipeline execution using as inputs selections on the

matrix. An example of visualization is the

generation of graphs of terms related to diseases, and

graphs of proteins and similar structures. Figure 3

shows the graph containing relationships between

terms related to articles that have to do with

particular diseases (Chagas in this case). It also

shows those terms that were found in particular

articles on two or more diseases. Evidently, such

terms can just have been associated with different

diseases because they present common research

methodologies.

The relationships presented in Figure 4 show that

there are evidences that proteins related to ligands of

different articles could be part of studies which

assumption contemplates two or more diseases.

The "graph" also suggests that the ligands can

interact with diverse proteins present in organisms

related to different diseases.

Given the scope of this study, that information

can only be validated by experiments on the

laboratory.

Like the proteins and ligands previously

presented in Figure 5, there are several drug-related

to more than one disease. This relationship gives

the researcher the possibility to choose if the

application and study of drugs will be applied for

more than one disease.

So far, we have not exploited the methodology

using the full text articles. The main contribution of

this proposal is given by the fact of directly

contributing to the development and use of possible

arrangements of similar proteins in the structure of

function. It can also suggest the set for laboratory

practices as screening of molecules or its

applicability for drug repositioning.

ACKNOWLEDGEMENTS

This paper has been funded by project CAPES,

PAPES/VI.

REFERENCES

Belloze, K.T. 2013. Priorização de alvos para fármacos

no combate a Doenças Tropicais Negligenciadas

causadas por protozoários. FIOCRUZ/IOC. (In

Portuguese) [PhD thesis]

Berners-Lee T., 1990. Information Management: A

proposal. available from: http://www.w3.org/

History/1989/proposal.html [Accessed June 2014]

BRASIL. 2010. Ministry of Health. Neglected diseases:

the strategies of the Brazilian Ministry of Health. In

Journal of the Public Health. 2010;44(1):200-202.

Campos, M.L.A., Campos, M.L.M. 2009.

METHODOLOGICAL ASPECTS ON ONTOLOGY

REUSE: a study on the domain of trypanosomatids.

RECIIS – In R. Eletr. de Com. Inf. Inov. Saúde. Rio de

Janeiro, v.3, n.1, p.64-75, mar., 2009. online:

www.reciis.cict.fiocruz.br DOI: 10.3395/reciis.v3i1.

243en

DECS. 2014. Health Sciences Descriptors In Virtual

Health Library (VHL). available from http://decs.

bvsalud.org. [Accessed 08/08/2014]

DNDi. 2010. Drugs for neglected diseases initiative

available from: http://www.dndi.org.br/pt/doencas-

negligenciadas. [Accessed 08/08/2014]

Feinerer, I., 2008. A text mining framework in R and its

applications. [PhD thesis]. Vienna. Department of

KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

352

Statistics and Mathematics, Vienna University of

Economics and Business Administration.

Feinerer, I., 2014. Text Mining Package available from:

http://cran.r-project.org

Feldman, R., Aumann, Y., Zilberstein, A., Ben-Yehuda,

Y., 2002. Mining biomedical literature using

information extraction. In Current Drug Discovery,

Volume2, Issue 10, pages 19-23,October 2002.

Gadelha, C.A.G., Quental C., Fialho B.C., 2003. Health

and innovation: a systemic approach in health

industries. In Reports in Public Health.

Haupt, V.J.; Schroeder, M. 2011. Old Friends in New

Guise:Repositioning of Known Drugs with Structural

Bioinformatics. In Brief. Bioinform. 12, 312−326.

doi:10.1093/bib/bbr011

INCT. 2014. National Institute for Science and

Technology on Innovation on Neglected Diseases

(INCT/IDN). Neglected Diseases. online: http://www.

cdts.fiocruz.br[Accessed 08/08/2014]

Jardim, R., 2013. Estudo de reposicionamento de fármacos

para Doenças Negligenciadas causadas por

protozoários através da integração de bases de dados

biológicas usando web semântica. FIOCRUZ/IOC. (In

Portuguese) [PhD thesis]

online: http://arca.icict.fiocruz.br/handle/icict/7027

Jezuz, M.P.G., 2013. Scientific text mining aiming at the

identification of bioactive compounds with

therapeutical potential against Chagas disease, dengue

and malaria. FIOCRUZ/IOC [PhD thesis]

Lancaster, F. W., 1986. Vocabulary control for

information retrieval. 2nd ed. Arlington, In VA:

Information Resources Press.

Markus L.M., 2001. Toward a theory of knowledge reuse:

Types of knowledge reuse situations and factors in

reuse success. In Journal of management information

systems, 2001;18(1):57-93.

NLM. 2014. MedLine® PubMed® XML Element

Descriptions and their Attributes available from:

http://www.nlm.nih.gov/bsd/licensee/elements_descrip

tions.html [Accessed June 2014]

OBO. 2014. The Open Bilogical and Biomedical

Ontologies.. online: www.obofoundry.org. [Accessed

08/08/2014]

PDB. 2014. Protein Data Bank. online: www.rcsb.org

PIR. 2014. Protein Information Resouce. online:

pir.georgetown.edu/pro/pro.shtml

PubMed, 2014. National Center for Biotechnology

Information online: http://www.ncbi.nlm.nih.gov

/pubmed. [Accessed 08/08/2014]

R Foundation, 2002. The R project for statistical

computing. online: http://www.r-project.org/

[Accessed 08/08/2014]

Salton, G., Wang, A., Yang, C.S., 1975. A vector space

model for information retrieval. Communications of

the ACM. 1975;18(11):613–620.

Schmitt, T., Messina, D.N., Schreiber, F., Sonnhammer

E.L.., 2011. Letter to the editor: SeqXML and

OrthoXML: standards for sequence and orthology

information. In Brief. Bioinform. 2011;12:485-488.

Swanson, D.R., 1986. Fish-oil, Raynaud’s syndrome and

undiscovered public knowledge. In Perspectives in

biology and medicine, 1986;30(1):7-18.

Swanson, D.R., 1987. Two medical literatures that are

logically but not bibliographically connected. In

Journal of the American society for information

science. 1987;38(4):228-333.

Swanson, D.R., Smalheiser, N.R., Torvik, V.I., 2006.

Ranking indirect connections in literature based

discovery. The role of Medical Subject Headings. In

Journal of the American Society for Information

Science and Technology, 2006;57(11):1427–39.

Torres, L.B. 2014. UFRJ. available from http://www.

portaldosfarmacos.ccs.ufrj.br/atualidades_profwermut

h.html [Accessed 08/08/2014]

UniProt Consortium, 2013. Update on activities at the

Universal Protein Resource (UniProt) in 2013. In

Nucleic Acids Res. 2013;41:D43-D47.

United Nations, 2000. Millennium development goals and

beyond 2015. online: http://www.un.org/millennium

goals/.[Accessed 08/08/2014]

WHO, 2013. Drugs for Neglected Diseases initiative

online: http://www.who.int

WHO, 2010. World Health Organization. First WHO

report on neglected tropical diseases: working to

overcome the global impact of neglected tropical

diseases. Geneva; 2010. online http://whqlibdoc.

who.int/publications [Accessed 08/08/2014]

Witten I.H, Don K.J, Dewsnip, M., Tablan V., 2004. Text

mining in a digital library. In Int J Digit Libr Journal.

2004;4:56-9.

IdentifyingDrugRepositioningTargetsusingTextMining

353