make a prospection of similar terms in the following
databases Protein Information Resource (PIR, 2014),
The Open Biological and Biomedical Ontologies
(OBO, 2014), Protein Data Bank (PDB, 2014) e
UniProt Consortium (UniProt, 2013). We will
employ methods for the formation of domains and
association of terms, such as controlled vocabularies
(Lancaster, 1986); descriptors and disjoint sets
(Swanson, 2006); Information Management
(Berners-Lee, 1990), among others.
We will get terms that set an ontological
representation (Campos et al, 2009) for proteins and
its explicit relationships to obtain different classes
that are displayed by agglomeration methods.
The assumption is based on the conjecture that
agglomerate information can establish a relevant
relationship related to the construction of the groups
that will be formed to assess the level of similarity
between the structures, functions and protein names.
4 RESEARCH METHODOLOGY
The project is divided into two phases: phase I sets
the programming framework, the terms that will be
used to search and recover abstracts in articles
deposited in PubMed database, the text mining. In
phase II, we will inspect for protein names in the
complete text articles previously selected, the search
for proteins with similar structure or function in
biological databases. And lastly to suggest inputs for
repositioning drugs.
4.1 Phase I
This work methodology will be developed using the
R programming language and environment (R
Foundation, 2002). It is a suitable free software
idealized to manipulate large amounts of data,
optimized to calculate and present results
graphically.
The data source will be PubMed. Currently, this
database contains more than 23 millions of
biomedical quoted literature of MEDLINE,
scientific journals and books online. Some citations
can include links towards the complete text content
of PubMed Central and sites of editors (PubMed,
2014).
The chosen terms will comprehend keywords,
correlated words to the topic or subjects related to
the query. Here, will use the following terms:
dengue, Chagas disease, malaria, leishmaniasis,
plasmodium and trypanosome.
The inputs will consist of abstracts collected in the
PubMed, written in their standard adopted format
(NLM, 2014) to semi-structured in R programming
language and environment (Feinerer, 2014).
The semi-structured data will be named as a
textual corpus or simply corpus.
The digital textual documents, in their raw
format, i.e. originated by metadata in XML format,
will require treatment for the formation of a textual
corpus (Feinerer, 2008), which needs to be modified
in a way to maintain in its content only the relevant
words to the proposed topic.
The preprocessing should be understood as an
initial phase in the text mining. First, the spurious
words that do not reflect the central theme are
removed. The objective is to extract a set of words
that represent all of a textual body that was
submitted to the natural processing of language. This
matrix term versus document follows the model of a
vector space (Salton, 1975) and has the purpose to
obtain a set of documents, their terms and their
respective frequencies. The third step is the analysis
and visualization of the data by means of clusters,
dendograms and word clouds, among other
techniques and functions.
Spurious data and stop words are terms that do
not translate the central theme of the text, such as
prepositions, articles, country name, slang, etc.
Consequently, they will be eliminated to obtain a
concise textual body, what will facilitate the
execution of the following procedures. It is
necessary to remove: a) words previously read in
the dictionary; b) country names, continents and
nationalities; c) prefix, suffix and verbs; d)
measurement units; e) terms identified during all the
processing that will not be in accordance with the
results obtained in the following phases. Therefore,
this group will form a new group of spurious terms
and will be verified and registered in the words
dictionary if needed.
In the present project, indexing and normalizing
the textual body will consist in disambiguate words
to reduce variability. The goal of this is to reduce to
a common term one set of words that have the same
sense or meaning.
The extracting terms will yield a set of words
after the processing of the textual corpus, in indexed
and normalized forms.
Finally, an analysis of the terms obtained in the
extracting process will be done to identify which
abstracts best represent the central topic that will be
representative of their corresponding full texts.
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
350