
cluded in the articles (Shah et al., 2003).
What elements should be extracted from the arti-
cles and what is the better way to extract them? There
are different strategies to approach the extraction of
information from articles. One possibility is to ex-
tract facts from the text by applying Biomedical Lan-
guage Processing methods (Weeber, 2005). There are
some important difficulties to be taken into account
(Hunter, 2006). The words can be ambiguous, allow-
ing for multiple interpretations for strings, like genes
whose names are identical to prepositions, acronims
and abbreviations. Another difficulty is to deal with
the polysemy of gene symbols, when a single name
or symbol can refer to more than one gene. And
metonymy can also occur when some string can refer
to more than one concept that are related, like the pro-
teins of a gene. This is the case of the p53 described
in Lakoff and Johnson article (Lakoff and Johnson,
1980). Named Entity Recognizers are tools that help
the correct extraction of biomedical terms like genes
and proteins. Now, there is an increasing attention
paid to syntax analyzers (Zweigenbaum, 2007). Their
task is computationally complex but the improvement
of computing power and faster algorithms makes it
more feasible every day.
In addition, from articles can also be retrieved
metadata associated to them. Concretely, the MeSH
Terms are useful information. They are biomedical
terms and, originally, they aimed to index the arti-
cles in such a way that can be found easily in the
databases. At the same time, the MeSH terms are
useful in text-mining because they summarize the rel-
evant contents, especially genes and diseases. More-
over, other information can be included to the analy-
sis. For example, chromosome location (Hristovski
et al., 2005), (Perez-Iratxeta, 2002) can be used to
better identify the desired elements from the articles.
Other biological information can also be used to im-
prove the knowledge base like in the tool developed
by Stein Aerts et al (Aerts, 2006).
Once the desired elements of the text have been
retrieved, the synonimous terms must be unified into
its concept (Bard and Rhee, 2004). There are initia-
tives called ontologies that build representation sys-
tems biomedical knowledge which unify terms into
concepts. In addition, the concepts are organized in
a semantic manner: they have associated a semantic
category and they are related among themselves by
a semantic relationship. Some extended ontologies
are Gene Ontology (GO), Open Biological Ontolo-
gies (OBO) and Unified Medical Language System
(UMLS).
How will the extracted elements be analyzed? The
analysis of the co-citation of concepts can be done in
many different ways. For example FACTA (Tsuruoka
et al., 2008) uses Pointwise Mutual Information, Pub-
Gene uses a Network of Co-citations (Jenssen et al.,
2001), PolySearch (Cheng et al., 2008) uses Associa-
tion Rules, Anni 2.0 (Jelier et al., 2008) uses Concepts
Profiles, the tool of Varun K. Gajendrana et al (Gajen-
drana et al., 2007) uses a model based in the Zipf law,
Seki et al (Seki and Mostafa, 2007) use a probabilis-
tic network, and Bitola (Hristovski et al., 2005) uses
Association Rules. The last four use these methods to
extract relationships and to discover new relationships
that are not explicitly writen in the articles, while the
others apply only to methods to extract implicit rela-
tionships.
SalamboMiner is a text-mining tool that aims to
obtain a prioritized list of relationships among genes
and diseases. The relationships obtained are twofold:
based on explicit information and based on implicit
relationships found by literature discovery. It ex-
tracts from the title, the abstract and MeSH terms of
PubMed articles the relevant concepts within the arti-
cles. We have defined as relevant concepts those that
represent genes, proteins and diseases. To do the ex-
traction of genes and proteins, we use the BNER Bio-
tagger (McDonald and Pereira, 2005) and we obtain
the diseases from the MeSH terms. It is important to
unify the elements into their respective concepts be-
cause it helps to manage of the extracted terms. In
SalamboMiner we use UMLS because of it gives a
relatively good coverage of genes and protein con-
cepts (Mary, 2004) although it has some problems
in the semantic assignment as described by Huany-
ing Gu et al (Gu, 2007).
In our project, we use the Bayes Factor mea-
sure to extract relevant relationships among concepts.
And, to discover new relationships, we construct a
Bayesian Network with the concepts as nodes and
their co-citations as edges, and we use the Bayes Fac-
tor again to infer new relationships.
We have tested SalamboMiner with data set that
comprised articles that appeared in the database of
Online Mendelian Inheritance in Man (OMIM) that
are related to Thrombosis and Diabetes.
2 METHODS
SalamboMiner has three consecutive modules:
• the first aim is to collect the desired articles,
• the second aim is to extract the desired concepts
from the articles, and
• the third aim is to organize the extracted concepts
so that allows to infer relationships among the
concepts.
BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms
144