a list of ontology features for each database.
Ontologies are domain knowledge, which can
provide a single identifier for describing each
concept or entity in a domain, even more, connect
concepts with related meanings, therefore, ontology
utility can drive data annotation and data integration.
Some databases had adapted the ontology concept
and provided access to a library of biomedical
ontologies and terminologies. For examples, the
Gene Ontology database (The Gene Ontology
Consortium; Ashburner, 2000), BRENDA
(Schomburg, 2004), TAIR (The Arabidopsis
Information Resource; Swarbreck, 2008), the
NCBI’s
BioPortal (Musen, 2012), etc. The ontology of
databases can be described as the relational schema
of their tagged corpuses. In order to make the
classification of mined databases in our BioMetaDB,
we constructed our own ontology list in which
specification and conceptualization define the
ontology purpose and provide the vocabulary,
relationships, and concepts for ontology design. The
ontological hierarchies and child-parent
relationships (PART_OF/IS_A) were established to
develop the domain ontology and sub-ontologies for
further use in implementation. Except the database
content, we also inferred the ontologies from other
groups such as the Gene Ontology database,
BioPortal, the Open Biological and Biomedical
Ontologies, the Proteomics Standards Initiative
(Orchard, 2003), and the Consultative Group on
International Agricultural Research. The relevance
among the databases was calculated according to
their ontology features, and the databases were then
grouped into various categories. In our BioMetaDB,
the species is indicated with the standard NCBI
taxonomy database taxid. In order to support search
in large, open and heterogeneous repositories of
unstructured biomedical information, we needed to
not only exploit deep levels of conceptualization of
these databases, but also their corresponding
publications and web site contents.
2.2 Relevance Measurement for
Classification of Databases
We adapted the hierarchical classification and
relevance measurement to categorize the databases.
Firstly, we had indexed the database by their
features, which were further used to evaluate the
relevance between different databases. The feature
index of each database also helped us to classify the
database. For example, the databases A, B and C can
be indexed as {human, transcription factor,
sequence}, {yeast, transcription factor}, and
{human, transcription factor binding site}
respectively. The databases A and C belong to the
“human” category, and the database B belongs to the
“yeast” category. Once the users propose the query
as “human”, they will obtain the results as databases
A and C. If the query is “human” plus “transcription
factor”, the output will be the database A. The goal
of the present work is to determinate the relevance
between each database pairs where each database
contains multiple biological features, for example,
the study species and the focused biological issue. In
the bag of indexes vector of each database, the
database was represented by vector in N-
dimensional space where N represented the total
number of feature indexes. For the relevance
calculation, we had inferred the previous database
classification method (Wu, 2005). Once two
databases share a significant number of feature
items, they were relevant to each other. For example,
we extracted the feature items of individual
database, such as A, B, C. Three databases were
presented as follows: D1= {A, B, C} D2= {A, C,
D, E}. The similarity S between the items of two
databases can be defined as,
(Item (D1) ∩Item (D2))⁄ (Item (D1) ∪Item (D2)) = S
Thus the relevance among various biomedical
databases can be measured. The significance of the S
value presents the high relevance between databases.
2.3 Database and Query
Implementation
We present an ontology-based multi database
classification and extension. The BioMetaDB is
curated by the authors and regularly updated (Fig.
1). Figure 1 is the workflow of our BioMetaDB
establishment. Generation of web pages was
implemented using the PHP server-side scripting
language for obtaining data and maintaining sessions
between web pages. The MySQL relational database
management system was used for storing the
biodatabase information in a structured manner.
BioMetaDB (http://cbs.ym.edu.tw/services/BMdb/)
provides versatile search functions with multi-source
multi-category searching through ontologies and
through researchers’ own keywords. Searching is
possible in the Web and dedicated collections, and
query results can be retrieved. A range of ontologies
can be used without assuming annotation of
databases. BioMetaDB offers a databases analysis
function through online query biased summarization
of individual databases and category sets. The
summarization criteria can be flexibly changed.
BioMetaDB:Ontology-basedClassificationandExtensionofBiodatabases
157