A TOOL BASED ON WEB SERVICES TO QUERY

BIODIVERSITY INFORMATION

Joana G. Malaverri, Bruno S. C. M. Vilar and Claudia B. Medeiros

Institute of Computing, University of Campinas, 13083-970, Campinas, SP, Brazil

Keywords:

Biodiversity Systems, Ontologies, Query processing, Web services.

Abstract: Biodiversity Information Systems are complex software systems that present data management solutions to

allow researchers to analyze species and their interactions. The complexity of these systems varies with the

data handled, users targeted and environment in which they are executed. An open problem to be faced

especially in a Web environment is data heterogeneity, and the diversity of user vocabularies and needs. This

hampers query processing. This paper presents a tool based on Web services to expand and process biodiversity

queries using ontology information. This solution relies on a new database organization, also described here,

which combines in a single model data collected in the ﬁeld with data found in archival sources. This tool is

being tested using real case studies, within a large Web-based biodiversity system.

1 INTRODUCTION

Biodiversity studies cover a wide variety of data, in-

cluding species occurrence records, spatial, ecologi-

cal, socio-economic data and others. The large vol-

ume of information on species and their habitats re-

quires new solutions for managing and analyzing the

characteristics of species and their interactions.

Biodiversity Information Systems emerged with

this objective. The scope of these systems includes

the recovery of textual information, such as literal

descriptions, and of the spatial distribution of one

or more species. Typically, they provide support to

queries on traditional database systems, and users are

limited in query ﬂexibility. Moreover, there is a need

for new tools to process biodiversity data on the Web.

This paper discusses our proposal to this problem

– a tool based on a set of Web services that processes

queries, extending them with semantic information.

This proposal is being tested on the BioCORE project,

a Web biodiversity system that is being developed in

a joint effort between computer scientists and bio-

logists. Our queries are centered on two kinds of

biodiversity data: ocurrence records, containg ob-

servations recorded and collected during ﬁeld trips;

and catalog records, containing information on (pre-

served) species in museums. This combination of data

sources is itself a contribution, since most biodiver-

sity systems consider either one or the other, but not

both. Our solution combines Web services, query ex-

pansion mechanisms based on ontologies, and a novel

biodiversity database model.

2 RELATED WORK

2.1 Managing Biodiversity Information

and Standards

There are a large number of projects that aim to de-

velop mechanisms to publish and manage biodiver-

sity data on the Web. Data heterogeneity is one of

the most important problems considered. Many of

these projects were proposed in order to manage col-

lections of Museums of Natural History and Herba-

riums. SpeciesLink (CRIA, 2001), for example, is a

Web system that allows integration of information on

biodiversity records available in museums, herbaria

and microbiological collections by publishing them

in the Internet. Another example is Specify (Beach,

2007), a project that aims to provide a platform that

uses Web services as a support for the management

of data collections. It also considers operations that

should be performed on the collections, such as loans,

exchanges, and donations.

On the other hand, the Biota project (Colwell,

1996) was one of the ﬁrst projects interested in occur-

rence records – those that register observations made

by biologists in the ﬁeld.

In parallel, projects like GBIF (Global Biodiver-

sity Information Facility) (GBIF, 2004), ITIS (Inte-

305

Malaverri J., Vilar B. and Medeiros C.

A TOOL BASED ONWEB SERVICES TO QUERY BIODIVERSITY INFORMATION.

DOI: 10.5220/0001836103050310

In Proceedings of the Fifth International Conference on Web Information Systems and Technologies (WEBIST 2009), page

ISBN: 978-989-8111-81-4

grated Taxonomic Information System) (ITIS, 2007),

or TDWG (Taxonomic Database Working Group)

(TDWG, 1994), are directing efforts to establish stan-

dards and infrastructure for integration and interope-

rability of data from biological collections, making

them available on the Web. Another considerable set

of biodiversity applications deals with the manage-

ment of taxonomic information and geographic dis-

tribution of species – e.g., Tree of Life (Maddison and

Schulz, 2007).

Most of these projects use data standards to facil-

itate access and dissemination of information on the

Internet. The two standards that are most commonly

adopted are Darwin Core and ABCD (Access Biolog-

ical Colections Data) (TDWG, 1994). The main ob-

jective of Darwin Core is to facilitate the exchange of

information on species. Among its core attributes, it

speciﬁes the name of the organism and where, when

and who collected it. ABCD brings additional ele-

ments to those provided by Darwin Core. It is a co-

mmon data schema that allows to structure and spec-

ify units of biological collections.

Data transfer protocols like DiGIR (Source-

Forge.NET,1999) and BioCase (BioCase, 2005) were

developed for these standards. DiGIR is a protocol

that provides a single access point to distributed data

sources, and uses the Darwin Core standard. Bio-

Case was developed to provide connectivity between

databases of biological collections. This protocol is

based on HTTP and XML and uses the ABCD stan-

dard to transmit data over the BioCase network. A

new approach known as Tapir (TDWG Access Pro-

tocol for Information Retrieval) is being promoted

by GBIF to enhance interoperability among biodiver-

sity tools and data to unify the DiGIR and BioCASE

protocols and to improve the interoperability between

them. Tapir (TDWG, 1994) speciﬁes a standard pro-

tocol that is based on XML schema and Web services.

Several of these efforts are begining to consider on-

tologies as a means to enhance interoperability in Bio-

diversity Systems.

2.2 Ontologies

An ontology is a speciﬁcation of a conceptualization

(Gruber, 1993). Ontologies can capture the semantics

of a domain by deﬁning concepts and their relation-

ships. Besides this, it is possible to ﬁnd speciﬁc appli-

cations of ontologies such as description of resources

and services to automate processes, to control vocab-

ularies, contextualize and infer information, etc.

Particularly in biodiversity information systems,

it is possible to ﬁnd different uses for ontologies.

SEEK (Michener et al., 2007) or Aonde (Daltio and

Medeiros, 2008) use ontologies to enable query and

analysis of the data in multiple and heterogeneous in-

formation sources.

2.3 Query Processing Issues

Our work concerns combining the ﬂexibility of Web

services with mechanisms for modiﬁcation of biodi-

versity queries to enhance their semantics. Different

query modiﬁcation techniques can be found in the li-

terature, such as reformulation, expansion, substitu-

tion, enrichment and relaxing e.g., (Florescu et al.,

1996; Lian et al., 2007). The goals of these techniques

include:

• Better performance - e.g., less time or fewer re-

sources needed in query execution;

• Better precision in the results through the modiﬁ-

cation of a query that originally does not retrieve

all relevant results.

Query expansion/rewriting – the technique we

adopted – is the process to augment a user query with

additional terms, to improve results. The techniques

and resources used to expand the queries include on-

tologies and probabilistic methods (Andreou, 2005),

and term extraction through a set of documents ob-

tained or query logs. The use of ontologies corre-

sponds to the so-called Semantic Query Optimization,

which reformulates a query into another, in a more ef-

ﬁcient way, which is semantically equivalent, provid-

ing the same answer (Necib and Freytag, 2004).

3 QUERYING BIODIVERSITY

DATA

3.1 The BioCORE Project

BioCORE (Bio-CORE, 2008) is a Web based project

developed in a collaboration between researchers in

Computer Science and Biology. It aims to aid scien-

tists and researchers in biodiversity to perform multi-

modal and exploratory queries among heterogeneous

biodiversity data sources. Its architecture, presented

on Figure 1, is based on Web services.

The architecture covers a client application,

which supplies an interface between the users and the

provided services. Services are categorized as stora-

ge, support and advanced. The ﬁrst group provides

basic data access facilities, encapsulating data repo-

sitories at the storage level. Supporting services in-

clude: content based image retrieval, and manage-

ment of collection data, metadata, geographic data

WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies

306

Figure 1: BioCORE Architecture.

and ontologies. Advanced services comprise more

complex services which invoke compositions of sup-

porting services. This paper concerns the Collection

Service, outlined in the ﬁgure.

Repositories contain information on images,

maps, collections and ontologies. The latter, stored

in a Semantic Repository, are used by our Collection

service. Also, each repository maintains a set of meta-

data to aid information management and retrieval.

3.2 The Collection Service

The Collection Service is a tool based on Web ser-

vices to query biodiversity records. Its queries are

performed on data stored in Collection Repositories,

and extended by ontologies in Semantic Repositories

(see ﬁgure 1).

It is composed of two main elements: (i) a basic

query Web service that receives and executes query

requests from a Client Aplication and (ii) a query ex-

pansion Web service that overwrites a query expres-

sion using ontologies delivered by our Ontology Ser-

vice (Daltio and Medeiros, 2008).

The main features of this tool are: (i) use of two

different Web services, to separate query processing

tasks between basic processing and expansion; (ii)

use of domain ontologies to ﬁnd alternative ways to

rewrite a query and (iii) adoption of biodiversity stan-

dards to improve sharing and exchanging of informa-

tion on the Web among diferent research groups. On-

tology management is performed by Aondˆe’s Ontol-

ogy Web Service. It provides a wide range of opera-

tions to store, manage, search, rank, analyze and inte-

grate ontologies.

Figure 2 shows a high level view of the Collection

Service and its components. Query processing works

as follows: a Client Aplication sends a user request to

the Collection service (1) with or without request for

expansion. This service encapsulates a Basic Query

module and a Query Expansion service. The Basic

Query module provides a connection with the Collec-

tions Repository, requests query execution (2) and re-

ceives a result (3). If the client did not request query

expansion, the data is returned to the client aplication

(12). Otherwise, the Basic Query module forwards

a request to the Query Expansion service (4). This

service makes a request to our Ontology Web Ser-

vice (5,6) for ontologies that are related to the query.

These ontologies are returned to the Query Expansion

service (7,8) where they are processed to rewrite the

query. The expanded query is sent back to the Basic

Query module (9) which runs it (10) and returns the

result to the client (11,12). The development of these

services is guided by the openness, accessibility and

interoperability provided by open source software and

Web service technologies.

Figure 2: Architecture of the Collection Service.

A query without expansion (Basic query) is a stan-

dard SQL query on the Collections Repository (which

contains ﬁeld observation and catalog records). The

client application, in such a case, must know the

database schema in order to express the query. The

only acceptable predicates are those that involveﬁelds

that appear in the schema. For instance, the query

“Return all species recorded in the museum catalog

that belong to the family “Amphiuridae” ” will only

work if the database schema has an attribute called

“family”. Table 1 shows an example of a partial re-

sult of this type of query, run on our database.

Table 1: Table with partial query results - basic query.

3.3 The Collections Repository

An important part of our work was the design of the

Collections repository. It is a database containing in-

A TOOL BASED ONWEB SERVICES TO QUERY BIODIVERSITY INFORMATION

307

formation on ﬁeld obervations and catalog records. It

has been implemented using the PostgreSQL database

system (PostgreSQL, 1996). One relevant issue in

the development of the data model is that it should

be general, allowing the exchange of information bet-

ween different research groups.

For this purpose, we decided to use the data model

elements that are part of the Darwin Core standard

(TDWG, 1994). This means that, in the future, our

work can interoperate with other projects, because it

relies on Web services and in this world wide data

standard. We started by deﬁning the subset of inte-

rest in Darwin Core, and added other relevant speciﬁc

ﬁelds, speciﬁed by our end users.

The entire work was conducted in cooperation

with these end-users: biologists from two distinct re-

search ﬁelds - ecology and marine biology. While

the ﬁrst perform ﬁeld trips to collect data on inte-

ractions among insects and plants, the latter collect

small sea animals. They are moreover in charge of

a large project to reorganize the university’s zoology

museum, and are thus conversant with the needs and

methods of management of species catalog records.

Thus, our database model reﬂects a dual view of

biodiversity data management. On one side, we sup-

port storage and handling of data on species obser-

vations and ﬁeld trip collections. On the other side,

we also cater to the needs of museum catalogs, which

are closer to those of (digital) librarians. As far as we

know, there is no other unifying database model pro-

posal of the same kind - biodiversity databases are ei-

ther concerned with ﬁeld trip records or with museum

catalog records.

Figure 3 shows a high level view of the database

entity relationship diagram. This multi purpose

database naturally supports a wider spectrum of

queries. This includes for instance queries that trace a

museum record entry back to its ﬁeld origins, without

losing any of the original annotations.

The central entities of the database model are

Sample (corresponding to ﬁeld observation/collection

records), HomogeneousSet (records on sets of homo-

geneous species extracted from ﬁeld collections) and

Catalog (museum records). Sample, Homogeneous

Set and Catalog records have to answer the same kind

of query: What (species identiﬁcation), How (it was

collected, preserved, catalogued), by Whom, When,

Where. The answer to these queries needs a con-

text (e.g., does the query concern ﬁeld observations,

catalog entries, or their interconnection). Moreover,

the What (taxonomic information) is often incom-

plete, and may evolve. Location (where) can be er-

roneous or imprecise, when coordinates are unavai-

lable. For more details on data incompleteness in bio-

diversity databases, we refer the reader to (Daltio and

Medeiros, 2008). For more on the collection reposi-

tory, we refer the reader to (Malaverri, 2008).

3.4 The Query Expansion Service

The Collection service receives a query as parameter

and analyzes its predicates and optionally involkesthe

Query Expansion service. The use of ontologies in

query processing allows the Query Expansion service

to expand a query expression to incorporate terms and

concepts that are not in the collection database, but are

part of the biologists’ conceptual view of the world.

This section presents examples of typical queries,

with invocation of the Expansion Service.

3.4.1 The use of Subclasses (Hyponym)

Consider the natural language query:

Return insects of the order lepidoptera that

were collected in the adult life stage.

This query can be represented in SQL (Structured

Query Language) as:

SELECT * FROM Taxonomy t, Catalog

c WHERE t.class=’insecta’ AND t.order =

’lepidoptera’ AND c.lifestage = ’adult’ and

t.idTaxa = c.idTaxa

Suppose the query is posed on Table 2, extracted

from our Catalog Table. In particular, our database

records have many nulls. Hence, records 1, 2 and

3 have the Order identiﬁed while 4, 5 and 6 contain

SuperFamily information. The query can be directly

applied to the table, since it contains all needed at-

tributes.

Since the Order attribute is not present directly in

records 4, 5 and 6 these records would not be con-

sidered. However, it is possible to expand the query

using an ontology that represents taxonomic informa-

tion. This ontology is partially depicted in Figure 4.

Using the inheritance relation between the con-

cepts, it is possible to recognize that gracillarioidea,

hesperioidea, micropterigoidea, and papilionoidea

are ontological sub-classes of order lepidoptera. The

query can be rewritten as follows:

SELECT * FROM Taxonomy t, Catalog c

WHERE t.class=’insecta’ AND t.superfamily

in (’Gracillarioidea’, ’Hesperioidea’, ’Mi-

cropterigoidea’, ’Papilionoidea’) AND

c.lifestage = ’adult’ and t.idTaxa = c.idTaxa

The user needs to deﬁne whether the query is to be

processed with or without expansion. In the ﬁrst case,

the query will process only the contents of records 1,

WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies

308

Figure 3: Entity-Relationship Diagram based on Darwin Core standard version 2.

Table 2: Table with records about insects.

Id Class Order SuperFamily LifeStage commomName

1 insecta lepidoptera adult butterﬂy

2 insecta lepidoptera larval moth

3 insecta coleoptera adult beetle

4 insecta hesperioidea adult

5 insecta lepidotrichidae larval

6 insecta chrysomeloidea adult

2, and 3. In the second case, the Query Expansion ser-

vice is invoked to reformulate the query, and will also

process records 4, 5, and 6. The result is the union of

results of expanded and non-expanded queries.

Figure 4: Partial ontology for the Insecta class.

3.4.2 The use of Equivalence (Synonyms)

Consider the natural language query: ”Return data on

butterﬂies”. This can be represented in SQL as:

SELECT * FROM Catalog WHERE com-

momName = ’butterﬂy’

Suppose, again, data on Table 2. If the query is

processed, the criteria speciﬁed can be applied di-

rectly only to the records 1, 2 and 3. To consider data

from records 4, 5 and 6, it needs to be reformulated,

e.g., ’common name’ is not present.

Again, it is possible to use the information from

an ontology to specify an alternative classiﬁcation

mode to verify if an insect is a butterﬂy. The ontol-

ogy in Figure 4 also includes an alternative concept

that deﬁnes this common name, deﬁned equivalent to

order Lepidoptera. However, this equivalence is re-

stricted to Hesperioidea and Papilionoidea (see Fig-

ure 4). From this information, the SQL query can be

reformulated as follows:

SELECT * FROM Catalog WHERE super-

family in (’Hesperioidea’,’Papilionoidea’)

Again, if the user does not demand query expansion,

it will be processed and return record 1. Expansion

will return record 4. The result is the union of both

queries.

3.4.3 Other uses of Ontologies

The previous examples use the information on sub-

class and equivalence relationships to obtain differ-

ent speciﬁcations about a concept. Ontologies have

A TOOL BASED ONWEB SERVICES TO QUERY BIODIVERSITY INFORMATION

309

an additional set of resources that can be adopted to

rewrite queries.

Query expansion can consider identity (syn-

tactic identity) and equivalence concepts. Sub-

class/superclass relationships can be exploited in one

or more levels. In particular, super/subclasses can be

suggested to users when a query does not return the

desired answers. Moreover, if a concept consists of

an intersection of others, the query can be speciﬁed

utilizing the concepts and restrictions applied to the

intersection.

Additional relationships and properties should be

considered in query expansion. They include applica-

tions of transitivity and symmetry. A particular inter-

esting ontological rewriting possibility involves part

of - whole relationships.

4 CONCLUSIONS

This work presented a tool to support research on bio-

diversity. It uses metadata standards and Web services

to exchange and share data, and applies a query ex-

pansion technique to adapt user queries to the data

sources. Query expansion relies on the use of ontolo-

gies, which are served by a Web service.

The design and test of database and tool are being

conducted with participation of biology experts. The

database has been created using Postgres. The biol-

ogists’ distinct archived ﬁles are now being migrated

into the database. We are conducting tests with and

without query expansion, to validate database design

choices. These tests are being executed directly in

SQL. The Collection service has already been speci-

ﬁed and we are now ﬁnishing the speciﬁcation of the

Expansion Service to meet all expansion techniques

of section 3.4.3. All services are being built using

Apache Axis.

Future work involves many issues. The ﬁrst is

to use the TAPIR protocol, used by large biodiver-

sity projects, as a mechanism to transfer information.

Another issue will involve distinct kinds of user inter-

action modes, and other kinds of interaction mecha-

nisms – e.g., clicking on maps.

ACKNOWLEDGEMENTS

This work was partially ﬁnanced from grants by

Brazilian funding agencies CNPq (including project

BioCORE), CAPES and FAPESP.

REFERENCES

Andreou, A. (2005). Ontologies and Query expansion.

Master’s thesis, University of Edinburgh.

Beach, J. (2007). Specify Biodiversity Collections Soft-

ware. http://www.specifysoftware.org/Specify/. (ac-

cess Oct, 2008).

Bio-CORE (2008). Tools, models and tech-

niques to support research in biodiversity.

http://www.lis.ic.unicamp.br/projects/bio-core/.

BioCase (2005). Biological Collection Access Services for

Europe. http://www.biocase.org/index.shtml. (access

Oct, 2008).

Colwell, R. (1996). Biota: The Biodiversity Database Man-

ager. Sinauer Associates.

CRIA (2001). Centro de Referˆencia em Informac¸˜ao Ambi-

ental. http://splink.cria.org.br. (access Oct, 2008).

Daltio, J. and Medeiros, C. B. (2008). Aondˆe: An ontol-

ogy web service for interoperability across biodiver-

sity applications. Information Systems, 33. Accepted

for publication.

Florescu, D., Raschid, L., and Valduriez, P. (1996). A

methodology for query reformulation in cis using se-

mantic knowledge. Int. J. Cooperative Inf. Syst.,

5(4):431–468.

GBIF (2004). Global Biodiversity Information Facility.

URL: http://www.gbif.org. (access Oct, 2008).

Gruber, T. (1993). Toward principles for the design for on-

tologies used for knowledge sharing. In Formal On-

tology in Conceptual Analysis and Knowledge Repre-

sentation. Kluwer Academic Publishers.

ITIS (2007). Integrated taxonomic information system.

http://www.itis.gov/. (access Oct, 2008).

Lian, L., Ma, J., Lei, J., Song, L., and Zhang, D. (2007).

Query relaxing based on ontology and users’ behavior

in service discovery. In Fourth International Confer-

ence on Fuzzy Systems and Knowledge Discovery.

Maddison, D. and Schulz, K. (2007). The tree of life web

project. Zootaxa, 1668.

Malaverri, J. G. (2008). Serving Biodiversity Data on the

Web. Master’s thesis. Defence April 2009.

Michener, W., Beach, J., Jones, M., Ludscher, B., Penning-

ton, D., Pereira, R. S., Rajasekar, A., and Schildhauer,

M. (2007). A knowledge environment for the biodi-

versity and ecological sciences. J. Intell. Inf. Syst.,

29(1):111–126.

Necib, C. B. and Freytag, J. C. (2004). Using ontologies

for database query reformulation. In ADBIS (Local

Proceedings).

PostgreSQL (1996). Postgresql.

http://www.postgresql.org/. (access Oct, 2008).

SourceForge.NET (1999). Distributed generic information

retrieval (digir). http://sourceforge.net/projects/digir.

(access Oct, 2008).

TDWG (1994). Biodiversity information standards - tdwg.

http://www.tdwg.org/. (access Oct, 2008).

WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies

310