A Virtual Document Approach for Keyword Search in Databases

Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa and Ivan Lopez-Arevalo

Information Technology Laboratory, Center of Research and Advanced Studies of the National Polytechnic Institute

(CINVESTAV), Cd. Victoria, Tamaulipas, Mexico

Keywords:

Keyword Search, Indexing, Databases, Top-k, Virtual Documents.

Abstract:

It is clear that in recent years the amount of information available in a variety of data sources, like those

found on the Web, has presented an accelerated growth. This information can be classiﬁed based on its

structure in three different forms: unstructured (free text documents), semi-structured (XML documents) and

structured (a relational database or XML database). A search technique that has gained wide acceptance for

use in massive data sources, such as the Web, is the keyword based search, which is simple to people who

are familiar with the use of Web search engines. Keyword search has become an alternative to users without

any knowledge about formal query languages and schema used in structured data. There are some traditional

approaches to perform keyword search over relational databases such as Steiner Trees, Candidate Networks

and recently Tuple Units. Nevertheless these methods have some limitations. In this paper we propose a

Virtual Document (VD) approach for keyword search in databases. We represent the structured information

as graphs and propose the use of an index that captures the structural relationships of the information. This

approach produce fast and accuracy results in search responses. We have conducted extensive experiments on

large-scale real databases and the results demonstrates that our approach achieves high search efﬁciency and

high accuracy for keyword search in databases.

1 INTRODUCTION

In recent years we have been witnesses to an expo-

nential increase of the quantity of information avail-

able on the World Wide Web. The information search

has become an indispensable component in our lives.

Web search engines such as Google or Yahoo are the

primary way to access massive information. Most of

data in the Web can be found ina textual format, and it

is also common to ﬁnd a huge amount of information

stored in relational databases, XML documents, and

other data sources. This information could be classi-

ﬁed based on its structure such as: a) unstructured

data, or free-text documents (emails, news, HTML

documents) that are written in natural language, b)

semi-structured, such as posting on newsgroup (e.g.

apartment rentals), medical records, equipment main-

tenance logs; XML documents are semi-structured

because the schema information is mixed with the

data values. HTML pages on the Web are considered

semi-structured, since the embedded data are often

rendered regularly via the use of HTML tags. And c)

structured, such as those found in relational databases.

The search engines have used a very simple and

widely accepted mechanism for querying textual doc-

uments known as keyword search. This mecha-

nism becomes an alternative of querying over re-

lational databases and XML documents, because it

is simple to people who are familiar with the use

of Web search engines. Recently the database re-

search community has recognized the beneﬁts of

keyword search and has been introducing keyword

search capability into relational databases (Li and

Feng, 2009),(Agrawal and Chaudhuri, 2002),(Ding

and Xu, 2007),(Hristidis and Papakonstantinou,

2002),(Park and goo Lee, 2011),(Feng and Li,

2011),(Bhalotia and Hulgeri, 2002), XML databases

(He and Wang, 2007),(V. Hristidis, 2006),(Hris-

tidis and Papakonstantinou, 2003), (Bao and Lu,

2010), graph databases (Kimelfeld and Sagiv,

2008),(Achiezra and Golenberg, 2010),(Kacholia and

Pandit, 2005),(Zhong and Liu, 2009),(He and Wang,

2007), and heterogeneous data sources (Dong and

Halevy, 2007),(Li and Feng, 2008a),(Franklin and

Halevy, 2005). One important advantage of keyword

search is that it enables users to search for information

without having a detailed knowledge of the schema of

the database or XML document and without the need

to learn some formal languages like SQL or XQuery.

Although keyword search has been proven to be

I. Lopez-Veyna J., J. Sosa-Sosa V. and Lopez-Arevalo I..

A Virtual Document Approach for Keyword Search in Databases.

DOI: 10.5220/0004048700390048

In Proceedings of the International Conference on Data Technologies and Applications (DATA-2012), pages 39-48

ISBN: 978-989-8565-18-1

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

effective for textual documents, it presents some limi-

tations on structured and semi-structured data that are

not easy to carry out. This situation is because cur-

rent Information Retrieval (IR) techniques applied on

search engines have not been designed for this type

of data sources (M. Karnstedt, 2008). They have

always employed the inverted index to process key-

word queries, which is effective for unstructured data

but it is inefﬁcient for semi-structured and structured

data (Franklin and Halevy, 2005). Additionally, these

techniques ignore the information structure (hyper-

links in the case of unstructured data, parent child re-

lationships or IDRefs in semi-structured data and pri-

mary and foreign keys in the case of structured data)

that can be extracted from the data sources for answer-

ing keyword queries.

We propose a novel virtual document approach as

an attempt to solve the problem of keyword search

in databases. This approach can be applied to un-

structured and semi-structured information. Our ap-

proach will improve the keyword search in databases

and make an efﬁcient processing using a score based

on classical information retrieval that takes into ac-

count the meta-information included in structured in-

formation.

The rest of this paper is organized as follows. Sec-

tion 2 shows the Related Work. We introduce the

Problem in Section 3. Section 4 presents our ap-

proach. Section 5 shows some of our preliminary re-

sults. Conclusions and Future Work are given in sec-

tion 6.

2 RELATED WORK

The ﬁrst area of research related to our work is the

keyword-based search over structured data using the

IR techniques. Inverted index has been proven to be

one of the most effective techniques to organize un-

structured data. In recent years there are applications

that need to integrate structured data and text docu-

ments. Due to this situation, different approaches to

keyword search over structured data have been pro-

posed. The existing approaches can be classiﬁed into

three main methods: Steiner Tree, Candidate Net-

works and Tuple Units.

The Steiner-Tree-based methods use a data graph

structure to compute relevant answers. A Steiner Tree

can be deﬁned as: given N points in a plane, it is

necessary to connect them by lines of minimum to-

tal length in such a way that any two points may be

interconnected(Du and Hu, 2008). The authors in

(Li and Feng, 2009),(He and Wang, 2007),(Li and

Feng, 2008a),(Dong and Halevy, 2007) and (Bhalo-

tia and Hulgeri, 2002) ﬁrst model the tuples in a re-

lational database as a data graph, where nodes are tu-

ples and edges are the primary-foreign-key relation-

ships, and they identify the steiner trees that contain

all or some of the input keywords to answer a query.

These methods compute answers on the ﬂy by travers-

ing the database graph to discover structural relation-

ships. Since minimum Steiner tree problem is known

to be NP-hard (Fang and Clement, 2006), a Steiner

Tree based approach can be inefﬁcient, demanding

new studies to ﬁnd polynomial time solutions that can

effectively approximate the minimum Steiner Tree

problem (Li and Feng, 2009). For example, the EASE

approach proposed by (Li and Feng, 2008a) was re-

viewed and we found that the datagraph generated

from the IMDB Dataset

, contained 1,000,209 nodes

and 200,000 vertex. After following the process de-

scribed in this approach, we found 1,007,826 min-

imal Steiner Trees, this is a big number, closer to

the same number of nodes of the datagraph, requir-

ing an important computational effort. Another issue

found in the EASE approach is the need for comput-

ing a score DB, which requires to analyze all possible

paths between two nodes belonging to a Steiner Tree.

These issues imply a very high computational effort

and time consuming process, especially in the cases

where a Minimal Steiner Graph contains hundred

of nodes and arcs. BLINKS(He and Wang, 2007),

BANKS(Bhalotia and Hulgeri, 2002), EASE(Li and

Feng, 2008a) and CSTREE(Li and Feng, 2009) are

examples of this approach.

The Candidate Networks based methods use the

database or XML schema to identify answers. A Can-

didate Network is a subtree that involves tuple sets

with occurrences of the query keywords, where the

keywords can be leafs nodes or the root of the sub-

tree. They ﬁrst generate candidate networks follow-

ing the primary-foreign-key relationships, and then

compute answers composed of relevant tuples based

on Candidate Networks. Each Candidate Network

would be generated and evaluated as a good answer.

The problem of evaluating all Candidate Networks

is a multi-query optimization problem and has two

main issues: (1) How to share common subexpres-

sions among Candidate Networks generated in order

to reduce computational cost evaluation and (2) How

to ﬁnd a proper join order to fast evaluate all Can-

didate Networks(Xu and Qui, 2010), this is because

for a keyword query, the number of Candidate Net-

works can be very large. DISCOVER(Hristidis and

Papakonstantinou, 2002) and SPARK(Luo and Lin,

2007) are examples of this approach.

http://www.imdb.com/interfaces (1,000,000 Data Set

(475 MB)).

DATA2012-InternationalConferenceonDataTechnologiesandApplications

These ﬁrst two methods discover the connec-

tions (primary/foreign keys) between tuples on the

ﬂy. As there are usually large numbers of tuples in

a database, these methods are rather expensive to ﬁnd

answers and neglect that relevant tuples can be iden-

tiﬁed, pre-computed, materialized and indexed off-

line(Li and Feng, 2008b).

The Tuple Units are proposed by Li et. al. in

(Li and Feng, 2008b). A Tuple Unit is composed

of relevant tuples connected by primary-foreign-key

relationships. It is a set of highly relevant tuples

which contain query keywords. The tuple units are

generated and materialized off-line in order to ac-

celerate the on-line processing of keyword queries.

EKSO(Su and Widom, 2005) and RETUNE(Li and

Feng, 2008b) are examples of this approach. Fol-

lowing the RETUNE approach proposed by (Li and

Feng, 2008a). We tried to make our own implemen-

tation in order to make a real comparison with this

approach. Unfortunately, it was not possible to cre-

ate the score table that contains the posting list that

the authors created for a single keywords. They men-

tioned the use of MYSQL in this process. However

MYSQL is limited to 4096 columns per table or ev-

ery table has a maximum row size of 65,535 bytes

In this process, at leas 4170 columns to store terms

and 126,505 columns to store keyword pairs were re-

quired. This situation made impossible to reproduce

those experiments.

3 PROBLEM

3.1 Motivation

Nowadays information becomes the most critical and

valuable asset for business. Providing a uniform in-

terface for querying and retrieving information re-

gardless of its structure, is not a trivial task. Hav-

ing access to a large set of structured data or het-

erogeneous data has been one of the most impor-

tant challenges facing the current database and in-

formation recovery/retrieval projects (Li and Feng,

2008a),(Chaudhuri and Ramakrishnan, 2005),(Abite-

boul and Allard, 2008). It Is very common to ﬁnd

information on the web by typing some keywords

in a search engine (using IR techniques). However,

some times the returning results may not contain rel-

evant data. Current search engines are mainly fo-

cused on exploring unstructured information (docu-

ments). In this way, they are losing relevant informa-

http://dev.mysql.com/doc/refman/5.0/en/column-

count-limit.html

tion that could be found in different data sources like

semi-structured or structured. To have access to struc-

tured or semistructured data, users need to know the

schema of the database or to learn some formal lan-

guage like SQL, XQuery (Li and Feng, 2009). Fur-

thermore, the IR-style ranking model implemented in

current search engines ignores the information struc-

ture (meta-information) that can be extracted from

different data sources. This meta-information can be

useful for determining the relevance of that informa-

tion(Li and Feng, 2008a). The structure information

awareness facilitates the assembling of pieces of data

that are interconnected at different relations or XML

documents. These pieces of data can represent more

relevant answers that could be provided by a search

engine.

The mentioned situation has motivated us to pro-

pose a novel Virtual Document Approach for Keyword

Search in Databases. This approach makes an efﬁ-

cient query processing and uses a score based on the

classical Information Retrieval metrics. Furthermore

it takes into account the structure extracted from the

structured information. Our approach identiﬁes the

most relevant and meaningful tuples to answer key-

word queries. We model structured data as graphs,

where nodes can be tuples, and edges can be primary-

foreign-key relationships. A data index enables efﬁ-

cient keyword search over databases using an IR style

that captures the structural relationships of the data.

This structure is kept implicitly in what is called Vir-

tual Documents..

The main contributions of this paper are:

• A novel and efﬁcient keyword search method for

structured information.

• A new method for the construction of virtual doc-

uments from structured data.

• A deﬁnition of a data index that stores the struc-

tural relationships taken from the data sources.

• The analysis of the issues related to indexing

and ranking, and a simple and efﬁcient indexing

mechanism to index the structural relationships

between the transformed data.

• The development of an effective ranking tech-

nique to rank the virtual documents by taking into

account both the structural relationships included

in the relational and XML databases and the tex-

tual relevancy using a text retrieval approach.

4 OUR APPROACH

Our Virtual Document Approach is a novel approach

for keyword search on databases that takes some ideas

AVirtualDocumentApproachforKeywordSearchinDatabases

from the RETUNE project (Li and Feng, 2008b) but

implementing important differences. The ﬁrst differ-

ence is that we manipulated the information in a data

graph, instead of using relational tables. The second

difference is that RETUNE uses SQL to extract the tu-

ple units and our approach uses the data graph to iden-

tify the Tuple Units that are converted into something

called Virtual Documents. These documents are used

for creating a data index that takes into account the

meta-information included in structured data without

including redundant information.

4.1 Notations

This section introduces some notations for clarity of

this paper. In our approach we model structured data

as an undirected graph, where the nodes are tuples,

and the edges are primary-foreign-key relationships.

A data Graph can be deﬁned as follows:

Deﬁnition 1. (Data Graph) A Data Graph G is a pair

(V, E), where V is a set of vertices, and E is a set of

edges between the vertices E ⊆ {(u, v)|u, v ∈ V}

We need a database to identify Tuple Units and

transform them into Virtual Documents. Given a

database D with n tables, R

, R

..., R

, R

−→

denote

that R

has a foreign key k which refers to the pri-

mary key of R

. If two relational tables R

and R

are

connected, are denoted as, R

⇔ R

, if i) R

−→

; or

ii) R

−→

; or iii) ∃R

, R

⇔ R

and R

⇔ R

. In this

approach, we suppose that each table is pointed out

by other relations that act as link table. A relational

table R

is called a link table if there is not relation

table R

, such that, R

−→

. That is, R

only con-

tains foreign keys to connect other relational tables

but it does not contain any primary key (Li and Feng,

2008b). For example, consider the database with

three tables in the ﬁgure 1, we have Rating

UID

−→

User

and Rating

MID

−→

Movie; and User ← Rating → Movie.

Rating is a link relational table as it has no primary

key.

4.2 Tuple Units

Since the database is modeled as a data graph, we can

identify a Tuple Unit following the primary and for-

eign keys into the data graph. We have to choose a

root node of the data graph and the set of neighbors or

adjacent nodes connected by primary or foreign key,

will be called a Tuple Unit. Next this Tuple Unit is

transformed in a set of Virtual Documents. Our ver-

sion of Tuple Units is deﬁned as follows:

Deﬁnition 2. (Tuple Unit (TU)) Given a node v

∈ V

a Tuple Unit from the node v

, are composed of the

Figure 1: An example database.

Figure 2: The graph model for the example database in Fig-

ure 1.

set of neighbors of the node v

, and can be deﬁned as

TU(v

) = {u|u ∈ neighbors(v

) and u ∈ V}. The node

TU(v

) is called the root node from the Tuple Unit of

To better understand the concept of tuple Tuple

Units (TU), we give the Example 1.

Example 1. Consider the database in the ﬁgure 1

where Rating is a link table. First we model as a data

graph following the primary and foreign keys as illus-

trated in ﬁgure 2. With this graph we can iteratively

get each node in the graph and obtain the TU. For

example, consider the nodes u1, u3 and m2. We can

get the Tuple Units Tu1, Tu3 and Tm2 respectively, as

shown in ﬁgure 3

The Tuple Unit based method has the follow-

ing features: (1) TU is effective to answer keyword

Figure 3: Tuple units from the nodes u1, u3 and m2.

DATA2012-InternationalConferenceonDataTechnologiesandApplications

Figure 4: Virtual documents from the nodes u1 and m2.

queries as they capture structures and can represent

a meaningful and integral information unit. (2) The

relationships between tuples connected through pri-

mary/foreign keys can be identiﬁed and indexed, so

we can efﬁciently answer keyword queries by using

such indexed structural information. (3) The number

of TU will not be large, which is not larger than of the

total tuples in the underlying database(Ding and Xu,

2007).

Once we identify the TU for each node in the

graph, we need to transform it into a Virtual Docu-

ment by dividing the TU into a set of documents. We

indeed create a new document for each neighbors of

the node, and all these documents together compose a

single Virtual Document. The new Virtual Document

only contains the textual attributes from TU, the idea

is to reduce the size of virtual documents by avoiding

redundant information. We can also include non tex-

tual attributes, such as numbers and dates, taking their

string representation. A similar approach to virtual

documents was proposed by (Su and Widom, 2005).

The set of virtual documents generated will feed the

index. The ﬁgure 4 shows the virtual documents gen-

erated from the Tuple Units Tu1 and Tm2.

4.3 Ranking

Section 4.2 described how Tuple Units were identi-

ﬁed and transformed into Virtual Documents(VD). In

this section, we ﬁrst discuss how to meaningfully in-

dex and rank the Virtual Documents and then to iden-

tify the top-k answer based on existing proposals. The

generated VDs include the structural relationships be-

tween input keywords with respect to a Tuple Unit.

The simplest way to score and rank the virtual

documents is by the use of the TF · IDF metric.

We can take the terms in the virtual documents as

keywords, following a similar technique of indexing

and ranking used in DISCOVER2 (Hristidis and Gra-

vano, 2003), SPARK (Luo and Lin, 2007), EASE (Li

and Feng, 2008a) and by (Fang and Clement, 2006),

where the terms can be used to answer keyword-based

queries over Virtual Documents. The idea is ﬁrst as-

signing to each Virtual Document a score using a stan-

dard IR-ranking formula, and then to combine the in-

dividual scores using a score aggregation function,

such as SUM, to obtain the ﬁnal score.

Assuming that we have calculated the total set of

the virtual documents denoted as D, there are N dif-

ferent virtual documents and K keywords in D. Given

a virtual document denoted as d ∈ D and a keyword

contained in d. We set nt f(k

, d) as the normalized

term frequency of k

in a document d, and it is deﬁned

in Equation 1. Where t f (k

, d) represents the number

of occurrences of the term k

in a document d.

nt f(k

) = 1+ ln(1+t f(k

, d)) (1)

We set id f(k

) as the inverse document frequency

of k

deﬁned in equation 2. Where d f

is the number

of virtual documents that a term k

occurs in D. Id f

is normalized by dividing the total number of Virtual

Documents N over (d f

+1) and then applying the ln

function.

id f(k

) = ln

d f

+ 1

(2)

Document Normalized Length ndl(d) is deﬁned

in Equation 3, and represents the normalized docu-

ment length, that is the number of terms in the doc-

ument d over the average of terms in the set of doc-

uments D. ndl is used to reduce the term weights in

long documents. |d| denotes the number of terms in

a document d. And s is a constant taken from IR lit-

erature(Fang and Clement, 2006) and is usually set to

0.2.

ndl(d) = (1− s) + s∗

|d|

∑

′

∈D

′

(3)

In the IR literature, ranking methods usually com-

bine the three metrics (tf, idf and ndl) in a TF · IDF

for a keyword k

in a Virtual Document d as illustrated

in equations 4.

TF · IDF(k

, d) =

nt f(k

)

ndl(d)

∗ id f(k

) (4)

Although the TF · IDF-based ranking methods are

efﬁcient for textual documents, they are inefﬁcient for

semi-structured and structured data and do not take

into account the implicit structured information. In

our approach, Virtual Documents capture some struc-

tural information, which is easy to ﬁnd in relational

databases and XML documents. Representing the

rich structural relationships by the Virtual Documents

should be at least as important as discovering more

keywords, and in some cases, even more crucial.

Therefore, given a keyword query K = k

, k

, ...k

and a Virtual Document d we can compute the score

of a virtual document d with respect to a keyword

query K by summing the TF · IDF for each keyword

∈ K as it is deﬁned in equation 5.

SCORE(K, d) =

∑

k=1

TF · IDF(k

, d) (5)

AVirtualDocumentApproachforKeywordSearchinDatabases

With the compute of the SCORE, we are allowed

to answer queries and rank the relevant virtual doc-

uments, taking into consideration the classical Infor-

mation Retrieval metrics and including in this score

the structural information extracted from the relation-

ships of a Tuple Unit.

4.4 Indexing

To efﬁciently retrieve scores of the documents, we

propose a virtual document inverted index (VDII),

which is similar to the traditional inverted index. The

entries of VDII are also the keywords that are con-

tained in the virtual documents. VDII is similar to

inverted indexes since each entry k

keeps the virtual

documents that directly contain the keyword, and the

corresponding score. The virtual documents with re-

spect to the entry k

are sorted by SCORE(k

, d) in

descending order, where d ∈ {d

contains the key-

word k

}. We indexingall the virtual documents using

the Apache Hadoop Framework

. Hadoop is an open

source framework for writing and running distributed

applications that process large amounts of data using

commodity hardware(Lam, 2011). Hadoop uses an

implementation of Map-Reduce programming model.

4.5 Query Processing

With the VDII index, we can answer a keyword query

as follows: given a keyword query K = k

, k

, ...k

we ﬁrst retrieve the inverted lists of I

(1 ≤ i ≤ n),

which is composed of the virtual documents that con-

tain keyword k

. Then, we compute the score of each

relevant virtual document adding the SCORE of each

keyword, next we apply a ranking function to ﬁnally

return the top− k answers with the highest scores.

5 RESULTS

5.1 Experimental Scenario

In order to evaluate the effectiveness and the efﬁ-

ciency of our proposal, we have conducted extensive

experiments on real datasets. We compared search

efﬁciency and results quality with existing state of

the art algorithms BLINKS(He and Wang, 2007),

EASE(Li and Feng, 2008a) and RETUNE(Li and

Feng, 2008b). We used the Internet Movie DataBase

(IMDB) to evaluate our approach. IMDB contains ap-

proximately one million anonymous ratings of 3883

movies made by 6040 users, each user has rated at

http://hadoop.apache.org/

Table 1: IMDB dataset.

Tables Attributes No. of Records

User UID, UserName,

Gender, OID

6,040

Movie MID, title, genres 3,883

Occupation OID, Occupation 30

Rating UID, MID, Score 1,000,209

least 20 movies. IMDB dataset are translated into four

tables as illustrated in Table 1.

We model the data set as undirected graph.

This dataset generate about 1,010,153 vertex and

2,006,458 arcs. The elapsed time to indexing IMDB

and build the VDII index is 741 seconds approxi-

mately and the size of the index is close to 355 MB.

The VDII index includes 4,170 different keywords of

the dataset (without stop words) appearing in at least

one Virtual Document and 13,766,755 postings, with

an average of 3301 posting per keyword, and a max-

imal of 713,160 postings for a keyword. The elapsed

time required to load the index in memory and start to

query it requires only 108 seconds approximately.

All the algorithms were coded in Java. All the

experiments were conducted on a computer with an

Intel(R) Core (TM) 2@2.33 GHz, CPU, 4 GM RAM

running Ubuntu 11.10.

5.2 Search Efﬁciency

This section evaluates the search efﬁciency of vari-

ous algorithms. We selected one hundred keyword

queries with several number of keywords (2 to 6). To

form a query, we ﬁrst selected a random number of

keywords from the movietitle, next we selected a ran-

dom number of keywords from the Movie’s genre. Fi-

nally, we selected a random number of keywords from

occupation of the evaluator of movies. We group the

queries that have the same number of keywords and

then compute the average of the elapsed time. Table

2 shows various sample queries. The ﬁgure 5 illus-

trates the elapsed time from the queries. We can note

that the average of the elapsed time increases with re-

spect to the value of the number of keywords. This

is because we need to ﬁnd more virtual documents to

calculate the ﬁnal score. For example, in a query with

four keywordswe get 132 ms to identify the top−100

results and compute the ﬁnal score. We can also ob-

serve that our algorithm produces better results than

the existing methods BLINKS, EASE and RETUNE.

VDII clearly reach high efﬁciency in the search, this

is because our approach does not need to identify an-

swers by discovering the relationships between tuples

that appears in different relational tables on the ﬂy, or

DATA2012-InternationalConferenceonDataTechnologiesandApplications

Table 2: Several sample queries employed in the experi-

ments.

Examples of queries

herbie bananas

super mario 1993

toy story 2 animation

back future 1990 western writer

star wars iv hope adventure retired

500

1000

1500

2000

2 3 4 5 6

Elapsed Time (ms)

No. of keywords

EASE

BLINKS

RETUNE DB+IR

VDII

Figure 5: Search efﬁciency using 2 to 6 keywords.

using SQL, since we include that relationships in the

virtual documents in the VDII index. VDII is faster

than EASE, this is because EASE spend some time

extracting the Steiner Graphs and eliminating the non-

Steiner nodes. The elapsed time of BLINKS increases

considerably when adding more keywords, compared

with the elapsed time of VDII that varies a few ms,

achieving higher search efﬁciency. This difference is

due to the use of VDII index to identify the answers in

our approach. VDII takes less than 320 ms to answer

keyword queries with six keywords; EASE takes 420

ms and BLINKS uses more than 1,200 ms. Clearly

VDII outperforms BLINKS and is also signiﬁcantly

faster than EASE.

To better evaluate performance of our approach,

we identiﬁed the top-k answers with different values

of k and compared the corresponding elapsed time.

The ﬁgures 6 and 7 show the experimental results.

We can see that our approach outperforms the existing

methods BLINKS and EASE signiﬁcantly. For exam-

ple in the Figure 6 VDII takes 121 ms to identify the

top 10 answers, while BLINKS and EASE take more

than 200 ms and 1400 ms respectively. Furthermore,

with the increase of the number of top − k answers

the elapsed time of VDII varies slightly and always

100

1000

10000

top−1 top−5 top−10 top−20 top−50 top−100

Elapsed Time (ms)

Top k

BLINKS EASE VDII

Figure 6: Top-k search efﬁciency using 4 keywords.

100

1000

10000

top−1 top−5 top−10 top−20 top−50 top−100

Elapsed Time (ms)

Top k

BLINKS EASE VDII

Figure 7: Top-k search efﬁciency using 5 keywords.

achieves high efﬁciency on the search.

5.3 Search Accuracy

This section evaluate the search accuracy using stan-

dard precision. The lack of a benchmark that allows

us to verify the precision of the obtained results for

a set of queries motivated us to deﬁne a basic strat-

egy for measuring the accuracy. Document relevancy

is measured by using two constraints: First, a docu-

ment is relevant if the query keywords appear in some

of its textual attributes. Second, it is also relevant

if these keywords are located in a semantically re-

lated attribute. This is not an automatic validation,

for example, if we wanted to ﬁnd the rates given by

AVirtualDocumentApproachforKeywordSearchinDatabases

100

2 3 4 5 6

Top−k Precision (%)

BLINKS

EASE

RETUNE DB+IR

VDII

Figure 8: Top-k search precision.

Writers to the movie Doctor Dolittle (keywords: Doc-

tor Dolittle Writer), a document would be relevant

if the keywords Doctor Dolittle are appearing in the

movie name attribute and the keyword Writer appears

in the occupation attribute. In this case, it is easy to

see that false positive documents could be those in-

cluding movies that were rated by Doctors. So far,

this validation is carried out manually. We also used

the metric top-k precision, which measures the ratio

of the number of relevant answers among the ﬁrst

k answers with the highest scores. Figure 8 shows

the experimental results. In this ﬁgure we can ob-

serve that our approach obtains higher search accu-

racy and outperforms the existing methods. VDII is

better than BLINKS in various queries with different

number of keywords. This is because our approach

includes structural information in the index construc-

tion when applying the TF · IDF based IR technique.

VDII also outperforms EASE and RETUNE in terms

of precision.

Figures 9 and 10 illustrate the experimental re-

sults, which show that VDII achieves higher preci-

sion than BLINKS and EASE on different values of

k. We can see that with the increase in the k values,

the top−k precision of BLINKS falls while VDII can

always achieve high precision. This conﬁrm the efﬁ-

ciency of our method.

We need to make other tests with VDII to compare

the results with other approaches, however, we can

observe that VDII achieves high search efﬁciency and

quality for keyword search.

100

top−1 top−5 top−10 top−20 top−50 top−100

Top−k Precision (%)

BLINKS EASE VDII

Figure 9: Top-k search precision using 4 keywords.

100

top−1 top−5 top−10 top−20 top−50 top−100

Top−k Precision (%)

BLINKS EASE VDII

Figure 10: Top-k search precision using 5 keywords.

6 CONCLUSIONS AND FUTURE

WORK

Searching information has become an indispensable

component in our lives. There is a need to ﬁnd infor-

mation in data sources with heterogeneous structure

including databases. In this paper, we have studied

the problem of effective keyword search over struc-

tured data and present a Virtual Document Approach

for Keyword Search in Databases to make an efﬁ-

cient processing of keyword queries. Our approach is

based on the classical Information Retrieval metrics

taking into account the meta-information extracted

from structured information. We model structured

DATA2012-InternationalConferenceonDataTechnologiesandApplications

data as graph and have proposed the use of a data in-

dex (VDII) to capture the structural relationships for

fast and accuracy response. Finally, we have con-

ducted some experiments to evaluate the efﬁciency

and effectiveness of our approach using real data sets.

This experiments show that our approach achieves

high search efﬁciency and quality for keyword search

and is capable to scale with databases with tens of

millions of tuples.

Our future work includes the need to continue the

testing of our approach with other datasets including

semi-structurated and unstructured data. We also have

to continue working on some strategies to reduce of

the size of the index for example by the using of Mu-

tual Information.

ACKNOWLEDGEMENTS

This research was partially supported by project num-

ber 187325 from Fondo Mixto Conacyt-Gobierno del

Estado de Tamaulipas.

REFERENCES

Abiteboul, S. and Allard, T. (2008). Webcontent: Efﬁcient

p2p warehousing of web data.

Achiezra, H. and Golenberg, K. (2010). Exploratory key-

word search on data graphs. In Proceedings of the

2010 international conference on Management of data

(SIGMOD), pages 1163–1166. ACM.

Agrawal, S. and Chaudhuri, S. (2002). Dbxplorer: A system

for keyword-based search over relational databases. In

Proceedings of the 18th International Conference on

Data Engineering, ICDE ’02. IEEE Computer Soci-

ety.

Bao, Z. and Lu, J. (2010). Towards an effective xml key-

word search. IEEE Transactions on Knowledge and

Data Engineering, 22(8):1077–1092.

Bhalotia, G. and Hulgeri, A. (2002). Keyword searching

and browsing in databases using banks. In Proceed-

ings of the 18th International Conference on Data En-

gineering, ICDE ’02, pages 431–440.

Chaudhuri, S. and Ramakrishnan, R. (2005). Integrating

db and ir technologies: What is the sound of one

hand clapping. In Innovative Data Systems Research

(CIDR), pages 1–12.

Ding, B. and Xu, J. (2007). Finding top-k min-cost con-

nected trees in databases.

Dong, X. and Halevy, A. (2007). Indexing dataspaces. In

Proceedings of the 2007 ACM SIGMOD international

conference on Management of data, SIGMOD ’07,

pages 43–54. ACM.

Du, D. and Hu, X. (2008). Steiner Tree problems in Com-

puter Communication Networks. World Scientiﬁc

Publishing.

Fang, L. and Clement, Y. (2006). Effective keyword search

in relational databases. In Proceedings of the 2006

ACM SIGMOD international conference on Manage-

ment of data, SIGMOD ’06, pages 563–574. ACM.

Feng, J. and Li, G. (2011). Finding top-k answers in

keyword search over relational databases using tuple

units. IEEE Transactions on Knowledge and Data En-

gineering Volume, 23:1781–1794.

Franklin, M. and Halevy, A. (2005). From databases to

dataspaces: A new abstraction for information man-

agement. SIGMOD Record, 34:27–33.

He, H. and Wang, H. (2007). Blinks: ranked keyword

searches on graphs. In Proceedings of the 2007 ACM

SIGMOD international conference on Management of

data, SIGMOD ’07, pages 305–316. ACM.

Hristidis, V. and Gravano, L. (2003). Efﬁcient ir-style key-

word search over relational databases. In Proceed-

ings of the 29th international conference on Very large

data bases - Volume 29, VLDB ’2003, pages 850–861.

VLDB Endowment.

Hristidis, V. and Papakonstantinou, Y. (2002). Discover:

Keyword search in relational databases. In Proceed-

ings of the 28th international conference on Very

Large Data Bases, pages 670–681. VLDB Endow-

ment.

Hristidis, V. and Papakonstantinou, Y. (2003). Keyword

proximity search on xml graphs. In Proceedings. 19th

International Conference Data Engineering, pages

367–378.

Kacholia, V. and Pandit, S. (2005). Bidirectional expansion

for keyword search on graph databases.

Kimelfeld, B. and Sagiv, Y. (2008). Efﬁciently enumerating

results of keyword search over data graphs. Informa-

tion Systems, 33:335–359.

Lam, C. (2011). Hadoop in Action. Manning Publications

Co.

Li, G. and Feng, J. (2008a). Ease: an effective 3-in-1 key-

word search method for unstructured, semi-structured

and structured data. In Proceedings of the 2008 ACM

SIGMOD international conference on Management of

data(SIGMOD), pages 903–914.

Li, G. and Feng, J. (2008b). Retrieving and materializ-

ing tuple units for effective keyword search over re-

lational databases. In Lecture Notes in Computer Sci-

ence, Conceptual Modeling - ER, pages 469–483.

Li, G. and Feng, J. (2009). Providing built-in keyword

search capabilities in rdbms.

Luo, L. and Lin, X. (2007). Spark: top-k keyword query in

relational databases. In Proceedings of the 2007 ACM

SIGMOD international conference on Management of

data, SIGMOD ’07, pages 115–126. ACM.

M. Karnstedt, K. S. (2008). A dht-based infrastructure for

ad-hoc integration and querying of semantic data. In

Proceedings of the 2008 international symposium on

Database engineering and applications, pages 19–28.

Park, J. and goo Lee, S. (2011). Keyword search in rela-

tional databases. Knowl. Inf. Syst, 26(2):175–193.

Su, Q. and Widom, J. (2005). Indexing relational database

content ofﬂine for efﬁcient keyword-based search. In

AVirtualDocumentApproachforKeywordSearchinDatabases

Proceedings of the 9th International Database Engi-

neering and Application Symposium (IDEAS), pages

297–306.

V. Hristidis, N. K. (2006). Keyword proximity search in

xml trees. IEEE Transactions on Knowledge and Data

Engineering, pages 525–539.

Xu, J. and Qui, L. (2010). Keyword search in relational

databases: A survey. Bulletin of the IEEE Com-

puter Society Technical Comittee on Data Engineer-

ing, 33:67–78.

Zhong, M. and Liu, M. (2009). Efﬁcient keyword prox-

imity search using a frontier-reduce strategy based on

d-distance graph index. In Proceedings of the 2009

International Database Engineering & Applications

Symposium (IDEAS), pages 206–216. ACM.

DATA2012-InternationalConferenceonDataTechnologiesandApplications