Building a Query Engine for a Corpus of Open Data

Mauro Pelucchi, Giuseppe Psaila and Maurizio Toccu

University of Bergamo - DIGIP, Viale Marconi 5, I-24044 Dalmine (BG), Italy

Keywords:

Retrieval of Open Data, Blind Querying, Single Item Extraction.

Abstract:

Public Administrations openly publish many data sets concerning citizens and territories in order to increase

the amount of information made available for people, ﬁrms and public administrators. As an effect, Open Data

corpora has become so huge that it is impossible to deal with them by hand; as a consequence, it is necessary

to use tools that include innovative techniques able to query them.

In this paper, we present a technique to select open data sets containing speciﬁc pieces of information, and

retrieve them in a corpus published by a portal of open data. In particular, users can formulate structured

queries blindly submitted to our search engine prototype (i.e., being unaware of the actual structure of data

sets). Our approach reinterpret and mixes several known information retrieval approaches, giving at the same

time a database view of the problem. We implemented this technique within a prototype, that we tested on a

corpus containing more that over 2000 data sets. We noted that our technique provides focused results w.r.t.

the baseline experiments performed with Apache Solr.

1 INTRODUCTION

Public Administrations are publishing more and more

data sets concerning their activities and citizens.

Since these data sets are open, i.e. publicly avail-

able for any interested people and organizations, they

are called Open Data. The reader should not confuse

the concept of Open Data for Public Administration

(which is the context of this paper), with the concept

of Linked Open Data, i.e., the ability of describing the

content of web sites linked each other through RDF

(Resource Description Framework).

Public Administrations extended there web sites

with Open Data Portals; typically (see (H

ochtl and

Lampoltshammer, 2016)) they exploit two frame-

works named Socrata and CKAN, that can be viewed

as CMS (Content Management Systems) to support

the publication of data sets within an Open Data Por-

tal.

Data sets published by Open Data Portals are

Structured data sets: in the simplest form, they are

CSV ﬁles with a ﬁxed number of ﬁelds, with ﬁeld

names reported in the ﬁrst line. Alternatively, they are

often represented as vectors of JSON objects, without

nesting (they must be represented in a ﬂat way), and

(less frequently) as structured XML documents (with-

out exploiting nesting capabilities of XML). In prac-

tice, data sets are always thought as tables: an item in

the data set (CSV row, JSON object, XML element)

represents a table row.

Nowadays, the number of data sets published by a

single Open Data Portal is very large and is continu-

ously increasing. Often, a corpus published by one

single publishing system can contain several thou-

sands of data sets.

People (typically analysts) that need to ﬁnd the

right data sets among the mass of available data sets

in an Open Data Portal have to work hard to ﬁnd what

they look for. They need a powerful search tool to

query the above corpus.

We observed that users, a priori, do not know

the actual structure of thousands of heterogeneous

data sets collected in the corpus. So, they can only

blindly query the corpus, making some hypothesis

about property or ﬁeld names. Furthermore, we also

observed that often they are not even interested in an

entire data set, but only in some rows (items) among

those contained in the data set, i.e., only those sat-

isfying a given selection condition. For example, a

user/analyst might want to get information about high

schools located in a given city named “My City”,

being interested in their name, address and reputa-

tion. The query mechanism should be able to focus

the search on those data sets that actually contain the

items of interest, and retrieve them. However, at the

same time, it should be able to extend the search to

126

Pelucchi, M., Psaila, G. and Toccu, M.

Building a Quer y Engine for a Corpus of Open Data.

DOI: 10.5220/0006308801260136

In Proceedings of the 13th International Conference on Web Information Systems and Technologies (WEBIST 2017), pages 126-136

ISBN: 978-989-758-246-2

those data sets that could potentially contain items of

interest, by considering similar names in the corpus

catalog (i.e., the list of data sets in the corpus and their

ﬁelds). For example, a data set with ﬁeld “city name”

or simply “city” in place of “cityname”.

Therefore, in this paper, we propose a technique

for querying a corpus of Open Data Sets on the basis

of queries that users blindly formulate on the corpus.

The technique mixes a query expansion mechanism,

which relies on string similarity matching, and a key-

word selection mechanism based on the concepts of

informativeness and representativeness.

The paper is organized as follows. Section 2 gives

some preliminary deﬁnitions. Section 3 formally de-

ﬁnes the problem we are dealing with. In Section 4,

the overall query process is presented. Then, Sec-

tion 5 presents the query assessment step and the key-

word selection step. Section 6 introduces the query

expansion step. Finally, Section 7 presents the steps

concerning open data set retrieval and ﬁltering. The

experimental evaluation is shown in Section 8, while

Section 9 discusses some related work. Section 10

draws the conclusions.

2 PRELIMINARY DEFINITIONS

In order to facilitate the paper comprehension, we

give some preliminary deﬁnitions.

Deﬁnition 1: Data Set. An Open Data Set ods is

described by a tuple

ods: <ods id, dataset name, schema, metadata>

where ods id is a unique identiﬁer, dataset name is

the name of the data set (not unique); schema is a

set of ﬁelds names, metadata is a set of pairs (la-

bel, value), which are additional meta-data associated

with the data set. 

Deﬁnition 2: Instance and Items. With Instance,

we denote the actual content of data sets. It is a set

of items, where an item is a row of a CSV ﬁle or data

table, as well as a ﬂat object in a JSON vector. 

Many open data portals provide open data sets in

several formats. Commonly, they adopt CSV (comma

separated values), but it is becoming frequent the

adoption of JSON (JavaScript Object Notation); XML

is also used, but less frequently.

The adoption of JSON is motivated by the fact

that, now, it has become the de facto standard for data

sharing in the world of Web Services. In fact, it is easy

to read, much lighter than XML and JavaScript (the

programming language used for developing client-

side web applications) natively interprets JSON data.

However, in the open data world, it is used only

as an alternative representation for ﬂat tabular data: a

CSV ﬁle can always be transformed into a vector of

JSON objects, where all objects in the vector has the

same structure, i.e., the same ﬁelds and ﬁelds have

only simple values (there are not nested objects).

Consequently, JSON capabilities to describe

nested complex structures, as well as collections of

heterogeneous objects, are not exploited in this con-

text.

To illustrate, consider Figure 1. We report a

fragment of the open data set named Local Author-

ity Maintained Trees published on the London Data

Store, that is the ofﬁcial open data portal of London

City. Notice how a ﬂat CSV ﬁle (in the upper part of

the ﬁgure) can be easily transformed into a vector of

JSON objects (lower part of the ﬁgure). As the reader

can see, the price to pay is that the JSON version is

more verbose, since ﬁeld names mast be repeated for

each object. However, in the optic of integrating the

items extracted from several open data sets, having

different structures, into a unique collection of results,

JSON is the natural representation format for our tool.

Deﬁnition 3: Global Meta-data. The list of ﬁelds

of items in the open data set ods is ods.schema. We

call the triple <data set name, schema, metadata>

the Global Meta-Data of the data set. 

Deﬁnition 4: Corpus and Catalog. With C = {ods

ods

, ...} we denote the corpus of Open Data sets.

The catalog of the corpus is the list of descriptions,

i.e., global meta-data, of open data sets in C (see Def-

inition 1). 

Deﬁnition 5: Query. Given a Data Set Name dn, a

set P of ﬁeld names (properties) of interest P = { pn

, ...} and a selection condition sc on ﬁeld values,

a query q is a triple q: <dn, P, sc>. 

Deﬁnition 6: Query Term. With Query Term (term

for simplicity) we denote a data set name q.dn, a ﬁeld

name appearing in q.P, in q.sc, a constant appearing

in q.sc. 

3 PROBLEM DEFINITION

Consider an open data corpus published by an Open

Data Portal, i.e., a system that publishes open data

sets.

Queries allow users to discover, in the corpus C,

those data sets, possibly having the speciﬁed name

q.dn or a similar one, that contain items (e.g., rows

or JSON objects) with the desired ﬁeld names q.P (or

similar) and can be actually ﬁltered on the basis of

Building a Query Engine for a Corpus of Open Data

127

CSV Version

ID,Borough,Species name,Common name,Display name,Ox,Oy

1,Barking ,ACER PSEUDOPLATANUS ’BRILLIANTISSIMUM’,,Maple,548320.8,189593.7

2,Barking ,TAXUS BACCATA FASTIGIATA,,Other,548297.9,189590.2

JSON Version

[

{

"ID": 1,

"Borough": "Barking ",

"Species name": "ACER PSEUDOPLATANUS ’BRILLIANTISSIMUM’",

"Common name": "",

"Display name": "Maple",

"Ox": 548320.8,

"Oy": 189593.7

{

"ID": 2,

"Borough": "Barking ",

"Species name": "TAXUS BACCATA FASTIGIATA",

"Common name": "",

"Display name": "Other",

"Ox": 548297.9,

"Oy": 189590.2

}

]

Figure 1: Fragment from the data set Local Authority Maintained Trees published in the London Data Store.

the selection condition q.sc, so that users can obtain

ﬁltered items. By relying on the deﬁnition of a rel-

evance measure for data sets (in Section 7), we can

deﬁne the overall problem.

Problem 1: Given a Relevance Measure rm(ods)∈

[0,1] of a data set ods∈C, and a minimum threshold

th∈ [0, 1], a data set ods is relevant if rm(ods)> 0 and

rm(ods)≥th.

The result set RS = {o

, o

, . .. } of query q is the set

of items (rows or JSON objects) o

such that ods

is a

relevant data set, o

∈ Instance(ods

) and o

satisﬁes

the selection condition q.sc. 

Example 1: The sample query about schools in Sec-

tion 1 becomes:

q :<dn=Schools,

P= {Name, Address, Reputation},

sc= (City="My City" AND

Type="High School") >. 

4 PROCESSING STEPS

The contribution of this paper is a technique to solve

Problem 1.

The proposed technique is built around a query

mechanism based on the Vector Space Model (Man-

ning et al., 2008), encompassed in a multi-step pro-

cess devised to deal with the high heterogeneity of

open data set and the blindly query approach (the user

does not know the actual schema of data sets in the

corpus). Recall that, in our vision, the user performs

a blindly query, but asks for speciﬁc ﬁeld names and

for items (rows or JSON objects) that satisfy a precise

selection condition. The system is fully aware of data

set schemata, but due to their heterogeneity it is not

possible to rely only on exact matching. Thus, it is

necessary to focus the search, identifying those terms

in the query that mostly represent the query essence

and the capability of the query to select items within

data set instances. At the same time, it is necessary to

expand the search space, considering terms which are

similar to those in the query, in order to address the

fact that users are unaware of actual schemata of data

sets.

Moving from the above considerations, we de-

ﬁned the steps performed by our technique.

• Step 1: Query Assessment. Terms in the query

are searched within the catalog. If some term t

is missing, t is replaced with term t, the term in

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

128

the meta-data catalog with the highest similarity

score w.r.t. t. We obtain the assessed query q.

• Step 2: Keyword Selection. Keywords are selected

from terms in the assessed query q, in order to

ﬁnd the most representative/informative terms for

ﬁnding potentially relevant data sets.

• Step 3: Neighbour Queries. For each selected

keyword, a pool of similar terms which are

present in the meta-data catalog are identiﬁed, in

order to derive, from the assessed query q, the

Neighbour Queries, i.e., queries which are similar

to q. Both q and the derived neighbour queries are

in the set Q of queries to process. For a neigh-

bour query, the set of relevant keywords is ob-

tained from the set of relevant keywords for q by

replacing original terms with similar terms.

• Step 4: VSM Data Set Retrieval. For each query

in Q, the selected keywords are used to retrieve

data sets based on the Vector Space Model (Man-

ning et al., 2008): in this way, the set of possibly

relevant data sets is obtained.

• Step 5: Schema Fitting. The full set of ﬁeld names

in each query in Q is compared with the schema of

each selected data set, in order to compute the rel-

evance measure rm. Data sets whose schema bet-

ter ﬁts the query will be more relevant than other

data sets.

• Step 6: Instance Filtering. Instances of relevant

data sets are processed in order to ﬁlter out and

keep only the items (rows or JSON objects) that

satisfy the selection condition.

5 QUERY ASSESSMENT AND

KEYWORD SELECTION

In Step 1, query q is assessed against the meta-data

catalog, in order to check for the presence of terms

in q: if all terms in q are present in the catalog, the

assessed query q is the same query q; otherwise, the

assessed query q is derived from q by substituting the

missing terms with the most similar and representa-

tive terms in the catalog. Since the concept of rep-

resentativeness is introduced hereafter in this section,

we will formally deﬁne the notion of most similar and

representative term in Section 5.2.

5.1 Keyword Selection

Consider the assessed query q. Terms appearing in q

are not equally important/informative for the query. If

the set of terms used to retrieve possibly relevant data

sets contains terms that are not important or much less

important than others, in Step 4 we can retrieve data

sets that may be not actually relevant for the query.

Thus, we adopted a keyword selection technique that

extracts, from the assessed query q, the set of (mean-

ingful) keywords K(q).

We were inspired by the technique introduced in

(Liu et al., 2006) for the context of classical search

engines. However, we introduced several variations

(structure of the graph and values of scores) to cope

with peculiarities of our problem.

The keyword selection technique proceeds as fol-

lows:

1. The Term Graph is built, where each node cor-

responds to a term in the assessed query q and

an edge connects two terms and weights their re-

lationship (see Figure 3 for a sample graph dis-

cussed after the keyword selection algorithm re-

ported in Figure 2).

2. An iterative algorithm:

(a) Chooses the most relevant term.

(b) Modiﬁes the weights of not selected terms,

propagating a decreasing reduction of informa-

tiveness to neighbours of the selected term.

3. Only most informative terms are kept and inserted

into the keyword set K(q ).

Term Graph Construction. Each node in the Term

Graph corresponds to a term in the query. There are

three types of nodes: one Data Set Name Node, cor-

responding to the data set name q.dn; Field Nodes,

corresponding to ﬁeld names in q.P or in q.sc; Value

Nodes, corresponding to constants appearing in com-

parison predicates in q.sc.

Nodes are connected each other by an edge if

terms are related to each other. In particular: a ﬁeld

node is connected with the data set name node; two

ﬁeld nodes are connected if they are the operands of

the same comparison predicate in q.sc; a ﬁeld node

and a value node are connected if the ﬁeld (described

by the ﬁeld node) is compared with the constant value

(described by the value ode) by a comparison predi-

cate in q.sc.

Each node (term) has two scores: a representative-

ness score, denoted as r score, and an informativeness

score, denoted as i score.

Deﬁnition 7: Representativeness Score. The rep-

resentativeness score r score denotes the capability of

the term to ﬁnd a relevant data set, i.e., how much it

represents a speciﬁc subset of mapped data sets. In

particular, we consider two components for represen-

tativeness: Intrinsic Representativeness and Extrinsic

Building a Query Engine for a Corpus of Open Data

129

Representativeness. The former is related to the pres-

ence of the term in the query; we denote it as irep(t)

(where t is a term). The latter is related to the pres-

ence of the term in global meta-data of data sets, thus

it measures the actual capability of the term to focus

the search; we denote the extrinsic representativeness

as erep(t). Both irep and erep are deﬁned in the range

[0,1].

The r score of a term is obtained by combining

the intrinsic and the extrinsic representativeness, by

means of the average of them, i.e., r score= (irep(t)+

erep(t))/2. This way, both contribute and balance

each other. 

Intrinsic Representativeness. Depending on the role

played by a term in the query, we established different

values for the intrinsic representativeness.

- The intrinsic representativeness of the Data Set

Name is irep(t)= 1.0 if there exists a data set q.dn in

the corpus C; otherwise, irep(t)= 0.0.

- The intrinsic representativeness of a Field Name is

irep(t)= 0.5 if there exists at least a data set that has a

ﬁeld named t; otherwise, irep(t)= 0.0.

- The intrinsic representativeness for string value

nodes and date value nodes, is irep(t)= 0.5, while it

is irep(t)= 0.2 for numerical value nodes.

Extrinsic Representativeness. The approach we

adopted to deﬁne the extrinsic representativeness, is

inspired by the idf (inverse document frequency) mea-

sure (Manning et al., 2008). Its classical deﬁnition is

idf =log(D / d(t)), where D is the total number of doc-

uments, while d(t) is the number of documents that

contain term t. The lower the number of documents

that contain t, the higher the value of idf, because term

t is more capable to focus the search on relevant doc-

uments.

Moving from this idea, we deﬁned the extrinsic

representativeness in order to be in the range [0,1] and

to be suitable for our context.

Deﬁnition 8: Normalized Data Set Frequency.

Consider the corpus C of open data sets and its car-

dinality |C|. Consider a term t, the set Matched(t) of

data sets that contain term t in global meta-data (such

that Matched(t)⊆C) and its cardinality |Matched(t)|.

The Normalized Data Set Frequency of term t is

deﬁned as

ndf(t)= log



|Matched(t)|

|C|

+ 1



that is deﬁned in the range [0,1].

If term t is contained in the global meta-data of all

data sets, ndf(t)= 1 (the argument of the logarithm is

2. If term t is not contained in any global meta-data,

ndf(t)= 0, because the argument of the logarithm is 1.



Deﬁnition 9: Extrinsic Representativeness. Given

a term t and its normalized data set frequency ndf(t),

its Extrinsic Representativeness is deﬁned as:

erep(t)= 1 − ndf(t) if ndf(t)> 0

erep(t)= 0 if ndf(t)= 0 

The rationale is the following: if a term is very

frequent, its extrinsic representativeness is close to 0,

because it is not able to focus the search. In contrast,

not frequent terms are able to focus the search, and

their extrinsic representativeness is close to 1.

Deﬁnition 10: Informativeness Score. The infor-

mativeness score i score measures the degree of in-

formation that could be obtained by selecting a term

as keyword. At the beginning, an initial value is set:

we chose i score= 1.0 for data set name nodes and

ﬁeld nodes that appear in the selection condition q.sc.

We chose i score= 0.8 for value nodes and ﬁeld nodes

appearing only in q.P. In fact, constants are naturally

less informative than data set names and ﬁeld names;

furthermore, they are less likely to appear in meta-

data of data sets, thus the data set retrieval step based

on VSM would result ineffective. The same is for

ﬁeld names not appearing in the selection condition:

they are less informative than those appearing in the

selection condition.

During the keyword selection process, the selec-

tion of a term affects the informativeness of neigh-

bours: in fact, selecting two neighbours may be re-

dundant. Consequently, the procedure will modify the

i score values of not selected terms. 

Keyword Selection Algorithm. The algorithm is in-

spired by the algorithm reported in (Liu et al., 2006),

but modiﬁed because, in our term graph, edges are

neither labeled nor weighted. The main procedure of

the algorithm, named KeywordSel, is reported in Fig-

ure 2.

The algorithm works on the input sets of nodes

and edges; notice that nodes have an additional ﬁeld

named marked, denoting selected nodes when true.

The main cycle from lines 2 to 13, at each step se-

lects the node not yet selected with the highest value

for i score + r score, provided that the sum is greater

than 1. This is because below this threshold the capa-

bility of retrieving signiﬁcant data sets becomes too

low. The for loop from line 4 to line 8 looks for the

highest i score + r score value (and the correspond-

ing node) in not marked nodes. Then, the if instruc-

tion at line 9 veriﬁes that the selected node (if any)

has the sum i score + r score greater than 1: if so, the

node is marked as selected, the corresponding term is

inserted into the keyword set K and the procedure Up-

dateIScore is called (line 12) to propagate a reduction

of informativeness.

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

130

Procedure KeywordSel

Input: N set of nodes, E set of edges

a node is n =< t,i score,r score,marked = false >

Output: the set of selected keywords K

begin

1. K =

0; // set of selected keywords

2. max score= 0.0; //initial max score

3. max node=null; //no node selected

4. for each node n ∈ N with n.marked =false do

5. score= n.i score + n.i score;

6. if score>max score then

7. max score=score;

8. max node= n;

end if

end for each

9. if n 6=null and max score> 1.0 then

10. K = K ∪ {n.t}; //term selected as keywordd

11. n.marked =true; //node marked as selected

12. call procedure UpdateIScore(N, E, n);

end if

13. while max

score> 1.0;

end procedure

Figure 2: Procedure KeywordSel, that selects possibly rele-

vant keywords.

Procedure UpdateIScore performs the update of

the i score, by propagating from the selected node.

The propagation mechanism can be described in a re-

cursive way.

- Starting from the selected node s, for each node

connected with s, the reduction factor is rf(z

) =

s.r score /2/n, where n is the number of nodes con-

nected with s.

- Consider a node z that just received a reduction fac-

tor rf(z) and the set Out(z) of nodes connected to z

that have not received yet a reduction factor. For each

node z

∈Out(t), the reduction factor is rf(z

)=rf(z) / 2

/ |Out(z)|. The propagation continues until all nodes

have a reduction factor.

- For each node z, the i score is decreased by the re-

duction factor rf(z), i.e., z.i score = z.i score - rf(z)

(where the lower bound is z.i score = 0).

This way, the representativeness of the selected

node decreases the informativeness of neighbours; the

reduction becomes lower and lower as the distance

from the selected node increases.

Figure 3.a reports the initial term graph for the (as-

sessed) query in Example 1. Figure 3.b shows how

i score values change, showing the propagation from

the selected node (City) to the other nodes.

5.2 Deriving the Assessed Query

Having deﬁned the concept of extrinsic representa-

tiveness for a term t (denoted as erep(t)), we can now

describe how query q is assessed into query q.

- The assessed query q is derived from query q by re-

placing each term t non appearing in the meta-data of

data sets with a substituting term t’.

- First of all, we consider the notion of string sim-

ilarity metric between two terms t and t’, denoted

as sim(t, t’)∈ [0,1]. We adopt the well known Jaro-

Winkler metric (see (Winkler, 1999), (Jaro, 1989)).

- For a pair of terms t and t’, we deﬁne the suitabil-

ity degree as suitability(t, t’)= (sim(t, t’)+erep(t’))/2,

i.e., a term t’ is more suitable to replace the missing

term t if t’ is as similar as possible to t and, at the

same time, it is as representative as possible of data

sets (extrinsic representativeness).

- A missing term t is replaced, in the assessed query q,

by the most suitable term t’ in the catalog, i.e., having

the greatest suitability(t, t’).

The rationale of this choice is simple: we want

to substitute missing terms with those that are more

likely than others to ﬁnd relevant data sets, w.r.t. to

the query.

6 DERIVING NEIGHBOUR

QUERIES

The assessed query q could be not enough to ﬁnd rel-

evant data sets for the user, because he/she writes the

query blindly, w.r.t. data set meta-data. As an exam-

ple, if the user wants to query a ﬁeld named latitude,

he/she is usually interested also in data sets having

ﬁeld lat (common short-cut for latitude). To make the

query process effective, we derive, from the assessed

query q, the neighbour queries nq, i.e., queries that

are close to q. Both q and nq

are collected into Q, the

set of queries to process.

Deﬁnition 11: Pool of Alternative Terms. Con-

sider a term t∈K(q). We denote as Alt(t) the set of

alternative terms for t.

A term t’∈Alt(t) if: 1) t’ is in the meta-data cat-

alog; 2) if its string similarity with t is grater than

or equal to a minimum similarity threshold th sim,

i.e., sim(t, t’)≥th sim; 3) given max alt the maximum

number of alternative terms, t’ is one of the max alt

top-most similar terms to t. 

In other words, Alt(t) contains the max alt terms

which are mostly similar to t.

Deﬁnition 12: Potential Neighbour Query. Con-

sider the assessed query q, with its keyword vector

Building a Query Engine for a Corpus of Open Data

131

a) b)

Figure 3: Initial state of the Term Graph (a) and after the ﬁrst keyword selection (b). The triple in each node is (i score, irep,

erep).

K(q)=[t

, . .. , t

] and a vector of weights W(q), where

W(q)[i]= 1, for each term t

=K(q)[i]. A potential

neighbour query pnq and its keyword vector K(pnq)

are obtained as follows:

- |K(pnq)| = |K(q)|.

- One or more terms t

in K(q) are replaced with

one term chosen from the set of alternative terms

∈Alt(t

). If t

is not replaced, K(pnq)[i]=t

; other-

wise, K(pnq)[i]=t

. Correspondingly, the new poten-

tial neighbour query pnq is obtained by performing

the same replacements in the assessed query q.

- The vector of weights for the potential neighbour

query is as follows: if t

is not replaced, W(pnq)[i]=

1; otherwise, K(pnq)[i]=sim(t

, t

). 

Any query obtained by replacement of one or

more terms is a potential neighbour query. How-

ever, their number could be high and the query could

be possibly too far away from the assessed query;

therefore, we have to prune those potential neighbour

queries that may be too distant, based on the cosine

similarity metric applied to the weight vectors.

Deﬁnition 13: Neighbor Queries. Given the as-

sessed query q and a potential neighbour query p, p

is considered a neighbour query if CosineSim(W(q),

W(p))≥th neighbour, where th neighbour∈ [0,1] is a

minimum threshold. 

This way, the set Q of queries to process is pop-

ulated by the assessed query q and all neighbour

queries.

7 DATA SET RETRIEVAL

Each query nq ∈ Q has associated the vector K(nq) of

keywords selected in step 2 (for the assessed query)

or in step 3 (for neighbour queries). In step 4, the

Vector Space Model (Manning et al., 2008) is applied

to retrieve possibly relevant data sets.

The VSM approach relies on the concept of in-

verted index created on the bases of terms in the

global meta-data of each data set. Deﬁnitely, in our

context the inverted index IN implements a function

f (t) → {ods id

,ods id

,. .. }, that, given a term t, re-

turns the set of data set identiﬁers whose global meta-

data contain term t.

Consider a vector of keywords K(nq)=[k

, ...,

]. Then, given a data set identiﬁer ods id, the

occurrences of keywords associated with ods id in

the inverted index IN are represented by the vector

m(ods id) = [p

, ..., p

], where p

= 1 if the data set

is associated with the keyword, p

= 0 otherwise.

The Keyword-based Relevance Measure krm(ods,

nq) of a data set ods for a query nq∈Q is the average

of p

values. The maximum value of krm(ods) is 1,

obtained when the data set is associated with all the

keywords, while values less than 1 but greater than 0

denote partial matching.

7.1 Schema Fitting

This step of our technique computes, for each data set

retrieved in step 4, the ﬁnal relevance measure rm, by

composing the keyword-based relevance measure krm

of a data set with the ﬁtting degree of its schema w.r.t.

the ﬁeld names in the query. This step is necessary

to identify, among all retrieved data sets, those having

an instance that can be effectively processed by the

selection condition nq.sc.

We adopt an approach based on a weighted inclu-

sion measure, where the weight of each ﬁeld name

depends on the role played in the query by the ﬁeld

name.

Deﬁnition 14: Schema Fitting Degree. Consider a

data set ods and its schema ods.schema=<pn

, pn

.. .>. Then, consider a query nq∈Q and the set of

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

132

ﬁeld names N = {n

,. .. ,n

} that appear in the selec-

tion condition nq.sc and/or in the list of ﬁeld names

of interest nq.P. For each name n

∈ N, a weight

∈ [0,1] is associated (see Deﬁnition 15), depend-

ing on the role played by n

in the query. The Schema

Fitting Degree sfd(ods, nq) is deﬁned as:

sfd(ods, nq)=

|N|

∑

i=1

× p(n

) /

|N|

∑

i=1

where function p(n

) = 1 if there exists a ﬁeld name

= n

in nq.schema, p(n

) = 0 otherwise. 

The following deﬁnition deﬁnes how names are

weighted.

Deﬁnition 15: Weight. The weight w

of a ﬁeld name

∈ N is

= max(pw

,scw

)

where pw

is the weight due to the presence of n

the list nq.P of ﬁelds of interest in the query, while

scw

is the weight due to the presence of n

in the se-

lection condition nq.sc.

- pw

= 0.5 if ﬁeld name n

∈ nq.P (may be relevant

to have this ﬁeld in the resulting items); pw

= 0 oth-

erwise.

- scw

= 1 if the selection condition nq.sc contains one

single predicate with n

, or the predicates in the con-

dition are ANDed each other; scw

= 0.7 if predicates

are ORed each other. 

Finally, we deﬁne the relevance measure for a data

set.

Deﬁnition 16: Data Set Relevance Measure. Con-

sider a data set ods and a query nq∈Q. Given its

Keyword-based Relevance Measure krm(ods, nq) and

its schema Fitting Degree sfd(ods), the Relevance

Measure of data set ods w.r.t. query nq is

rm(ods, nq)=

(α − 1) × krm(ods, nq) + α × sfd(ods, nq)

where α = 0.6. The data set relevance measure

rm(ods) is

rm(ods)=max(rm(ods, nq

)),

for each nq

∈ Q. 

The rationale is that the schema ﬁtting degree cor-

rects the keyword-based relevance measure. The rea-

son is simple: the keyword-based relevance measure

is obtained by querying the inverted index in a blind

mode, i.e., irrespective of the fact that a keyword is

a data set name, a ﬁeld name or a word appeared in

meta-data. The keyword selection technique partially

corrects this blind approach, to discard those data sets

that could result accidentally relevant.

Finally, in order to perform item selection from down-

loaded data sets, how the ﬁeld names in the query

match the actual schema of data sets becomes cru-

cial, because it is impossible to evaluate the selection

condition if ﬁelds in the condition are not present (all

or partially) in the schema of the data sets. This also

motivates the choice of α = 0.6.

Finally (see Section 3), the minimum threshold th

is used to discard less relevant data sets and keep only

those such that rm(ods)≥th.

8 EXPERIMENTAL EVALUATION

We implemented our technique in a prototype and

evaluated the effectiveness of the technique. Here, we

report a short evaluation, due to space limitations.

We performed experiments on a set of 2301 open

data sets published by the Africa Open Data portal

These open data sets are prepared in a non homoge-

neous way, as far as ﬁeld names and conventions are

concerned. This fact makes the search more difﬁcult

than with an homogeneous corpus (such as, e.g., New

York City open data).

We performed 6 different queries. They are re-

ported in Figure 4. Hereafter, we describe them.

In query q

, we look for statistical information

about malnutrition in Monbasa or Nairobi or Turkana

counties in Kenya. Apart from counties, we are inter-

ested in ﬁelds that describe underweight, stunting and

wasting. In query q

, we look for information about

civil unions since 2012, where the age of both spouses

is at least 35. The features of interest are year,

month, spouse1age and spouse2age. In query q

we look for information about availability of water,

a serious problem in Africa. In particular, we look

for distribution points that are functional (note the

condition, where we consider two possible values for

ﬁeld functional-status (functional and yes). In

query q

we look for information about the number of

teachers in public schools, for each county. In query

, we look for data about per capita water consump-

tions at the date of December, 31 2013 (notice that the

date is written as "2013-12-31t00:00:00", to stress

the similarity search). Finally, query q

look for data

about orchid export in terms of weight (ﬁeld kg) and

money.

To perform the evaluation, we downloaded all the

data sets and we explored and labelled every data set

and item as correct or wrong w.r.t. each query

We performed 9 tests, in order to evaluate the ef-

fect of varying some parameters. The maximum num-

ber of alternative terms maxc alt was set to 3 for all

tests, and the cosine similarity threshold for neigh-

bour queries th neighbour was set to 0.9995 for all

https://africaopendata.org/

We used Open Reﬁne (http://openreﬁne.org) a powerful

tool to work with messy data

Building a Query Engine for a Corpus of Open Data

133

Table 1: Evaluation of Experimental Results (n.s.i. means no selected items).

Recall (%) for Data sets

Query Test 1 Test 2 test 3 test 4 test 5 test 6 test 7 test 8 test 9

13.33 13.33 13.33 13.33 13.33 13.33 80.00 80.00 80.00

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

66.67 66.67 66.67 66.67 66.67 66.67 66.67 66.67 66.67

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

Precision (%) for data sets

Query Test 1 Test 2 test 3 test 4 test 5 test 6 test 7 test 8 test 9

66.67 66.67 66.67 66.67 66.67 66.67 7.79 8.22 8.22

50.00 50.00 50.00 1.15 50.00 1.15 1.14 1.15 1.15

75.00 75.00 75.00 50.00 50.00 37.50 50.00 50.00 37.50

13.33 13.33 13.33 13.33 11.76 13.33 13.33 11.76 13.33

100.00 100.00 100.00 100.00 50.00 100.00 6.67 6.25 6.67

20.00 33.33 33.33 20.00 25.00 25.00 20.00 25.00 25.00

Recall (%) for Items

Query Test 1 Test 2 test 3 test 4 test 5 test 6 test 7 test 8 test 9

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

0.00 100.00 100.00 0.00 100.00 100.00 0.00 100.00 100.00

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

0.00 100.00 100.00 0.00 100.00 100.00 0.00 100.00 100.00

Precision (%) for Items

Query Test 1 Test 2 test 3 test 4 test 5 test 6 test 7 test 8 test 9

50.00 50.00 50.00 50.00 50.00 50.00 0.31 0.31 0.31

89.91 89.91 54.40 89.91 89.91 54.40 89.91 89.91 54.40

n.s.i. 51.43 51.43 n.s.i. 48.54 48.54 n.s.i. 48.54 48.54

10.39 14.39 9.72 10.39 9.72 9.72 10.39 9.72 9.72

100.00 100.00 100.00 100.00 100.00 100.00 24.73 32.85 49.46

n.s.i. 100.00 50.00 n.s.i. 100.00 50.00 n.s.i. 100.00 50.00

tests. Instead, we evaluated three different values for

th sim, i.e., the string similarity threshold for alter-

native terms, (0.9, 0.8 and 0.7) in combination with

three different values for th, i.e., the data set rele-

vance threshold (0.3, 0.25, 0.2). The combinations

are: for test1, th sim= 0.9 and th= 0.3; for test2,

th sim= 0.8 and th= 0.3; for test3, th sim= 0.7 and

th= 0.3; for test4, th sim= 0.9 and th= 0.25; for test5,

th sim= 0.8 and th= 0.25; for test6, th sim= 0.7 and

th= 0.25; for test7, th sim= 0.9 and th= 0.2; for test8,

th sim= 0.8 and th= 0.2; for test9, th sim= 0.7 and

th= 0.2.

In the ﬁrst two sections of Table 1, we report re-

call and precision as far as the retrieval of data sets

is concerned, before ﬁltering items. The reader can

observe that we often get very high values of recall,

even 100%, meaning that we are able to get all mean-

ingful data sets; the data set recall is sensitive to min-

imum relevance threshold variation only for query q

where with th= 0.2 the recall signiﬁcantly improves.

Looking at the precision for data sets, results are more

various; for query q

, precision is very sensitive to

the variation of thresholds, due to the large amount of

false positive downloaded data sets; for query q

, the

variation of minimum thresholds generally improves

precision, meaning that we are able to better capture

the data sets of interest.

However, the ﬁnal goal of our technique is to re-

trieve items that satisfy the selection condition. The

third and forth section in Table 1 reports the evalu-

ation of recall and precision for selected items. The

reader can notice that in this case things signiﬁcantly

change: recall values are generally very high, mean-

ing that the technique is able to retrieve those (typ-

ically few) items of interest, in particular with low

settings for minimum thresholds. Looking at preci-

sion, we observe that sometimes (query q

) we get

low values (around 10%), meaning that a large num-

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

134

Table 2: Comparison of T Test 2 with the base the baseline obtained by means of Apache Solr.

Query

Baseline Test 2 - Data Sets Test 2 - Items

Recall(%) Precision(%) Recall (%) Precision(%) Recall(%) Precision(%)

100.00 22.09 13.33 66.67 100.00 50.00

100.00 1.96 100.00 50.00 100.00 89.91

100.00 25.00 100.00 75.00 100.00 51.43

100.00 33.33 66.67 13.33 100.00 14.39

100.00 100.00 100.00 100.00 100.00 100.00

100.00 10.00 100.00 33.33 100.00 100.00

:<dn=nutrition,

P= {county, underweight, stuting,

wasting},

sc= (county = "Monbasa"

OR county = "Nairobi"

OR county = "Turkana") >

:<dn=civilunions,

P= {year, month, spouse1age,

spouse2age},

sc= (year = 2012 AND (spouse1age >= 35

OR spouse2age >= 35) >

:<dn=wateravailability,

P= {district, location, position,

wateravailability} ,

sc= (functional-status = "functional"

OR functional-status = "yes" ) >

:<dn=teachers,

P= {county, schooltype,

noofteachers} ,

sc= (schooltype="public") >

:<dn=waterconsumption,

P= {city, date, description,

consumption per capita} ,

sc= (date = "2013-12-31t00:00:00") >

:<dn=national-export,

P= {commodity, kg, money} ,

sc= (commodity = "Orchids") >

Figure 4: Queries for the Experimental Evaluation.

ber of false positive items is selected; this is the effect

of the Jaro-Winkler string similarity measure adopted

to ﬁnd similar ﬁelds.

The goodness of the approach is shown by the ex-

periments with query q

. The high precision for items

obtained in Test 2 (in spite of low precision for data

sets) is conﬁrmed looking at downloaded items: many

of them have value "ORCHIDS/CYMBIDIUM" for ﬁeld

commodity, instead of value "Orchid" written in the

query. Thus, our technique resulted particularly effec-

tive in the expansion phase with th sim.

Our technique obtained the best performance for Test

2, with th sim=0.8 and th=0.3. The reason is that

our approach is greedy during the VSM Data Set Re-

trieval; then, in the Schema Fitting step the search

space is narrowed and data set instances are down-

loaded. The ﬁltering (and latter) phase is lazier: so

the average precision for items is above of 60%.

To complete the evaluation, we identiﬁed a suit-

able baseline. Among other tools, we decided to

adopt Apache Solr (Shahi, 2015), which is an open

source enterprise search engine, developed on top of

Apache Lucene.

We indexed the Africa Open Data corpus with

Apache Solr and executed the same 6 queries, eval-

uating both recall and precision for the retrieved data

sets (Apache Solr retrieves only data sets, while it is

not able to retrieve items).

Table 2 shows the results of our comparison: the

ﬁrst couple of columns reports recall and precision for

data sets retrieved by Apache Solr, the second couple

of columns reports recall and precision for data sets

retrieved by our prototype in Test 2; the third couple

of columns reports recall and precision for items re-

trieved by our prototype in Test 2.

It is possible to see that recall is always 100% for

Apache Solr, while in two cases it is not 100% for

retrieved data sets in Test 2. In contrast, our proto-

type always overcomes Apache Solr as far as preci-

sion is concerned: this is mainly due to the generation

of neighbour queries and keyword selection.

Finally, when we consider retrieval of items (not

supported by Apache Solr), out prototype generally

behaves very well, with 100% of recall and generlaly

very high values for precision (except for query q

9 RELATED WORKS

The research area of this paper is certainly young,

and, at the best of our knowledge, we are in the pi-

oneering phase. However, many researchers are rea-

soning about Open Data management.

In (Khosro et al., 2014) Khosro et al., present the

state of art in Linked Open Data (LOD), with issues

and challenges. Moreover, the authors motivate this

topic by exploiting the projects analysed in the ﬁve

Building a Query Engine for a Corpus of Open Data

135

major computer science areas (Intelligence, Multime-

dia, Sensors, File System and Library), and present

the future trends and directions in LOD.

The Open Data world is related to the Linked

Data world. In fact, standardized proposals are typ-

ically used to describe published Linked Open Data.

The RDF (Data Set Description Framework) (Miller,

1998) is widely used for this purpose. Further-

more, the standard query language for RDF is called

SPARQL Standard Protocol and RDF Query Lan-

guage (Clark et al., 2008). W.r.t. our proposal, it is

very general and devoted to retrieve those data sets

with certain features. However, highly skilled people

in computer science are able to use it. In contrast, our

query technique is very easy for not skilled people and

closer to the concept of query in information retrieval.

Similar considerations are done in (Auer et al.,

2007), where the authors presents DBPedia: the RDF

approach is not suitable for non expert users that need

a ﬂexible and simple query language.

In this paper we do not consider Linked Open

Dara: we query a corpus of Open Data Sets, in gen-

eral not related to each other, thus, not linked at all.

The idea of extending our approach to a pool of

federated Open Data Corpora is exciting. A pioneer

work on this topic is (Schwarte et al., 2011), but they

still rely on SPARQL as query language.

Finally, the heterogeneity of Open Data asks for

the capability of NoSQL databases. In (Kononenko

et al., 2014), the authors report their experience

with Elasticsearch (distributed full-text search en-

gine), highlighting strengths and weaknesses.

10 CONCLUSION

This paper presents a technique to retrieve items (rows

in CSV ﬁles or objects in JSON vectors) contained in

open data data sets from those published by an open

data portal. User blindly query the published corpus.

The technique both focuses the search on relevant

terms and expands the search by generating neigh-

bour queries, by means of a string matching degree.

The experimental evaluation shows that the technique

is promising and effective.

New and more extensive experiments will be per-

formed in the future, as well as the technique will

be reﬁned and further improved, by adding seman-

tic information to drive the choice of similar terms.

In particular, as far as this point is concerned, we are

thinking to exploit dictionaries, such as WordNet, that

provide relationships between words. We think that

given a term in the query. we could discover its syn-

onyms and use them to rewrite the query (obtaining

new neighbour queries).

REFERENCES

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak,

R., and Ives, Z. (2007). Dbpedia: A nucleus for a web

of open data. In The semantic web, pages 722–735.

Springer.

Clark, K. G., Feigenbaum, L., and Torres, E. (2008). Sparql

protocol for rdf. World Wide Web Consortium (W3C)

Recommendation, 86.

ochtl, J. and Lampoltshammer, T. J. (2016). Adequate-

analytics and data enrichment to improve the quality

of open data. In Proceedings of the International Con-

ference for E-Democracy and Open Government Ce-

DEM16, pages 27–32.

Jaro, M. A. (1989). Advances in record-linkage methodol-

ogy as applied to matching the 1985 census of tampa,

ﬂorida. Journal of the American Statistical Associa-

tion, 84(406):414–420.

Khosro, S. C., Jabeen, F., Mashwani, S., and Alam, I.

(2014). Linked open data: Towards the realization of

semantic web - a review. Indian Journal of Science

and Technology, 7(6):745–764.

Kononenko, O., Baysal, O., Holmes, R., , and Godfrey,

M. (2014). Mining modern repositories with elastic-

search. In MSR. June 29-30 2014, Hyderabad, India.

Liu, J., Dong, X., and Halevy, A. Y. (2006). Answering

structured queries on unstructured data. In WebDB.

2006, Chicago, Illinois, USA, volume 6, pages 25–30.

Citeseer.

Manning, C. D., Raghavan, P., Sch

utze, H., et al. (2008).

Introduction to information retrieval, volume 1. Cam-

bridge university press Cambridge.

Miller, E. (1998). An introduction to the resource descrip-

tion framework. Bulletin of the American Society for

Information Science and Technology, 25(1):15–19.

Schwarte, A., Haase, P., Hose, K., Schenkel, R., and

Schmidt, M. (2011). Fedx: a federation layer for

distributed query processing on linked open data. In

Extended Semantic Web Conference, pages 481–486.

Springer.

Shahi, D. (2015). Apache solr: An introduction. In Apache

Solr, pages 1–9. Springer.

Winkler, W. E. (1999). The state of record linkage and cur-

rent research problems. In Statistical Research Divi-

sion, US Census Bureau. Citeseer.

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

136