3 PROBLEM DEFINITION
We use R to denote a relational entity table that con-
tains collections of heterogeneous data with non-ID
columns C = {c
1
,c
2
,... ,c
k
}. Column and attribute
are used interchangeably in this paper. Each column
c
i
can take a text value v from a domain denoted as
range(c
i
), and we use range(R) to represent the set
of all duplicate-free values of the table R, which is
also referred as the Closed Language Model of the
table R, i.e. CL M(R) (Sarkas et al., 2010). Values
that are not from CLM are denoted as Open language
model (OLM). A text value will have at least one
word, which is defined as a sequence of characters not
including white space. Note that numerical attributes
are also treated as categorical attributes in this work
and we do not assume the ranges of the columns are
mutually exclusive. Each tuple from the entity table
R can be viewed as an entity with k attributes and an
attribute value of an entity could be empty, i.e. null
values as in the Table 1.
Definition 3.1. Given a relational entity table R, the
problem of entity search/match over R is to find all
tuples in R that refer to the same entity denoted by the
free-text input query q, which is represented as a list
of words (w
1
,w
2
,. .. w
n
).
The assumption here is that the entity denoted by
the input query is of the same type as the tuple in R.
According to Pound et al. (Pound et al., 2010), the in-
put query in this work can be categorised as an Entity
Query , where the intention of the query is to find a
particular entity in an Ad-hoc Object Retrieval (AOR)
task.
There are three main challenges for the entity
search/match over RDBs:
• The heterogeneity of the data means there could
be many representations even for the same en-
tity, and each attribute value could also have a
varieties of forms. The complexity of the entity
query grows exponentially to the number of the
attributes of the entity.
• Two different attributes could have a large over-
lap of values, which means the same word or
words from the query could be mapped to multi-
ple attributes at the same time. This characteristic
has also been discussed in (Kim et al., 2009).
• A real-world entity query might contain redun-
dant or incorrect information, that results in mis-
interpretation of the query.
For example, given a product database where each
tuple refers to a product entity, consider two in-
put queries “iphone 7” and “iphone 7 cover”. First
query refers to a “phone” with product name equal to
“iphone 7”. The second query refers to a “cover” that
is used with an “iphone 7”. Thus, the value “iphone
7” in the first query should be mapped to a ‘product
name’ attribute, and the same value “iphone 7” in the
second query should be mapped to an ‘applicable’ at-
tribute. In this paper, we argue and exploit the fact
that mapping between values and attributes are not in-
dependent of each other.
4 ATTRIBUTE EXTRACTIONS
In this section, we show how to perform a simple but
effective attribute value extraction for an input entity
query. We start by reviewing the query annotation
model in (Sarkas et al., 2010).
Definition 4.1. An annotated value of a query q for
a table R is a pair AV = (v,c) of a value v from q and
a column c in R, such that v ∈ range(c). Note that a
value v could be a single word w
i
or a sequence of the
words (w
i
,. .. ,w
j
) from the query.
In the example of Table 1, consider the input
query “5 Oxford Street London Englad”, the anno-
tated value AV=(Oxford Street, Street) denotes that
“Oxford Street” is a possible value for the column
“Street”. Intuitively, an annotated value AV decides
which column a value should be mapped to.
Sarkas et al. proceed to define a segmentation of
the query as a sequence of non-overlapping values
that cover the entire query, and a structured annota-
tion of the query is a set of annotated values such that
the values from a segmentation of the query. When a
word w from the query does not belong to the range
of the database, i.e. w /∈ range(R), AV = (w, OLM)
can be used to denote the word is from the OLM. For
example, there are 4 different structured annotations
for the input query “5 Oxford Street London Englad”,
which are shown as follows:
S
1
=((5, Sb Nb), (Oxford Street, Street), (London,
Postal Town), (Englad, OLM))
S
2
=((5, Sb Nb), (Oxford, Postal Town), (Street,
Postal Town), (London, Postal Town), (Englad,
OLM))
S
3
=((5, Bu Nb), (Oxford Street, Street), (London,
Postal Town),(Englad, OLM))
S
4
=((5, Bu Nb), (Oxford, Postal Town), (Street,
Postal Town), (London, Postal Town), (Englad,
OLM))
Intuitively, each structured annotation of the query
is a possible interpretation of the query and every
word from the query is mapped to a single column
in the database. For the running example, structured