and based on that, a randomized algorithm is also
proposed. The other type of literatures on similarity
joins returns the exact answers. Based on various
index techniques and filtering principles, several
approaches such as (Sarawagi and Kirpal, 2004);
(Chaudhuri et al., 2006) (Bayardo et al., 2007);
(Xiao et al., 2008) have been proposed. (Bayardo et
al., 2007) proposes a principle to quickly access
inverted lists. (Xiao et al., 2008) designs a novel
technique to index and process the similarity join
queries. (Arasu et al., 2006) divides the records into
partitions and hashes them into signatures. It also
employs a post filtering step to prune the pairs of
records for reducing candidates.
As mentioned, the original similarity join needs a
user-given threshold, yet setting a suitable threshold
may not be easy without the background knowledge
to the given datasets. Therefore, Xiao et al. propose
a variant of similarity joins, i.e. the top-k similarity
join query in (Xiao et al., 2009), which returns the k
pairs of records from the given two sets of records,
with the highest similarity values. In (Xiao et al.,
2009), the topk-join approach, to be formally
introduced in the next section, is proposed to deal
with the top-k similarity join query. Its main idea is
to quickly compute the upper bounds of similarity
values related to pairs of records and then prune the
candidate results if their upper bounds are lower
than the similarity value of the temporal k
th
pair of
records. Consider a scenario as follows. A blogger
writes some articles in her/his blog and is interested
in the other blog articles highly related to these
articles. As blog articles are continuously generated,
able to be regarded as an article data stream, the
above scenario can be turned into the problem of
continuous top-k similarity joins. Since users often
concern more about the recent data, we adopt the
sliding window model in this paper. Given a set of
records being regarded as a query and a sliding
window over a data stream, the continuous top-k
similarity join query returns k pairs of records
regarding the query and the data contained in the
sliding window, which have the highest similarity
values.
To deal with this problem, we can apply the
topk-join approach (Xiao et al., 2009) whenever the
window slides. Obviously, we can improve this
solution since most of the data in the current window
are identical to those in the last window. We first
propose a solution extended from the topk-join
approach, which computes the top-k results
regarding the query and newly arrived data as
candidate results and derives the join results from
the candidate set. Moreover, we propose another
algorithm preprocessing the query in advance,
making the data able to be compared with all the
records in the query at one time. Our contributions
can be summarized as follows. 1) We make the first
attempt to address the problem on continuous Top-k
similarity joins in this paper. 2) We also propose two
algorithms for solving this problem, one extended
from the topk-join approach proposed in (Xiao et al.,
2009) and the other one based on preprocessing the
issued query for parallel comparisons of the records.
The rest of the paper is organized as follows. The
preliminaries are introduced in Section 2, including
the problem formulation and the topk-join approach
(Xiao et al., 2009). Thereafter, the proposed
solutions are detailed in Section 3. The experiment
results are presented and analyzed in Section 4 and
finally, Section 5 concludes this work.
2 PRELIMINARIES
To deal with the traditional problem of similarity
joins, a user needs to set a similarity threshold to
identify which join results s/he is interested in. In
(Xiao et al., 2009), Xiao et al. turn to solve a variant
of the similarity join problem, i.e. top-k similarity
joins. Without the need to set the threshold, in the
top-k similarity join problem, the join results with
the k highest similarity values are returned. Next, the
problem of top-k similarity joins and the
corresponding solution proposed in (Xiao et al.,
2009) are introduced in Subsection 2.1, followed by
the problem of continuous top-k similarity joins,
formulated in Subsection 2.2.
2.1 Introduction to Top-k Similarity
Joins
Let I = {W
1
, W
2
, …, W
|I|
} be a finite set of symbols
(literals) called tokens. A record is considered as a
set of tokens. Given a similarity function denoted
sim(⋅, ⋅), which returns a similarity value s ∈ [0, 1]
between two records, top-k similarity joins between
two sets of records return k pairs of records that have
the highest similarity values. Notice that, we focus
on Jaccard similarity function in this paper;
accordingly, sim(x, y) is equal to
||||
yxy∩∪
,
where x and y are records.
A solution to the problem of top-k similarity
joins, proposed in (Xiao et al., 2009), is mainly
based on the concept of prefix filtering (Chaudhuri
et al., 2006); (Xiao et al., 2009) described as follows.
Suppose that the tokens of two records x and y are
DATA2012-InternationalConferenceonDataTechnologiesandApplications
88