2 PRELIMINARIES
In this section we present five formal definitions of
the basic concepts required to understand the
foundations of Skyline and Skyline metrics. For
these definitions we are assuming a space S on a set
of n dimensions {d
1
, …, d
n
}, a subspace S’ or non-
empty subset of the space S, and a dataset DS on S.
Also, we suppose a tuple t DS is represented as t =
(t
1
, …, t
n
) where t
i
is a real number on dimension d
i
.
For simplicity, we suppose all dimension will be
preferred if they have the highest values
(maximization).
Definition 1 (Dominance). A tuple t = (t
1
, …, t
n
)
DS dominates
another
tuple u = (u
1
, …, u
n
) DS if
(∀i | 1 i n : t
i
u
i
∧ (j | 1 j n : t
j
u
j
)).
Definition 2 (Skyline). The
Skyline
of a
space
S,
denoted as SKY
S
, is the set of the non-dominated
tuples on S.
Definition 3 (Skycube). The Skycube or lattice is the
set of the all Skylines for any subspace S’ of S, i.e.,
Skycube = {∪SKY
S’
| S’
⊆
S}
.
Definition 4 (Skyline Frequency). The Skyline
Frequency of a tuple t DS, denoted by sf(t), is the
number of subspaces S’ of S in which t is a Skyline
tuple, this is, sf(t) = (∑ S’ | S’ S
∧
t SKY
S’
: 1).
Since the Skyline can be huge (Chan et al.,
2006a), the Skyline needs to be ranked by a score
function to distinguish the top-k tuples in a set of
incomparable ones. A score function of a tuple t,
denoted as f(t), is a function that ranks the tuple t
inducing a totally ordered of the input dataset DS.
Definition 5 (Top-k Skyline). The Top-k Skyline
tuples of a space S, denoted by TKS
S
, are the k
Skyline tuples on S that no other Skyline tuple on S
may have higher score function value than them:
TKS
S
= {t | t SKY
S
∧ (
k-|SKYs|
u | u SKY
S
: f(u)
> f(t))}, where,
x
means that exists at most x
elements in the set.
The Skyline Frequency may be used as score
function to rank the Skyline. In (Chan et al., 2006a),
the Top-k Frequent Skyline tuples, denoted here by
TKFS, are defined as the k tuples in DS that no other
tuple in DS can have larger Skyline Frequency than
them: TKFS = {t | t SKY
S
∧ (
k-|SKYs|
u | u SKY
S
: sf(u) > sf(t))}.
3 SKYLINE METRICS
The three steps to compute the SFM metric are: 1)
The Skyline for each subspace of the multi-
dimensional criteria is computed; 2) The SFM of
each tuple t is calculated by summing up the
number of subspaces for which t is a Skyline tuple;
3) The Skyline is sorted by SFM values and the best
k tuples are returned.
Unfortunately, Skyline Frequency has two
disadvantages. On one hand, it may require to build
a lattice of skylines for each non-empty subset of a
multi-dimensional criteria, this is, 2d − 1 skylines
(Chan et al., 2006a). In this sense, several solutions
have been introduced to reduce cost of the lattice
computation. In (Chan et al., 2006a), the authors
proposed to estimate the Skyline Frequency values
with efficient approximated algorithms. (Yuan et al.,
2005; Pei et al., 2006) define algorithms to
efficiently calculate the Skycube or the lattice of
skylines by sharing computation of multiple related
Skyline subspaces.
On the other hand, Skyline Frequency benefits
those tuples that have the best value in at least one
dimension. Any tuple with this characteristic will
have a lower bound of 1+
1
1
1
d
d
i
i
when data
are not duplicated. According to Corollary 1 in
(Yuan et al, 2005), a tuple in a subspace s will be in
all subspaces for which subspace s is a subset. For
this reason, all of these tuples could have the same
Skyline Frequency value (little variability).
To introduce variability into SFM, we propose a
new metric called Top-k Skyline Frequency Metric
(TKSFM). The basis of the lattice for TKSFM is the
two-dimensional Skylines. Therefore, it does not
benefit those tuples with the best value in at least
one dimension as SFM does. Additionally, our
experimental study shows that our metric is less
expensive than SFM because it does not need to
build the whole Skyline for each subspace.
To exemplify the difference between TKSFM
and SFM, suppose a lattice for 4 dimensions: A, B,
C, and D, as shown in Figure 1. SFM value of a
tuple t is the number of times in which t is in a
subspace of the lattice. Since the Skyline for each
subspace must be calculated, the Skyline Frequency
computation is very costly (Chan et al., 2006a).
Instead of the skylines for each subspace of the
lattice, the lattice of the TKSFM is based on Top-k
Skyline subspaces. Thus, the evaluation cost of the
metric may be reduced because the Top-k Skyline is
computed instead of the whole Skyline set
(Goncalves and Vidal, 2009).
ICEIS 2010 - 12th International Conference on Enterprise Information Systems
384