items in I. We say that a transaction t satisfies X
if for all items I
k
in X, t[k] = 1. An AR is then,
an implication of the form X ⇒ I
j
, where X is a
set of some items in I, and I
j
is a single item in I
that is not present in X. An example of this type of
rule is: ”90% of transactions that purchased bread and
butter also purchased milk”. The antecedent of this
rule consists of bread and butter and the consequent
consists of milk alone.
In (Srikant and Agrawal, 1996) where the con-
cept of Quantitative Association Rules (QARs) is first
shown, the authors deal with the fact that the vast ma-
jority of relational databases, either based on scien-
tific or business information are not filled with binary
datatypes (as requested by the classical ARs) but with
a much richer range of datatypes both numerical and
categorical.
A first approach to tackle this problem consists of
mapping the QARs problem into the boolean ARs
problem. The key idea is that if all attributes are cat-
egorical or the quantitative attributes have only a few
values, this mapping is straightforward. However, this
approach generates problems as if the intervals are too
large, some rules may not have the required minimum
confidence and if they are too small, some rules may
not have the required minimum support. We could
also think of the strategy of considering all possible
continuous ranges over the values of the quantitative
attribute to cover the partitioned intervals (to solve the
minimum confidence problem) and increase the num-
ber of intervals (solving the problem of minimum sup-
port). Unfortunately two new problems arise: First,
if a quantitative attribute has n values (or intervals),
there are on average O(n
2
) ranges that include a spe-
cific value or interval, fact that blows up the execution
time and second, if a value (or interval) of a quantita-
tive attribute has minimum support, so will any range
containing this value/interval, therefore, the number
of rules increases dramatically.
The approach taken by (Srikant and Agrawal,
1996) is different. Considering ranges over adjacent
values/intervals of quantitative attributes to avoid the
minimum support problem. To mitigate the problem
of the excess of execution time, they restricted the ex-
tent to which adjacent values/intervals may be com-
bined by introducing a user-specified maximum sup-
port parameter; they stop combining intervals if their
combined support exceeds this value. They introduce
as well a partial completeness measure in order to be
able to decide whether to partition a quantitative at-
tribute or not and how many partitions should there
be, in case it’s been decided to partition at all. To
address the problem of the appearance of too many
rules, they propose an interest measure based on the
deviation from the expectation that helps to prune
out the uninteresting rules (extension of the interest
measure already proposed in (Srikant and Agrawal,
1997)). Finally an algorithm to extract QARs is pre-
sented, sharing the same idea of the algorithm for
finding ARs over binary data given in (Agrawal and
Srikant, 1994) but adapting the implementation to the
computational details of how candidates are generated
and how their supports are now counted.
In (Miller and Yang, 1997), the authors pointed out
the pitfalls of the equi-depth method (interest measure
based on deviation), and presented several guiding
principles for quantitative attribute partitioning. They
apply clustering methods to determine sets of dense
values in a single attribute or over a set of attributes
that have to be treated as a whole. But although they
took distance among data into account, they did not
take the relations among other attributes into account
by clustering a quantitative attribute or a set of quan-
titative attributes alone. Based on this, (Tong et al.,
2005) improved the method to take into account the
relations amongst attributes.
Another improvement in the mining of quantitative
data is the inclusion of Fuzzy Sets to solve the sharp
boundary problem (Kuok et al., 1998). An element
belongs to a set category with a membership value,
but it can as well belong to the neighbouring ones.
In (Dong and Tjortjis, 2003) a mixed approach
based on the quantitative approach introduced by
(Srikant and Agrawal, 1996), the hash-based tech-
nique from the Direct Hashing and Pruning (DHP) al-
gorithm (Park et al., 1995) and the methodology for
generating ARs from the apriori algorithm (Agrawal
and Srikant, 1994) was proposed. The experimental
results prove that this approach precisely reflects the
information hidden in the datasets, and on top of it, as
the dataset increases, it scales-up linearly in terms of
processing time and memory usage.
On the other hand, the work realised by Aumann et
al. in (Aumann and Lindell, 1999), proposes a new
definition for QARs. An example of this rule would
be: sex = female ⇒ W age : mean = $7.90 p/hr
(overall mean wage = $9.02). This form of QAR,
unlike others doesn’t require the discretisation of at-
tributes with real number domains as a pre-processing
step. Instead it uses the statistical theory and data-
driven algorithms to process the data and find regu-
larities that lead to the discovery of ARs. A step for-
ward in this kind of rules was given by (Okoniewski
et al., 2001). They provide variations of the algorithm
proposed in (Aumann and Lindell, 1999) enhancing
it by using heuristic strategies and advanced database
indexing. The whole methodology is completed with
the proposition of post-processing techniques with the
use of similarity and significance measures.
The motivation of this work is to tackle some of the
drawbacks of the previous techniques. Most of them
require the translation of the original database so that
each non-binary attribute can be regarded as a discrete
set of binary variables over which the existing data
ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES
188