score(d, D
pos
) =
∑
x⊆d,x∈EP
pos
ρ(x, D
pos
) × sup(x, D
pos
)
(5)
Here, d denotes a transaction, and EP
pos
denotes the
set of emerging patterns for the pos class. The value
for score(d, D
neg
) is calculated in the same way, so
a transaction is given a score for each class. After
that, the scores are normalized by the median for each
class, and the class that has a larger score is the pre-
dicted class for the transaction.
There have been some implementations of CAEP
for real-world problems, Takizawa et al. proposed a
method to categorized crime and safety zones using
various spatial factors (Takizawa et al., 2010). They
compared their method with existing methods such
as decision tree, and showed that CAEP outperforms
them. Morita et al. incorporated item taxonomy with
emerging patterns and extended CAEP using these
patterns (Morita and Hamuro, 2013). They applied
their extended CAEP with real POS (Point Of Sales)
data, and effective results for business were shown.
The principle of CAEP is simple, and if effective
emerging patterns are extracted from the data, and it
can be useful, as shown by these implementations.
However, for real business data, some problems oc-
cur. One problem is caused by emerging patterns. As
shown in Figure 1, the areas where emerging patterns
can exit are ”area A” and ”area C”. However in ”area
C” there may be few patterns in some difficult cases
consisting of real business data as mentioned above.
Of course, the size of each area is dependent upon the
minimum support value and minimum ρ value, but the
nature of the problem is not changed. The emerging
patterns in ”area A” are powerful, but the number of
transactions covered by the patterns is small, because
the support value of each pattern is small. Because of
this, the number of unpredicted transactions that both
scores for the classes are 0 becomes large, if the min-
imum support value and the minimum value of ρ are
not changed. On the contrary, if the minimum sup-
port value is lowered and the minimum value of ρ is
increased, there are many cases for which the num-
ber of emerging patterns increases rapidly. This rapid
increase results in increased computational time. In
many such cases, it is difficult to practically compute
a good classification model. The second problem is
the normalizing score for each class. In the original
CAEP method, the score for each class is normalized
by the median of each distribution. To use such a nor-
malizing score is a good method by which to com-
pare a score for each class, but there are some cases
for which the distribution is biased. We believe that
it is better to change normalizing method because the
existing method is insufficient.
In the next section, we propose a classification
method to solve these problems.
3 PROPOSED METHOD
We propose a new method called Classification by
Aggregating Contrast Patterns (CACP), which uses
contrast patterns instead of emerging patterns. We
use LCM (Uno et al., 2003) to enumerate contrast pat-
terns, because it is efficient and orders the enumerated
patterns by d f (x, D
pos
) or d f(x, D
neg
) are larger from
the top.
After enumerating the contrast patterns, redundant
patterns are pruned. Given two contrast patterns x and
y in the same class, if sup(y, D
pos
) ≤ sup(x, D
pos
) and
d f(y, D
pos
) ≤ d f(x, D
pos
), then contrast pattern y is
removed and x is kept. In the example given in Fig-
ure 2, y is removed by x, but z is not removed by x,
because sup(x, D
pos
) ≤ sup(z, D
pos
). The pattern x is
not removed by z, because d f(z, D
pos
) ≤ d f(x, D
pos
).
In this case, contrast patterns x and z are kept, and y is
removed.
pruning
area
x
y
z
Support value for pos class
Support value for neg class
Figure 2: Pruning area.
We also change the score of a contrast pattern is
changed from ρ(x, D
pos
) × sup(x, D
pos
) to
cpScrore(x, D
pos
, D
neg
, θ) =
q
θ· (sup(x, D
pos
) − 1)
2
+ (1− θ) · (sup(x, D
neg
) − 1)
2
,
(6)
where θ denotes a weight from 0 ≤ θ ≤ 1 to adjust
the importance of support value for each class. The
score for transaction d for each class is then defined
as Equation 7,
score(d, D
pos
, D
neg
, θ,CP
pos
) =
∑
x⊆d,x∈CP
pos
cpScrore(x, D
pos
, D
neg
, θ), (7)
ICEIS2013-15thInternationalConferenceonEnterpriseInformationSystems
336