2. Compute the fitness of the optimal 2-partition of
any t consecutive points set D
t
= {x
1
, x
2
, ..., x
t
},
where 2 ≤ t ≤ n, and find the minimum by
Fit(D
t
, 2) = min
2≤s≤t
{dis(1, s− 1) + dis(s,t)},
3. Compute the fitness of the optimal L-interval-
partition of any t consecutive points set D
t
=
{x
1
, x
2
, ..., x
t
}, where L ≤ t < n, and 3 ≤ L < K
by using
Fit(D
t
, L) = min
L≤s≤t
{Fit(D
s−1
, L− 1) + dis(s,t)} .
4. Create a new matrix f(t, L) which stores the fit-
ness computed in the above two steps for all op-
timal L-partitions (1 ≤ L < K) on any t points set
D
t
= {x
1
, x
2
, ..., x
t
}, where 1 ≤ t ≤ n.
f(t, L) =
Fit(D
t
, L) 1 < L < K, L < t,
dis(1, j) L = 1, 1 ≤ j ≤ t,
0 1 < L < K, L ≥ t.
The optimal K-partition can be discovered from
the matrix f(t, L) by finding the index l, so that
f(t, K) = f(l, K − 1) + dis(l, n).
Then the Kth partition is {x
l
, x
l+1
, . . . , x
n
}, and the
(K − 1)th partition is {x
l
∗
, x
l
∗
+1
, . . . , x
l−1
}, where
f(l − 1, K − 1) = f(l
∗
− 1, K − 2) + dis(l
∗
, l − 1),
and so on.
2.3 Automatic Taxonomy Construction
When the number of cluster is large, each cluster can
be replaced by its centroid, where the centroid of a
cluster C of reals is the average value and is easily
computed; these clusters can then be clustered by ap-
plying the algorithm on their centroids. Repeating
this procedure, a tree hierarchy of the clusters can be
gradually built from bottom to top.
Given a value set, V = {V
1
,V
2
, ...,V
n
},V
i
∈ R, of a
feature/attribute, A, the procedure of partitional clus-
tering based attribute-value taxonomy construction is
described as below.
1. Let the number of clusters, k, equal the size of
value set, V, then the leaves of the tree are {V
i
}
for each value V
i
∈ V. Call this clustering, C.
2. Determine a suitable k which is less than the cur-
rent number of clusters, and apply Fisher’s algo-
rithm to C to find k clusters.
3. Replace each cluster with its centroid, and reset C
to be the new k singleton clusters.
4. Go to step 2 until k reaches 2 or the distance
between successive centroids are all sufficiently
similar.
3 CASE STUDY
We conducted a case study to demonstrate the au-
tomatic construction of taxonomies for a real world
database. The
Adult
dataset, extracted from the 1994
and 1995 current population surveys conducted by the
U.S. Census Bureau, is chosen to carry out the exper-
iment. There are 30,162 records of training data and
15,060 records of test data, once all missing and un-
known data are removed. The distribution of records
for the target class is shown in table 1.
Table 1: Target Class Distribution.
Data set Target Class % Records
Train ≤ 50K 75.11 22,654
Train > 50K 24.89 7,508
Test ≤ 50K 75.43 11,360
Test > 50K 24.57 3,700
3.1 Data Preprocessing
As described in section 2, Fisher’s algorithm works on
a set of ordered or continuous real values. To cluster
data with nominal attributes, one common approach
is to convert them into numeric attributes, and then
apply a clustering algorithm. This is usually done by
“exploding” the nominal attribute into a set of new bi-
nary numeric attributes, one for each distinct value in
the original attribute. For example, the
sex/gender
attribute can be replaced by two attributes,
Male
and
Female
, both with a numeric domain {0, 1}.
Another way of transformation is using of some
distinct numerical (real) values to represent nominal
values. If again, using
sex/gender
attribute as an
example, a numeric domain {1, 0} is a substitute for
its nominal domain {Male, Female}. A more gen-
eral technique, frequency based analysis, can also
be exploited to perform this transformation. For in-
stance, the domain of attribute
race/ethnicity
can
be transformed from {White, Asian, Black, Indian,
other} to {0.56, 0.21, 0.12, 0.09, 0.02}, according to
their occurrence in data.
With the
Adult
data set, prediction is usually inter-
ested in identifying what kind of person can earn more
than $50K per year, based on the various personal in-
formation, such as education background, marital sta-
tus, etc. This prediction/classification is very practi-
cal for some government agencies, e.g. the taxation
bureau, to detect fraudulent tax refund claims. Thus
a frequency based transformation seems more appro-
priate for this task, because each numeric value to
be transformed also reveals the statistical information
of its original nominal value. Our transformational
KEOD 2009 - International Conference on Knowledge Engineering and Ontology Development
338