• Υ = T
(tree), meaning classes are organized in a
tree structure.
• Ψ = SPL
(single path of labels), meaning the
problems we consider are not hierarchically multi-
label.
• Φ = P D
(partial depth) labeling, meaning data-
points do not always have a leaf class.
In this paper, a novel distance measure is introduced,
with respect to the label tree. The purpose of the
distance measure is to capture similarity between la-
bels and penalize errors at high levels in the hierarchy,
more than errors at lower levels. This distance mea-
sure leads to a trade-off between accuracy and the dis-
tance of misclassifications. Intuitively, this trade-off
makes sense for UNSPSC codes as, for example, clas-
sifying an apple as a fresh fruit should be penalized
less than classifying an apple as toxic waste. Training
a classifier for such distance measures is not straight-
forward, therefore a classification method is presented,
which copes with a distance measure defined between
two labels.
The rest of this paper is structured as follows. Sec-
tion 2 discusses existing HMC approaches in the lit-
erature. Section 3 introduces hierarchical classifica-
tion. In Section 4, we define properties a hierarchical
tree distance measure should comply to, and describe
our concrete implementation of these properties. Sec-
tion 5 details how to embed the distance measure in
a hierarchical multiclass classifier. The experiments
of Section 6 compares this classifier with other classi-
fiers. Finally, Section 7 presents our ideas for further
research. Section 8 concludes.
2 RELATED WORK
(Dumais and Chen, 2000) explore hierarchical classi-
fication of web content by ordering SVMs in a hierar-
chical fashion, and classifying based on user-specified
thresholds. The authors focus on a two-level label hi-
erarchy, as opposed to the 4-level UNSPSC hierarchy
we utilize on in this paper. Assigning an instance to
a class requires using the posterior probabilities prop-
agated from the SVMs through the hierarchy. The
authors conclude that exploiting the hierarchical struc-
ture of an underlying problem can, in some cases, pro-
duce a better classifier, especially in situations with a
large number of labels.
(Labrou and Finin, 1999) use a global classifier
based system to classify web pages into a 2-level DAG-
based hierarchy of Yahoo! categories by computing
the similarity between documents. The authors con-
clude that their system is not accurate enough to be
suitable for automatic classification, and should be
used in conjunction with active learning. This devi-
ates from the method introduced in this paper in that
model we introduce does not support DAGs and can be
used without the aid of active learning, with promising
results.
(Wang et al., 1999) identify issues in local-
approach hierarchical classification and propose a
global-classifier based approach, aiming for closeness
of hierarchy labels. The authors realize that the con-
cern of simply being correct or wrong in hierarchical
classification is not enough, and that only focusing on
the broader, higher levels is where the structure, and
thus accuracy, diminishes. To mitigate these issues,
the authors implement a multilabel classifier based
upon rules from features to classes found during train-
ing. These rules minimize a distance measure between
two classes, and are deterministically found. Their
distance measure is application-dependent, and the au-
thors use the shortest distance between two labels. In
this paper, we also construct a global classifier which
aims to minimize distances between hierarchy labels.
(Weinberger and Chapelle, 2009) introduces a la-
bel embedding with respect to the hierarchical struc-
ture of the label tree. They build a global multiclass
classifier based on the embedding. We utilize their
method of classification with our novel distance mea-
sure.
3 HIERARCHICAL
CLASSIFICATION
The hierarchical structure among labels allows us to
reason about different degrees of misclassification.
We are concerned with predicting the label of data-
points within a hierarchical taxonomy. We define the
input data as a set of tuples, such that a dataset
D
is
defined by
D = {(x, y) | x ∈ X, y ∈ Y } , (1)
where
x
is a
q
-dimensional datapoint in feature space
X
and
y
is a label in a hierarchically structured set of
labels Y = {1, 2, . . . , m}.
Assume we have a datapoint
x
with label
y = U
from the label tree in Figure 1. It makes sense that
a prediction
ˆy = V
should be penalized less than a
prediction
ˆy
0
= Z
, since it is closer to the true label
y
in the label tree. We capture this notion of distance
between any two labels with our hierarchy embracing
distance measure, properties of which are defined in
Section 4.
One commonly used distance measure is to count
the number of edges on a path between two labels in
the node hierarchy. We call this method the Edges
A Hierarchical Tree Distance Measure for Classification
503