TEXT CLASSIFICATION THROUGH TIME
Efficient Label Propagation in Time-Based Graphs
Shumeet Baluja, Deepak Ravichandran and D. Sivakumar
Google, Inc. 1600 Amphitheatre Parkway, Mountain View, CA, 94043, U.S.A.
Keywords: Text analysis, Text Classification, Machine Learning, Graph Algorithms, Preference Propagation, Semi
supervised learning, Natural Language Processing, Adsorption.
Abstract: One of the fundamental assumptions for machine-learning based text classification systems is that the
underlying distribution from which the set of labeled-text is drawn is identical to the distribution from
which the text-to-be-labeled is drawn. However, in live news aggregation sites, this assumption is rarely
correct. Instead, the events and topics discussed in news stories dramatically change over time. Rather than
ignoring this phenomenon, we attempt to explicitly model the transitions of news stories and classifications
over time to label stories that may be acquired months after the initial examples are labeled. We test our
system, based on efficiently propagating labels in time-based graphs, with recently published news stories
collected over an eighty day period. Experiments presented in this paper include the use of training labels
from each story within the first several days of gathering stories, to using a single story as a label.
1 INTRODUCTION
The writing, vocabulary, and topic of news stories
rapidly shift within extremely small periods of time.
In recent years, new events and breaking, “hot”,
stories almost instantaneously dominate the majority
of the press, while older topics just as quickly recede
from popularity (Project for Excellence in
Journalism, 2008). For typical automated news-
classification systems, this can present severe
challenges. For example, the ‘Political’ and
‘Entertainment’ breaking news stories of one week
may have very little in common, in terms of subject
or even vocabulary, with the news stories of the next
week. An automated news classifier that is trained to
accurately recognize the previous day/month/year’s
stories may not have encountered the type of news
story that will arise tomorrow.
Unlike previous work on topic detection and
tracking, we are not attempting to follow a particular
topic over time or to determine when a new topic
has emerged ((
http://www.nist.gov/speech/tests/tdt/) ,
(
Allen, 2002),(Mori et al., 2006)). Instead, we are
addressing a related problem of immediate interest
to live news aggregation sites: given that a news
story has been published, in which of the site’s
preset categories should it be placed?
The volume of news stories necessitates the use
of an automated classifier. However, one of the
fundamental assumptions in machine learning based
approaches to news classification is that the
underlying distribution from which the set of
labeled-text is drawn is identical to the distribution
from which the text-to-be-labeled is drawn. Because
of the rapidly changing nature of news stories, this
may not hold true. In this paper, we present a graph-
based approach to address the problem of explicitly
capturing both strong and weak similarities within
news stories over time and to use these efficiently
for categorization. Our approach combines the
paradigm of Min-Hashing and label propagation in
graphs in a novel way. While Min-Hashing is well-
understood in information retrieval applications, our
application of it to create a temporal similarity graph
appears to be new. Label propagation is gaining
popularity in the field of machine learning as a
technique for semi-supervised learning. Our
approach to label propagation follows our previous
work (Baluja et al., 2008) , where equivalent views
of a basic algorithm termed Adsorption were
established, and the technique was successfully
employed for propagating weak information in
extremely large graphs to create a video
recommendation system for YouTube.
The aims of this paper are to present the
following techniques that we anticipate will have
general applicability for data mining in industrial
settings: formulation of temporal similarities via
graphs created using Min-Hashes, and the
application of label propagation as an off-the-shelf
tool for classification tasks when very little ground
truth is available.
174
Baluja S., Ravichandran D. and Sivakumar D. (2009).
TEXT CLASSIFICATION THROUGH TIME - Efficient Label Propagation in Time-Based Graphs.
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, pages 174-182
DOI: 10.5220/0002303001740182
Copyright
c
SciTePress
Stories Acquired over Days
0
50
100
150
200
250
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76
Day
Number Acquired
Politics Internet Health Environment Entertainment Business Sports
Figure 1: Distribution of Stories Acquired over Testing Period.
The next section describes the data collected and
presents a series of experiments to develop strong,
realistic, baselines for performance. Section 3 gives
a detailed description of the Adsorption algorithm.
Section 4 presents the empirical results to establish
the Adsorption baselines for this task. Section 5
presents extensive results with tiny amounts of
labeled data (e.g., a single labeled example). Section
6 concludes the paper and offers avenues for future
exploration.
2 DATA AND INITIAL
EXPERIMENTS
For the experiments conducted in this paper, we
examined 11,014 unique news stories published
over an 80 day period in 2008. The news stories
were manually placed into one of seven categories
(% composition): “Politics” (19.8%),
“Internet”(6.0%), “Health”(8.8%),
“Environment”(8.3%), “Entertainment”(10.8%),
“Business”(31.6%), or “Sports”(14.5%). Figure 1
shows the number of stories gathered each day from
each class. Note that a few of the entries are 0; due
to errors, no stories were gathered on those days.
Although there are numerous methods to pre-
process and represent text ((Pomikalek and
Rehurek, 2007), (McCallum and Nigam, 1998)), we
chose an extremely simple technique for
reproducibility. Alternate, more sophisticated, pre-
pre-processing techniques will improve all of the
results obtained in this paper. For simplicity, we
only generated a binary bag-of-words representation
for each news story by determining the presence (or
absence) of each word in the vocabulary. The
vocabulary consisted of all words in the complete
set of articles, except those words that occurred in
less than 10 news stories (too infrequent) or those
that occurred in more than 15% of the documents
(too frequent); these words were simply discarded.
2.1 Initial Experiments
In the first set of experiments, we examine how two
standard machine learning techniques, support
vector machines ((Cortes and Vapnik, 1995),
(Joachims, 2002)) and k-nearest neighbor, perform
on the standard task of classifying news stories into
the 1-of-7 categories described earlier. This task is
constructed as a standard machine learning
classification task; a total of 3900 news stories are
used (the first 3900 of the set described in Section
4).
In Table 1, we vary the number of labeled
examples between 100 and 500, and label the
examples 500-3900 using an SVM with linear
kernel (Joachims, 2002). Additionally, a full set of
experiments were conducted with non-linear
kernels, such as Radial Basis Functions. The
performance did not improve over using a linear
kernel, this may be due to the little labeled data
provided. Note that because the SVM is a binary
classifier, we train 21 SVM models to distinguish
each class from each other class. The performance
of the SVM-system dramatically improved with
more labeled samples. Additionally, if we continue
to ignore the temporal nature of the task, we can use
the test set as unlabeled data and take advantage of
unlabeled-training methods. We attempted this in
the training process for the SVM through the use of
transductive learning (in SVM-Light ((Joachims,
2002), (Joachims, 1999)); however, that did not
significantly impact the performance (Ifrim and
Weikum, 2006) reported similar results).
Besides the overall performance, to view the
effects of degradation of performance over time, we
also examine the performance of the first (in time)
100 samples classified in the test set compared with
TEXT CLASSIFICATION THROUGH TIME - Efficient Label Propagation in Time-Based Graphs
175
the last 100 samples classified; these results are
shown in the last two columns of Table 1. Note that,
as expected, the unlabeled stories that are classified
close to the period from which the labeled stories
were taken are labeled more accurately than those
that are labeled further away.
2.2 k-Nearest Neighbor
The experiments with k-nearest neighbor (k-NN)
mirror those conducted with SVMs in the previous
section. However, in order to make the k-NN
process efficient, there must be a rapid method to
find the nearest-neighbors. For this, we use a
hashing scheme based on sparse sketches of the
news stories. The sketches are created using a Min-
Hash scheme (Cohen, et al., 2001) that is then
looked up using an approximate hashing approach
termed LSH. Previously, this technique has been
successfully applied to the large-scale lookup of
music and images (Baluja and Covell, 2008).
Although a full discussion of these approaches is
beyond the scope of this paper, both will be briefly
described since the distance calculations are also
used as the basis of the weights in the Adsorption
graph.
Table 1: SVM Performance, measured with varying
Labeled Samples.
Labeled
Examples
Overall
Performance
(Samples
500-3900)
Initial
Performance
(Samples
500-600)
Later
Performance
(Samples
3800-3900)
0-100 58.5 66 41
0-200 76.0 86 68
0-300 81.6 84 72
0-400 85.2 92 81
0-500 86.2 95 83
Min-Hash creates compact fingerprints of sparse
binary vectors such that the similarity between the
two fingerprints provides a reliable measure of the
probability that the two original vectors were
identical. Because of the sparseness of the bag-of-
words presence vector that is used for the news
stories, it is an ideal candidate for this procedure.
Min-Hash works as follows: select a random, but
known, reordering of all the vector positions. We
call this a permutation reordering. Then for each
story, (for a given permutation reordering) pick the
minimum vector-element that is ‘on’ (in our that is
present in the news story). It is important to note
that the probability by which two news stories will
have the same minimum vector-element is the same
as its Jaccard coefficient value. Hence, to get better
estimates of this value, we repeat this process p
times, with p different permutations to get p
independent projections of the bit vector. Together,
these p values are the signature of the bit vector.
Various values of p were tried. For the remainder of
this paper, we use p=500; this is the signature length
of each vector, and is therefore the length of the
representation of each news story.
Table 2: k-Nearest Neighbor, with Varying Labeled
Samples, Best Value for k given in Column 1.
Labeled
Examples
(best value
of k shown)
Overall
Performance
(Samples
500-3900)
Initial
Performance
(Samples
500-600)
Later
Performance
(Samples
3800-3900)
0-100 (10) 81.3 85 79
0-200 (1) 80.9 86 78
0-300 (10) 82.2 90 76
0-400 (10) 83.3 90 79
0-500 (10) 83.4 92 80
Even with the compression afforded with Min-
Hash, efficiently finding near-neighbors in a 500
dimensional space is not a trivial task; naïve
comparisons are not practical. Instead, we use
Locality-Sensitive Hashing (LSH) (Gionis et al.,
1999). In contrast to standard hashing, LSH
performs a series of hashes, each of which examines
only a portion of the sub-fingerprint. The goal is to
partition the feature vectors (in this case the Min-
Hash signatures) into sub-vectors and to hash each
sub-vector into separate hash tables. Each hash table
uses only a single sub-vector as input to the hash
function. Candidate neighbors are those vectors that
have collisions in at least one of the sub-fingerprint
hashes; the more collisions the more similar.
Together with Min-Hash, LSH provides an effective
way to represent and lookup nearest neighbors of
large, sparse binary vectors. The results with the k-
NN system are given in Table 2. In order to make
the baselines as competitive as possible, we
searched over a large range of possible k-values for
each trial to find the best answer; it is given below.
Note that for smaller number of training examples,
k-NN outperformed SVMs; as the number of
training examples increased, the performance of k-
NN dropped below SVMs.
3 ADSORPTION
The genesis of the family of algorithms that we
collectively call Adsorption (Baluja et al., 2008) is
the following question: assuming we wish to
classify a node in a graph in terms of class-labels
present on some of the other nodes, what is a
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
176
principled way to do it? Perhaps the easiest answer
to this question is to impose a metric on the
underlying graph and classify the label by adopting
the labels present on its nearest neighbor. There are
a variety of metrics to choose from (e.g., shortest
distance, commute time or electrical resistance,
etc.), but most of these are expensive to compute,
especially for large graphs. Furthermore,
conceptually simple ones like shortest distance have
undesirable properties; for example, they do not take
into account the number of paths between the
labeled and unlabeled nodes. Adsorption provides
an intuitive, iterative, manner in which to propagate
labels in a graph.
The first step is setting up the problem in terms
of a graph. For the news story classification task, the
embedding is straightforward: each story is a node
in the graph, and the weights of the edges between
nodes represent the similarity between two news
stories. The similarity is computed via the MIN-
HASH/LSH distance described previously; if there
is a collision via the LSH procedure, then an edge
exists and the weights is non-zero and positive. In
the simplest version of the algorithm, the stories that
are labeled, are labeled with a single category. The
remaining nodes, those to be labeled, will gather
evidence of belonging to each of the seven classes
as Adsorption is run. At the end of the algorithm, for
each node, the class with the largest accumulated
evidence is assigned to the node (and therefore the
news story). When designing a label propagation
algorithm in this framework, there are several
overarching, intuitive, desiderata we would like to
maintain. First, node v should be labeled l only
when there are short paths, with high weight, to
other nodes labeled l. Second, the more short paths
with high weight that exist, the more evidence there
is for l. Third, paths that go through high-degree
nodes may not be as important as those that do not
(intuitively, if a node is similar to many other nodes,
then it being similar to any particular node may not
be as meaningful). Adsorption is able to capture
these desiderata effectively.
Next, we present Adsorption in its simplest
form: iterated label passing and averaging. We will
also present an equivalent algorithm, termed
Adsorption-RW, that computes the same values, but
is based on random walks in the graphs. Although
not presented in this paper, Adsorption can also be
defined as a system of linear equations in which we
can express the label distribution at each vertex as a
convex combination of the other vertices. Our
presentation follows our prior work (Baluja et al.,
2008), which also includes additional details. These
three interpretations of the Adsorption algorithm
provide insights into the computation and direct us
to important practical findings; a few will be briefly
described in Section 3.3.
Figure 2: Basic adsorption algorithm.
3.1 Adsorption via Averaging
In Adsorption, given a graph where some nodes
have labels, the nodes that carry some labels will
forward the labels to their neighbors, who, in turn,
will forward them to their neighbors, and so on, and
all nodes collect the labels they receive. Thus each
node has two roles, forwarding labels and collecting
labels. The crucial detail is the choice of how to
retain a synopsis that will both preserve the essential
parts of this information as well as guarantee a
stable (or convergent) set of label assignments.
Formally, we are given a graph
),,( wEVG =
where
V denotes the set of vertices (nodes),
E denotes the set of edges, and REw : denotes
a nonnegative weight function on the edges. Let
L denote a set of labels, and assume that each node
v in a subset
VV
L
carries a probability
distribution
v
L on the label set L . We often refer to
L
V
as the set of labeled nodes. For the sake of
exposition, we will introduce a pre-processing step,
where for each vertex
L
Vv
, we create a “shadow”
vertex
v
~
with exactly one out-neighbor, namely
v
,
connected via an edge
),
~
( vv ; furthermore, for each
L
Vv
, we will re-locate the label distribution
v
L
from
v to v
~
, leaving v with no label distribution.
Let
V
~
denote the set of shadow vertices,
{
}
L
VvvV = |
~
~
. From now on, we will assume that
at the beginning of the algorithm, only vertices in
V
~
have non-vacuous label distributions. See Figure
2 for the full algorithm.
Some comments on Adsorption: (1) In the
algorithm, we say that convergence has occurred if
the label distribution of none of the nodes changes
in a round. It can be shown that the algorithm
Al
g
orithm Adsorption:
Input:
L
VLwEVG ,),,,(
=
repeat
for each
VVv
~
do:
Let
=
u
uv
LvuwL ),(
end-for
Normalize
v
L to have unit
1
L norm
until convergence
Output: Distributions
}|{ VvL
v
TEXT CLASSIFICATION THROUGH TIME - Efficient Label Propagation in Time-Based Graphs
177
converges to a unique set of label distributions. (2)
Upon convergence, each node
VVv
~
carries a
label distribution, provided there is a path from
to
some node
L
Vu . (3) We do not update the label
distribution in each round; rather, we recompute it
entirely, based on the distributions delivered by the
neighbors. (4) Adsorption has an efficient iterative
computation (similar to PageRank (Brin and Page ,
1998)), where, in each iteration, a label distribution
is passed along every edge.
Recalling that our goal was to maintain a
synopsis of the labels that are reachable from a
vertex, let us remark that the normalization step
following the step of computing the weighted sum
of the neighbors’ label distribution is crucial to our
algorithm. Labels that are received from multiple (or
highly-weighted neighbors) will tend to have higher
mass after this step, so this normalization step
renders the Adsorption algorithm as a classifier in
the traditional machine learning sense. The
algorithm, as presented, is a modification of the
label propagation algorithm of Zhu et. al. ((Brin and
Page , 1998), (Zhu et al., 2003)), where they
considered the problem of semi-supervised classifier
design using graphical models. They also note that
their algorithm is different from a random-walk
model proposed by Szummer and Jaakkola
(Szummer and Jaakkola, 2001); in the next section
we will show that there is a very different random
walk algorithm that coincides exactly with the
Adsorption algorithm. The latter fact has also been
noticed independently by Azran (Azran, 2007). This
aspect of the Adsorption algorithm distinguishes it
from the prior works of Zhu et al; the enhanced
random walk model we present generalizes the work
of Zhu et al., and presents a broader class of
optimization problems that we can address
1
. The
approach of Zhu et al. is aimed at labeling the
unlabeled nodes while preserving the labels on the
initially labeled nodes and minimizing the “error”
across edges. In Adsorption, there is a subtle, but
vital, difference, the importance attached to
preserving the labels, as well as the importance of
near vs. far neighbors is explicitly controlled
through the use of the injection-label weights and
abandonment probabilities. These will both be
described in detail in Section 3.3. The random walk
equivalence, under the mild conditions of
ergodicity, immediately implies an efficient
algorithm for the problem, a fact not obvious from a
general formulation as minimizing a convex
1
We thank P. Talukdar (personal communication,
November 2008) for pointing this out.
function. From a broader standpoint, it is interesting
to note that this family of “repeated averaging”
algorithms have a long history in the mathematical
literature of differential equations, specifically in the
context of boundary value problems (i.e., estimating
the heat at a point of a laminar surface, given the
boundary temperatures).
3.2 Adsorption via Random Walks
The memoryless property of the Adsorption
algorithm that we alluded to earlier immediately
leads to a closely related interpretation. Let us
“unwind” the execution of the algorithm from the
final round, tracing it backwards. For a vertex
Vv
, denote by
v
N the probability distribution on
the set of neighbors of
v
described by
)),(/(),()(
=
u
v
vuwvuwuN that is, the probability
of u
is proportional to the weight on the edge
),( vu . The label distribution of a vertex v is simply
a convex combination of the label distributions at its
neighbors, that is,
=
u
uvv
LuNL )( ; therefore, if
we pick an in-neighbor u
of
.at random according
to
v
N and sample a label according to the
distribution
u
L , then for each label Ll , )(lL
v
is
precisely equal to
[
]
)(Exp lL
uu
, where the
Figure 3: Adsorption in terms of random walks.
expectation arises from the process of picking a
neighbor u
according to
v
N . Extending this to
neighbors at distance 2, it is easy to see that for each
label
[]
)(ExpExp)(,
uw
lLlLLl
wv
=
where an in-
neighbor
u of
v
is chosen according to
v
N
and an
in-neighbor w of u is chosen according to
u
N .
Expanding this out, we see that
() ( ) ( ) ()
vvuw
wu
L
lNuNwLl=
.
Algorithm Adsorption-RW
Input:
L
VLwEVG ,),,,(
=
, distinguished vertex
v
Let
}},|)
~
,{(,
~
(
~
wVvvvEVVG
L
=
Define 1)
~
,(
=
vvw for all
L
Vv
done := false
vertex :=
v
while (not done) do:
vertex := pick-neighbor
),,( wEv
if (neighbor
V
~
)
done := true
en
d
-while
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
178
Table 3: Adsorption, with Varying Number of Connections Per Node, 200 Labeled Nodes.
Maximum
Number of
Connections
per Node
Overall
Performance
(Samples
500-3900)
Initial
Performance
(Samples
500-600)
Later
Performance
(Samples
3800-3900)
Adsorption 10 80.1 90 72
100 88.1 92 83
500 86.6 91 84
1000 85.8 91 82
Unlimited 82.4 94 80
Notice that
)(uN
v
is the probability of reaching
u from
v
in one step of a random walk starting
from
v
and picking a neighbor according to
v
N ,
and similarly,
)(wN
u
is the probability of picking a
neighbor
w
of
u
according to
u
N . Notice also the
crucial use of the Markov property
(memorylessness) here: conditioned on the random
walk having reached
u , the only information used
in picking w
is
u
N , which depends only on u , and
not on where we initiated the random walk from.
Finally, note that if the random walk ever reaches
one of the shadow vertices
z
~
where
L
Vz , then
there is no in-edge into
z
, so the random walks
stops. Thus vertices in
V
~
are “absorbing states” of
the Markov chain defined by the random walk. A
simple induction now establishes that the
Adsorption algorithm is equivalent to the following
variation, described in terms of random walks on the
reverse of the graph G together with the edges
from
V
~
to V . See Figure 3. Here, pick-
neighbor
),,( wEv returns a node u such that
Evu ),( (so that there is an edge from
v
to u in
the reversed graph) with
probability
)),(/(),(
u
vuwvuw .
In our exposition below, the algorithm takes a
starting vertex
v
for the random walk, and outputs
a label distribution
v
L for it when it reaches an
absorbing state. Thus, the label distribution for each
node is a random variable, whose expectation yields
the final label distribution for that vertex. To obtain
label distributions for all vertices, this procedure
needs to be repeated many times for every vertex,
and the average distributions calculated. This yields
a very inefficient algorithm; therefore, in practice,
we exploit the equivalence of this algorithm to the
averaging Adsorption algorithm in Section 2.2,
which has very efficient implementations.
It is instructive to compare algorithm
Adsorption-RW with typical uses of stationary
distributions of random walks on graphs, such as the
PageRank algorithm (Brin and Page , 1998). In the
case of PageRank, a fixed Markov random walk is
considered; therefore, the stationary probability
distribution gives, for each node of the graph, the
probability that the walk visits that node. In the
absence of any absorbing node (and assuming the
walk is ergodic), the initial choice of the node from
which the random walk starts is irrelevant in
determining the probability of reaching any
particular node in the long run. Consequently, these
methods do not allow us to measure the influence of
nodes on each other. In our situation, labeled nodes
are absorbing states of the random walk; therefore,
the starting point of the walk determines the
probability with which we will stop the walk at any
of the absorbing states. This implies that we may use
these probabilities as a measure of the influence of
nodes on each other.
3.3 Injection and Abandonment
Probabilities in Adsorption
The three equivalent renditions of the algorithm
(averaging, random walk, system of linear
equations) lead to a number of interesting variations
that one may employ. For example, in the viewpoint
of a linear system of equations, it is easy to see how
we can restrict which labels are allowed for a given
node. In another variation, we can model the
“amount of membership” of a node to a class. Recall
the notion of a “shadow” node
v
~
that act as a
“labelbearer” for
v . A judicious choice of edge
weight along the edge to the label-bearer, or
equivalently the label injection probability, helps us
control how the random walk behaves (this is
equivalent to choosing the transition probability
from
v to v
~
in the reversed graph). For example,
lower transition probabilities to the shadow nodes
may indicate lower membership in the label class
(e.g. a news story is ½ in politics as it is only
tangentially related, etc). Note that indicating ½
politics label does not imply that the other ½ must
be assigned to another class. This will be used in
experiments described in Section 5.
Another important insight is realized when
examining Adsorption in terms of random walks.
Instead of considering the standard random walk on
an edge-weighted graph, one may consider a
“hobbled walk,” where at each step, with some
probability, which we call the abandonment proba-
TEXT CLASSIFICATION THROUGH TIME - Efficient Label Propagation in Time-Based Graphs
179
Table 4: Adsorption Performance, with Varying Labeled Samples, 500 connections per node.
Labeled
Examples
Overall
Performance
(Samples
500-3900)
Initial
Performance
(Samples
500-600)
Later
Performance
(Samples
3800-3900)
Adsorption
0-100 86.4 91 84
0-200 86.6 91 84
0-300 86.4 91 82
0-400 86.8 92 84
0-500 86.5 93 83
bility; the algorithm abandons the random walk
without producing any output label. Our
experiments (here and in other applications) have
confirmed that abandoning the random walk with a
small probability at each iteration is a very useful
feature. It slows down the random walk in a
quantifiable way: the influence of a label
l on a
node
u falls off exponentially in the number of
nodes along the paths from nodes that carry
l to u .
This strengthens the effect of nearby nodes; this has
proven crucial to obtaining good results in large
graphs.
4 INITIAL EXPERIMENTS WITH
ADSORPTION
In the experiments presented in this section, we use
the same data that was presented in Section 2, and
apply the Adsorption algorithm. Given the similarity
measurements that were computed via the MIN-
HASH & LSH combination described earlier, the
graph and weights are constructed by simply setting
each story as a node in the graph, and the weights of
the edges between stories as the distance as
specified by the distance computation mentioned
above. The stories that are in the labeled set have
shadow nodes attached to them with the correct
label; stories outside of the labeled set do not have
shadow nodes. Adsorption computes a label
distribution at each node; the label with the
maximum value at the end of the Adsorption run is
considered the node’s (and therefore story’s)
classification. In constructing the graph to use with
Adsorption, a number of options are available.
Encoding domain-specific information into the
graph topology may be a powerful way to express
any a priori or expert knowledge of the task. For
example, knowing that the most accurate
classifications are likely to happen in stories
temporally close to the labeled stories, connections
to nodes representing earlier news stories may
receive a higher weighting; or connections to the
labeled set may be prioritized over other
connections, etc. Nonetheless, to avoid confusing
the causes of the performance numbers and
introducing ad-hoc, domain specific, heuristics, we
experimented with only domain-independent
parameters. One of the most salient is when we
construct the graph, we can limit the number of the
closest neighbors that we connect with each node. In
Table 3, we experiment with connecting each
neighbor only with, at most
2
, its S=10, 100, 500,
1000 most similar stories
3
. Perhaps the most
interesting observation is that increasing the number
of connections does not necessarily increase the
performance. As the number of maximum
connections is increased, eventually the connections
encode such weak similarities between the news
stories that it better not to use them. Currently, we
set the maximum number of connections empirically
(to 500); in the future, other methods will be
explored.
Having set the connection count, we examine
the effects of the number of labeled training
samples. The Adsorption algorithm reveals
performance with 100 labeled examples that is
comparable or exceeds in overall and long-term
performance to the best k-NN and SVM
performance with 500 labeled examples. Results are
shown in Table 4.
In the next section, we continue the empirical
evaluation by looking at larger numbers of news
stories, and the effects of even fewer labels
5 FULL-SCALE EXPERIMENTS
The first full-scale experiment parallels the
experiments presented to this point. We assume that
we have 100 labeled examples and that we would
2
Because there may be fewer than S collisions for a news story in
the LSH hash-tables that are used to rapidly estimate
similarity, every node may not have the maximum S
connections.
3
Recall that since the connections are undirected, a node may
have more than S connections. The total number of undirected
connections will not exceed
VS *
.
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
180
Adsorption (avg=.878)
0.7
0.75
0.8
0.85
0.9
0.95
1
1 11213141516171
Days
Accuracy
k
-nn (avg=0.825)
0.7
0.75
0.8
0.85
0.9
0.95
1
1 11213141516171
Days
Accuracy
Figure 4: Performance of Adsorption and k-NN over 81 days.
like to categorize examples that appear up to 80
days later after the labeled examples were classified.
The performance is shown in Figure 4, each day in
which a news story was gathered is shown in the
graph. In Figure 4 (right), the comparative results
for k-NN are given. The average performance for
Adsorption is 87.8%, k-NN: 82.5%. Other
techniques such as Naïve-Bayes and SVMs were
also tried; of these other techniques, k-NNs
performed the best. Specifically, Naïve-Bayes
performed worse than both SVMs and k-NNs, and
SVMs performed worse than k-NNs.
In our second experiment, we explore the
ramifications of having two orders of magnitude
less
training data. Only a single example is labeled on
day 1. The goal is to examine the articles in the last
three days (days 78-81), and to rank them according
to the probability of being in the same class as the
single labeled example from the first day. This
scenario is a proxy for a very common scenario
encountered in practice in sites like
news.google.com and other news aggregation sites.
A user may read only a small number of articles one
day, and then come back to the site many days later.
Although there is not much evidence of the user’s
preferences, we know simply that of all the articles
the user could have chosen to read on day 1, (s)he
read a single one. In this case, the labels from the
first day’s article are simply
0: article was not read
or
1: the article was read. For Adsorption, we
weighted the examples with label 0 with an injection
probability of 0.1 to reflect uncertainty of why the
user did not read the article, was it because of
interest, time, or simply not noticing it? The articles
labeled 1 (“read”), continued to have an injection
probability of 1.0.
The performance was measured as follows. 500
articles from the last few days of the experiment
were ranked according to their probability of being
from the same class as the ‘read’ article. The full
Adsorption connectivity graph was used, as
described in the previous experiments, to propagate
the label through time. In Figure 5, we examine the
top-N (N = 5, 10, 25, 50) of the ranked articles, and
give the percentage of the N that are from the same
class. As can be seen, Figure 5 (Left), even with a
single example, the average precision rate is
approximately 84% with Adsorption for the top-5
examples, and over 80% for the top-10. In Figure 5
(Right), the same experiment is performed, but
measures the effect of having added a second
labeled example (from the same class as the first).
All algorithms improve dramatically over all ranges
of N. Interestingly, a single additional labeled
example provides information that all the algorithms
effectively exploit. Adsorption continues to
outperform K-NN and SVMs
4
in both tests, for all
values of N.
6 CONCLUSIONS & FUTURE
WORK
In this paper, we have presented an efficient and
simple procedure in which to incorporate an often
ignored signal into the task of news classification:
time. Although the writing, vocabulary and topics of
the news stories rapidly change over time, we are
able to perform the classification of news stories
with very little training data that is received only in
the beginning of the testing period.
There are many avenues for future research, both in
this task and in the development of Adsorption.
First, a comparison with different unlabeled data
learning systems is warranted. Although, in this
study, we used transductive SVMs as means to
incorporate unlabeled data, it did not improve
performance significantly. Other methods, such as
spectral clustering may do better. Although most
other techniques do not incorporate a notion of time,
perhaps combinations of the other methods with the
4
The use of unlabeled samples through transductive
learning for the SVM was again used here. It slightly
improved performance in a few trials; the best of both
is given here.
TEXT CLASSIFICATION THROUGH TIME - Efficient Label Propagation in Time-Based Graphs
181
Experiments with a Single
Labeled Example: Precision
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
5102550
Examples Examined
Precision
Adsorption K-NN SVM
Experiments with Two Labeled
Examples: Precision
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
5102550
Examples Examined
Precision
Adsorption K-NN SVM
Figure 5: Experiments with 1 & 2 labeled examples. Precision at 5,10,25, and 50 results in retrieving examples from the
same class as the single labeled example (left) or two labeled examples (right).
ones presented here can be devised; this is
potentially large area of interest. Second, we used a
simple graph structure that did not incorporate all of
the available domain information (e.g. all the labeled
examples are at the beginning). Using the graph
structure to encode domain knowledge will be very
relevant in new domains as well. Further, graph
pruning algorithms are of interest, especially in the
cases in which domain knowledge may not be
readily available; as was seen in the experiments,
more connections do not imply improved
performance. Finally, this test was conducted over a
period of approximately 3 months with real
examples of rapidly shifting news stories that
exemplify current news-aggregation-site challenges;
longer tests are forthcoming.
REFERENCES
Topic Detection and Tracking Evaluation,
http://www.nist.gov/speech/tests/tdt/
Allen, J. (2002) Topic Detection and Tracking: Event-
Based Information Org., Springer.
Mori, M. Miura, T. Shioya, I. (2006) Topic Detection
And Tracking for News Web Pages, IEEE/WIC/ACM
International Conference on Web Intelligence, 2006.
Baluja, S., Seth, R., Sivakumar, D., Jing, Y., Yagnik, J.,
Kumar, S., Ravichandran, D., Aly M., (2008) Video
Suggestion and Discovery for YouTube: Taking
Random Walks Through the View Graph (WWW-
2008).
Pomikalek, J., Rehurek, R. (2007) The Influence of
preprocessing parameters on text categorization,
Proceedings of World Academy of Sci, Eng. Tech, V21
McCallum A. and Nigam, K. (1998) A comparison of
event models for Naïve Bayes text classification,
AAAI-98 Workshop on Learning for Text
Categorization.
Cortes, C. & Vapnik, V. (1995). Support-Vector
Networks. Machine Learn. J., 273-297.
Joachims T. (2002), Learning to Classify Text Using
Support Vector Machines. Dissertation, Kluwer, 2002.
(code from svm-lite: http://svmlight.joachims.org/)
Joachims T. (1999), “Transductive Inference for Text
Classification using Support Vector Machines”.
International Conference on Machine Learning
(ICML), 1999.
Cohen, E.; Datar, M.; Fujiwara, S.; Gionis, A.; Indyk, P.;
Motwani, R.; Ullman, J.D.; Yang, C. (2001) Finding
interesting associations without support pruning.
Knowledge and Data Engineering, V13:1
Gionis, A., Indyk, P., Motwani, R. (1999), Similarity
search in high dimensions via hashing. Proc.
International Conference on Very Large Data Bases,
1999.
Brin, S. and Page, L. (1998). The anatomy of a large-scale
hypertextual web search engine. Comp. Nets 30
Zhu, X. (2005) Semi-Supervised Learning with Graphs.
Carnegie Mellon U., PhD Thesis.
Zhu, X., Ghahramani G., and Lafferty, J. (2003). Semi-
supervised learning using Gaussian fields and
Harmonic Functions , in International Conference
on Machine Learning-20.
Szummer, M. & Jaakkola, T. (2001) Partially labeled
classification with Markov random walks. NIPS-2001.
Azran, A. (2007) The Rendezvous Algorithm: Multiclass
semi-supervised learning with markov random walks.
In International Conference on Machine Learning -24,
2007.
Baluja, S. & Covell M. (2008) Audio Fingerprinting:
Combining Computer Vision & Data Stream
Processing, Int. Conf. Acoustics, Speech and Signal
Processing (ICASSP-2008).
Ifrim, G. & Weikum, G.,(2006) Transductive Learning for
Text Classification using Explicit Knowledge Models,
PKDD-2006
Project for Excellence in Journalism (2008). “A Year in
the News”, The State of News Media 2008: An Annual
Report on American Journalism.
http://www.stateofthenewsmedia.org/2008/index.
php
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
182