of supervised clustering, where some approaches op-
timize or learn the clustering similarity function from
the training data. Eick et al. use a process of inter-
leaving clustering with distance function in order to
learn a distance function (Eick et al., 2005). Demirez
et al. (1999) use genetic algorithms and an appropri-
ate fitness function to guide the search for an opti-
mal clustering. Rendle et al. (2006) have also pro-
posed supervised techniques that use labelled data in
the form of Must-Link and Cannot-Link constraints to
guide the search for a clustering, have also applied in
the field of Record Linkage. Basu et al. use labelled
data to create Must-Link and Cannot-Link constraints
in their semi-supervised clustering approach in order
to cluster unlabeled data (Basu et al., 2003).
The Learning of a similarity function between
documents is directly related to the task of learning a
ranking function in the area of Information Retrieval.
In our paper we propose to use a method similar to
traditional information retrieval where the system re-
trieves and ranks a document according to the simi-
larity between the queried document and the entries
in the document collection. Joachims proposes Rank-
ingSVMs to learn a linear ranking model that can be
exploited to rank documents in information retrieval
scenarios (Joachims, 2002). In his scenario, the tar-
gets are not class labels but a binary ordering rela-
tion. Others like Fakeri-Tabrizi et al. use a Ranking
SVM for an imbalanced classification problem with
image annotation and show that this type of SVM
performs better than a standard SVM (Fakeri-Tabrizi
et al., 2011). Freund et al. propose an efficient boost-
ing algorithm to combine linear rankings by experts
(Freund et al., 2003). These experts correspond to
the similarity functions we use as features in our ap-
proach. Our results indeed show that a ranking SVM
can be successfully applied for the task of finding and
learning a similarity function for the event identifica-
tion problem. It is showing a more stable behavior
than a standard SVM and is also able to learn the sim-
ilarity function with less training examples. Gao et al.
have developed a similar approach but using a percep-
tron to learn the ranking function instead of a SVM
(Gao et al., 2005).
In the area of event identification, there have been
several attempts to decide whether a document is as-
sociated with an event or not. The task of grouping
documents into clusters describing the same event has
been dubbed event identification (Becker et al., 2010).
Others learn to distinguish Flickr documents repre-
senting an event from those that do not (Rattenbury
and Naaman, 2009). Allan and Papka used an incre-
mental clustering approach for detecting events in text
document streams (Allan et al., 1998). Becker et al.
use a similar method to identify event clusters (Becker
et al., 2010).
Firan et al. provide a formulation of the event
identification problem as a standard classification
task, thus learning a function class : D → E (Firan
et al., 2010). They used a Naive Bayes classifier to do
this. This is problematic in our view. It does not ac-
count for the dynamic nature of data in the sense that
it does not model the fact that new events constantly
emerge. In order to accommodate this, a new classi-
fier has to be trained for each new event that is added
to the system. There is no mechanism for detecting
new events. Our results have shown that there are on
average only 26 data points on each event. Given this
sparsity of data, it is not clear if it is possible to learn
appropriate classifiers for each of them.
5 CONCLUSIONS
In this paper we addressed the question of how we
can learn a suitable similarity function from training
data for the task of identifying events in social media
data. We made use of different types of SVMs (Rank-
ing and Standard SVMs) in order to train an appro-
priate similarity measure. We investigated different
strategies for creating training data, the impact of the
amount of training data and the kernel used. We have
shown that it is possible to learn a suitable similarity
measure using both SVMs. It is sufficient to use only
few examples to train a model. Using a ranking SVM,
one positive and one negative pair allows the creation
of a suitable similarity measure. Furthermore, the per-
formance of a ranking SVM does not vary signifi-
cantly for different sizes of training data and is thus
more robust than a standard SVM. Besides, we fig-
ured out that a linear kernel always outperforms an
RBF one regardless of which type of SVM we use.
We have clearly shown that the sampling strat-
egy is crucial for the success of the training process.
The use of well-chosen training examples improve the
quality significantly. We have seen that the search for
the nearest wrong pair helps to create a good model.
The fact that the time-based sampling strategy is com-
parable to the nearest one for small amounts of data
explains the good results of the SVM in the approach
of Becker et al. They used a time-based strategy with
500 training examples (Becker et al., 2010). In gen-
eral there is only few data needed to be able to create
a working model. If too much training data is used, at
least the standard SVM suffers from overfitting.
Furthermore, we have discovered in our feature
analysis that the lack of a single feature has no great
impact on the quality of the system. The results are
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
214