task of TDT. In the Topic Detection (TD) task the set
of most prominent topics have been found in a col-
lection of documents. Furthermore, in other words,
TD is based on the identification of the stories in sev-
eral continuous news streams which concern new or
previously unidentified events. Sometimes unidenti-
fied events have been retrieved in an accumulated col-
lection (“retrospective detection”) while in the “on-
line detection” the events could be labelled when they
have been flagged as new events in real time. Give
the TD is a problem to assign labels to unlabelled
data grouping together a subset of news reports with
similar contents, most unsupervised learning meth-
ods, proposed in literature as (Wartena and Brussee,
2008), (Jia Zhang et al., 2011), exploit the text clus-
tering algorithms to solve this problem.
Most common approaches, as (Wartena and
Brussee, 2008), given list of topics the problem of
identifying and characterizing a topic is a main part
of the task. For this reason a training set or other
forms of external knowledge cannot be exploited and
the own information contained in the collection can
be used to solve the Topic Detection problem. More-
over the method, proposed in (Wartena and Brussee,
2008), is a two-step approach: in the former a list of
the most informative keywords have been extracted;
the latter consists in the identification of the clusters
of keywords for which a center has been defined as
the representation of a topic. The authors of (Wartena
and Brussee, 2008) considered topic detection with-
out any prior knowledge of category structure or pos-
sible categories. Keywords are extracted and clus-
tered based on different similarity measures using the
induced k-bisecting clustering algorithm. They con-
sidered distance measures between words, based on
their statistical distribution over a corpus of docu-
ments, in order to find a measure that yields good
clustering results.
In (Bolelli and Ertekin, 2009), a generative model
based on Latent Dirichlet Allocation (LDA) is pro-
posed that integrates the temporal ordering of the doc-
uments into the generativeprocess in an iterative fash-
ion called Segmented Author-Topic Model (S-ATM).
The document collection has been split into time seg-
ments where the discovered topics in each segment
has been propagated to influence the topic discovery
in the subsequent time segments. The document-topic
and topic-word distributions learned by LDA describe
the best topics for each document and the most de-
scriptive words for each topic. An extension of LDA
is the author-topic model (ATM). In ATM, a docu-
ment is represented as a product of the mixture of
topics of its authors, where each word is generated by
the activation of one of the topics of the document au-
thor, but the temporal ordering is discarded. S-ATM
is based on the (ATM) and extends it to integrate the
temporal characteristics of the document collection
into the generative process. Besides S-ATM learns
author-topic and topic-word distributions for scien-
tific publications integrating the temporal order of the
documents into the generative process.
The goals in (Seo and Sycara, 2004) are: i) the
system should be able to group the incoming data into
a cluster of items of similar content; ii) it should re-
port the contents of the cluster in summarized human-
readable form; iii) it should be able to track events of
interest in order to take advantage of developments.
The proposed method has been motivated by con-
structive and competitive learning from neural net-
work research. In the construction phase, it tries to
find the optimal number of clusters by adding a new
cluster when the intrinsic difference between the pre-
sented instance and the existing clusters is detected.
Then each cluster moves toward the optimal cluster
center according to the learning rate by adjusting its
weight vector.
In (Song et al., 2012), a text clustering algorithm
C-KMC is introduced which combined Canopy and
modified k-means clustering applied to topic detec-
tion. This text clustering algorithm is based on two
steps: in the former, namely C-process, has been ap-
plied Canopy clustering that split all sample points
roughly into some overlapping subsets using inaccu-
rate similarity measure method; in the latter, called
K-process, has been employed a modified K-means
that take X-means algorithm to generate rough clus-
ters from the canopies which share common instance.
In this algorithm, Canopies are an intermediate result
which can reduce the computing cost of the second
step and make it much easier to be used, although
Canopy is not a completed cluster or topic.
The authors of (Zhang and Li, 2011) used vector
space model (VSM) to represent topics, and then they
used K-means algorithm to do a topic detection exper-
iment. They studied howthe corpus size and K-means
affect this kind of topic detection performance, and
then they used TDT evaluation method to assess re-
sults. The experiments proved that optimal topic de-
tection performance based on large-scale corpus en-
hances by 38.38% more than topic detection based on
small-scale corpus.
3 METHODOLOGY
The proposed methodology, aiming to extract and re-
duce the features characterizing the most significant
topic within a corpus of documents, is depicted in
AMethodofTopicDetectionforGreatVolumeofData
435