the most important documents on the basis of their
relevance. We also discuss measures used to detect
similarity between Hot Days and Derived Hot days.
2 PROPOSED APPROACH
We first explain the use of NLP methodologies to
detect the most important documents and rank these
documents on the basis of their relevance. Then we
compare the similarity of the ranked relevant docu-
ments between Hot days (set of) and between Hot
days and Derived Hot days to determine the corre-
lation between them. This helps us to estimate chains
of Hot correlated days.
2.1 NLP Techniques to Identify and
Rank Important Documents
Once we know the days when the topic of interest is
Hot, we process all the documents that day to identify
the most important ones. A standard parser is used for
the purpose of tagging (TnT parser). The corpus used
as a pre-model for providing tags is the default model
available with the TnT parser. For each Hot day we
select the relevant documents (ones whose score is
greater than 0.5). We parse each document so that
there is only one token in each line and make it suit-
able for the Tnt parser. The tokens of each document
is kept in a single file. This file is used as our test
file which has to be provided with tags and we use the
trigrams model for providing tags to each token.
Once we have provided the tags we select noun
noun (NN) phrases and adjective noun (JN) phrases
for each document. These are the most important
phrases or concepts which can replicate the main
content of the documents efficiently. The same NN
phrase or JN phrase can occur multiple time in a doc-
ument for a given Hot day. So we maintain the fre-
quency of occurrence of each phrase in a document.
We can now estimate the total occurrence of each such
phrase in all the documents for that day.
We have a bag of phrases model for each relevant
document and for all relevant documents (background
information) on a Hot day. Now, we define a mech-
anism to rank the relevant documents. We convert
the bag of phrases to a vector space model assigning
0 for those phrases not present in the document and
assigning frequency of the phrase for those present in
the document. We then compare the cosine similarity
between the vector for a given document and the
vector corresponding to the relevant background
information. Mathematically, its represented as:-
D
i
= Vector o f phrases present in document D
i
N = Vector of phrases present in all documents
Score(D
i
) = CosineSimilarity(D
i
∗ N) (1)
The greater the score for a document D
i
, the more
is the importance or relevance of that document for
that given Hot day. So, greater the similarity with
the background information of that day and higher the
rank.
2.2 Similarity between Hot Days and
Derived Hot Days
We now try to estimate whether Hot days are really
actually correlated or not, based on context. For this
purpose we use the variable parameter k which is used
to limit the selection of the ranked relevant docu-
ments. For example, if we set k = 10 then only the top
ranking 10% of the relevant documents are used. We
then construct a vector from the top ranking k percent
of the documents in a manner similar to that described
in previous subsection. Then we take into account Hot
Days in pairs and as there are 31 Hot Days we have
30 such pairs of days. We then calculate the cosine
similarity between each such pair of days and greater
the similarity more is the actual correlation between
those Hot days. This parameter k plays an important
role in the quality of correlation.
A similar process is followedto determine correla-
tion between Hot days and Derived Hot days. Here it
is important to mention that the existence of a Derived
Hot day corresponding to a given Hot day depends
on the statistical criteria. So once we have identified
the Hot Day and Derived Hot day pair, we again con-
struct the vector using the top ranking k percent of the
relevant documents for those days and calculate the
cosine similarity or the KL divergence.
3 EXPERIMENTAL RESULTS
We performed our experiment on AG’s corpus of
NewsArticles using the ‘Presidential Elections’ as the
topic of our focus. The presidential elections were ac-
tually held during the period when the data was col-
lected and was thus chosen as the topic under consid-
eration. To identify chains of hot correlated days, we
used cosine similarity and KL divergence measures.
3.1 Cosine Similarity Results
From Figure 1, we see the square brackets represent
the Hot Days for which value is 1. The other sym-
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
376