more reputable sources. Finally, proximity informa-
tion can be better exploited when a sentence is com-
pared against each document separately.
5.2 Selecting Multiple Sentences
While the ranking function gives us the most repre-
sentative sentence, it provides no information for se-
lecting the best k sentences when there are multiple
sentences to be selected.
There exist many solutions to this problem. The
most basic solution is to select the k top scoring sen-
tences. However, doing so may result in selection of
redundant sentences. To help mitigate this problem,
one may only select sentences that have less than a
certain degree of overlap with every sentence in the
summary set (Kumar et al., 2009). More sophisticated
approaches, such as the MMR algorithm, formulate
the sentence selection problem as a search problem
that seeks to maximize an objective function which
gives credit for the relevance score, and penalizes for
overlap (Carbonell and Goldstein, 1998).
Experiments did not show many differences be-
tween these methods, and for our evaluation, we use
the aforementioned approach used in (Kumar et al.,
2009).
6 EXPERIMENTAL SETUP
The Summarization Task in the Text Analysis Confer-
ence (TAC) 2009 is an evaluation framework that pro-
vides a comparative analysis for computer-generated
summaries. Using this framework, accuracy scores
for short summaries can be compared among dif-
ferent algorithms. In the task, the challenge is to
generate summaries of up to 100 words from a col-
lection of 10 documents across 44 different topics
(Gillick et al., 2010). Then, the generated summary is
compared against model summaries using ROUGE-
2 measures, which gives recall and precision scores
based on whether bigrams in the generated summary
are also present in the model summary (Lin, 2004).
6.1 Methods for Comparison
We compare our approach against methods that only
use either PLM or semantic smoothing to see whether
employing both yields better results. In addition to
LSA, we compare semantic smoothing using latent
Dirichlet allocation (LDA) (Blei et al., 2003) and ran-
dom indexing (RI) (Sahlgren and Karlgren, 2005).
We thus consider the following approaches:
• Language Modeling (LM) only.
• Proximity Language Model (PLM).
• LM + latent semantic analysis.
• LM + random indexing.
• LM + latent Dirichlet allocation.
• PLM + latent semantic analysis.
• PLM + random indexing.
• PLM + latent Dirichlet allocation.
Sections 3-4 show the formulas used for the prox-
imity language model and PLM + latent semantic
analysis. The formulas used for other methods can
be adapted from these formulas, as we show.
6.1.1 Formula for the Language Modeling
Approach
Language modeling approach represents the most ba-
sic approach among our comparison methods. This
approachuses the estimator in Section 3.1, and the KL
divergence in Equation 10 can be used in a straight-
forward manner to derive scores for each sentence.
6.1.2 Formula for the Random Indexing
Approach
Random indexing represents another form of seman-
tic smoothing. Thus, employing random indexing in-
stead of LSA will affect equations in Section 4.1.
Running random indexing on the term-document
matrix will produce a term-context matrix. We can
perform LSA on this matrix, which will yield matri-
ces U andV and Σ that we can use in Equations 7-8 in
Section 4.1. Using these values will result in a mod-
ified estimator in Equation 9. Sentence ranking can
then proceed as in Section 4.2. This method follows
the random indexing + LSA approach shown in (Sell-
berg and Jonsson, 2008).
This approach has advantages over LSA in perfor-
mance due to the fact that the term-context matrix is
of a reduced dimension. Computing the term-context
matrix takes an order of magnitude less time than
LSA, and thus, this approach leads to an improved
overall performance compared to regular LSA.
6.1.3 Formula for the Latent Dirichlet
Allocation Approach
Latent Dirichlet allocation is yet another form of se-
mantic smoothing. LDA can be used to analyze the
term-document matrix and provide latent topic proba-
bilities for each term. The topic probabilities for each
term can be substituted for p(w
j
|d) in Equation 6.
A weakness of using LDA in our semantic
smoothing framework is that LDA does not provide
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
430