Polytope Model for Extractive Summarization
Marina Litvak and Natalia Vanetik
Department of Software Engineering, Sami Shamoon College of Engineering, Beer Sheva, Israel
Keywords:
Text Summarization, Quadratic Programming, Polytope Model.
Abstract:
The problem of text summarization for a collection of documents is defined as the problem of selecting a
small subset of sentences so that the contents and meaning of the original document set are preserved in the
best possible way. In this paper we present a linear model for the problem of text summarization, where we
strive to obtain a summary that preserves the information coverage as much as possible in comparison to the
original document set. We construct a system of linear inequalities that describes the given document set
and its possible summaries and translate the problem of nding the best summary to the problem of finding
the point on a convex polytope closest to the given hyperplane. This re-formulated problem can be solved
efficiently with the help of quadratic programming.
1 INTRODUCTION
Automated text summarization is an active field of re-
search in various communities like Information Re-
trieval (IR), Natural Language Processing (NLP), and
Text Mining (TM). Summarization is important for
IR since it helps to access large repositories of tex-
tual data efficiently by identifying the essence of
a document and indexing a repository. Taxonomi-
cally, we distinguish between single-document, where
a summary per single document is generated, and
multi-document, where a summary per cluster of re-
lated documents is generated, summarization. Also,
we distinguish between automatically generated ex-
tract—the most salient fragments of the input doc-
ument/s (e.g., sentences, paragraphs, etc.) and ab-
stract—re-formulated synopsis expressing the main
idea of the input document/s. Since generating
abstracts requires a deep linguistic analysis of the
input documents, most existing summarizers work
in extractive manner (Mani and Maybury, 1999).
Moreover, extractive summarization can be applied
to cross-lingual/multilingual domains (Litvak et al.,
2010).
In this paper we deal with the problem of extrac-
tive summarization. Our method can be generalized
for both single-document and multi-document sum-
marization. Since the method includes only very basic
linguistic analysis (see section 5.4), it can be applied
to cross-lingual/multilingual summarization.
Formally speaking, in this paper we introduce:
A novel text representation model expanding a
classic Vector Space Model (Salton et al., 1975)
to Hyperplane and Half-spaces;
A distance measure between text and information
coverage we wish to preserve;
A re-formulated extractive summarization prob-
lem as a distance minimizing task and its solution
using quadratic programming.
The main challenge of this paper is a new text repre-
sentation model making possible to represent an ex-
ponential number of extracts without computing them
explicitly, and finding the optimal one by simple min-
imizing a distance function in polynomial time.
This paper is organized as follows: section 2 de-
picts related work, section 3 describes problem setting
and definitions, section 4 introduces a new text repre-
sentation model and a possible distance measure be-
tween text and information coverage, section 5 refers
summarization task as a distance optimization in a
new text representation model. We discuss as unsu-
pervised as supervised approaches. Last section con-
tains our future work and conclusions.
2 RELATED WORK
Numerous techniques for automated summarization
have been introduced in the last decades, trying to
reduce the constant information overload of profes-
sionals in a variety of fields. Many works formulated
281
Litvak M. and Vanetik N..
Polytope Model for Extractive Summarization.
DOI: 10.5220/0004170902810286
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 281-286
ISBN: 978-989-8565-29-7
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
the summarization as optimization problem, solving it
using such techniques like a standard hill-climbing al-
gorithm(Hassel and Sjobergh, 2006), regression mod-
els (Ouyang et al., 2011), and Evolutionary algo-
rithms (Alfonseca and Rodriguez, 2003; Liu et al.,
2006).
Some authors reduce summarization to the max-
imum coverage problem (Takamura and Okumura,
2009). The maximum coverage model extracts sen-
tences to a summary to cover as many informa-
tion as possible, where information can be mea-
sured by text units like terms, n-grams, etc. De-
spite a great performance (Takamura and Okumura,
2009; Gillick and Favre, 2009) in summarization
field, maximum coverage problem is known as NP-
hard (Khuller et al., 1999). Some works attempt to
find a near-optimum solution by greedy approach (Fi-
latova, 2004; Takamura and Okumura, 2009). Linear
Programming helps to find a more accurate approx-
imated solution to the maximum coverage problem
and became very popular in summarization field in
the last years (Gillick and Favre, 2009; Woodsend and
Lapata, 2010; Hitoshi Nishikawa and Kikui, 2010;
Makino et al., 2011).
Trying to solve a trade-offbetween summary qual-
ity and time complexity, we propose a novel summa-
rization model solving the approximated maximum
coverage problem by quadratic programming in poly-
nomial time. We measure information coverage by
terms (normalized meaningful words) and strive to
obtain a summary that preserves the term frequency
as much as possible in comparison to the original doc-
ument.
3 DEFINITIONS
3.1 Problem Setting
We are given a set on sentences
1
S
1
, ..., S
m
derived
from a document or a cluster of related documents
speaking on some subject. Meaningful words in these
sentences are entirely described by terms T
1
, ..., T
n
.
Our goal is to find a subset S
i
1
, ..., S
i
k
consisting of
sentences such that (1) there are at most N terms in
these sentences; (2) term frequency is preserved as
much as possible w.r.t. the original sentence set; (3)
redundant information among k selected sentences is
minimized.
1
Since an extractive summarization usually deals with
sentence extraction, this paper also focuses on sentences.
Generally, our method can be used for extracting any other
text units like phrases, paragraphs, etc..
3.2 The Matrix Model
We describe sets of sentences and terms by real matrix
A = (a
i, j
) of size n× m where
a
i, j
= k if term T
i
appears in the sentence S
j
precisely k times.
Then columns of A describe sentences and rows de-
scribe terms. Since we are not interested in redun-
dant sentences, in the case of multi-documentsumma-
rization, we can initially select meaningful sentences
by clustering all the columns as vectors in R
n
and
choose a single representativefrom each cluster. Then
columns describe representatives of sentence clusters.
Here and further, we refer to A as the sentence-
term matrix corresponding to the given document/s.
Example 1. Given the following text of m = 3 sen-
tences and n = 5 (normalized) terms:
S
1
= A fat cat is a cat that eats fat meat.
S
2
= My cat eats fish but he is a fat cat.
S
3
= All fat cats eat fish and meat.
A matrix corresponding to the text above has the
following shape:
S
1
S
2
S
3
T
1
= “fat”
T
2
= “cat”
T
3
= “eat”
T
4
= “fish”
T
5
= “meat”
a
1,1
= 2 a
1,2
= 1 a
1,3
= 1
a
2,1
= 2 a
2,2
= 2 a
2,3
= 1
a
3,1
= 1 a
3,2
= 1 a
3,3
= 1
a
4,1
= 0 a
4,2
= 1 a
4,3
= 1
a
5,1
= 1 a
5,2
= 0 a
5,3
= 1
where a
i, j
are term frequencies.
Let s be the total number of terms in all the sen-
tences. We can derive s from the sentence-term matrix
S. Formally, we compute
s =
m
i=1
n
j=1
a
i, j
Example 2. For the matrix of Example 1 we have
s =
5
i=1
3
j=1
a
i, j
= 16.
3.3 Term Frequencies
We can use the sentence-term matrix to compute term
frequency for each term. Indeed, for n terms in the
document their term count is a real vector C of size
n, where C[i] = k stands for term count. Then term
frequency is a real vector
F =
1
s
C
obtained from C by dividing each of its elements by
the total term count.
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
282
Computing the vector F requires application of a
simple linear transformation to A. We have
F
T
=
1
s
A× J
m×1
where J
m×1
is the all-1 vector. Vector F consists
of term frequences [t f(T
1
), . . . , t f(T
n
)] and is easily
computed using matrix-vector multiplication.
Example 3. For the matrix of Example 1 we have
C =
2 1 1
2 2 1
1 1 1
0 1 1
1 0 1
1
1
1
=
1
16
4
5
3
2
2
and s = 16. Then
F
T
=
1
16
C
T
=
1
16
[4 5 3 2 2]
3.4 The Goal
In this setting, our goal can be reformulated as the
problem of finding subset i
1
, ..., i
k
of matrix columns
from A, so that for the resulting submatrix A
the dis-
tance from F to the vector
(F
)
T
=
1
s
A
× J
m
×1
is as small as possible. Here, the number s
denotes
the total count of terms in selected sentences. Dis-
tance functions in this case can vary for example,
Manhattan distance, Euclidean distance, cosine simi-
larity, mutual information etc.
4 FROM MATRIX MODEL TO
POLYTOPE
4.1 Hyperplanes and Half-spaces
Extractive summarization aims at extracting a sub-
set of sentences that covers as much non-redundant
information as possible w.r.t. the source docu-
ment/documents. Here we introduce a new efficient
text representation model with purpose of represent-
ing all possible extracts without computing them ex-
plicitly. Since the number of potential extracts is ex-
ponential in the number of sentences, we would be
saving a great portion of computation time. Finding
an optimal extract of text units is a general problem
for various Information Retrieval tasks like: Question
Answering, Literature Search, etc., and our model can
be efficiently applied on all these tasks.
In our representation model, each sentence is rep-
resented by hyperplane, and all sentences derived
from a document form a hyperplane intersections
(polytope). Then, all possible extracts can be repre-
sented by subplanes of our hyperplane intersections
and as such that are not located far from the boundary
of the polytope. Therefore, intuitively, the boundary
of the resulting polytope is a good approximation for
extracts that can be generated from the given docu-
ment.
4.2 The Approach
We view every column of the sentence-term matrix as
a linear constraint representing a hyperplane in R
mn
.
A term t
i
in sentence S
j
is represented by variable x
i, j
.
Example 4. This example demonstrates variables
corresponding to the 4 × 3 sentence-term matrix A
from Example 1.
S
1
S
2
S
3
T
1
T
2
T
3
T
4
T
5
x
1,1
x
1,2
x
1,3
x
2,1
x
2,2
x
2,3
x
3,1
x
3,2
x
3,3
x
4,1
x
4,2
x
4,3
x
5,1
x
5,2
x
5,3
Together all the columns will define a system of
linear inequalities, we also express constrains on the
number of terms in the extract we seek. Then every
sentence in our document is a hyperlane in R
mn
ex-
pressed with the help of elements in columns A[][i] of
A and variables x
i, j
representing appearances of terms
in sentences.
We define linear inequality
A[][i] · [x
i,1
, . . . , x
i,n
]
T
=
n
j=1
a
j,i
x
j,i
A[][i] · 1
T
(1)
Every inequality of this form defines a hyperplane H
i
and it lower half-space specified by equation (1):
H
i
:=
n
j=1
a
j,i
x
j,i
=
n
j=1
a
j,i
and has normal vector
˜
n
i
with
˜
n
i
[k] =
a
j,i
k = j
0, otherwise.
(2)
To express the fact that every term is either present or
absent from the chosen extract, we add constraints
0 x
i, j
1 (3)
Intuitively, a point p on H
i
represents a sentence
with the same term counts as S
i
. To study subsets
of sentences, we observe intersections of hyperplanes
PolytopeModelforExtractiveSummarization
283
H
3
Intersection of H
1
and H
2
H
1
H
2
Figure 1: Two-dimensional projection of hyperplane inter-
section.
H
i
. In this case, we say that the intersection of two
hyperplanes H
i
and H
j
represents a set of two sen-
tences S
i
and S
j
. Then a subset of sentences of size r
is represented by intersection of r hyperplanes.
Example 5. Sentence-term matrix of Example 1 de-
fines the following hyperplane equations.
H
1
:= 2x
1,1
+ 2x
2,1
+ x
3,1
+ x
5,1
= 2+ 2+ 1+ 1= 6
H
2
:= x
1,2
+ 2x
2,2
+ x
3,2
+ x
4,2
= 5
H
3
:= x
1,3
+ x
2,3
+ x
3,3
+ x
4,3
+ x
5,3
= 5
Here, a summary consisting of the rst and the sec-
ond sentence is expressed by the intersection of hy-
perplanes H
1
and H
2
. Figure 1 shows how a two-
dimensional projection of hyperplanes H
1
, H
2
, H
3
and
their intersections look like.
5 SUMMARIZATION AS A
DISTANCE FUNCTION
We express summarization constraints in the form of
linear inequalities in R
mn
, using the columns of the
sentence-term matrix A as linear constraints. Maxi-
mality constraint on the number of terms in the sum-
mary can be easily expressed as a constraint on the
sum of term variables x
i, j
. Since we are looking for
summaries that consist of at most N terms, we intro-
duce the following linear constraint
m
i=1
n
j=1
x
i, j
N (4)
Indeed, every variable x
i, j
stands for a separate term
in specific sentence, and we intend for their sum to
express the number of terms in selected sentences.
Example 6. Equation (4) for the sentence-term ma-
trix of Example 1 for N = 10 has the form
4
i=1
3
j=1
x
i, j
10
H
i
H
j
H
i
!
j
Figure 2: Intersection of hyperplanes.
Having defined linear inequalities that describe
each sentence in a document separately and the to-
tal number of terms in sentence subset, we can now
look at them together as a system:
n
j=1
a
1,i
x
1, j
n
j=1
a
j,1
. . .
n
j=1
a
1,m
x
m, j
n
j=1
a
j,m
m
i=1
n
j=1
x
i, j
N
0 x
i, j
1
(5)
First m inequalities describe sentences S
1
, . . . , S
m
, and
the next inequality describes constraints on the total
number of terms in a summary.
Since every inequality in the system (5) is lin-
ear, the entire system describes a convex polyhe-
dron in R
mn
, which we denote by P. Faces of P
are determined by intersections of hyperplanes H
i
,
x
1
+ · · · + x
n
= N and x
i
= 0, x
i
= 1, y
i
= 0, y
i
= 1. In-
tersections of H
i
s represent subsets of sentences (see
Figure 2 for illustration), as the following property
shows.
Property 1. Equation of the intersection H
1,...,k
=
H
1
··· H
k
(which is a hyperplane by itself) satis-
fies
k
j=1
n
i=1
a
j,i
x
j,i
=
k
j=1
n
i=1
a
j,i
.
Proof. This property is trivial, since the intersection
H
1,...,k
has to satisfy all the equations of H
1
, . . . , H
k
.
Therefore, summing up equalities (1) for H
1
, . . . , H
k
we have
k
j=1
n
i=1
a
j,i
x
j,i
=
k
j=1
n
i=1
a
j,i
. Note that
the choice of indexes 1, . . . , k was arbitrary and the
property holds for any subset of indexes.
Therefore, the hypersurfaces representing sen-
tence sets we seek are in fact hyperplane intersections
that form the boundaries of the polytope P.
5.1 Finding the Closest Point
We assume here that the surface of the polyhedron P
is a suitable representation of all the possible sentence
subsets (its size, of course, is not polynomial in m and
n since the number of vertices of P can be very large).
Fortunately, we do not need to scan the whole set of
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
284
Ps surfaces but rather to find the point on P which is
the closest to the term frequency we wish to preserve.
We use the fact that the term frequency vector F
is in fact a point in R
n
. Polytope P R
mn
, and we
need to breach the gap between dimensions of the two
spaces.
Let H be a hyperplane representing a summary.
Then by Property 1 H is w.l.o.g. an intersection of
several hyperplanes H
1
, . . . , H
k
of the form (1) and its
normal vector
˜
n satisfies the following condition:
˜
n[i, j] =
a
i, j
i = 1. . . k,
0 otherwise.
Then term count of term T
i
in this summary is pre-
cisely
m
j=1
˜
n[i, j] (6)
To find the distance from F to possible summaries as
Euclidean distance in R
n
, we
look at linear transformation of P that transforms
a hyperplane with normal vector
˜
n = (
˜
n
i, j
) into a
hyperplane with normal vector
˜
m = (
m
j=1
˜
n[1, j], . . . ,
m
j=1
˜
n[n, j]) (7)
observe F = ( f
1
, . . . , f
n
) as a point in R
n
;
search for points p = (p
1
:=
m
j=1
x
1, j
, . . . , p
n
:=
m
j=1
x
n, j
) on the transformed polytope whose dis-
tance to F is minimal;
such a point p is a transformation of point x =
(x
ij
) R
mn
which lies on the boundary of P. It
holds that x
k,1
= · · · = x
k,m
= 1 if the point belongs
to hyperplane H
k
.
Distance between the two is computed as
d(p, F) =
s
n
i=1
( f
i
p
i
)
2
(8)
Since vector F is constant for given document or doc-
ument collection, values of f
i
s are constant and there-
fore d(p, F) is a quadratic function.
The problem of finding the required summary can
now be reformulated as finding the point on P clos-
est to the point F (see Figure 3 for illustration). Since
our polytope is defined by a system of linear inequal-
ities and the distance from F to P is expressed as a
quadratic function, the minimum of this function is
achieved on the boundary of P.
Formally speaking, we are looking for the mini-
mum of the following function under constraints de-
fined in (5).
d(P, F) =
s
n
i=1
(
m
j=1
x
i, j
) f
2
i
(9)
F
P
Figure 3: Distance from F to P.
Minimizing function (9) is a quadratic program-
ming problem. Therefore, the original summarization
problem can be posed as a quadratic programming
problem of minimizing the function d(P, F) under
constraints of (5). Quadratic programming problem
of this type can be solved efficiently both theoretically
and practically (see (Karmarkar, 1984; Khachiyan,
1996; Berkelaar, 1999)).
5.2 Extracting the Summary
Since the LP method not only finds the minimal dis-
tance but also presents an evidence to that minimality
in the form of a point x = (x
i, j
), we use the point’s
data to find what sentences belong to the chosen sum-
mary. Viewing x as a matrix, we check whether or not
each column x[][i] equals to 1. If this equality holds, x
lies on an intersection of hyperplanes that includes H
i
and thereforesentence S
i
is contained in the summary.
Otherwise, S
i
does not belong to the chosen summary.
This test is straightforward and takes O(mn) time.
Applying of our method with lp-solve (Berkelaar,
1999) to Example 1 under length constraint of 11
terms resulted in a summary consisting of the first and
the second sentences.
5.3 Supervised Approach
As described above, finding the closest point is rela-
tive to the term frequency we wish to preserve. Try-
ing to preserve original term frequency of the source
document, we get unsupervised method. Given gold
standard summaries, we can train our model to find
the closest point to the term frequency of gold stan-
dard and apply a trained model on new documents, as
in supervised method.
5.4 Text Preprocessing
In order to build the matrix and then the polytope
model, one needs to perform the basic text prepro-
cessing including sentence splitting and tokenization.
Also, such additional steps like stopwords removal,
stemming, synonym resolution, etc. may be per-
formed for resource-rich languages. Since the main
PolytopeModelforExtractiveSummarization
285
purpose of these methods is to reduce the matrix di-
mensionality, the resulted model will be more effi-
cient.
6 CONCLUSIONS AND FUTURE
WORK
In this paper we present a linear programming model
for the problem of extractive summarization. We rep-
resent the document as a sentence-term matrix whose
entries contain term count values and view this matrix
as a set of intersecting hyperplanes. Every possible
summary of a document is represented as an intersec-
tion of two or more hyperlanes, and one additional
constraint is used to limit the number of terms used in
a summary. We consider the summary to be the best
if term frequency is preserved during summarization,
and in this case the summarization problem translates
into a problem of finding a point on a convexpolytope
(defined by linear inequalities) which is the closest to
the hyperplane describing overall term frequencies in
the document.
Linear programming problem can be solved in
polynomial time (see (Karmarkar, 1984), (Khachiyan,
1996)). Numerous packages and applications are
available, such as (Berkelaar, 1999), (Makhorin,
2000) etc. In future research, we plan to implement
and test our approach, as in unsupervised as in super-
vised learning. Also, we’d like to extend our model to
query-based summarization by adapting the distance
function and apply our text representation model to
such text mining tasks like text clustering and text cat-
egorization.
ACKNOWLEDGEMENTS
The authors thank Ruvim Lipyansky for ideas that led
to development of their approach.
REFERENCES
Alfonseca, E. and Rodriguez, P. (2003). Generating ex-
tracts with genetic algorithms. In Proceedings of the
2003 European Conference on Information Retrieval
(ECIR’2003), pages 511–519.
Berkelaar, M. (1999). lp-solve free software.
http://lpsolve.sourceforge.net/5.5/.
Filatova, E. (2004). Event-based extractive summarization.
In In Proceedings of ACL Workshop on Summariza-
tion, pages 104–111.
Gillick, D. and Favre, B. (2009). A Scalable Global Model
for Summarization. In Proceedings of the NAACL
HLT Workshop on Integer Linear Programming for
Natural Language Processing, pages 10–18.
Hassel, M. and Sjobergh, J. (2006). Towards holistic sum-
marization: Selecting summaries, not sentences. In
Proceedings of LREC - International Conference on
Language Resources and Evaluation.
Hitoshi Nishikawa, Takaaki Hasegawa, Y. M. and Kikui,
G. (2010). Opinion Summarization with Integer Lin-
ear Programming Formulation for Sentence Extrac-
tion and Ordering. In Coling 2010: Poster Volume,
pages 910–918.
Karmarkar, N. (1984). New polynomial-time algorithm for
linear programming. Combinatorica, 4:373–395.
Khachiyan, L. G. (1996). Rounding of polytopes in the real
number model of computation. Mathematics of Oper-
ations Research, 21:307–320.
Khuller, S., Moss, A., and Naor, J. S. (1999). The budgeted
maximum coverage problem. Information Precessing
Letters, 70(1):39–45.
Litvak, M., Last, M., and Friedman, M. (2010). A new ap-
proach to improving multilingual summarization us-
ing a Genetic Algorithm. In ACL ’10: Proceedings of
the 48th Annual Meeting of the Association for Com-
putational Linguistics, pages 927–936.
Liu, D., Wang, Y., Liu, C., and Wang, Z. (2006). Multiple
Documents Summarization Based on Genetic Algo-
rithm. In Fuzzy Systems and Knowledge Discovery,
volume 4223 of Lecture Notes in Computer Science,
pages 355–364.
Makhorin, A. O. (2000). GNU Linear Programming Kit.
http://www.gnu.org/software/glpk/.
Makino, T., Takamura, H., and Okumura, M. (2011). Bal-
anced coverage of aspects for text summarization. In
TAC ’11: Proceedings of Text Analysis Conference.
Mani, I. and Maybury, M. (1999). Advances in Automatic
Text Summarization. MIT Press, Cambridge, MA.
Ouyang, Y., Li, W., Li, S., and Lu, Q. (2011). Applying
regression models to query-focused multi-document
summarization. Information Processing and Manage-
ment, 47:227–237.
Salton, G., Yang, C., and Wong, A. (1975). A vector-space
model for information retrieval. Communications of
the ACM, 18.
Takamura, H. and Okumura, M. (2009). Text summariza-
tion model based on maximum coverage problem and
its variant. In EACL ’09: Proceedings of the 12th Con-
ference of the European Chapter of the Association for
Computational Linguistics, pages 781–789.
Woodsend, K. and Lapata, M. (2010). Automatic Genera-
tion of Story Highlights. In ACL ’10: Proceedings of
the 48th Annual Meeting of the Association for Com-
putational Linguistics, pages 565–574.
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
286