Polytope Model for Extractive Summarization

Marina Litvak and Natalia Vanetik

Department of Software Engineering, Sami Shamoon College of Engineering, Beer Sheva, Israel

Keywords:

Text Summarization, Quadratic Programming, Polytope Model.

Abstract:

The problem of text summarization for a collection of documents is deﬁned as the problem of selecting a

small subset of sentences so that the contents and meaning of the original document set are preserved in the

best possible way. In this paper we present a linear model for the problem of text summarization, where we

strive to obtain a summary that preserves the information coverage as much as possible in comparison to the

original document set. We construct a system of linear inequalities that describes the given document set

and its possible summaries and translate the problem of ﬁnding the best summary to the problem of ﬁnding

the point on a convex polytope closest to the given hyperplane. This re-formulated problem can be solved

efﬁciently with the help of quadratic programming.

1 INTRODUCTION

Automated text summarization is an active ﬁeld of re-

search in various communities like Information Re-

trieval (IR), Natural Language Processing (NLP), and

Text Mining (TM). Summarization is important for

IR since it helps to access large repositories of tex-

tual data efﬁciently by identifying the essence of

a document and indexing a repository. Taxonomi-

cally, we distinguish between single-document, where

a summary per single document is generated, and

multi-document, where a summary per cluster of re-

lated documents is generated, summarization. Also,

we distinguish between automatically generated ex-

tract—the most salient fragments of the input doc-

ument/s (e.g., sentences, paragraphs, etc.) and ab-

stract—re-formulated synopsis expressing the main

idea of the input document/s. Since generating

abstracts requires a deep linguistic analysis of the

input documents, most existing summarizers work

in extractive manner (Mani and Maybury, 1999).

Moreover, extractive summarization can be applied

to cross-lingual/multilingual domains (Litvak et al.,

2010).

In this paper we deal with the problem of extrac-

tive summarization. Our method can be generalized

for both single-document and multi-document sum-

marization. Since the method includes only very basic

linguistic analysis (see section 5.4), it can be applied

to cross-lingual/multilingual summarization.

Formally speaking, in this paper we introduce:

• A novel text representation model expanding a

classic Vector Space Model (Salton et al., 1975)

to Hyperplane and Half-spaces;

• A distance measure between text and information

coverage we wish to preserve;

• A re-formulated extractive summarization prob-

lem as a distance minimizing task and its solution

using quadratic programming.

The main challenge of this paper is a new text repre-

sentation model making possible to represent an ex-

ponential number of extracts without computing them

explicitly, and ﬁnding the optimal one by simple min-

imizing a distance function in polynomial time.

This paper is organized as follows: section 2 de-

picts related work, section 3 describes problem setting

and deﬁnitions, section 4 introduces a new text repre-

sentation model and a possible distance measure be-

tween text and information coverage, section 5 refers

summarization task as a distance optimization in a

new text representation model. We discuss as unsu-

pervised as supervised approaches. Last section con-

tains our future work and conclusions.

2 RELATED WORK

Numerous techniques for automated summarization

have been introduced in the last decades, trying to

reduce the constant information overload of profes-

sionals in a variety of ﬁelds. Many works formulated

281

Litvak M. and Vanetik N..

Polytope Model for Extractive Summarization.

DOI: 10.5220/0004170902810286

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 281-286

ISBN: 978-989-8565-29-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

the summarization as optimization problem, solving it

using such techniques like a standard hill-climbing al-

gorithm(Hassel and Sjobergh, 2006), regression mod-

els (Ouyang et al., 2011), and Evolutionary algo-

rithms (Alfonseca and Rodriguez, 2003; Liu et al.,

2006).

Some authors reduce summarization to the max-

imum coverage problem (Takamura and Okumura,

2009). The maximum coverage model extracts sen-

tences to a summary to cover as many informa-

tion as possible, where information can be mea-

sured by text units like terms, n-grams, etc. De-

spite a great performance (Takamura and Okumura,

2009; Gillick and Favre, 2009) in summarization

ﬁeld, maximum coverage problem is known as NP-

hard (Khuller et al., 1999). Some works attempt to

ﬁnd a near-optimum solution by greedy approach (Fi-

latova, 2004; Takamura and Okumura, 2009). Linear

Programming helps to ﬁnd a more accurate approx-

imated solution to the maximum coverage problem

and became very popular in summarization ﬁeld in

the last years (Gillick and Favre, 2009; Woodsend and

Lapata, 2010; Hitoshi Nishikawa and Kikui, 2010;

Makino et al., 2011).

Trying to solve a trade-offbetween summary qual-

ity and time complexity, we propose a novel summa-

rization model solving the approximated maximum

coverage problem by quadratic programming in poly-

nomial time. We measure information coverage by

terms (normalized meaningful words) and strive to

obtain a summary that preserves the term frequency

as much as possible in comparison to the original doc-

ument.

3 DEFINITIONS

3.1 Problem Setting

We are given a set on sentences

, ..., S

derived

from a document or a cluster of related documents

speaking on some subject. Meaningful words in these

sentences are entirely described by terms T

, ..., T

Our goal is to ﬁnd a subset S

, ..., S

consisting of

sentences such that (1) there are at most N terms in

these sentences; (2) term frequency is preserved as

much as possible w.r.t. the original sentence set; (3)

redundant information among k selected sentences is

minimized.

Since an extractive summarization usually deals with

sentence extraction, this paper also focuses on sentences.

Generally, our method can be used for extracting any other

text units like phrases, paragraphs, etc..

3.2 The Matrix Model

We describe sets of sentences and terms by real matrix

A = (a

i, j

) of size n× m where

i, j

= k if term T

appears in the sentence S

precisely k times.

Then columns of A describe sentences and rows de-

scribe terms. Since we are not interested in redun-

dant sentences, in the case of multi-documentsumma-

rization, we can initially select meaningful sentences

by clustering all the columns as vectors in R

and

choose a single representativefrom each cluster. Then

columns describe representatives of sentence clusters.

Here and further, we refer to A as the sentence-

term matrix corresponding to the given document/s.

Example 1. Given the following text of m = 3 sen-

tences and n = 5 (normalized) terms:

= A fat cat is a cat that eats fat meat.

= My cat eats ﬁsh but he is a fat cat.

= All fat cats eat ﬁsh and meat.

A matrix corresponding to the text above has the

following shape:

= “fat”

= “cat”

= “eat”

= “ﬁsh”

= “meat”







1,1

= 2 a

1,2

= 1 a

1,3

= 1

2,1

= 2 a

2,2

= 2 a

2,3

= 1

3,1

= 1 a

3,2

= 1 a

3,3

= 1

4,1

= 0 a

4,2

= 1 a

4,3

= 1

5,1

= 1 a

5,2

= 0 a

5,3

= 1







where a

i, j

are term frequencies.

Let s be the total number of terms in all the sen-

tences. We can derive s from the sentence-term matrix

S. Formally, we compute

s =

∑

i=1

∑

j=1

i, j

Example 2. For the matrix of Example 1 we have

s =

∑

i=1

∑

j=1

i, j

= 16.

3.3 Term Frequencies

We can use the sentence-term matrix to compute term

frequency for each term. Indeed, for n terms in the

document their term count is a real vector C of size

n, where C[i] = k stands for term count. Then term

frequency is a real vector

F =

obtained from C by dividing each of its elements by

the total term count.

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

282

Computing the vector F requires application of a

simple linear transformation to A. We have

A× J

m×1

where J

m×1

is the all-1 vector. Vector F consists

of term frequences [t f(T

), . . . , t f(T

)] and is easily

computed using matrix-vector multiplication.

Example 3. For the matrix of Example 1 we have

C =







2 1 1

2 2 1

1 1 1

0 1 1

1 0 1



























and s = 16. Then

[4 5 3 2 2]

3.4 The Goal

In this setting, our goal can be reformulated as the

problem of ﬁnding subset i

, ..., i

of matrix columns

from A, so that for the resulting submatrix A

′

the dis-

tance from F to the vector

′

)

′

× J

′

×1

is as small as possible. Here, the number s

′

denotes

the total count of terms in selected sentences. Dis-

tance functions in this case can vary – for example,

Manhattan distance, Euclidean distance, cosine simi-

larity, mutual information etc.

4 FROM MATRIX MODEL TO

POLYTOPE

4.1 Hyperplanes and Half-spaces

Extractive summarization aims at extracting a sub-

set of sentences that covers as much non-redundant

information as possible w.r.t. the source docu-

ment/documents. Here we introduce a new efﬁcient

text representation model with purpose of represent-

ing all possible extracts without computing them ex-

plicitly. Since the number of potential extracts is ex-

ponential in the number of sentences, we would be

saving a great portion of computation time. Finding

an optimal extract of text units is a general problem

for various Information Retrieval tasks like: Question

Answering, Literature Search, etc., and our model can

be efﬁciently applied on all these tasks.

In our representation model, each sentence is rep-

resented by hyperplane, and all sentences derived

from a document form a hyperplane intersections

(polytope). Then, all possible extracts can be repre-

sented by subplanes of our hyperplane intersections

and as such that are not located far from the boundary

of the polytope. Therefore, intuitively, the boundary

of the resulting polytope is a good approximation for

extracts that can be generated from the given docu-

ment.

4.2 The Approach

We view every column of the sentence-term matrix as

a linear constraint representing a hyperplane in R

A term t

in sentence S

is represented by variable x

i, j

Example 4. This example demonstrates variables

corresponding to the 4 × 3 sentence-term matrix A

from Example 1.







1,1

1,2

1,3

2,1

2,2

2,3

3,1

3,2

3,3

4,1

4,2

4,3

5,1

5,2

5,3







Together all the columns will deﬁne a system of

linear inequalities, we also express constrains on the

number of terms in the extract we seek. Then every

sentence in our document is a hyperlane in R

ex-

pressed with the help of elements in columns A[][i] of

A and variables x

i, j

representing appearances of terms

in sentences.

We deﬁne linear inequality

A[][i] · [x

i,1

, . . . , x

i,n

]

∑

j=1

j,i

≤ A[][i] · 1

(1)

Every inequality of this form deﬁnes a hyperplane H

and it lower half-space speciﬁed by equation (1):

∑

j=1

j,i

∑

j=1

j,i

and has normal vector

with

[k] =



j,i

k = j

0, otherwise.

(2)

To express the fact that every term is either present or

absent from the chosen extract, we add constraints

0 ≤ x

i, j

≤ 1 (3)

Intuitively, a point p on H

represents a sentence

with the same term counts as S

. To study subsets

of sentences, we observe intersections of hyperplanes

PolytopeModelforExtractiveSummarization

283

Intersection of H

and H

Figure 1: Two-dimensional projection of hyperplane inter-

section.

. In this case, we say that the intersection of two

hyperplanes H

and H

represents a set of two sen-

tences S

and S

. Then a subset of sentences of size r

is represented by intersection of r hyperplanes.

Example 5. Sentence-term matrix of Example 1 de-

ﬁnes the following hyperplane equations.

:= 2x

1,1

+ 2x

2,1

+ x

3,1

+ x

5,1

= 2+ 2+ 1+ 1= 6

:= x

1,2

+ 2x

2,2

+ x

3,2

+ x

4,2

= 5

:= x

1,3

+ x

2,3

+ x

3,3

+ x

4,3

+ x

5,3

= 5

Here, a summary consisting of the ﬁrst and the sec-

ond sentence is expressed by the intersection of hy-

perplanes H

and H

. Figure 1 shows how a two-

dimensional projection of hyperplanes H

, H

and

their intersections look like.

5 SUMMARIZATION AS A

DISTANCE FUNCTION

We express summarization constraints in the form of

linear inequalities in R

, using the columns of the

sentence-term matrix A as linear constraints. Maxi-

mality constraint on the number of terms in the sum-

mary can be easily expressed as a constraint on the

sum of term variables x

i, j

. Since we are looking for

summaries that consist of at most N terms, we intro-

duce the following linear constraint

∑

i=1

∑

j=1

i, j

≤ N (4)

Indeed, every variable x

i, j

stands for a separate term

in speciﬁc sentence, and we intend for their sum to

express the number of terms in selected sentences.

Example 6. Equation (4) for the sentence-term ma-

trix of Example 1 for N = 10 has the form

∑

i=1

∑

j=1

i, j

≤ 10

Figure 2: Intersection of hyperplanes.

Having deﬁned linear inequalities that describe

each sentence in a document separately and the to-

tal number of terms in sentence subset, we can now

look at them together as a system:











∑

j=1

1,i

1, j

≤

∑

j=1

j,1

. . .

∑

j=1

1,m

m, j

≤

∑

j=1

j,m

∑

i=1

∑

j=1

i, j

≤ N

0 ≤ x

i, j

≤ 1

(5)

First m inequalities describe sentences S

, . . . , S

, and

the next inequality describes constraints on the total

number of terms in a summary.

Since every inequality in the system (5) is lin-

ear, the entire system describes a convex polyhe-

dron in R

, which we denote by P. Faces of P

are determined by intersections of hyperplanes H

+ · · · + x

= N and x

= 0, x

= 1, y

= 0, y

= 1. In-

tersections of H

’s represent subsets of sentences (see

Figure 2 for illustration), as the following property

shows.

Property 1. Equation of the intersection H

1,...,k

∩ ··· ∩ H

(which is a hyperplane by itself) satis-

ﬁes

∑

j=1

∑

i=1

j,i

∑

j=1

∑

i=1

j,i

Proof. This property is trivial, since the intersection

1,...,k

has to satisfy all the equations of H

, . . . , H

Therefore, summing up equalities (1) for H

, . . . , H

we have

∑

j=1

∑

i=1

j,i

∑

j=1

∑

i=1

j,i

. Note that

the choice of indexes 1, . . . , k was arbitrary and the

property holds for any subset of indexes.

Therefore, the hypersurfaces representing sen-

tence sets we seek are in fact hyperplane intersections

that form the boundaries of the polytope P.

5.1 Finding the Closest Point

We assume here that the surface of the polyhedron P

is a suitable representation of all the possible sentence

subsets (its size, of course, is not polynomial in m and

n since the number of vertices of P can be very large).

Fortunately, we do not need to scan the whole set of

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

284

P’s surfaces but rather to ﬁnd the point on P which is

the closest to the term frequency we wish to preserve.

We use the fact that the term frequency vector F

is in fact a point in R

. Polytope P ⊆ R

, and we

need to breach the gap between dimensions of the two

spaces.

Let H be a hyperplane representing a summary.

Then by Property 1 H is w.l.o.g. an intersection of

several hyperplanes H

, . . . , H

of the form (1) and its

normal vector

n satisﬁes the following condition:

n[i, j] =



i, j

i = 1. . . k,

0 otherwise.

Then term count of term T

in this summary is pre-

cisely

∑

j=1

n[i, j] (6)

To ﬁnd the distance from F to possible summaries as

Euclidean distance in R

, we

• look at linear transformation of P that transforms

a hyperplane with normal vector

n = (

i, j

) into a

hyperplane with normal vector

m = (

∑

j=1

n[1, j], . . . ,

∑

j=1

n[n, j]) (7)

• observe F = ( f

, . . . , f

) as a point in R

;

• search for points p = (p

∑

j=1

1, j

, . . . , p

∑

j=1

n, j

) on the transformed polytope whose dis-

tance to F is minimal;

• such a point p is a transformation of point x =

) ∈ R

which lies on the boundary of P. It

holds that x

k,1

= · · · = x

k,m

= 1 if the point belongs

to hyperplane H

Distance between the two is computed as

d(p, F) =

∑

i=1

( f

− p

)

(8)

Since vector F is constant for given document or doc-

ument collection, values of f

’s are constant and there-

fore d(p, F) is a quadratic function.

The problem of ﬁnding the required summary can

now be reformulated as ﬁnding the point on P clos-

est to the point F (see Figure 3 for illustration). Since

our polytope is deﬁned by a system of linear inequal-

ities and the distance from F to P is expressed as a

quadratic function, the minimum of this function is

achieved on the boundary of P.

Formally speaking, we are looking for the mini-

mum of the following function under constraints de-

ﬁned in (5).

d(P, F) =

∑

i=1

(

∑

j=1

i, j

) − f

(9)

Figure 3: Distance from F to P.

Minimizing function (9) is a quadratic program-

ming problem. Therefore, the original summarization

problem can be posed as a quadratic programming

problem of minimizing the function d(P, F) under

constraints of (5). Quadratic programming problem

of this type can be solved efﬁciently both theoretically

and practically (see (Karmarkar, 1984; Khachiyan,

1996; Berkelaar, 1999)).

5.2 Extracting the Summary

Since the LP method not only ﬁnds the minimal dis-

tance but also presents an evidence to that minimality

in the form of a point x = (x

i, j

), we use the point’s

data to ﬁnd what sentences belong to the chosen sum-

mary. Viewing x as a matrix, we check whether or not

each column x[][i] equals to 1. If this equality holds, x

lies on an intersection of hyperplanes that includes H

and thereforesentence S

is contained in the summary.

Otherwise, S

does not belong to the chosen summary.

This test is straightforward and takes O(mn) time.

Applying of our method with lp-solve (Berkelaar,

1999) to Example 1 under length constraint of 11

terms resulted in a summary consisting of the ﬁrst and

the second sentences.

5.3 Supervised Approach

As described above, ﬁnding the closest point is rela-

tive to the term frequency we wish to preserve. Try-

ing to preserve original term frequency of the source

document, we get unsupervised method. Given gold

standard summaries, we can train our model to ﬁnd

the closest point to the term frequency of gold stan-

dard and apply a trained model on new documents, as

in supervised method.

5.4 Text Preprocessing

In order to build the matrix and then the polytope

model, one needs to perform the basic text prepro-

cessing including sentence splitting and tokenization.

Also, such additional steps like stopwords removal,

stemming, synonym resolution, etc. may be per-

formed for resource-rich languages. Since the main

PolytopeModelforExtractiveSummarization

285

purpose of these methods is to reduce the matrix di-

mensionality, the resulted model will be more efﬁ-

cient.

6 CONCLUSIONS AND FUTURE

WORK

In this paper we present a linear programming model

for the problem of extractive summarization. We rep-

resent the document as a sentence-term matrix whose

entries contain term count values and view this matrix

as a set of intersecting hyperplanes. Every possible

summary of a document is represented as an intersec-

tion of two or more hyperlanes, and one additional

constraint is used to limit the number of terms used in

a summary. We consider the summary to be the best

if term frequency is preserved during summarization,

and in this case the summarization problem translates

into a problem of ﬁnding a point on a convexpolytope

(deﬁned by linear inequalities) which is the closest to

the hyperplane describing overall term frequencies in

the document.

Linear programming problem can be solved in

polynomial time (see (Karmarkar, 1984), (Khachiyan,

1996)). Numerous packages and applications are

available, such as (Berkelaar, 1999), (Makhorin,

2000) etc. In future research, we plan to implement

and test our approach, as in unsupervised as in super-

vised learning. Also, we’d like to extend our model to

query-based summarization by adapting the distance

function and apply our text representation model to

such text mining tasks like text clustering and text cat-

egorization.

ACKNOWLEDGEMENTS

The authors thank Ruvim Lipyansky for ideas that led

to development of their approach.

REFERENCES

Alfonseca, E. and Rodriguez, P. (2003). Generating ex-

tracts with genetic algorithms. In Proceedings of the

2003 European Conference on Information Retrieval

(ECIR’2003), pages 511–519.

Berkelaar, M. (1999). lp-solve free software.

http://lpsolve.sourceforge.net/5.5/.

Filatova, E. (2004). Event-based extractive summarization.

In In Proceedings of ACL Workshop on Summariza-

tion, pages 104–111.

Gillick, D. and Favre, B. (2009). A Scalable Global Model

for Summarization. In Proceedings of the NAACL

HLT Workshop on Integer Linear Programming for

Natural Language Processing, pages 10–18.

Hassel, M. and Sjobergh, J. (2006). Towards holistic sum-

marization: Selecting summaries, not sentences. In

Proceedings of LREC - International Conference on

Language Resources and Evaluation.

Hitoshi Nishikawa, Takaaki Hasegawa, Y. M. and Kikui,

G. (2010). Opinion Summarization with Integer Lin-

ear Programming Formulation for Sentence Extrac-

tion and Ordering. In Coling 2010: Poster Volume,

pages 910–918.

Karmarkar, N. (1984). New polynomial-time algorithm for

linear programming. Combinatorica, 4:373–395.

Khachiyan, L. G. (1996). Rounding of polytopes in the real

number model of computation. Mathematics of Oper-

ations Research, 21:307–320.

Khuller, S., Moss, A., and Naor, J. S. (1999). The budgeted

maximum coverage problem. Information Precessing

Letters, 70(1):39–45.

Litvak, M., Last, M., and Friedman, M. (2010). A new ap-

proach to improving multilingual summarization us-

ing a Genetic Algorithm. In ACL ’10: Proceedings of

the 48th Annual Meeting of the Association for Com-

putational Linguistics, pages 927–936.

Liu, D., Wang, Y., Liu, C., and Wang, Z. (2006). Multiple

Documents Summarization Based on Genetic Algo-

rithm. In Fuzzy Systems and Knowledge Discovery,

volume 4223 of Lecture Notes in Computer Science,

pages 355–364.

Makhorin, A. O. (2000). GNU Linear Programming Kit.

http://www.gnu.org/software/glpk/.

Makino, T., Takamura, H., and Okumura, M. (2011). Bal-

anced coverage of aspects for text summarization. In

TAC ’11: Proceedings of Text Analysis Conference.

Mani, I. and Maybury, M. (1999). Advances in Automatic

Text Summarization. MIT Press, Cambridge, MA.

Ouyang, Y., Li, W., Li, S., and Lu, Q. (2011). Applying

regression models to query-focused multi-document

summarization. Information Processing and Manage-

ment, 47:227–237.

Salton, G., Yang, C., and Wong, A. (1975). A vector-space

model for information retrieval. Communications of

the ACM, 18.

Takamura, H. and Okumura, M. (2009). Text summariza-

tion model based on maximum coverage problem and

its variant. In EACL ’09: Proceedings of the 12th Con-

ference of the European Chapter of the Association for

Computational Linguistics, pages 781–789.

Woodsend, K. and Lapata, M. (2010). Automatic Genera-

tion of Story Highlights. In ACL ’10: Proceedings of

the 48th Annual Meeting of the Association for Com-

putational Linguistics, pages 565–574.

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

286