Uniﬁed New Techniques for NP-Hard Budgeted Problems with

Applications in Team Collaboration, Pattern Recognition, Document

Summarization, Community Detection and Imaging

Dorit S. Hochbaum

Department of Industrial Engineering and Operations Research, University of California, Berkeley, U.S.A.

Keywords:

Parametric Flow, Maximum Diversity, Quadratic Knapsack, Efﬁcient Frontier, Text Summarization.

Abstract:

This paper introduces new techniques for any NP-hard problems formulated as monotone integer programming

(IPM) with a budget constraint “budgeted IPM”. Problems of this type have diverse applications, including

maximizing team collaboration, the maximum diversity problem, facility dispersion, threat detection, mini-

mizing conductance, clustering, and pattern recognition.

We present a uniﬁed framework for effective algorithms for budgeted IPM problems based on the Langrangian

relaxation of the budget constraint. It is shown that all optimal solutions for all values of the Lagrange mul-

tiplier are generated very efﬁciently, and the piecewise linear concave envelope (convex, for minimization

problems) of these solutions has breakpoints that are optimal solutions for the respective budgets. This is used

to derive high quality upper and lower bounds for budgets that do not correspond to breakpoints. We show that

for all these problems, the weight “perturbation” concept, that was successful for the problem of maximum

diversity in enhancing the number and distribution of breakpoints, is applicable. Furthermore, the insights

derived from this efﬁcient frontier of solutions, lead to the result that all the respective ratio problems have a

solution at the “ﬁrst” breakpoint, which generalizes the concept of maximum density subgraph.

1 INTRODUCTION

We explore here NP-hard problems that can be for-

mulated as monotone integer programs with an ad-

ditional budget constraint. Monotone integer pro-

gramming problems (IPM) are solvable in polyno-

mial time as a minimum cut on an associated graph,

(Hochbaum, 2002). The addition of the budget con-

straint renders these problems NP-hard. Among such

problems are the quadratic knapsack, the minimum

conductance problem, the facility dispersion problem,

the ratio problem, image segmentation problems and

others. The relationship of IPM problems to the min-

imum cut problem is crucial, as it allows to solve the

relaxed budget constraint problem parametrically, and

very efﬁciently, and to generate the concave envelope

which has desirable properties: that the breakpoints

give optimal solutions and that the solutions at the se-

quence of breakpoints are nested.

Special cases of IPM with budget constraint, the

maximum diversity and the facility dispersion prob-

lems, were recently addressed in (Hochbaum et al.,

https://orcid.org/0000-0002-2498-0512

2023), by solving efﬁciently the Lagrangean relax-

ation of the budget constraint for all values of the La-

grange multiplier. The function of all solutions for

each budget value has an upper envelope of the inter-

section of all lines that lie above all solutions, which

is concave, piecewise linear. We refer to this piece-

wise linear function that maps the budget to an up-

per bound on the value of the objective, as the con-

cave envelope. Because IPM problems are minimum

cut problems, the breakpoints in the concave envelope

correspond to solutions that are optimal and nested.

This concave envelope provides an upper bound for

the solutions for each budget, but is also used to gen-

erate high quality feasible solutions as lower bounds

(for maximization). The feasible solutions are gen-

erated with the breakpoints algorithm that utilizes

breakpoints with close budget values to the value in

the input, to append or remove elements, using a fast

heuristic, (Hochbaum et al., 2023). Since increased

density of the breakpoints enhances the quality of the

solutions, it was shown that a perturbation method

on the utility values lead to tighter lower bounds (for

maximization). This resulted in the ability to solve

the problem, even for small budget values, which are

Hochbaum, D.

Uniﬁed New Techniques for NP-Hard Budgeted Problems with Applications in Team Collaboration, Pattern Recognition, Document Summarization, Community Detection and Imaging.

DOI: 10.5220/0012207200003598

In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2023) - Volume 1: KDIR, pages 365-372

ISBN: 978-989-758-671-2; ISSN: 2184-3228

365

particularly challenging, either to optimality, or very

close to optimality. All this is done within a small

fraction of the running time required by competing

approaches, such as an integer programming solver,

or by state-of-the-art meta-heuristics.

The Lagrangean relaxation of the budget con-

straint has been previously used in the context of the

quadratic knapsack problem, most recently by (Spiers

et al., 2023). The quadratic knapsack, which is iden-

tical to the maximum diversity and maximum facil-

ity dispersion problems, can be formulated as IPM

with a budget constraint. However, the methods de-

veloped for the quadratic knapsack were ad-hoc and

suitable for this case only, and the general structure

has not been recognized up till now. In addition, all

the literature that uses this approach, going back to

(Chaillou et al., 1989; Pisinger, 2006) can solve the

problems to optimality for at most a few hundred vari-

ables, and for selected, relatively high, values of the

budget (accommodating 10%-50% of the variables),

whereas the harder problems, that are more prevalent

in applications, are for smaller budgets. In addition,

the running times of the current approaches do not

scale well. Indeed in the reported results of (Spiers

et al., 2023) the largest instances contain up to 2000

nodes of the GKD benchmark. These instances were

shown to be particularly easy, in (Hochbaum et al.,

2023). In contrast, in (Hochbaum et al., 2023), new

insights as to how to deal with harder problems, with

perturbation, were able to provide, often optimal, or

very close to optimal, solutions, within a tiny fraction

of the running time of competing approaches, includ-

ing integer programming software.

Here we demonstrate that the breakpoints algo-

rithm is applicable to a vast collection of hard prob-

lems, with the potential of providing optimal or very

close to optimal, solutions One example explored

here is the text summarization problem, aka multi-

document summarization, which is modeled, under

the MMR criterion, as combining one goal of max-

imizing the sum of dissimilarities in the selected set

(of sentences), to enhance the diversity of the selected

set and eliminate redundancy, with the second goal

maximizing the similarities between the selected set

and its complement. This combined objective is NP-

hard to solve even without the budget constraint on

the total size of the sentences selected. A straightfor-

ward formulation of this optimization problem, given

in (Lin and Bilmes, 2010), is reported to be solved

for an instance of the problem of size 178 sentences,

in 17 hours, using an integer programming software.

Our approach is to model the problem as budgeted

IPM simply by replacing the similarity weights by

dis-similarity weights, e.g. by taking the reciprocal

of the similarities. Once the problem is modeled as

IPM with a budget constraint, the framework pre-

sented here can utilize the concave envelope to solve

the problem effectively, and with a highly scalable al-

gorithm. This is discussed in detail in Section 3.

There is a close relationship between ratio prob-

lems and budgeted IPM problems. This relationship is

reﬂected in the concave envelope, where the optimal

solution to the respective budget problem is the ﬁrst

breakpoint, for the smallest budget value, that cor-

responds to a generalization of the maximum density

subgraph problem.

Our contributions here include:

• Introducing a large class of NP-hard problems that

are formulated as monotone integer programming

with a budget constraint: budgeted IPM.

• Demonstrating that for all budgeted IPM prob-

lems the concave envelope (for maximization,

convex for minimization) related to the La-

grangean relaxation of the budget constraint is

constructed as the output of a (parametric) min-

imum cut procedure on a respective graph.

• The breakpoints in the concave envelope are

shown to be optimal solutions for the respective

budget values, and correspond to nested solutions.

• The perturbation concept, of (Hochbaum et al.,

2023), applies to the class of budgeted IPM prob-

lems, and can increase the number of breakpoints

and enhance their distribution around the budget

values of interest.

• Relationship of budgeted IPM problems to the

respective IPM ratio problems, that are polyno-

mial time solvable with a parametric cut proce-

dure, showing that the ﬁrst (leftmost for maxi-

mization) breakpoint, solves the ratio problem op-

timally. The ﬁrst breakpoint is shown to general-

ize the concept of maximum density subgraph.

• The newly introduced, procedure incremental-

para, that solves IPM ratio problems, with a given

initial feasible solution, in the complexity of a sin-

gle minimum cut procedure and generates sequen-

tially all breakpoints.

• Show how all budgeted IPM problems are

amenable to the breakpoints algorithm of

(Hochbaum et al., 2023) which bodes well to the

chances of being able to use a scalable algorithm

that delivers high quality solutions.

• Demonstrating a new formulation for the text

summarization problem that renders it a budgeted

IPM problem with the potential of new scalable

methods for the problem.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

366

The paper is structured as follows: the following Sec-

tion 2 provides the basic graph notation used in the

rest of the paper, the formulation of a general IPM

problem, and the construction of the associated graph

and several examples of IPM problems and ratio prob-

lems that are IPM. Section 3 introduces the text sum-

marization problem in the known form of non-IPM

problem, and shows that it can be modeled as a bud-

geted IPM problem. The key insights to the concave

envelope and the implications for budgeted IPM, and

ratio IPM problem are presented in Section 4. Sec-

tion 5 presents a new form of fully parametric cut

procedure, that is more efﬁcient than the parametric

cut approach in that it only computes the solutions at

the breakpoints. Section 6 includes conclusions and

pointers to future research.

It is noted that although most of the results are

presented in the setting of maximization problems,

they all apply analogously to minimization problems

as well.

2 NOTATION AND IPM

PROBLEMS

Firstly, we introduce graph notation used here: Let the

input graph be an undirected graph G = (V,E) with

the weights of the edges be w

i j

for [i, j] ∈ E. A bi-

partition of the graph is called a cut, (S,

S) = {[i, j] ∈

E|i ∈ S, j ∈

S}, where

S = V \ S. Given two subsets

of nodes, A ⊆ V, B ⊆ V let the sum of weights of the

edges, with one endpoint in A and the other in B be

C(A, B) =

∑

i∈A, j∈B

i j

. For the cut (S,

S), the capac-

ity of this cut is C(S,

S), and the sum of weights inside

the set A is C(A,A) =

∑

i, j∈A

i j

If the edges have two sets of weights, these will

be denoted by w

i j

and w

i j

. For inputs with two sets

of edge weights we let C

(A,B) =

∑

i∈A, j∈B

i j

and

(A,B) =

∑

i∈A, j∈B

i j

We denote by d

the weighted degree of node i in

G: d

∑

j|[i, j]∈E

i j

. The sum of the weighted de-

grees of nodes in a set H ⊂ V is d(H) =

∑

i∈H

, and

is also referred to as the volume of the set H. We also

allow nodes to be associated with arbitrary weights,

which are denoted by q

for i ∈ V , and the sum of

weights of nodes in the set H is q(H) =

∑

i∈H

Monotone integer programming problems,

IPM, also referred to as Monotone IP3 problems, are

integer programming problems on at most 3 vari-

ables per constraint, where two of the variables, the

x-variables, appear with opposite sign coefﬁcients,

and a third variable, a z-variable, if included, can ap-

pear in at most one constraint. The coefﬁcient of the

third variable, in the objective function, must be non-

negative for minimization problems, or non-positive

for maximization problems. The formulation of a

general maximization (monotone) IP3, for a set of

n x-variables, those that can appear in multiple con-

straints, and a set of constraints involving a collec-

tion of pairs of variables A and a respective set of z-

variables is

(IPM) max

∑

i=1

−

∑

(i, j)∈A

i j

s.t. a

i j

− b

i j

≤ c

i j

+ z

i j

∀ (i, j) ∈ A

ℓ

≤ x

≤ u

, integer ∀ i ∈ V

i j

≥ 0, integer ∀ (i, j) ∈ A.

Here all a

i j

and b

i j

and u

i j

are all non-negative. A

constraint may appear in the form a

− b

≤

, without a third variable. In that case, for the sake

of streamlined presentation, we can assume that z

’s

coefﬁcient is u

= ∞.

(Hochbaum, 2002) showed that any IPM can be

written as the maximum s-excess problem, in U =

∑

i=1

− ℓ

) binary variables. (Solving the problem

(IPM) independently of U was proved to be NP-hard.)

The maximum s-excess problem is formulated as a bi-

nary optimization problem, where x

= 1 iff node i is

in the optimal set S:

(s-excess) max

∑

j∈V

−

∑

(i, j)∈A

i j

subject to x

− x

≤ z

i j

for (i, j) ∈ A

binary j = 1,.. ., n

i j

binary (i, j) ∈ A.

Although the constraints of the type x

− x

≤ 0 do

not appear explicitly in this formulation, these can be

written as x

− x

≤ z

i j

where the cost coefﬁcient u

i j

of the respective variable z

i j

(and the capacity of the

corresponding arc) is inﬁnite. We now show that the

maximum s-excess set in a graph G is the source set

of a minimum cut in an associated graph G

, con-

structed as follows, (Hochbaum, 2002): We add nodes

s and t to the graph G, with an arc from s to every

positive weight node i, of capacity u

= w

, and an

arc from every negative weight node j to t of capac-

ity u

= −w

. Let this added set of arcs, adjacent to

s and t (source node and sink node respectively) be

denoted by A

. The arcs of A each carry the capacity

i j

. The graph G

is then (V ∪ {s,t},A ∪ A

Lemma 1. S

∗

is a set of maximum s-excess capacity

in the original graph G if and only if S

∗

is the source

set of a minimum s,t-cut in the associated graph G

Proof. Let V

≡ {i ∈ V |w

> 0}, and let V

−

≡ { j ∈

V |w

< 0}. Let (s ∪ S,t ∪ T ) be a minimum s,t cut on

Uniﬁed New Techniques for NP-Hard Budgeted Problems with Applications in Team Collaboration, Pattern Recognition, Document

Summarization, Community Detection and Imaging

367

. Then the capacity of this cut is given by

C (s ∪ S,t ∪ T )

∑

(s,i)∈A

,i∈T

s,i

∑

( j,t)∈A

, j∈S

j,t

∑

i∈S, j∈T

i j

∑

i∈T ∩V

∑

j ∈S∩V

−

−w

∑

i∈S, j∈T

i j

= W

− [

∑

j∈S

−

∑

i∈S, j∈T

i j

]

Where W

∑

i∈V

is the sum of all positive

weights in G, which is a constant. Therefore, min-

imizing C (s ∪ S,t ∪T ) is equivalent to maximizing

∑

j∈S

−

∑

i∈S, j∈T

i j

, and we conclude that the

source set of a minimum s,t cut on G

is also a max-

imum s-excess set of G.

Examples of IPM Problems. For the following prob-

lems, the discovery of the IPM model for the problem

led to very fast, high quality solutions.

1. The co-segmentation problem: This is to iden-

tify the same feature in two distinct images. The

problem is modeled as increased similarity be-

tween histogram buckets of the foregrounds and

decreased binary MRF (Markov Random Field -

equivalent to distinctness of the image) in image

1 and in image 2 that determine the foregrounds

(co-segmentation), (Hochbaum and Singh, 2009).

2. Identifying nuclear threat alerts regions: The

model is to identify a region with increased num-

ber and intensity of alerts within; decreased length

of region boundary and decreased number of no-

alerts within the region, contributing to alert con-

centration. (Hochbaum and Fishbain, 2011).

There are many ratio problems that are IPM, but were

not always recognized as such. As explained in Sec-

tion 4, a ratio problem is IPM if the “linearized”

λ-question is an IPM problem. One such problem,

thought to be NP-hard (Sharon et al., 2006), is to ﬁnd

a subset in the graph, that maximizes the ratio of the

sum of similarities (edge weights) in the set, divided

by the cut capacity separating the set from its com-

plement: max

S⊂V

C(S,S)

. This problem was stated in

(Sharon et al., 2006) to be equivalent to the normal-

ized cut problem, (Shi and Malik, 2000), which is

NP-hard. However, recognizing that the problem is

IPM led to a polynomial time algorithm (Hochbaum,

2010). This problem is also called HNC. A variant of

HNC, max

S⊂V

(S,S)

, where the set of weights used

in the numerator is different from the set of weights

used in the denominator, is also IPM, (Hochbaum,

2010).

Another problem, related to the conductance

problem, is to minimize the cut separating the set

and its complement divided by the “volume” of the

set: min

S⊂V

C(S,

d(S)

. This problem is in fact equiva-

lent to HNC as shown in (Hochbaum, 2010). It is

also a relaxation of conductance and Cheeger con-

stant, (Cheeger, 1970) where the “budget” constraint

d(S) ≤

d(V ) is relaxed.

The well known problem, the maximum density

subgraph problem, max

S⊂V

C(S,S)

d(S)

, is IPM. We refer to

it in its most general form where the nodes can carry

arbitrary weights q

: max

S⊂V

C(S,S)

q(S)

In all the above problems, the formulations have

binary variables and do not require the binarization

transformation of a general IPM.

3 TEXT SUMMARIZATION-AS

BUDGETED IPM

The text summarization problem, aka, the multi-

document summarization problem has been studied

extensively, with heuristics and approximations. The

problem is modeled as the Maximal Marginal Rele-

vance (MMR) criterion, introduced in (Carbonell and

Goldstein, 1998), which strives to reduce redundancy

while maintaining query relevance in re-ranking re-

trieved documents and in selecting appropriate pas-

sages for text summarization. This MMR model

was formulated as integer programming in (Lin and

Bilmes, 2010), as shown next.

The formulation uses the variables: Let x

be the

binary variable which takes the value 1 if node (sen-

tence) i is selected, and 0 otherwise,



1 if i ∈ S

0 if i ∈

i j

= 1 if exactly one of i and j is in S; y

i j

= 1 if

both With these variables the following formulation

of (MMR) max

S⊂V

C(S,S) − αC(S, S) was given in

(Lin and Bilmes, 2010):

(MMR) max

∑

(i, j)∈E

i j

− α

∑

[i, j]∈E

i j

subject to x

− x

≤ z

i j

for all (i, j) ∈ E

i j

≤ x

1 + z

i j

≥ x

i j

≤ x

for all [i, j] ∈ E

i j

≤ x

+ x

≤ 1 − y

i j

binary .

This formulation is not IPM since the “third” vari-

ables of some of the constraints, also appear in other

constraints. We maintain the spirit of this MMR

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

368

objective and change the weights to mean the dis-

similarity between every pair of nodes. This results in

a polynomial time solvable problem, since it isIPM,

(MMR*) max

S⊂V

C(S,S) − βC(S,S).

This objective function is f (x) where x

= 1 if sen-

tence i is selected in the “summary” set S. Here β

corresponds to 1/α in the maximization formulation.

The problem formulation as IPM is:

(MMR*) max

∑

′

i j

− β

∑

i j

subject to x

− x

≤ z

i j

for all [i, j] ∈ E

− x

≤ z

for all [i, j] ∈ E

i j

≤ x

for all [i, j] ∈ E

i j

≤ x

i j

binary .

To verify the validity of the formulation notice that

the objective function drives the values of z

i j

to be

as small as possible, and the values of y

i j

to be as

large as possible. With the constraints, z

i j

cannot be

0 unless both endpoints i and j are in the same set.

On the other hand y

i j

cannot be equal to 1 unless both

endpoints i and j are in S. In this formulation, both x

and y

i j

are the x-variables and there is a node for each

in the associated ﬂow graph. Next we show that there

is a more compact formulation that has one node in

the graph for each candidate sentence.

3.1 A Compact IPM MMR*

Formulation

We ﬁrst need additional notation: Let the input graph

be an undirected graph G = (V,E) with the weights of

the edges be w

i j

for [i, j ] ∈ E. Let d

be the weighted

degree of node i in G which is the sum of the weights

of the edges adjacent to j: d

∑

j|[i, j]∈E

i j

. For a

subset of nodes D ⊂ V let d(D) =

∑

i∈D

which is

also known as the “volume” of the set D.

The compact formulation is presented as the fol-

lowing Lemma:

Lemma 2. Solving max

S⊂V

C(S,S) − βC(S,

S) is

equivalent to solving

min

S⊂V

C(S,S) −

1+2β

d(S).

Proof. Using the equality

2C(S, S) +C(S,

S) = d(S) (1)

max

S⊂V

C(S,S) − βC(S , S) = max

S⊂V

d(S) −

C(S,S) − βC(S,S)

= −

min

S⊂V

(1 + 2β)C(S, S) − d(S).

The latter is equivalent to min

S⊂V

C(S,S) −

1+2β

d(S).

In the formulation of min

S⊂V

C(S,S) −

1+2β

d(S)

the variables y

i j

do not appear. Therefore the num-

ber of nodes the respective graph is n, where n is the

number of sentences, or x-variables.

4 LINEARIZING RATIOS AND

ENVELOPES

We consider a generic IPM problem max

x∈F

f (x), for

x an integer vector. The entire discussion applies to

minimization problems as well. Suppose there is a

budget constraint of the form g(x) ≤ B. The Lagrange

relaxation approach is related to minimizing, or max-

imizing, a fractional (or as it is sometimes called, ge-

ometric) objective function over a feasible region F ,

max

x∈F

f (x)

g(x)

. To solve such fractional problem one

can reduce it to a sequence of calls to an oracle that

provides the yes/no answer to the λ-question:

Is there a feasible solution x ∈ F such that f (x) −

λg(x) > 0?

The answer to this question is given by solving the

optimization problem

max

x∈F

f (x) − λg(x).

It is yes if the maximum value is greater than 0, no

if the maximum value is less than 0. Suppose the an-

swer to the λ-question, for maximization, is yes, then

the optimal solution has value greater than λ. Oth-

erwise, the optimal value is less than or equal to λ.

If the answer is 0 then the optimum has been found.

Note that this maximization problem is still an IPM

problem since adding any term that is linear in the x

variables, retains the form of the s-excess problem.

The λ-question problem is also the Lagrangean re-

laxation of the budget constraint for the budgeted IPM

problem, max

x∈F

{ f (x)|g(x) ≤ B}.

Since the λ-question is formulated as IPM, (to

make it simpler assume the variables are binary,

though everything applies to the general integer case)

the x-variables of the problem correspond to nodes in

the associated graph that are set to 1 if they belong to

the source set, and 0 if belong to the sink set. Fur-

thermore, the terms that depend on λ appear only in

the coefﬁcients of the x variables, and therefore the

associated ﬂow graph is parametric.

An s,t-graph with source and sink adjacent ca-

pacities that depend on a parameter, is said to be

a parametric ﬂow graph if source adjacent capaci-

ties are monotone nondecreasing in λ, and sink adja-

cent capacities monotone nonincreasing in λ (or vice

versa). A parametric ﬂow algorithm solves the max-

imum ﬂow and minimum cut on a parametric ﬂow

Uniﬁed New Techniques for NP-Hard Budgeted Problems with Applications in Team Collaboration, Pattern Recognition, Document

Summarization, Community Detection and Imaging

369

graph, for all values of the parameter, in the complex-

ity of a single max-ﬂow, or min-cut. There are two

parametric ﬂow algorithms known that solve the prob-

lem, for all values of λ, in strongly polynomial time

in the complexity of a single cut, O(mn log

), for m

the number of edges in the graph, and n the number

of nodes ((Gallo et al., 1989; Hochbaum, 2008) (the

improved run time of the latter is given in (Hochbaum

and Orlin, 2013)).

First we show the properties of the minimum cut

function of λ in a parametric ﬂow network.

Figure 1: The cut capacity as a function of λ in a parametric

cut.

Deﬁnition 1. A function C(λ) is breakpoint-concave

if for λ

< λ

, C

(λ) −C

(λ) is a monotone nonde-

creasing function of λ.

Recall that in a parametric ﬂow network the

source adjacent capacities are monotone non-

decreasing and the sink adjacent capacities are mono-

tone non-increasing.

Lemma 1. In a parametric graph, the cut capacity

function C(λ) is breakpoint-concave.

Proof: By the deﬁnition, we need to show that for

< λ

, C

(λ) −C

(λ) is a monotone nondecreas-

ing function of λ.

Let ({s} ∪ S

,{t} ∪ T

), ({s} ∪ S

,{t} ∪ T

) be the

cuts corresponding to λ

and λ

respectively. Because

of the nestedness property, S

⊆ S

and T

⊇ T

(λ) −C

(λ) =

C(S

) −C(S

)

+ C({s}, T

)(λ) −C({s},T

)(λ)

+ C(S

,{t})(λ) −C(S

,{t})(λ) =

+C({s},T

\ T

)(λ) −C(S

\ S

,{t})(λ)

1,2

is a constant independent of λ. C({s},T

)(λ) is a monotone nondecreasing function, since

it is a sum of capacities adjacent to the source, and

C(S

\ S

,{t})(λ) is a monotone nonincreasing func-

tion, since it is the sum of capacities adjacent to the

sink. Therefore the difference between these two

terms is monotone nondecreasing.

Benefit

Budget

Figure 2: The concave envelope, the breakpoints and the

ratio maximizing solution.

We are interested next in the link between

max

x∈F

f (x) − λg(x) and the efﬁcient frontier of

the solutions to max

x∈F

{ f (x)|g(x) ≤ B}: Suppose

that we graph the optimal solution x

to the problem

= argmax

x∈F

{ f (x)|g(x) ≤ B}, with the horizon-

tal axis the value of the budgets B and the vertical axis

the value of the objective f (x

). We will refer to B as

the “budget” of x

and to f (x) as the “beneﬁt” of of

Consider the intersection of all the lines that have

the entire collection of optimal solutions below them.

This upper envelope is concave piecewise linear and

the points at which the line segment changes, to lower

slope line, are called breakpoints, see Figure 2. We

note that the ﬁrst breakpoint is also the optimal solu-

tion to the ratio problem, max

x∈F

f (x)

g(x)

Let x

∈ F be a feasible solution, and let λ

f (x

)

g(x

)

. We claim that max

x∈F

f (x) − λ

g(x) provides

the tangent line to the concave envelope with the slope

at a budget ≤ g(x

Lemma 3. max

x∈F

f (x) − λ

g(x) provides the tan-

gent line to the concave envelope with the slope λ

a budget ≤ g(x

λ0

Budget

Benefit

Figure 3: Identifying a breakpoint with λ

subgradient.

Proof. Let −∆ be the intercept of such line, of slope

, on the horizontal axis, as in Figure 3. To ﬁnd the

point of the tangent we want to maximize ∆.

Since f (x) = λ

(g(x) − ∆), it follows that ∆ =

[ f (x) − λ

(g(x)].

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

370

Therefore, maximizing ∆ is equivalent to

max

x∈F

f (x) − λ

g(x). For such ∆

∗

the line

f (x) − λ

(g(x) − ∆

∗

) lies above all feasible solu-

tions and is tangential to the concave envelope at

breakpoint x

, where x

= argmax

x∈F

f (x) − λ

g(x).

is a breakpoint with a left subgradient equal to

and right subgradient equal to λ

, such that λ

≥

≥ λ

For the convex envelope corresponding to the min-

imization problem, the tangent is found at a break-

point to the right, bigger budget, of g(x

The proof of Lemma 3 leads to an iterative pro-

cedure for ratio improvement. Consider the maxi-

mization case: From solution x

we derive the break-

point solution x

with the left subgradient λ

. When

resolving, for λ

, max

x∈F

f (x) − λ

g(x) the optimal

solution is an adjacent breakpoint on the left, say x

that has λ

as a right derivative. Also, the nestedness

property implies that x

≤ x

. Therefore, the proce-

dure will scan all the breakpoints, one at a time, till

it reaches the maximum “density” point, x

∗

that max-

imizes max

x∈F

f (x)

g(x)

. Figure 2 illustrates this break-

point, which is the leftmost of all breakpoints in the

concave envelope. This process leads to the incre-

mental parametric cut procedure described in the next

section.

5 THE INCREMENTAL

PARAMETRIC CUT AND

CLUSTER IMPROVEMENT

In many applications the clustering problem has an

NP-hard version that has an additional, budget, con-

straint on the size or weight of the selected set. Con-

sider for instance the ratio region problem which is

a relaxation of the NP-hard graph expander problem,

where the minimization of the ratio is constrained by

the requirements the size of the set is at most half the

number of nodes in the graph. A common heuristic

method to address this is cluster improvement. One

derives an initial solution that satisﬁes the additional

constraint, and then apply the respective ratio opti-

mization to ﬁnd a subset which is optimizes the ratio

within this cluster. Although one can solve for the

optimal ratio within the subset with the same para-

metric cut algorithm after ﬁxing the values of the out-

of-initial-cluster nodes to be out of the cluster there is

another approach that was used, for instance, in (Lang

and Rao, 2004).

PROCEDURE (MAXIMUM) RATIO IMPROVEMENT

( f (), g(),x

∈ F ).

Step 0: Initialize k = 0,

Step 1: λ

f (x

)

g(x

)

. {This step can be implemented extra

efﬁciently adding linear time to all iterations.}

Step 2: Solve, with a min-cut algorithm or procedure

incremental-para,

ratio(λ

) = max

x∈F

f (x) − λ

g(x), and let x

k+1

argmax

x∈F

f (x) − λ

g(x).

Step 3: If ratio(λ

) = 0 stop. Output x

∗

= x

. Else, con-

tinue

Step 4: {ratio(λ

) < 0} Let k := k + 1. Go to step 1.

To prove validity need to show that ratio(λ) ≤ 0 (be-

cause there is already a feasible solution of the value

λ we know that the optimal ratio value can only be

better than λ), and that the solution is optimal among

those that have their support (valued 1 variables) con-

tained in the support of x

The second claim follows from the fact that the

problem is monotone IP3, and therefore solved with a

min cut procedure. Also need to assume that g(x) =

∑

. therefore, when multiplied by λ the source ad-

jacent are monotone increasing and sink adjacent are

monotone decreasing and therefore the source sets are

nested. That is, for the optimal ratio value λ

∗

f (x)

∗

g(x

∗

)

∗

< λ it is guaranteed that the optimal solution set

(the set of nodes with variable value = 1) is a strict

subset of the set of nodes corresponding to x

While this procedure was used until now with

each iteration requiring the running time of one min-

imum cut procedure, the authors adapted it to proce-

dure incremental-para that uses the “continuation”

idea of the HPF parametric cut procedure (Chandran

and Hochbaum, ). With this procedure values of the

parameter, computed at each iteration, are used to de-

termine the capacities of the source and the sink ad-

jacent arcs, during the ratio improvement algorithm

when a new solution is found. The number of calls to

the procedure is the number of breakpoints traversed,

till it reaches the leftmost breakpoint. The complex-

ity however of procedure incremental-para is that

of a single minimum cut procedure on the respective

graph, e.g. O(mnlog

6 CONCLUSIONS

We present here a general unifying framework for any

NP-hard problem, that can be formulated as mono-

tone integer programming problem with a budget con-

straint (budgeted IPM problem). This framework pro-

vides powerful algorithmic methods that work efﬁ-

ciently and deliver very high quality solutions. One

example illustrated here is the text summarization

problem, for which we introduce here a new model

Uniﬁed New Techniques for NP-Hard Budgeted Problems with Applications in Team Collaboration, Pattern Recognition, Document

Summarization, Community Detection and Imaging

371

which is budgeted IPM potentially leading to more

efﬁcient and scalable algorithms.

We study here the concave envelope of the La-

grangean relaxation of the budget constraint, and

demonstrate, a new parametric cut procedure that

identiﬁes, consecutively, all the breakpoints. This in-

cremental parametric cut procedure is more easily im-

plemented than the general parametric cut procedure,

and reduces the running time on average.

ACKNOWLEDGEMENTS

This research was supported in part by AI Institute

NSF Award 2112533.

REFERENCES

Carbonell, J. and Goldstein, J. (1998). The use of mmr,

diversity-based reranking for reordering documents

and producing summaries. In Proceedings of the

21st annual international ACM SIGIR conference on

Research and development in information retrieval,

pages 335–336.

Chaillou, P., Hansen, P., and Mahieu, Y. (1989). Best net-

work ﬂow bounds for the quadratic knapsack problem.

In Combinatorial Optimization: Lectures given at

the 3rd Session of the Centro Internazionale Matem-

atico Estivo (CIME) held at Como, Italy, August 25–

September 2, 1986, pages 225–235. Springer.

Chandran, B. and Hochbaum, D. S. Pseudoﬂow para-

metric maximum ﬂow solver version 1.0. url-

https://riot.ieor.berkeley.edu/Applications/Pseudoﬂow

/parametric.html. Accessed: 2023-04-30.

Cheeger, J. (1970). A lower bound for the smallest eigen-

value of the laplacian, problems in analysis (papers

dedicated to salomon bochner, 1969).

Gallo, G., Grigoriadis, M. D., and Tarjan, R. E. (1989). A

fast parametric maximum ﬂow algorithm and applica-

tions. SIAM Journal on Computing, 18(1):30–55.

Hochbaum, D. S. (2002). Solving integer programs

over monotone inequalities in three variables: A

framework for half integrality and good approxima-

tions. European Journal of Operational Research,

140(2):291–321.

Hochbaum, D. S. (2008). The pseudoﬂow algorithm: A new

algorithm for the maximum-ﬂow problem. Operations

research, 56(4):992–1009.

Hochbaum, D. S. (2010). Polynomial time algorithms for

ratio regions and a variant of normalized cut. IEEE

transactions on pattern analysis and machine intelli-

gence, 32(5):889–898.

Hochbaum, D. S. and Fishbain, B. (2011). Nuclear threat

detection with mobile distributed sensor networks.

Annals of Operations Research, 187:45–63.

Hochbaum, D. S., Liu, Z., and Goldschmidt, O. (2023).

A breakpoints based method for the maximum diver-

sity and dispersion problems. In SIAM Conference

on Applied and Computational Discrete Algorithms

(ACDA23), pages 189–200. SIAM.

Hochbaum, D. S. and Orlin, J. B. (2013). Simpliﬁcations

and speedups of the pseudoﬂow algorithm. Networks,

61(1):40–57.

Hochbaum, D. S. and Singh, V. (2009). An efﬁcient algo-

rithm for co-segmentation. In 2009 IEEE 12th Inter-

national Conference on Computer Vision, pages 269–

276. IEEE.

Lang, K. and Rao, S. (2004). A ﬂow-based method for

improving the expansion or conductance of graph

cuts. In Integer Programming and Combinatorial

Optimization: 10th International IPCO Conference,

New York, NY, USA, June 7-11, 2004. Proceedings 10,

pages 325–337. Springer.

Lin, H. and Bilmes, J. (2010). Multi-document summariza-

tion via budgeted maximization of submodular func-

tions. In Human Language Technologies: The 2010

Annual Conference of the North American Chapter of

the Association for Computational Linguistics, pages

912–920.

Pisinger, D. (2006). Upper bounds and exact algorithms

for p-dispersion problems. Computers & operations

research, 33(5):1380–1398.

Sharon, E., Galun, M., Sharon, D., Basri, R., and Brandt, A.

(2006). Hierarchy and adaptivity in segmenting visual

scenes. Nature, 442(7104):810–813.

Shi, J. and Malik, J. (2000). Normalized cuts and image

segmentation. IEEE Transactions on pattern analysis

and machine intelligence, 22(8):888–905.

Spiers, S., Bui, H. T., and Loxton, R. (2023). An exact cut-

ting plane method for the euclidean max-sum diversity

problem. European Journal of Operational Research.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

372