Unified New Techniques for NP-Hard Budgeted Problems with
Applications in Team Collaboration, Pattern Recognition, Document
Summarization, Community Detection and Imaging
Dorit S. Hochbaum
a
Department of Industrial Engineering and Operations Research, University of California, Berkeley, U.S.A.
Keywords:
Parametric Flow, Maximum Diversity, Quadratic Knapsack, Efficient Frontier, Text Summarization.
Abstract:
This paper introduces new techniques for any NP-hard problems formulated as monotone integer programming
(IPM) with a budget constraint “budgeted IPM”. Problems of this type have diverse applications, including
maximizing team collaboration, the maximum diversity problem, facility dispersion, threat detection, mini-
mizing conductance, clustering, and pattern recognition.
We present a unified framework for effective algorithms for budgeted IPM problems based on the Langrangian
relaxation of the budget constraint. It is shown that all optimal solutions for all values of the Lagrange mul-
tiplier are generated very efficiently, and the piecewise linear concave envelope (convex, for minimization
problems) of these solutions has breakpoints that are optimal solutions for the respective budgets. This is used
to derive high quality upper and lower bounds for budgets that do not correspond to breakpoints. We show that
for all these problems, the weight “perturbation” concept, that was successful for the problem of maximum
diversity in enhancing the number and distribution of breakpoints, is applicable. Furthermore, the insights
derived from this efficient frontier of solutions, lead to the result that all the respective ratio problems have a
solution at the “first” breakpoint, which generalizes the concept of maximum density subgraph.
1 INTRODUCTION
We explore here NP-hard problems that can be for-
mulated as monotone integer programs with an ad-
ditional budget constraint. Monotone integer pro-
gramming problems (IPM) are solvable in polyno-
mial time as a minimum cut on an associated graph,
(Hochbaum, 2002). The addition of the budget con-
straint renders these problems NP-hard. Among such
problems are the quadratic knapsack, the minimum
conductance problem, the facility dispersion problem,
the ratio problem, image segmentation problems and
others. The relationship of IPM problems to the min-
imum cut problem is crucial, as it allows to solve the
relaxed budget constraint problem parametrically, and
very efficiently, and to generate the concave envelope
which has desirable properties: that the breakpoints
give optimal solutions and that the solutions at the se-
quence of breakpoints are nested.
Special cases of IPM with budget constraint, the
maximum diversity and the facility dispersion prob-
lems, were recently addressed in (Hochbaum et al.,
a
https://orcid.org/0000-0002-2498-0512
2023), by solving efficiently the Lagrangean relax-
ation of the budget constraint for all values of the La-
grange multiplier. The function of all solutions for
each budget value has an upper envelope of the inter-
section of all lines that lie above all solutions, which
is concave, piecewise linear. We refer to this piece-
wise linear function that maps the budget to an up-
per bound on the value of the objective, as the con-
cave envelope. Because IPM problems are minimum
cut problems, the breakpoints in the concave envelope
correspond to solutions that are optimal and nested.
This concave envelope provides an upper bound for
the solutions for each budget, but is also used to gen-
erate high quality feasible solutions as lower bounds
(for maximization). The feasible solutions are gen-
erated with the breakpoints algorithm that utilizes
breakpoints with close budget values to the value in
the input, to append or remove elements, using a fast
heuristic, (Hochbaum et al., 2023). Since increased
density of the breakpoints enhances the quality of the
solutions, it was shown that a perturbation method
on the utility values lead to tighter lower bounds (for
maximization). This resulted in the ability to solve
the problem, even for small budget values, which are
Hochbaum, D.
Unified New Techniques for NP-Hard Budgeted Problems with Applications in Team Collaboration, Pattern Recognition, Document Summarization, Community Detection and Imaging.
DOI: 10.5220/0012207200003598
In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2023) - Volume 1: KDIR, pages 365-372
ISBN: 978-989-758-671-2; ISSN: 2184-3228
Copyright © 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
365
particularly challenging, either to optimality, or very
close to optimality. All this is done within a small
fraction of the running time required by competing
approaches, such as an integer programming solver,
or by state-of-the-art meta-heuristics.
The Lagrangean relaxation of the budget con-
straint has been previously used in the context of the
quadratic knapsack problem, most recently by (Spiers
et al., 2023). The quadratic knapsack, which is iden-
tical to the maximum diversity and maximum facil-
ity dispersion problems, can be formulated as IPM
with a budget constraint. However, the methods de-
veloped for the quadratic knapsack were ad-hoc and
suitable for this case only, and the general structure
has not been recognized up till now. In addition, all
the literature that uses this approach, going back to
(Chaillou et al., 1989; Pisinger, 2006) can solve the
problems to optimality for at most a few hundred vari-
ables, and for selected, relatively high, values of the
budget (accommodating 10%-50% of the variables),
whereas the harder problems, that are more prevalent
in applications, are for smaller budgets. In addition,
the running times of the current approaches do not
scale well. Indeed in the reported results of (Spiers
et al., 2023) the largest instances contain up to 2000
nodes of the GKD benchmark. These instances were
shown to be particularly easy, in (Hochbaum et al.,
2023). In contrast, in (Hochbaum et al., 2023), new
insights as to how to deal with harder problems, with
perturbation, were able to provide, often optimal, or
very close to optimal, solutions, within a tiny fraction
of the running time of competing approaches, includ-
ing integer programming software.
Here we demonstrate that the breakpoints algo-
rithm is applicable to a vast collection of hard prob-
lems, with the potential of providing optimal or very
close to optimal, solutions One example explored
here is the text summarization problem, aka multi-
document summarization, which is modeled, under
the MMR criterion, as combining one goal of max-
imizing the sum of dissimilarities in the selected set
(of sentences), to enhance the diversity of the selected
set and eliminate redundancy, with the second goal
maximizing the similarities between the selected set
and its complement. This combined objective is NP-
hard to solve even without the budget constraint on
the total size of the sentences selected. A straightfor-
ward formulation of this optimization problem, given
in (Lin and Bilmes, 2010), is reported to be solved
for an instance of the problem of size 178 sentences,
in 17 hours, using an integer programming software.
Our approach is to model the problem as budgeted
IPM simply by replacing the similarity weights by
dis-similarity weights, e.g. by taking the reciprocal
of the similarities. Once the problem is modeled as
IPM with a budget constraint, the framework pre-
sented here can utilize the concave envelope to solve
the problem effectively, and with a highly scalable al-
gorithm. This is discussed in detail in Section 3.
There is a close relationship between ratio prob-
lems and budgeted IPM problems. This relationship is
reflected in the concave envelope, where the optimal
solution to the respective budget problem is the first
breakpoint, for the smallest budget value, that cor-
responds to a generalization of the maximum density
subgraph problem.
Our contributions here include:
Introducing a large class of NP-hard problems that
are formulated as monotone integer programming
with a budget constraint: budgeted IPM.
Demonstrating that for all budgeted IPM prob-
lems the concave envelope (for maximization,
convex for minimization) related to the La-
grangean relaxation of the budget constraint is
constructed as the output of a (parametric) min-
imum cut procedure on a respective graph.
The breakpoints in the concave envelope are
shown to be optimal solutions for the respective
budget values, and correspond to nested solutions.
The perturbation concept, of (Hochbaum et al.,
2023), applies to the class of budgeted IPM prob-
lems, and can increase the number of breakpoints
and enhance their distribution around the budget
values of interest.
Relationship of budgeted IPM problems to the
respective IPM ratio problems, that are polyno-
mial time solvable with a parametric cut proce-
dure, showing that the first (leftmost for maxi-
mization) breakpoint, solves the ratio problem op-
timally. The first breakpoint is shown to general-
ize the concept of maximum density subgraph.
The newly introduced, procedure incremental-
para, that solves IPM ratio problems, with a given
initial feasible solution, in the complexity of a sin-
gle minimum cut procedure and generates sequen-
tially all breakpoints.
Show how all budgeted IPM problems are
amenable to the breakpoints algorithm of
(Hochbaum et al., 2023) which bodes well to the
chances of being able to use a scalable algorithm
that delivers high quality solutions.
Demonstrating a new formulation for the text
summarization problem that renders it a budgeted
IPM problem with the potential of new scalable
methods for the problem.
KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval
366
The paper is structured as follows: the following Sec-
tion 2 provides the basic graph notation used in the
rest of the paper, the formulation of a general IPM
problem, and the construction of the associated graph
and several examples of IPM problems and ratio prob-
lems that are IPM. Section 3 introduces the text sum-
marization problem in the known form of non-IPM
problem, and shows that it can be modeled as a bud-
geted IPM problem. The key insights to the concave
envelope and the implications for budgeted IPM, and
ratio IPM problem are presented in Section 4. Sec-
tion 5 presents a new form of fully parametric cut
procedure, that is more efficient than the parametric
cut approach in that it only computes the solutions at
the breakpoints. Section 6 includes conclusions and
pointers to future research.
It is noted that although most of the results are
presented in the setting of maximization problems,
they all apply analogously to minimization problems
as well.
2 NOTATION AND IPM
PROBLEMS
Firstly, we introduce graph notation used here: Let the
input graph be an undirected graph G = (V,E) with
the weights of the edges be w
i j
for [i, j] E. A bi-
partition of the graph is called a cut, (S,
¯
S) = {[i, j]
E|i S, j
¯
S}, where
¯
S = V \ S. Given two subsets
of nodes, A V, B V let the sum of weights of the
edges, with one endpoint in A and the other in B be
C(A, B) =
iA, jB
w
i j
. For the cut (S,
¯
S), the capac-
ity of this cut is C(S,
¯
S), and the sum of weights inside
the set A is C(A,A) =
i, jA
w
i j
.
If the edges have two sets of weights, these will
be denoted by w
1
i j
and w
2
i j
. For inputs with two sets
of edge weights we let C
1
(A,B) =
iA, jB
w
1
i j
and
C
2
(A,B) =
iA, jB
w
2
i j
.
We denote by d
i
the weighted degree of node i in
G: d
i
=
j|[i, j]E
w
i j
. The sum of the weighted de-
grees of nodes in a set H V is d(H) =
iH
d
i
, and
is also referred to as the volume of the set H. We also
allow nodes to be associated with arbitrary weights,
which are denoted by q
i
for i V , and the sum of
weights of nodes in the set H is q(H) =
iH
q
i
.
Monotone integer programming problems,
IPM, also referred to as Monotone IP3 problems, are
integer programming problems on at most 3 vari-
ables per constraint, where two of the variables, the
x-variables, appear with opposite sign coefficients,
and a third variable, a z-variable, if included, can ap-
pear in at most one constraint. The coefficient of the
third variable, in the objective function, must be non-
negative for minimization problems, or non-positive
for maximization problems. The formulation of a
general maximization (monotone) IP3, for a set of
n x-variables, those that can appear in multiple con-
straints, and a set of constraints involving a collec-
tion of pairs of variables A and a respective set of z-
variables is
(IPM) max
n
i=1
w
i
x
i
(i, j)A
u
i j
z
i j
s.t. a
i j
x
i
b
i j
x
j
c
i j
+ z
i j
(i, j) A
i
x
i
u
i
, integer i V
z
i j
0, integer (i, j) A.
Here all a
i j
and b
i j
and u
i j
are all non-negative. A
constraint may appear in the form a
pq
x
p
b
pq
x
q
c
pq
, without a third variable. In that case, for the sake
of streamlined presentation, we can assume that z
pq
s
coefficient is u
pq
= .
(Hochbaum, 2002) showed that any IPM can be
written as the maximum s-excess problem, in U =
n
i=1
(u
i
i
) binary variables. (Solving the problem
(IPM) independently of U was proved to be NP-hard.)
The maximum s-excess problem is formulated as a bi-
nary optimization problem, where x
i
= 1 iff node i is
in the optimal set S:
(s-excess) max
jV
w
i
x
i
(i, j)A
u
i j
z
i j
subject to x
i
x
j
z
i j
for (i, j) A
x
j
binary j = 1,.. ., n
z
i j
binary (i, j) A.
Although the constraints of the type x
i
x
j
0 do
not appear explicitly in this formulation, these can be
written as x
i
x
j
z
i j
where the cost coefficient u
i j
of the respective variable z
i j
(and the capacity of the
corresponding arc) is infinite. We now show that the
maximum s-excess set in a graph G is the source set
of a minimum cut in an associated graph G
st
, con-
structed as follows, (Hochbaum, 2002): We add nodes
s and t to the graph G, with an arc from s to every
positive weight node i, of capacity u
si
= w
i
, and an
arc from every negative weight node j to t of capac-
ity u
jt
= w
j
. Let this added set of arcs, adjacent to
s and t (source node and sink node respectively) be
denoted by A
st
. The arcs of A each carry the capacity
u
i j
. The graph G
st
is then (V {s,t},A A
st
).
Lemma 1. S
is a set of maximum s-excess capacity
in the original graph G if and only if S
is the source
set of a minimum s,t-cut in the associated graph G
st
.
Proof. Let V
+
{i V |w
i
> 0}, and let V
{ j
V |w
j
< 0}. Let (s S,t T ) be a minimum s,t cut on
Unified New Techniques for NP-Hard Budgeted Problems with Applications in Team Collaboration, Pattern Recognition, Document
Summarization, Community Detection and Imaging
367
G
st
. Then the capacity of this cut is given by
C (s S,t T )
=
(s,i)A
s
,iT
u
s,i
+
( j,t)A
t
, jS
u
j,t
+
iS, jT
u
i j
=
iT V
+
w
i
+
j SV
w
j
+
iS, jT
u
i j
= W
+
[
jS
w
j
iS, jT
u
i j
]
Where W
+
=
iV
+
w
i
is the sum of all positive
weights in G, which is a constant. Therefore, min-
imizing C (s S,t T ) is equivalent to maximizing
jS
w
j
iS, jT
u
i j
, and we conclude that the
source set of a minimum s,t cut on G
st
is also a max-
imum s-excess set of G.
Examples of IPM Problems. For the following prob-
lems, the discovery of the IPM model for the problem
led to very fast, high quality solutions.
1. The co-segmentation problem: This is to iden-
tify the same feature in two distinct images. The
problem is modeled as increased similarity be-
tween histogram buckets of the foregrounds and
decreased binary MRF (Markov Random Field -
equivalent to distinctness of the image) in image
1 and in image 2 that determine the foregrounds
(co-segmentation), (Hochbaum and Singh, 2009).
2. Identifying nuclear threat alerts regions: The
model is to identify a region with increased num-
ber and intensity of alerts within; decreased length
of region boundary and decreased number of no-
alerts within the region, contributing to alert con-
centration. (Hochbaum and Fishbain, 2011).
There are many ratio problems that are IPM, but were
not always recognized as such. As explained in Sec-
tion 4, a ratio problem is IPM if the “linearized”
λ-question is an IPM problem. One such problem,
thought to be NP-hard (Sharon et al., 2006), is to find
a subset in the graph, that maximizes the ratio of the
sum of similarities (edge weights) in the set, divided
by the cut capacity separating the set from its com-
plement: max
SV
C(S,S)
C(S,S)
. This problem was stated in
(Sharon et al., 2006) to be equivalent to the normal-
ized cut problem, (Shi and Malik, 2000), which is
NP-hard. However, recognizing that the problem is
IPM led to a polynomial time algorithm (Hochbaum,
2010). This problem is also called HNC. A variant of
HNC, max
SV
C
1
(S,S)
C
2
(S,S)
, where the set of weights used
in the numerator is different from the set of weights
used in the denominator, is also IPM, (Hochbaum,
2010).
Another problem, related to the conductance
problem, is to minimize the cut separating the set
and its complement divided by the “volume” of the
set: min
SV
C(S,
¯
S)
d(S)
. This problem is in fact equiva-
lent to HNC as shown in (Hochbaum, 2010). It is
also a relaxation of conductance and Cheeger con-
stant, (Cheeger, 1970) where the “budget” constraint
d(S)
1
2
d(V ) is relaxed.
The well known problem, the maximum density
subgraph problem, max
SV
C(S,S)
d(S)
, is IPM. We refer to
it in its most general form where the nodes can carry
arbitrary weights q
i
: max
SV
C(S,S)
q(S)
.
In all the above problems, the formulations have
binary variables and do not require the binarization
transformation of a general IPM.
3 TEXT SUMMARIZATION-AS
BUDGETED IPM
The text summarization problem, aka, the multi-
document summarization problem has been studied
extensively, with heuristics and approximations. The
problem is modeled as the Maximal Marginal Rele-
vance (MMR) criterion, introduced in (Carbonell and
Goldstein, 1998), which strives to reduce redundancy
while maintaining query relevance in re-ranking re-
trieved documents and in selecting appropriate pas-
sages for text summarization. This MMR model
was formulated as integer programming in (Lin and
Bilmes, 2010), as shown next.
The formulation uses the variables: Let x
i
be the
binary variable which takes the value 1 if node (sen-
tence) i is selected, and 0 otherwise,
x
i
=
1 if i S
0 if i
¯
S.
z
i j
= 1 if exactly one of i and j is in S; y
i j
= 1 if
both With these variables the following formulation
of (MMR) max
SV
C(S,S) αC(S, S) was given in
(Lin and Bilmes, 2010):
(MMR) max
(i, j)E
w
i j
z
i j
α
[i, j]E
w
i j
y
i j
subject to x
i
x
j
z
i j
for all (i, j) E
z
i j
x
i
1 + z
i j
x
j
y
i j
x
i
for all [i, j] E
y
i j
x
j
x
i
+ x
j
1 y
i j
x
j
,z
i j
,y
i j
binary .
This formulation is not IPM since the “third” vari-
ables of some of the constraints, also appear in other
constraints. We maintain the spirit of this MMR
KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval
368
objective and change the weights to mean the dis-
similarity between every pair of nodes. This results in
a polynomial time solvable problem, since it isIPM,
(MMR*) max
SV
C(S,S) βC(S,S).
This objective function is f (x) where x
i
= 1 if sen-
tence i is selected in the “summary” set S. Here β
corresponds to 1/α in the maximization formulation.
The problem formulation as IPM is:
(MMR*) max
w
i j
y
i j
β
w
i j
z
i j
subject to x
i
x
j
z
i j
for all [i, j] E
x
j
x
i
z
ji
for all [i, j] E
y
i j
x
i
for all [i, j] E
y
i j
x
j
x
j
,z
i j
,y
i j
binary .
To verify the validity of the formulation notice that
the objective function drives the values of z
i j
to be
as small as possible, and the values of y
i j
to be as
large as possible. With the constraints, z
i j
cannot be
0 unless both endpoints i and j are in the same set.
On the other hand y
i j
cannot be equal to 1 unless both
endpoints i and j are in S. In this formulation, both x
i
and y
i j
are the x-variables and there is a node for each
in the associated flow graph. Next we show that there
is a more compact formulation that has one node in
the graph for each candidate sentence.
3.1 A Compact IPM MMR*
Formulation
We first need additional notation: Let the input graph
be an undirected graph G = (V,E) with the weights of
the edges be w
i j
for [i, j ] E. Let d
i
be the weighted
degree of node i in G which is the sum of the weights
of the edges adjacent to j: d
i
=
j|[i, j]E
w
i j
. For a
subset of nodes D V let d(D) =
iD
d
i
which is
also known as the “volume” of the set D.
The compact formulation is presented as the fol-
lowing Lemma:
Lemma 2. Solving max
SV
C(S,S) βC(S,
S) is
equivalent to solving
min
SV
C(S,S)
1
1+2β
d(S).
Proof. Using the equality
2C(S, S) +C(S,
¯
S) = d(S) (1)
max
SV
C(S,S) βC(S , S) = max
SV
1
2
d(S)
1
2
C(S,S) βC(S,S)
=
1
2
min
SV
(1 + 2β)C(S, S) d(S).
The latter is equivalent to min
SV
C(S,S)
1
1+2β
d(S).
In the formulation of min
SV
C(S,S)
1
1+2β
d(S)
the variables y
i j
do not appear. Therefore the num-
ber of nodes the respective graph is n, where n is the
number of sentences, or x-variables.
4 LINEARIZING RATIOS AND
ENVELOPES
We consider a generic IPM problem max
xF
f (x), for
x an integer vector. The entire discussion applies to
minimization problems as well. Suppose there is a
budget constraint of the form g(x) B. The Lagrange
relaxation approach is related to minimizing, or max-
imizing, a fractional (or as it is sometimes called, ge-
ometric) objective function over a feasible region F ,
max
xF
f (x)
g(x)
. To solve such fractional problem one
can reduce it to a sequence of calls to an oracle that
provides the yes/no answer to the λ-question:
Is there a feasible solution x F such that f (x)
λg(x) > 0?
The answer to this question is given by solving the
optimization problem
max
xF
f (x) λg(x).
It is yes if the maximum value is greater than 0, no
if the maximum value is less than 0. Suppose the an-
swer to the λ-question, for maximization, is yes, then
the optimal solution has value greater than λ. Oth-
erwise, the optimal value is less than or equal to λ.
If the answer is 0 then the optimum has been found.
Note that this maximization problem is still an IPM
problem since adding any term that is linear in the x
variables, retains the form of the s-excess problem.
The λ-question problem is also the Lagrangean re-
laxation of the budget constraint for the budgeted IPM
problem, max
xF
{ f (x)|g(x) B}.
Since the λ-question is formulated as IPM, (to
make it simpler assume the variables are binary,
though everything applies to the general integer case)
the x-variables of the problem correspond to nodes in
the associated graph that are set to 1 if they belong to
the source set, and 0 if belong to the sink set. Fur-
thermore, the terms that depend on λ appear only in
the coefficients of the x variables, and therefore the
associated flow graph is parametric.
An s,t-graph with source and sink adjacent ca-
pacities that depend on a parameter, is said to be
a parametric flow graph if source adjacent capaci-
ties are monotone nondecreasing in λ, and sink adja-
cent capacities monotone nonincreasing in λ (or vice
versa). A parametric flow algorithm solves the max-
imum flow and minimum cut on a parametric flow
Unified New Techniques for NP-Hard Budgeted Problems with Applications in Team Collaboration, Pattern Recognition, Document
Summarization, Community Detection and Imaging
369
graph, for all values of the parameter, in the complex-
ity of a single max-flow, or min-cut. There are two
parametric flow algorithms known that solve the prob-
lem, for all values of λ, in strongly polynomial time
in the complexity of a single cut, O(mn log
n
2
m
), for m
the number of edges in the graph, and n the number
of nodes ((Gallo et al., 1989; Hochbaum, 2008) (the
improved run time of the latter is given in (Hochbaum
and Orlin, 2013)).
First we show the properties of the minimum cut
function of λ in a parametric flow network.
Figure 1: The cut capacity as a function of λ in a parametric
cut.
Definition 1. A function C(λ) is breakpoint-concave
if for λ
1
< λ
2
, C
λ
1
(λ) C
λ
2
(λ) is a monotone nonde-
creasing function of λ.
Recall that in a parametric flow network the
source adjacent capacities are monotone non-
decreasing and the sink adjacent capacities are mono-
tone non-increasing.
Lemma 1. In a parametric graph, the cut capacity
function C(λ) is breakpoint-concave.
Proof: By the definition, we need to show that for
λ
1
< λ
2
, C
λ
1
(λ) C
λ
2
(λ) is a monotone nondecreas-
ing function of λ.
Let ({s} S
1
,{t} T
1
), ({s} S
2
,{t} T
2
) be the
cuts corresponding to λ
1
and λ
2
respectively. Because
of the nestedness property, S
1
S
2
and T
1
T
2
.
C
λ
1
(λ) C
λ
2
(λ) =
C(S
1
,T
1
) C(S
2
,T
2
)
+ C({s}, T
1
)(λ) C({s},T
2
)(λ)
+ C(S
1
,{t})(λ) C(S
2
,{t})(λ) =
K
12
+C({s},T
1
\ T
2
)(λ) C(S
2
\ S
1
,{t})(λ)
K
1,2
is a constant independent of λ. C({s},T
1
\
T
2
)(λ) is a monotone nondecreasing function, since
it is a sum of capacities adjacent to the source, and
C(S
2
\ S
1
,{t})(λ) is a monotone nonincreasing func-
tion, since it is the sum of capacities adjacent to the
sink. Therefore the difference between these two
terms is monotone nondecreasing.
Benefit
Budget
B
Figure 2: The concave envelope, the breakpoints and the
ratio maximizing solution.
We are interested next in the link between
max
xF
f (x) λg(x) and the efficient frontier of
the solutions to max
xF
{ f (x)|g(x) B}: Suppose
that we graph the optimal solution x
B
to the problem
x
B
= argmax
xF
{ f (x)|g(x) B}, with the horizon-
tal axis the value of the budgets B and the vertical axis
the value of the objective f (x
B
). We will refer to B as
the “budget” of x
B
and to f (x) as the “benefit” of of
x.
Consider the intersection of all the lines that have
the entire collection of optimal solutions below them.
This upper envelope is concave piecewise linear and
the points at which the line segment changes, to lower
slope line, are called breakpoints, see Figure 2. We
note that the first breakpoint is also the optimal solu-
tion to the ratio problem, max
xF
f (x)
g(x)
.
Let x
0
F be a feasible solution, and let λ
0
=
f (x
0
)
g(x
0
)
. We claim that max
xF
f (x) λ
0
g(x) provides
the tangent line to the concave envelope with the slope
λ
0
at a budget g(x
0
).
Lemma 3. max
xF
f (x) λ
0
g(x) provides the tan-
gent line to the concave envelope with the slope λ
0
at
a budget g(x
0
).
λ0
Δ
Budget
Benefit
Figure 3: Identifying a breakpoint with λ
0
subgradient.
Proof. Let be the intercept of such line, of slope
λ
0
, on the horizontal axis, as in Figure 3. To find the
point of the tangent we want to maximize .
Since f (x) = λ
0
(g(x) ), it follows that =
1
λ
0
[ f (x) λ
0
(g(x)].
KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval
370
Therefore, maximizing is equivalent to
max
xF
f (x) λ
0
g(x). For such
the line
f (x) λ
0
(g(x)
) lies above all feasible solu-
tions and is tangential to the concave envelope at
breakpoint x
1
, where x
1
= argmax
xF
f (x) λ
0
g(x).
x
1
is a breakpoint with a left subgradient equal to
λ
1
and right subgradient equal to λ
2
, such that λ
1
λ
0
λ
2
.
For the convex envelope corresponding to the min-
imization problem, the tangent is found at a break-
point to the right, bigger budget, of g(x
0
).
The proof of Lemma 3 leads to an iterative pro-
cedure for ratio improvement. Consider the maxi-
mization case: From solution x
0
we derive the break-
point solution x
1
with the left subgradient λ
1
. When
resolving, for λ
1
, max
xF
f (x) λ
1
g(x) the optimal
solution is an adjacent breakpoint on the left, say x
2
that has λ
1
as a right derivative. Also, the nestedness
property implies that x
2
x
1
. Therefore, the proce-
dure will scan all the breakpoints, one at a time, till
it reaches the maximum “density” point, x
that max-
imizes max
xF
f (x)
g(x)
. Figure 2 illustrates this break-
point, which is the leftmost of all breakpoints in the
concave envelope. This process leads to the incre-
mental parametric cut procedure described in the next
section.
5 THE INCREMENTAL
PARAMETRIC CUT AND
CLUSTER IMPROVEMENT
In many applications the clustering problem has an
NP-hard version that has an additional, budget, con-
straint on the size or weight of the selected set. Con-
sider for instance the ratio region problem which is
a relaxation of the NP-hard graph expander problem,
where the minimization of the ratio is constrained by
the requirements the size of the set is at most half the
number of nodes in the graph. A common heuristic
method to address this is cluster improvement. One
derives an initial solution that satisfies the additional
constraint, and then apply the respective ratio opti-
mization to find a subset which is optimizes the ratio
within this cluster. Although one can solve for the
optimal ratio within the subset with the same para-
metric cut algorithm after fixing the values of the out-
of-initial-cluster nodes to be out of the cluster there is
another approach that was used, for instance, in (Lang
and Rao, 2004).
PROCEDURE (MAXIMUM) RATIO IMPROVEMENT
( f (), g(),x
0
F ).
Step 0: Initialize k = 0,
Step 1: λ
k
=
f (x
k
)
g(x
k
)
. {This step can be implemented extra
efficiently adding linear time to all iterations.}
Step 2: Solve, with a min-cut algorithm or procedure
incremental-para,
ratio(λ
k
) = max
xF
f (x) λ
k
g(x), and let x
k+1
=
argmax
xF
f (x) λ
k
g(x).
Step 3: If ratio(λ
k
) = 0 stop. Output x
= x
k
. Else, con-
tinue
Step 4: {ratio(λ
k
) < 0} Let k := k + 1. Go to step 1.
To prove validity need to show that ratio(λ) 0 (be-
cause there is already a feasible solution of the value
λ we know that the optimal ratio value can only be
better than λ), and that the solution is optimal among
those that have their support (valued 1 variables) con-
tained in the support of x
0
.
The second claim follows from the fact that the
problem is monotone IP3, and therefore solved with a
min cut procedure. Also need to assume that g(x) =
q
i
x
i
. therefore, when multiplied by λ the source ad-
jacent are monotone increasing and sink adjacent are
monotone decreasing and therefore the source sets are
nested. That is, for the optimal ratio value λ
=
f (x)
g(x
)
,
λ
< λ it is guaranteed that the optimal solution set
(the set of nodes with variable value = 1) is a strict
subset of the set of nodes corresponding to x
k
.
While this procedure was used until now with
each iteration requiring the running time of one min-
imum cut procedure, the authors adapted it to proce-
dure incremental-para that uses the “continuation”
idea of the HPF parametric cut procedure (Chandran
and Hochbaum, ). With this procedure values of the
parameter, computed at each iteration, are used to de-
termine the capacities of the source and the sink ad-
jacent arcs, during the ratio improvement algorithm
when a new solution is found. The number of calls to
the procedure is the number of breakpoints traversed,
till it reaches the leftmost breakpoint. The complex-
ity however of procedure incremental-para is that
of a single minimum cut procedure on the respective
graph, e.g. O(mnlog
n
2
m
).
6 CONCLUSIONS
We present here a general unifying framework for any
NP-hard problem, that can be formulated as mono-
tone integer programming problem with a budget con-
straint (budgeted IPM problem). This framework pro-
vides powerful algorithmic methods that work effi-
ciently and deliver very high quality solutions. One
example illustrated here is the text summarization
problem, for which we introduce here a new model
Unified New Techniques for NP-Hard Budgeted Problems with Applications in Team Collaboration, Pattern Recognition, Document
Summarization, Community Detection and Imaging
371
which is budgeted IPM potentially leading to more
efficient and scalable algorithms.
We study here the concave envelope of the La-
grangean relaxation of the budget constraint, and
demonstrate, a new parametric cut procedure that
identifies, consecutively, all the breakpoints. This in-
cremental parametric cut procedure is more easily im-
plemented than the general parametric cut procedure,
and reduces the running time on average.
ACKNOWLEDGEMENTS
This research was supported in part by AI Institute
NSF Award 2112533.
REFERENCES
Carbonell, J. and Goldstein, J. (1998). The use of mmr,
diversity-based reranking for reordering documents
and producing summaries. In Proceedings of the
21st annual international ACM SIGIR conference on
Research and development in information retrieval,
pages 335–336.
Chaillou, P., Hansen, P., and Mahieu, Y. (1989). Best net-
work flow bounds for the quadratic knapsack problem.
In Combinatorial Optimization: Lectures given at
the 3rd Session of the Centro Internazionale Matem-
atico Estivo (CIME) held at Como, Italy, August 25–
September 2, 1986, pages 225–235. Springer.
Chandran, B. and Hochbaum, D. S. Pseudoflow para-
metric maximum flow solver version 1.0. url-
https://riot.ieor.berkeley.edu/Applications/Pseudoflow
/parametric.html. Accessed: 2023-04-30.
Cheeger, J. (1970). A lower bound for the smallest eigen-
value of the laplacian, problems in analysis (papers
dedicated to salomon bochner, 1969).
Gallo, G., Grigoriadis, M. D., and Tarjan, R. E. (1989). A
fast parametric maximum flow algorithm and applica-
tions. SIAM Journal on Computing, 18(1):30–55.
Hochbaum, D. S. (2002). Solving integer programs
over monotone inequalities in three variables: A
framework for half integrality and good approxima-
tions. European Journal of Operational Research,
140(2):291–321.
Hochbaum, D. S. (2008). The pseudoflow algorithm: A new
algorithm for the maximum-flow problem. Operations
research, 56(4):992–1009.
Hochbaum, D. S. (2010). Polynomial time algorithms for
ratio regions and a variant of normalized cut. IEEE
transactions on pattern analysis and machine intelli-
gence, 32(5):889–898.
Hochbaum, D. S. and Fishbain, B. (2011). Nuclear threat
detection with mobile distributed sensor networks.
Annals of Operations Research, 187:45–63.
Hochbaum, D. S., Liu, Z., and Goldschmidt, O. (2023).
A breakpoints based method for the maximum diver-
sity and dispersion problems. In SIAM Conference
on Applied and Computational Discrete Algorithms
(ACDA23), pages 189–200. SIAM.
Hochbaum, D. S. and Orlin, J. B. (2013). Simplifications
and speedups of the pseudoflow algorithm. Networks,
61(1):40–57.
Hochbaum, D. S. and Singh, V. (2009). An efficient algo-
rithm for co-segmentation. In 2009 IEEE 12th Inter-
national Conference on Computer Vision, pages 269–
276. IEEE.
Lang, K. and Rao, S. (2004). A flow-based method for
improving the expansion or conductance of graph
cuts. In Integer Programming and Combinatorial
Optimization: 10th International IPCO Conference,
New York, NY, USA, June 7-11, 2004. Proceedings 10,
pages 325–337. Springer.
Lin, H. and Bilmes, J. (2010). Multi-document summariza-
tion via budgeted maximization of submodular func-
tions. In Human Language Technologies: The 2010
Annual Conference of the North American Chapter of
the Association for Computational Linguistics, pages
912–920.
Pisinger, D. (2006). Upper bounds and exact algorithms
for p-dispersion problems. Computers & operations
research, 33(5):1380–1398.
Sharon, E., Galun, M., Sharon, D., Basri, R., and Brandt, A.
(2006). Hierarchy and adaptivity in segmenting visual
scenes. Nature, 442(7104):810–813.
Shi, J. and Malik, J. (2000). Normalized cuts and image
segmentation. IEEE Transactions on pattern analysis
and machine intelligence, 22(8):888–905.
Spiers, S., Bui, H. T., and Loxton, R. (2023). An exact cut-
ting plane method for the euclidean max-sum diversity
problem. European Journal of Operational Research.
KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval
372