mating the clustering of the input data. This is not
a major problem in the Bayesian setting, as posterior
inference algorithms are computationally expensive.
Careless use of this algorithm in the optimisa-
tion setting, as we shall see in the next section, may
be devastating as computing the clustering of the in-
put data, even for the minimum number of iteration,
can be too expensive. The description of CABLR is
shown in Algorithm 1.
Input: D: input data, Q
D
: k-clustering of D
with |Q
D
| := k, M: coreset size
Output: ε-coreset C with |C | = M
1 initialise;
2 for n = 1,2, ..., N do
3 m
n
← Sensitivity(N,Q
D
) ;
// Compute the sensitivity of
each point
4 end
5 ¯m
N
←
1
N
∑
N
n=1
m
n
;
6 for n = 1,2, ..., N do
7 p
n
=
m
n
N ¯m
N
; // compute importance
weight for each point
8 end
9 (K
1
,K
2
,...,K
N
) ∼ Multi(M,(p
n
)
N
n=1
) ;
// sample coreset points
10 for n = 1,2, ..., N do
11 w
n
←
K
n
p
n
M
; // calculate the weight
for each coreset point
12 end
13 C ← {(w
n
,x
n
,y
n
)|w
n
> 0};
14 return C
Algorithm 1: CABLR: an algorithm to construct coresets
for Logistic Regression.
Remark. In the description of Algorithm 1, we hide
the coreset dependence on the error parameter ε, de-
fined in Section 2.1. There is a good reason for do-
ing this. When theoretically designing a coreset al-
gorithm for some fixed problem, there are two error
parameters involved: ε ∈ [0, 1], the “loss” incurred
by coresets, and δ ∈ (0, 1), the probability that the al-
gorithm will fail to compute a coreset. Then, it is nec-
essary to define the minimum coreset size M in terms
of these error parameters. The norm is to prove there
exists a function t : [0,1] × (0,1) → Z
+
, with Z
+
be-
ing the set of all positive integers, that gives the corre-
sponding coreset size for all possible error values i.e.
t(ε
1
,δ
1
) := M
1
implies that M
1
is the minimum num-
ber of points needed in the coreset for achieving, with
probability 1 − δ, the guarantee, defined in inequality
(1), for ε
1
. However, in practice, one does not worry
about explicitly giving the error parameters as inputs;
since each coreset algorithm comes with its own defi-
nition of t, one only needs to give the desired coreset
size M and the error parameters can be computed us-
ing t. Finally, t defines a fundamental trade-off for
coresets: the smaller the error parameters, the bigger
the resulting coreset size i.e. smaller coresets may po-
tentially lose more information than bigger coresets.
3
Notice that Algorithm 1 implements the sensitiv-
ity framework almost as described in Section 2.3: k
cluster centres, Q , from the input data are used to
compute the sensitivities; then, sensitivities are nor-
malised and points get sampled; finally, the weights,
which are inverse proportional to the sensitivities, are
computed for each of the sampled points. Thus, even
though the obtained coreset is for LR, CABLR still
needs a clustering of the input data as is common for
any coreset algorithms designed using the sensitivity
framework.
2.4 The Clustering Bottleneck
Clustering is known to be a computationally hard
problem (Arthur and Vassilvitskii, 2007). This is why
approximation and data reduction techniques are use-
ful for speeding up existing algorithms. The sen-
sitivity framework, originally proposed for design-
ing coresets for clustering problems, requires a sub-
optimal clustering of the input data D in order to com-
pute the sensitivity for each input point. This require-
ment transfers to CABLR, described in the previous
section. In the Bayesian setting, the time necessary
for clustering D is strongly dominated by the cost of
posterior inference algorithms (see (Huggins et al.,
2016)). However, if we remove the burden of pos-
terior inference and consider the optimisation setting,
then the situation is dramatically different.
Figure 1 sheds some light on the clustering time
with respect to all the other steps taken by CABLR to
construct a coreset, namely: sensitivity computation
and sampling. The time spent on learning from the
coreset is also included.
We can clearly see that obtaining the clustering
can be dangerously impractical for constructing core-
sets for LR in the optimisation setting as it severely
increases the overall coreset-construction time. Even
worse, constructing the coreset is slower than learn-
ing directly from D, defeating the purpose of using
the coreset as an acceleration technique.
3
For CABLR, Huggins et al. proved that t(ε, δ) :=
d
c ¯m
N
ε
2
[(D + 1)log ¯m
N
+ log(
1
δ
)]e, where D is the number of
features in the input data, ¯m
N
is the average sensitivity of
the input data and c is a constant. The mentioned trade-off
can be appreciated in the definition of t.
On Generating Efficient Data Summaries for Logistic Regression: A Coreset-based Approach
81