terwise regression problem which constructs initial
solutions at each iteration using results obtained at the
previous iteration. Based on tests on multiple regres-
sion datasets, they find that the proposed algorithm
is very efficient even in large datasets that are dense
and do not contain outliers. Joki et al. (2020) pro-
vides a support vector machine based formulation to
approximate the clusterwise regression problem with
an L
1
accuracy measure that is naturally more robust
to outliers. Part et al. (2017) uses a mixed-integer
quadratic programming formulation and designs and
compares the performance of several metaheuristic-
based algorithms (e.g., genetic algorithm, column
generation, two-stage approach) with synthetic data
and real-world retail sales data. Procedures to deter-
mine the optimal number of clusters, when this num-
ber is not known a priori, have also been suggested in
the literature (e.g., Shao and Wu 2005).
In this paper, we study an extension of the regres-
sion clustering problem. In our problem, data points
belong to some predefined subgroups and thus data
points from the subgroup are constrained to be in the
same cluster after the regression clustering procedure
is applied. The optimality criteria is the minimiza-
tion of the total sum of squares error (SSE) in these
two subsets after two independent OLS regressions
are applied to them. Due to a large number of pos-
sible partitions, a complete enumeration approach is
computationally prohibitive. Instead, we provide gra-
dient descent based heuristics to solve this problem.
We propose to cycle through the partition vari-
ables at each iteration and consider all possible binary
splits based on each variable. The candidate split de-
pends on the type of the independent variable. For an
ordered or a continuous variable, we sort the distinct
values of the variable and place “cuts” between any
two adjacent values to form partitions. Hence for an
ordered variable with L distinct values, there are L −1
possible splits, which can be huge for a continuous
variable in large-scale data. Thus we specify a thresh-
old L
cont
(say 500, for instance), and only consider
splits at the L
cont
equally spaced quantiles of the vari-
able if the number of distinct values exceeds L
cont
+1.
An alternative way of speeding up the calculation is to
use an updating algorithm that “updates” the regres-
sion coefficients as we change the split point, which
is computationally more efficient than having to re-
calculate the regression every time. Here, we adopt
the former approach for its algorithmic simplicity.
Splitting on an unordered categorical variable is
quite challenging, especially when there are many
categories. Setting up a different OLS equation for
each level may lead to statistically insignificant re-
sults, especially if the number of observations for a
particular level is small. Instead, we would like to
find a collection of these levels that have the same
regression equation. For a categorical variable with
L levels, the number of possible nonempty partitions
is equal to 2
L−1
− 1. When L > 20 levels, the num-
ber of feasible partitions is more than one million. In
this paper, we focus on this case due to its combina-
torially challenging nature. Since it is not possible to
search through all these solutions, we propose an in-
teger problem formulation to solve this problem. We
devise gradient descent based heuristic as an alterna-
tive.
The rest of the paper is organized as follows. Sec-
tion 2 presents the formulation of the problem and no-
tation used throughout the paper. We provide a list
of heuristics to solve the particular problem in Sec-
tion 3 and compare the performance of these heuris-
tics via simulated datasets in Section 4. Section 5
concludes the paper with a discussion on future work
that addresses limitations of the current method and
generated datasets and potential avenues for future re-
search.
2 PROBLEM FORMULATION
Consider the problem of splitting a node based on a
single categorical variable s ∈ R with L unique val-
ues which we will define as levels or categories. Let
y ∈ R be the response variable and x ∈ R
p
denote the
vector of linear predictors. The linear regression rela-
tionship between y and x varies under different values
of s. For the sake of our argument, we assume there to
be a single varying-coefficient variable. The proposed
algorithm can be extended to cases with multiple par-
tition variables by either forming factors through the
combination of original factors or searching for opti-
mal partition variable-wise.
Let (x
0
i
,y
i
,s
i
) denote the measurements on subject
i, where i = 1, ·· · , n and s
i
∈ {1, 2,·· · , L} denotes a
categorical variable with L levels. The partitioned re-
gression model is:
y
i
=
M
∑
m=1
x
0
i
β
m
w
m
(s
i
) + ε
i
, (1)
where w
m
(s
i
) ∈ {0, 1} denotes whether the i-th obser-
vation belongs to the m-th group or not. We require
that
∑
M
m=1
w
m
(s) = 1 for any s ∈ {1,2, ··· , L}.
In this paper, we consider binary partitions,
namely M = 2, but multi-way partitions can be ex-
tended in a straightforward fashion. To simplify our
notation in binary partitioning, let w
i
:= w
1
(s
i
) ∈
{0,1}, which is a mapping from {1, 2 ··· , L} to
{0,1}. Further, define atomic weights for each level
A Gradient Descent based Heuristic for Solving Regression Clustering Problems
103