
gle execution of a specific SD algorithm could involve
certain disadvantages. One of them concerns the SD
algorithm itself and its initial hyperparameters. In
this sense, an SD algorithm could implement either
an exhaustive or heuristic exploration strategy, return
either all subgroups explored or the top-k subgroups
explored, implement different pruning, and accept the
use of different quality measures. Besides, different
implementations of the same algorithm could incor-
porate other hyperparameters further than the origi-
nally defined ones (e.g., exploration depth). All the
aforementioned characteristics make the subgroup set
obtained by an SD algorithm highly variable and de-
pendent on the initial conditions of the algorithm, thus
causing the subgroups mined by different SD algo-
rithms or by different hyperparameters can be notably
different. Another disadvantage is the large number of
subgroups that could be mined by a certain SD algo-
rithm execution (pattern explosion problem), increas-
ing the subgroup set size and making the result hardly
readable and interpretable by experts in these cases.
Taking all this into account, we propose and de-
velop a new approach based on the evaluation of the
overlap between the subgroup sets mined by differ-
ent SD algorithm executions to obtain a reduced sub-
group set. More precisely, the main contributions of
this research are (1) the definition of the problem of
mining a patient phenotype in the form of a reduced
subgroup set and (2) a new 6-step methodology that
tackles this problem and allows the involvement of
clinical experts in the process. The idea behind this
methodology supported by the SD technique is based
on a previously developed work (Lopez-Martinez-
Carrasco et al., 2021), which consisted of finding pa-
tient cohorts by evaluating the overlap between differ-
ent executions of a certain clustering algorithm.
The experiments carried out in this research are
driven by the 6-step methodology proposed and the
results obtained are compared with another descrip-
tive SD method.
2 PROBLEM STATEMENT
This section provides the formal definitions related to
the problem of mining a patient phenotype in the form
of a reduced subgroup set.
An attribute a is a unique characteristic of an ob-
ject, which has an associated value. An example of
an attribute is a = headache : yes. Moreover, the
domain of a, denoted as dom(a), is the set of all
unique values that a can take. Note that an attribute
can be nominal or numeric depending on its domain.
An instance i is a tuple of attributes of the form
i = (a
1
, . . . , a
m
). Given the attributes a
1
= f ever :
no and a
2
= headache : yes, an example of an in-
stance is i = ( f ever : no, headache : yes). A dataset
d is a tuple of instances of the form d = (i
1
, . . . , i
n
).
Given the instances i
1
= (headache : yes, f ever :
no) and i
2
= (headache : yes, f ever : yes), an ex-
ample of a dataset is d = ((headache : yes, f ever :
no), (headache : yes, f ever : yes)). Moreover, the no-
tation v
x,y
is used to indicate the value of the x-th in-
stance i
x
and its y-th attribute a
y
from a dataset d.
Given an attribute a
y
from a dataset d, a bi-
nary operator ∈ {=, ̸=, <, >, ≤, ≥} and a value w ∈
dom(a
y
), then a selector e is defined as a 3-tuple of the
form (a
y
.characteristic, operator,w). Informally, a
selector is a binary relation between an attribute from
a dataset and a possible value of its domain. An ex-
ample of a selector is e = (headache, =, yes).
Given an instance i and a selector e, then i is cov-
ered by e if the binary expression “v
x,y
operator w”
holds true. Otherwise, i is not covered by e.
Given a dataset d, a pattern p is a list of selectors
of the form < e
1
, . . . , e
j
> in which all attributes of the
selectors are different. It is interpreted as a conjunc-
tion of selectors that represents a list of properties of a
subset from d. Additionally, the pattern size is defined
as the number of selectors that it contains.
Given an instance i and a pattern p, then i is cov-
ered by p if i is covered by all selectors e ∈ p. Other-
wise, i is not covered by p.
Given a pattern p and a selector e, a subgroup
s is a pair (p, e) in which the pattern is denomi-
nated as ‘description’ and the selector is denomi-
nated as ‘target’. Additionally, the subgroup size is
defined as the number of selectors that its descrip-
tion contains. An example of subgroup is s = (<
(headache, =, yes), ( f ever, =, no) >, ( f lu, =, no)).
Given two subgroups s and s
′
, s
′
is a refinement
of s (denoted as s ≺ s
′
) if s
′
has the same target as s,
i.e., s
′
.target = s.target, and has an extended descrip-
tion, i.e., s
′
.description = concat(s.description, <
e
1
, . . . , e
j
>).
Given a subgroup s and a dataset d, a quality mea-
sure q is a function that computes one numeric value
according to s and certain metrics from d (Atzmueller,
2015).
Focusing on a specific subgroup s and a specific
dataset d, different metrics with which to compute
quality measures can be defined: (1) true positives
(t p), defined as the number of instances i from the
dataset d that are covered by the subgroup descrip-
tion s.description and by the subgroup target s.target;
(2) false positives ( f p), defined as the number of in-
stances i from d that are covered by s.description, but
not by s.target; (3) true population (T P), defined as
A Methodology Based on Subgroup Discovery to Generate Reduced Subgroup Sets for Patient Phenotyping
347