A prevalent semi-automatic method is the elbow
method (Thorndike, 1953). The user is shown a graph
over the different k values on the x-axis and the y-axis
shows the sum of square errors for the corresponding
k value. This graph is typically monotonic decreas-
ing, however, the assumption here is that the correct k
value is the one with the highest decrease. This point
of the highest decrease can be obtained by looking
at the graph and searching for an elbow point by the
user. Yet, the automatic and semi-automatic methods
can only compare different clustering results and tell
which one is better, but cannot describe how a clus-
tering result is obtained and which are the meaningful
features that led to each individual cluster.
To detect the most significant features in data, fea-
ture selection (Kumar and Minz, 2014) or feature im-
portance (Altmann et al., 2010) algorithms might be
used. The most common use of feature selection is
that based on a target class, e. g., the cluster label, the
features that have the greatest impact on the result are
selected. These algorithms typically assign scores to
the features by measuring the correlation of a feature
with respect to the target. To this end, statistical mea-
sures as, for instance, ANOVA, chi-square or mutual
information are used (Altmann et al., 2010).
Yet, there are also more advanced feature selection
algorithms, e. g., Random Forest (Breiman, 2001) or
Boruta (Kursa and Rudnicki, 2010), which uses a
Random Forest model to predict which features con-
tribute most to the result. However, it is not possible
to apply these algorithms on the resulting clusters in-
dividually as each cluster contains only one label and
thus meaningful statements are not possible if only
one class is present. Hence, these approaches are not
suited to explain which features are important in the
individual clusters or the clustering result at all.
However, some feature selection algorithms can
be used without a target class (Solorio-Fern
´
andez
et al., 2020). Here, the authors conclude that addi-
tional hyperparameters, such as the number of clus-
ters or number of features, are needed for such fea-
ture selection algorithms, which are not available in
practice by domain experts, especially in the context
of exploratory analysis. Furthermore, the scalability
of these methods is limited and differences between
individual clusters and the entire dataset are not taken
into account. As a result, meaningful features can not
be determined for each cluster individually.
The most related approach to our work is Inter-
pretable k-Means (Alghofaili, 2021) which yet is not
a scientific publication but a promising article pub-
lished on towardsdatascience.com including imple-
mentation on GitHub. This approach utilizes the SSE
that is minimized in the k-Means method. To this end,
it calculates for each cluster which feature minimized
the SSE the most. Since the objective of k-Means is
to minimize the SSE, this would relate to the feature
with the most significant impact to the clustering re-
sult. This allows to assign a feature importance score
to each feature and to select the most meaningful fea-
tures on this basis. Though this method is able to se-
lect features for each cluster individually, it is only
applicable to k-Means.
Approaches that aim to increase the explainability
and the interpretation of clustering results typically
use decision trees (Dasgupta et al., 2020; Loyola-
Gonz
´
alez et al., 2020) for that purpose. Though, de-
cision trees are supervised methods, they can be used
on the clustering result by using the cluster labels as
class labels. Then, a decision tree is trained on the
clustering results. The resulting decision tree can sub-
sequently be used to explain for a certain data instance
why it belongs to a cluster. Though this is suitable to
explain why certain instances are within a cluster, this
is not suitable for domain experts if there are thou-
sands or millions of data instances. As a consequence,
explaining every single instance is not scalable for do-
main experts and it remains uncertain by what one
cluster is characterized in detail. Hence, we follow
a different approach, i. e., we aim to summarize and
describe the clusters themselves and not each data in-
stance. Therefore, our approach is especially more
suited for large-scale datasets where we might have
millions of data instances with hundreds of features.
With regard to commercial software, there is an
option in IBM DB2 Warehouse to visualize and com-
municate clustering results. Thereby exists the oppor-
tunity to sort the features according to their impor-
tance. However, based on the documentation
1
, this
sorting contains all features and is based either on the
normalized chi-square values, the homogeneity of the
values, or in alphabetical order.
In summary, approaches to explain clustering re-
sults only describe the generation, e.g., by decision
trees, but not the content or meaning of the result.
Feature selection algorithms can only be used to iden-
tify meaningful features for the entire result, but not
for individual clusters. The only algorithm we could
find for identifying meaningful features at the clus-
ter level, Interpretable k-Means, is limited to k-Means
and thus not generally applicable. Furthermore, Inter-
pretable k-Means a) ignores the differences between
clusters and the entire dataset and b) lacks function-
ality to determine the quantity of meaningful features
and instead returns a ranking over all features avail-
able in the dataset.
1
https://www.ibm.com/docs/en/db2/10.5
ICEIS 2022 - 24th International Conference on Enterprise Information Systems
366