Position Paper: Computer Supported Education vs. Education

Supported Computing - On the Problem of Informed Decision Making of

Appropriate Data Analytics Method

Daniyal Kazempour

1 a

, Christiane Attig

2 b

, Peer Kr

oger

1 c

, Muhammad Aammar Tufail

1 d

Daniela E. Winkler

1 e

and Claudius Zelenka

1 f

Christian-Albrechts-Universit

at zu Kiel, Kiel, Germany

Universit

at zu L

ubeck, L

ubeck, Germany

Keywords:

Data Analytics, Method Competence, Method-Application Gap, Interdisciplinary Research.

Abstract:

In the ﬁeld of data-related analytics, the overwhelming number of available methods presents a challenge:

Which method should actually be chosen for a given problem? In this position paper, we raise awareness of this

issue and propose educational and computational concepts to address related challenges and possibilities. As

a unique contribution, we include the perspectives of scientists from different domains which include biology,

bioinformatics, and psychology on the problem of method selection, aiming to initiate future discussions and

advancement.

1 INTRODUCTION

Teaching data analytics methods is growing in im-

portance. As the ﬁeld of database and machine

learning research advances, novel methods gradually

come into focus. These new methods can discover

patterns—such as clusters or correlations—that pre-

vious methods failed to detect. However, they may

also lack the ability to detect patterns that could be

discovered by earlier techniques. In short, there is no

‘one-size-ﬁts-all’ solution. Relying on either older or

newer methods as multi-purpose tools can be tempt-

ing, but may lead to a form of ‘blindness’, causing rel-

evant patterns to be missed. In this work we present

four positions deemed relevant for domain and com-

puter science alike, addressing the teaching of meth-

ods and their case-aware application.

Position 1:

Wealth of Methods vs. Lack of Knowledge:

The Method-Application Gap.

Data analytics appears to be omnipresent in many

https://orcid.org/0000-0002-2063-2756

https://orcid.org/0000-0002-6280-2530

https://orcid.org/0000-0001-5646-3299

https://orcid.org/0000-0002-2795-4985

https://orcid.org/0000-0001-7501-2506

https://orcid.org/0000-0002-9902-2212

different domain sciences such as physics, social sci-

ence, economics, biology etc. This does not come to

our surprise, since data analytics provided means to

boost the scientiﬁc advancements. Similarly, in the

ﬁeld of data analytics and machine learning a wealth

of methods has been developed, each of them address-

ing partially disjunct, partially overlapping challenges

in order to discover patterns within data. As a side-

effect we observe something that we address in this

position paper by the term method-application-gap:

On the one hand we have within the domain sciences

well-established subsets of methods that are ‘common

practice’ for data analytics. This subset of methods is

partially taught in a cookbook style within the educa-

tional processes of the academic landscape, as elab-

orated in Section 3. Each of the methods, however,

excels at their own subset of characteristics (i.e. dis-

covering arbitrary shaped patterns, linear correlations

etc.), which raises the need to utilize other and po-

tentially more recent methods. On the other hand the

sheer amount of methods developed and published in

the ﬁeld of data analytics and machine learning ren-

ders it impossible to ‘catch up’ for the domain science

knowing (a) that other methods exist and (b) which

one of them to choose (c) for which reasons. This

problem has also been discussed in Data Clustering:

50 Years Beyond K-means (Jain, 2010) where the au-

thors state:

“In spite of thousands of clustering algorithms

that have been published, a user still faces a

438

Kazempour, D., Attig, C., Kröger, P., Tufail, M. A., Winkler, D. E. and Zelenka, C.

Position Paper: Computer Supported Education vs. Education Supported Computing - On the Problem of Informed Decision Making of Appropriate Data Analytics Method.

DOI: 10.5220/0013476500003932

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Conference on Computer Supported Education (CSEDU 2025) - Volume 2, pages 438-445

ISBN: 978-989-758-746-7; ISSN: 2184-5026

dilemma regarding the choice of algorithm,

distance metric, data normalization, number

of clusters, and validation criteria.”

While it may be argued that the ‘go-to’ methods are

all that domain scientists need, we, an interdisci-

plinary group of scientists, claim that the knowledge

of other methods can enable the discovery of patterns

and hence ultimately novel insights that would else be

inaccessible. As a consequence, an Education Sup-

ported Computing approach that is tailored to teach

‘when to use what, and why’ is of paramount impor-

tance.

Position 2:

Automated Machine Learning Is not All

You Need.

Automated machine learning (ML) pipelines like Au-

toML provide a high comfort and are easy-to-use. At

that point one would be tempted to ask ‘Why should

we teach students how and when to choose which data

analytics method?’, since an entirely automated ap-

proach would render the need to answer such ques-

tions obsolete. However, AutoML approaches are not

the ‘holy grail’:

In a meta-review, the authors of (Barbudo et al.,

2023) performed a literature search on AutoML based

on a proposed taxonomy that encompasses 447 pri-

mary studies selected from a set of 31,048 papers.

Barbudo et al. (2023) found that the majority (91%)

of tasks addressed by AutoML are from supervised

or regression archetypes. The more challenging un-

supervised tasks like clustering or anomaly detection

are addressed by only 1-2% of the publications.

Even more severe disadvantages of AutoML re-

volve around the fact that AutoML approaches op-

erate as black-box methods [(Barbudo et al., 2023),

(Quaranta et al., 2024)], which implies that users have

to rely on the generated models regardless of the ways

they can be either interpreted or plausibly explained

by humans, hindering the scientists interpretability of

the results. In this respect, explainability is essential

for humans to provide more details in order to ob-

tain more meaningful results [(Barbudo et al., 2023),

(Quaranta et al., 2024)] that ultimately can beneﬁt the

automation process itself. Additionally, Quaranta et

al. (2024) conﬁrm that AutoML’s capabilities are lim-

ited in unsupervised settings, especially when con-

fronted with ‘non-standard’ use cases and domains by

failing to adapt to the complexities of such scenarios.

Overall, the necessity of humans in the loop

remains of paramount importance, as stated by

Xanthopoulos et al. (2020), that among the most

important criteria for users to choose a method

is the interpretability of the results. The authors

speciﬁcally mention that users are rarely satisﬁed

with only a predictive model, but aim to under-

stand the discovered patterns within the data, or

in the authors’ terms of brevity: “AutoML should

automate, not obfuscate.” (Xanthopoulos et al., 2020).

As a bottom line of this position, it is not advis-

able to entirely rely on automated machine learning

pipelines while at the same time neglecting or dis-

carding any need to teach and learn when to use which

method. This is especially the case when it comes

to discover novel and hence mostly unknown data.

Instead, we deem it as more important to educate

students and scientists alike to learn and understand

when to use which existing method rather than to rely

on automated ML processes.

Position 3:

Learning by Doing: On the Need to Inter-

actively Practice Data Analytics Methods.

So far we have addressed the Education Supported

Computing (ESC) ﬁeld, meaning that one needs to

learn when to use which of the existing methods.

To achieve this, we now transit to the realm

of Computer Supported Education (CSE), the main

theme of this conference. The third position discusses

the need to incorporate computer science methods to

support the education on when to use which method.

Many of the data analytics modules provide the

means to practice the learned methods in the tutori-

als of their respective courses. This practice is how-

ever mostly tailored at completing an existing code

fragment (e.g. local sequence alignment in bioinfor-

matics) or to apply a tool on a speciﬁc dataset. Some

courses even require to perform steps of an algorithm

in ’pen-and-paper’ style. While these approaches in-

deed foster the understanding of how the algorithms

work and how they can be used, they do not explic-

itly focus on the strengths and limitations of meth-

ods. Moreover, they do not actively demand an un-

derstanding of the methods and their case-aware ap-

plication.

In a currently ongoing teaching of bachelor and

master students in unsupervised machine learning

methods on the example of clustering, we provide

datasets and ask them to interactively run different

methods on the datasets and with different parame-

ter settings using ELKI (Schubert and Zimek, 2019).

The students are then instructed to note what they ob-

serve regarding differences in the clustering results.

In case of data streams we use the MOA framework

(Bifet et al., 2011) such that the students can simulate

Position Paper: Computer Supported Education vs. Education Supported Computing - On the Problem of Informed Decision Making of

Appropriate Data Analytics Method

439

and observe different data stream scenarios.

To leverage the experience in learning the

strengths and limitations of methods we provide stu-

dents the task to design their own datasets through a

sample generator

. While this generator is simple and

does not require any installation or complex learning

efforts, it allows students to focus on the provided

task:

Design and modify datasets in such a way

that the results of the clustering improve or

worsen. Characterize the properties of the

dataset that lead to either changes of the per-

formance. Provide possible reasons with re-

spect to the used method that explain the ex-

ceptionally good or poor performance.

With this combination of exploring the performance

of algorithms with different parameter settings be-

tween different methods and the impact of character-

istics of the data on the methods performance, we see

a Computer Supported Education approach that sus-

tainably prepares scientists of computer science and

domain science alike becoming proﬁcient in when to

choose which method.

2 APPROACHING THE

METHOD-APPLICATION GAP

Despite the gained experiences of when to use which

method, the sheer amount of data analytics meth-

ods itself can be prohibitive to explore and use novel

methods that may be more suitable for the respective

problem.

But what can be done to discover more suitable

algorithms with low(er) effort?

Obviously one possible approach to that lies in

computer supported solutions. We ask at this point:

What if we would have a recommendation system? A

recommendation system that can be queried and then

responds with archetype methods (e.g. density-based

clustering, hierarchical clustering) to choose from.

More importantly, a system that provides an explana-

tion for the selection of methods, which enhances un-

derstandability of the underlying selection. The idea

of relying on such a recommender system is neither

new nor far-fetched. Consider for this case movie

streaming platforms that recommend movies or on-

line market platforms that recommend products. The

idea of a recommender system for suggesting algo-

rithms has been approached Collins et al. (2018) and

is actively discussed in the Interdisciplinary Work-

https://guoguibing.github.io/librec/datagen.html

shop on Algorithm Selection and Meta-Learning in

Information Retrieval (Beel and Kotthoff, 2019).

However, two aspects remain unclear: Which

questions should be posed to the recommender sys-

tem, and which information (e.g. properties of the

data) should be provided? To approach that prob-

lem, it is of paramount importance to have some kind

of structuring that allows a categorization of differ-

ent data analytics methods for a speciﬁc task (in the

scope of this work, we take clustering as an example).

As an open question, we ask for the choice of criteria

in order to structure the algorithms so that researchers

in different ﬁelds can use the system to their advan-

tage. In various survey papers [(Sim et al., 2013),

and references within], we see tables that indicate po-

tential structures; these, however, seem to cover cer-

tain aspects, e.g., the way algorithms operate (bottom-

up, top-down, grid-based, etc.) or their parameters—

which might not be relevant for all research questions

in all scientiﬁc ﬁelds.

To provide a more application-tailored structur-

ing, we deem it necessary to propose a categorization

of algorithms. This idea itself is also not new per se

and can be seen in the different approaches e.g. meta-

data information. It serves the purpose to understand

in which instance which types of algorithms in data

mining and machine learning are a reasonable choice.

It fosters taking aspects/properties like:

1. data set speciﬁc properties

2. algorithm speciﬁc properties and

3. model speciﬁc properties

into account.

Each of the aspects/properties in itself is gov-

erned by certain assumptions that operate on differ-

ent levels. The following list of aspects and assump-

tions is by no means complete. Here we have the chal-

lenge of a delicate balance between coverage of dif-

ferent (use)cases vs. complexity, on which we elabo-

rate more in detail in the following section.

In case of (1) data speciﬁc properties we consider:

a. data type and semantic-level assumptions

b. data-origin/generation-level assumptions

c. instance-level assumptions

d. feature-level assumptions

e. pattern-level assumptions

f. outlier/anomaly-level assumptions

In the case of (2) algorithm speciﬁc properties we

suggest the following levels with their respectively

underlying assumptions:

CSEDU 2025 - 17th International Conference on Computer Supported Education

440

a. objective-level assumptions

b. process-level assumptions

c. parameter-level assumptions

d. output-level assumptions

Lastly, in case of (3) model speciﬁc properties we

advise for the following levels including their as-

sumptions:

a. model-level assumptions

b. relationship-level assumptions

c. explainability-level assumptions

A categorization into different properties and as-

sumptions is in its consequence a way of highly struc-

tured prompt engineering. The beneﬁts of structured

prompts in context of learning data analysis have been

demonstrated in context of ChatGPT (Garg and Ra-

jendran, 2024). The novel aspect that we propose here

is a more systematic structuring with respect to data

set, algorithm and model properties with a beneﬁt that

is two-fold: For students it fosters to think in more

structured ways regarding the input, the properties of

methods and the output, while for the recommender

system (e.g. via ChatGPT) it enables the discovery

of more suitable methods and improved explanations,

since it is provided with explicit properties and as-

sumptions to account for.

3 PERSPECTIVES IN

DIFFERENT DOMAINS

Position 4:

Beyond the Ivory Tower: On Different Pre-

conditions and Practices in Domain Sci-

ence.

So far we have mostly taken a look with a computer

scientist’s view in mind. In the following, we include

the vision from the perspective of different domains,

exempliﬁed in this work by our coauthors from biol-

ogy, psychology, and bioinformatics.

3.1 Biology and Psychology Perspective

The understanding of data and how to ﬁnd suitable

statistical methods varies widely between and within

biological and psychological sciences, and so does the

type of data. While ecologists may compare occur-

rence of a certain species (frequency, re-catch rates),

they may also model complex inter-species dynam-

ics. While personality psychologists may be partic-

ularly interested in inter-individual differences and

how to measure them, clinical scientists may apply

pre-post-comparisons for clinical trials, and devel-

opmental psychologists may analyze longitudinal or

nested data in path models and multilevel modeling.

There is deﬁnitely not ’one size ﬁts all’ in biology and

psychology, so the perspective given here is on a very

narrow ﬁeld dealing with morphometric, psychomet-

ric, and quantitative parameter data, by no means rep-

resentative and based on subjective experiences. This

perspective also comments on to what extent knowl-

edge on data analysis is (or is not) present among stu-

dents – again, from a very limited, subjective angle.

It seems that biology students are often lacking ba-

sic understanding of how to statistically analyze their

data beyond reporting descriptive statistics. This may

be due to statistics or biostatistics being only a foot-

note (or one class) in the undergraduate curriculum.

Still, students are expected to perform data analysis

at the end of their undergraduate studies, and way too

often basic statistic education starts within the lab in

which they have decided to write their Bachelor the-

sis. Therefore, we may need to start with the basics:

What kind of data do we deal with (continuous, or-

dinal, nominal)? How is the data distributed (nor-

mal versus non-normal)? Is this data dependent or

independent? How is the variance distribution (het-

eroscedasticity) and why does that even matter?

Can I/should I normalize my data, and if I should,

how to do it? How do I ﬁnd the correct statistical test

for my scientiﬁc question? A common workﬂow for

many types of parametric biological data with little

statistical knowledge may look like this:

- Gather and prepare data

- Test for normality with Shapiro-Wilk test

(Shapiro and Wilk, 1965)

- Transform data if not normally distributed with

simple transformations (log, log10, exponential)

- Univariate methods: t-test for normally dis-

tributed data, Wilcoxon-test (Wilcoxon, 1945) for

non-normally distributed data

- Multivariate methods: ANOVA (Fisher, 1935;

Girden, 1992) for normally distributed data, PCA

for non-normally distributed data (Pearson, 1901)

- Then consider correction for multiple compar-

isons, e.g., (Bonferroni, 1936; Benjamini and

Hochberg, 1995)

Position Paper: Computer Supported Education vs. Education Supported Computing - On the Problem of Informed Decision Making of

Appropriate Data Analytics Method

441

More speciﬁcally, let us look at two examples. In

functional morphology, we use Geometric Morpho-

metrics [(Adams et al., 2004), and references within]

to study shape using landmark and semi-landmark co-

ordinates that capture morphological features. The re-

sulting Cartesian coordinate data is treated in the fol-

lowing way:

• Conduct Procrustes Superimposition (Dryden

and Mardia, 1998) (to exclude size as a factor)

• Perform Principal Component Analysis

(PCA) (Pearson, 1901) using landmark coordi-

nates

• Plot PC1 and PC2, use appropriate statistical test

to compare means (e.g. Mann-Whitney U (Mann

and Whitney, 1947), Wilcoxon (Wilcoxon, 1945),

Dunn’s (Dunn, 1964), use correction for multiple

comparisons if applicable)

• Test for covariation between analyzed features

with two-block Partial Least Squares (2-block

PLS)

In a second, completely different study, we may ap-

ply 3D Surface Texture Analysis to obtain character-

istics of biological surfaces (eggshells, bones, teeth,

etc.) [(Attard et al., 2023),(Winkler et al., 2022),

(Martisius et al., 2020)]. The obtained surface data

are expressed as standardized ISO roughness param-

eters that are then treated as follows:

• Test for normality and heteroscedasticity of pa-

rameters

• Perform normalization

• Compare means between groups with appro-

priate tests (t-test, Wilcoxon (Wilcoxon, 1945),

Dunn’s (Dunn, 1964))

• Conduct PCA to reduce dimensions, as up to 50

parameters are often obtained

This sequence is not wrong, but it is following a

basic cookbook structure some students may have ac-

quired from their supervisors, but they may lack the

understanding of where to adjust it and have no idea

how to advance. Unfortunately, some steps may even

be skipped or ignored, if researchers are not aware

of their importance; for example, if normality is not

tested, the default analysis when comparing means of

multiple groups may always be ANOVA. If not cor-

rected for multiple comparisons, type I errors may al-

ways be inﬂated. We are not trying to paint a picture

of incompetent researchers here, but we need to ad-

dress the fact that there is no formalized education in

data analysis for the biological sciences, and we may

have very different competence levels among students

and researchers as a result. An accessible and hands-

on approach to data analysis using a sample dichoto-

mous decision tree (Breiman et al., 1984) (illustrated

by examples) that can be used by researchers and stu-

dents of different proﬁciency would be a great tool to

support data analysis on a consistent level. From this

level, it would be possible to advance to modeling and

multivariate methods, which are not as common.

In contrast to biology, research methods and statis-

tics are crucial parts of study programs in psychology,

both in undergraduate and graduate studies. While

undergraduate curricula are commonly focused on de-

scriptive statistics, exploratory data analysis and ba-

sic inferential statistics (e.g., graphical data analysis,

correlational methods, t-tests), graduate curricula are

more focused on advanced inferential statistics (e.g.,

multiple regression, ANOVA, non-parametric tests,

multilevel linear models, structural equation mod-

els) as well as methods exploring clusters and latent

factors (factor analysis, cluster analysis; see (Field,

2024) for a popular book on statistical analyses in

psychology). However, from our perspective, new

algorithms from database machine learning research

rarely enter common statistical analyses in psychol-

ogy, despite the shift from SPSS as the usual statis-

tical software to R, which is more versatile—even

though they might be proven useful, particularly for

complex multi-level and time series data sets.

3.2 Bioinformatics Perspective

The ﬁeld of bioinformatics presents unique chal-

lenges due to the complexity and diversity of

biological, especially OMICs (Li and Wong, 2008),

data. Researchers often deal with high-dimensional

datasets, noisy measurements, and intricate bio-

logical networks. The choice of computational

methods signiﬁcantly impacts the ability to uncover

meaningful biological insights. Here, we illustrate

how selecting appropriate algorithms can make a

substantial difference in bioinformatics research

outcomes.

Gene Expression Clustering

• Clustering algorithms are essential for analyzing

gene expression data to identify groups of genes

with similar expression patterns, c.f. (Eisen et al.,

1998).

• Many bioinformaticians default to using k-means

clustering because of its simplicity and ease of

implementation.

• K-means (Jain, 2010) assumes spherical clusters

of equal variance and may not capture the true

CSEDU 2025 - 17th International Conference on Computer Supported Education

442

structure of gene expression data, which often

contains irregularly shaped clusters and varying

cluster sizes.

• Using density-based clustering algorithms like

DBSCAN (Ester et al., 1996) can better identify

clusters of arbitrary shapes and is robust to noise.

This method can reveal subtle gene expression

patterns associated with speciﬁc biological con-

ditions or phenotypes that k-means might miss.

Sequence Alignment and Assembly

• Accurate sequence alignment and genome assem-

bly are critical for understanding genetic informa-

tion.

– Tools like BLAST (Altschul et al., 1990) for

alignment and assemblers based on de Bruijn

graphs (Pevzner et al., 2001) are widely used

due to their speed and familiarity.

– These methods may not handle genomic varia-

tions like large insertions, deletions, or repeti-

tive sequences effectively.

– Employing algorithms such as Smith-

Waterman for local alignment (Smith

and Waterman, 1981) or assemblers like

SPAdes (Bankevich et al., 2012) and

Canu (Koren et al., 2017), which are de-

signed to work with long-read sequencing

data, can provide more accurate results. These

methods account for complex genomic rear-

rangements and repetitive regions, leading to

better assembly quality.

Dimensionality Reduction in Single-Cell RNA-Seq

Analysis

Single-cell RNA sequencing (scRNA-seq) (Macosko

et al., 2015) generates high-dimensional data that re-

quire dimensionality reduction for visualization and

interpretation.

• Researchers often use Principal Component

Analysis (PCA) due to its ability to reduce di-

mensionality while preserving variance.

• PCA is a linear method and may not capture the

non-linear relationships inherent in scRNA-seq

data, potentially obscuring meaningful biological

variation.

• Non-linear dimensionality reduction techniques

like t-Distributed Stochastic Neighbor Em-

bedding (t-SNE) (van der Maaten and Hinton,

2008) and Uniform Manifold Approximation

and Projection (UMAP) (McInnes et al., 2018)

preserve local and global data structures, respec-

tively. These methods can uncover cell subpop-

ulations and developmental trajectories that PCA

might overlook.

Protein Structure Prediction

• Predicting protein structures from amino acid se-

quences is fundamental for understanding protein

function.

• Traditional methods like homology model-

ing (Schwede et al., 2003) rely on known struc-

tures of similar proteins but may not work well

for proteins without close homologs.

• Sole reliance on homology models can lead to in-

accuracies when templates are distant or unavail-

able.

• Utilizing advanced algorithms like Al-

phaFold (Jumper et al., 2021), which employs

deep learning techniques, can predict protein

structures with high accuracy even in the absence

of close homologs. Incorporating such methods

can signiﬁcantly enhance the understanding of

protein functions and interactions.

Phylogenetic Analysis

Constructing phylogenetic trees helps in understand-

ing evolutionary relationships among species or

genes.

• Methods like Neighbor-Joining (NJ) (Saitou and

Nei, 1987) are popular for their simplicity and

speed.

• NJ may not account for varying rates of evolution

across lineages or the complexities of genomic

data, potentially resulting in incorrect tree topolo-

gies.

• Maximum Likelihood (ML) (Felsenstein, 1985)

and Bayesian Inference (Huelsenbeck and Ron-

quist, 2001) methods provide more accurate phy-

logenetic reconstructions by modeling sequence

evolution more comprehensively. Although com-

putationally intensive, these methods can yield in-

sights into evolutionary processes that NJ cannot.

Outlier detection in Genomic Data

• Identifying outliers is important for quality con-

trol and detecting rare variants.

• Simple statistical thresholds or Z-scores are used

to ﬂag outliers.

• These methods may not account for the com-

plex, high- dimensional structure of genomic data,

leading to false positives or negatives.

• Robust Mahalanobis Distance (Rousseeuw and

Driessen, 1999) or Isolation Forests (Liu et al.,

2008) can detect multivariate outliers by consider-

ing the covariance structure of the data. Applying

these algorithms improves the accuracy of outlier

detection, ensuring that downstream analyses are

based on high-quality data.

Position Paper: Computer Supported Education vs. Education Supported Computing - On the Problem of Informed Decision Making of

Appropriate Data Analytics Method

443

These examples from bioinformatics demonstrate that

the choice of algorithm profoundly inﬂuences re-

search ﬁndings. By expanding the repertoire of com-

putational methods and making informed algorithm

selections, bioinformaticians can enhance the qual-

ity and impact of their research. A recommendation

system would serve as a valuable tool, guiding re-

searchers and students toward methods best suited to

their speciﬁc data characteristics and research ques-

tions.

Another interesting observation is that while the

different tests are prominent and actively used in the

different domain sciences, we do not observe them

explicitly in the processes of data mining and machine

learning, like in, e.g., the KDD process (Fayyad et al.,

1996). We would like to remind the reader that the

experiences shared and the structures provided are not

intended to be regarded as generally valid or by any

means complete.

4 CONCLUSION

In this paper we elaborate on the challenges that arise

with the richness of different methods for data ana-

lytics and the need to educate on decision making of

when to use which method. We discuss four posi-

tions related to that problem. Those positions encom-

pass (1) there is a rich plethora of methods, which is

a blessing and at the same time, in the light of the

sheer amount, a curse (2) that automated data ana-

lytics pipelines are not a ’holy grail’, meaning that

to learn when to use which method is of paramount

importance (3) computer supported approaches to un-

derstand the strengths and weaknesses of methods are

indispensable and (4) to facilitate informed decision

making across different domains, it is required to ﬁrst

understand their common practices and education ap-

proaches for data analytics. In conclusion, we hope

that with this position paper we can foster fruitful

discussions toward computer supported education of

data analytics with the goal of education supported

computing across domains.

ACKNOWLEDGEMENTS

The project is funded by Deutsche Forschungsge-

meinschaft (DFG, German Research Foundation) –

Kiel University UP23/1 and University of L

ubeck

in the context of DenkRaum, an inter- and trans-

disciplinary fellowship program for postdoctoral re-

searchers.

REFERENCES

Adams, D. C., Rohlf, F. J., and Slice, D. E. (2004). Geomet-

ric morphometrics: ten years of progress following the

‘revolution’. Italian journal of zoology, 71(1):5–16.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and

Lipman, D. J. (1990). Basic local alignment search

tool. Journal of Molecular Biology, 215(3):403–410.

Attard, M. R., Bowen, J., and Portugal, S. J. (2023).

Surface texture heterogeneity in maculated bird

eggshells. Journal of the Royal Society Interface,

20(204):20230293.

Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A.,

Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko,

S. I., Pham, S. P., Prjibelski, A. D., Pyshkin, A. V.,

Sirotkin, A. V., Vyahhi, N., Tesler, G., Alekseyev,

M. A., and Pevzner, P. A. (2012). SPAdes: A new

genome assembly algorithm and its applications to

single-cell sequencing. Journal of Computational Bi-

ology, 19(5):455–477.

Barbudo, R., Ventura, S., and Romero, J. R. (2023). Eight

years of automl: categorisation, review and trends.

Knowledge and Information Systems, 65(12):5097–

5149.

Beel, J. and Kotthoff, L. (2019). Preface: The 1st interdis-

ciplinary workshop on algorithm selection and meta-

learning in information retrieval (amir). In AMIR@

ECIR, pages 1–9.

Benjamini, Y. and Hochberg, Y. (1995). Controlling the

false discovery rate: A practical and powerful ap-

proach to multiple testing. Journal of the Royal Statis-

tical Society: Series B (Methodological), 57(1):289–

300.

Bifet, A., Holmes, G., Pfahringer, B., Read, J., Kranen,

P., Kremer, H., Jansen, T., and Seidl, T. (2011).

Moa: a real-time analytics open source framework.

In Machine Learning and Knowledge Discovery in

Databases: European Conference, ECML PKDD

2011, Athens, Greece, September 5-9, 2011, Proceed-

ings, Part III 22, pages 617–620. Springer.

Bonferroni, C. E. (1936). Teoria statistica delle classi e cal-

colo delle probabilit

a. Libreria Internazionale Seeber,

Florence, Italy.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984).

Classiﬁcation and Regression Trees. Wadsworth Inter-

national Group, Belmont, CA, USA.

Dryden, I. L. and Mardia, K. V. (1998). Statistical Shape

Analysis. John Wiley & Sons, Chichester, UK.

Dunn, O. J. (1964). Multiple comparisons using rank sums.

Technometrics, 6(3):241–252.

Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein,

D. (1998). Cluster analysis and display of genome-

wide expression patterns. Proceedings of the National

Academy of Sciences, 95(25):14863–14868.

Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996).

A density-based algorithm for discovering clusters in

large spatial databases with noise. In kdd, volume 96,

pages 226–231.

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996).

The kdd process for extracting useful knowledge from

CSEDU 2025 - 17th International Conference on Computer Supported Education

444

volumes of data. Communications of the ACM,

39(11):27–34.

Felsenstein, J. (1985). Conﬁdence limits on phyloge-

nies: An approach using the bootstrap. Evolution,

39(4):783–791.

Field, A. (2024). Discovering statistics using IBM SPSS

statistics. Sage publications limited.

Fisher, R. A. (1935). The Design of Experiments. Oliver

and Boyd, Edinburgh.

Garg, A. and Rajendran, R. (2024). The impact of struc-

tured prompt-driven generative ai on learning data

analysis in engineering students. In CSEDU (2), pages

270–277.

Girden, E. R. (1992). ANOVA: Repeated Measures, vol-

ume 84 of Quantitative Applications in the Social Sci-

ences. SAGE Publications, Newbury Park, CA.

Huelsenbeck, J. P. and Ronquist, F. (2001). Mrbayes:

Bayesian inference of phylogenetic trees. Bioinfor-

matics, 17(8):754–755.

Jain, A. K. (2010). Data clustering: 50 years beyond k-

means. Pattern recognition letters, 31(8):651–666.

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov,

M., Ronneberger, O., Tunyasuvunakool, K., Bates,

R.,

ıdek, A., Potapenko, A., Bridgland, A., Meyer,

C., Kohl, S. A., Ballard, A. J., Cowie, A., Romera-

Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T.,

Petersen, S., Reiman, D., Clancy, E., Zielinski, M.,

Steinegger, M., Pacholska, M., Li, T. H., Degrave, R.

J. L., Bickerton, C. M., Meyer, W. J., Velankar, A. A.,

and Hassabis, D. (2021). Highly accurate protein

structure prediction with alphafold. Nature, 596:583–

589.

Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., and

Phillippy, N. H. (2017). Canu: scalable and accurate

long-read assembly via adaptive k-mer weighting and

repeat separation. Genome Research, 27(5):722–736.

Li, C. and Wong, W. H. (2008). Model-based analysis of

oligonucleotide arrays: Expression index computation

and outlier detection. Proceedings of the National

Academy of Sciences, 98(1):31–36.

Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2008). Isolation

forest. In Proceedings of the 2008 Eighth IEEE Inter-

national Conference on Data Mining, pages 413–422.

IEEE.

Macosko, E. Z., Basu, A., Satija, R., Nemesh, J., Shekhar,

K., Goldman, M., Tirosh, I., Bialas, A. R., Kamitaki,

N., Martersteck, E. M., Trombetta, J. J., Weitz, D. A.,

and Regev, A. (2015). Highly parallel genome-wide

expression proﬁling of individual cells using nanoliter

droplets. Cell, 161(5):1202–1214.

Mann, H. B. and Whitney, D. R. (1947). On a test of

whether one of two random variables is stochastically

larger than the other. The Annals of Mathematical

Statistics, 18(1):50–60.

Martisius, N. L., McPherron, S. P., Schulz-Kornas, E.,

Soressi, M., and Steele, T. E. (2020). A method for the

taphonomic assessment of bone tools using 3d surface

texture analysis of bone microtopography. Archaeo-

logical and Anthropological Sciences, 12:1–16.

McInnes, L., Healy, J., Saul, N., and Großberger, L. (2018).

Umap: Uniform manifold approximation and projec-

tion. Journal of Open Source Software, 3(29):861.

Pearson, K. (1901). On lines and planes of closest ﬁt to

systems of points in space. The London, Edinburgh,

and Dublin Philosophical Magazine and Journal of

Science, 2(11):559–572.

Pevzner, P. A., Tang, H., and Waterman, M. S. (2001).

An eulerian path approach to dna fragment assem-

bly. Proceedings of the National Academy of Sciences,

98(17):9748–9753.

Quaranta, L., Azevedo, K., Calefato, F., and Kalinowski,

M. (2024). A multivocal literature review on the ben-

eﬁts and limitations of industry-leading automl tools.

Information and Software Technology, page 107608.

Rousseeuw, P. J. and Driessen, K. V. (1999). A fast algo-

rithm for the minimum covariance determinant esti-

mator. Technometrics, 41(3):212–223.

Saitou, N. and Nei, M. (1987). The neighbor-joining

method: A new method for reconstructing phylo-

genetic trees. Molecular Biology and Evolution,

4(4):406–425.

Schubert, E. and Zimek, A. (2019). Elki: A large open-

source library for data analysis-elki release 0.7. 5” hei-

delberg”. arXiv preprint arXiv:1902.03616.

Schwede, T., Kopp, J., Guex, N., and Peitsch, M. C.

(2003). Swiss-model: An automated protein

homology-modeling server. Nucleic Acids Research,

31(13):3381–3385.

Shapiro, S. S. and Wilk, M. B. (1965). An analysis

of variance test for normality (complete samples).

Biometrika, 52(3-4):591–611.

Sim, K., Gopalkrishnan, V., Zimek, A., and Cong, G.

(2013). A survey on enhanced subspace clustering.

Data mining and knowledge discovery, 26:332–397.

Smith, T. F. and Waterman, M. S. (1981). Identiﬁcation of

common molecular subsequences. Journal of Molec-

ular Biology, 147(1):195–197.

van der Maaten, L. and Hinton, G. (2008). Visualizing data

using t-sne. Journal of Machine Learning Research,

9:2579–2605.

Wilcoxon, F. (1945). Individual comparisons by ranking

methods. Biometrics Bulletin, 1(6):80–83.

Winkler, D. E., Kubo, T., Kubo, M. O., Kaiser, T. M.,

and T

utken, T. (2022). First application of dental

microwear texture analysis to infer theropod feeding

ecology. Palaeontology, 65(6):e12632.

Xanthopoulos, I., Tsamardinos, I., Christophides, V., Si-

mon, E., and Salinger, A. (2020). Putting the human

back in the automl loop. In EDBT/ICDT Workshops.

Position Paper: Computer Supported Education vs. Education Supported Computing - On the Problem of Informed Decision Making of

Appropriate Data Analytics Method

445