GAna: Model Generators and Data Analysts for Streamlined Processing

Stephanie C. Fendrich

1 a

, Philipp Fl

ugger

1 b

, Annegret Janzso

1 c

, David Kaub

1 d

Stefan Klein

1 e

, Anna Kravets

1 f

, Patrick Mertes

1 g

, Nhat Tran

1 h

, Jan Ole Berndt

1 i

and Ingo J. Timm

1,2 j

Cognitive Social Simulation, German Research Center for Artiﬁcial Intelligence, 54296 Trier, Germany

Business Informatics, Trier University, Universitaetsring 15, 54296 Trier, Germany

{stephanie.fendrich, philipp.ﬂuegger, annegret.janzso, david.kaub, stefan.klein, patrick.mertes, nhat.tran, anna.kravets,

jan ole.berndt, ingo.timm}@dfki.de

Keywords:

Data Processing, Workﬂow, Explainable AI, Transparency, Agent-Based Modeling, Simulation.

Abstract:

In the evolving landscape of Explainable AI, reliable and transparent data processing is essential to ensure

trustworthiness in model development. While agent-based modeling and simulation are used to provide in-

sights into complex systems, this becomes vital when applying results to decision-making processes. This

paper presents the GAna workﬂow — an approach that integrates model generation and data analysis to

streamline the workﬂow from data preprocessing to result interpretation. By automating data handling and fa-

cilitating the reuse of processed and generated data, the GAna workﬂow signiﬁcantly reduces the manual effort

and computational expense typically associated with creating synthetic populations and other data-intensive

tasks. We demonstrate the effectiveness of the workﬂow through two distinct case studies, highlighting its

potential to enhance transparency in AI applications.

1 INTRODUCTION

In the shift towards Explainable AI, transparency and

trustworthiness are key principles. Simulation-based

approaches, such as Agent-Based Modeling (ABM)

contribute to these principles, as decisions made by

agents are usually based on clear decision rules,

mostly derived from empirically grounded theories,

such as from psychology or sociology. Furthermore,

the individualized consideration of agents can help

explain which factors inﬂuence decision-making and,

through emergence, can lead to the overall behav-

ior of a population under consideration at the macro

level. This allows outputs to be backtracked and pro-

cesses to be more explainable by following the steps

taken in the simulated environment, based on the di-

https://orcid.org/0000-0002-5168-3564

https://orcid.org/0009-0000-8063-537X

https://orcid.org/0009-0000-0431-3766

https://orcid.org/0009-0002-4092-7188

https://orcid.org/0009-0009-5126-5491

https://orcid.org/0009-0007-3354-4935

https://orcid.org/0009-0004-3954-4176

https://orcid.org/0009-0009-2476-0593

https://orcid.org/0000-0001-7241-3291

https://orcid.org/0000-0002-3369-813X

rectly speciﬁed parameters. In the following, men-

tions of models typically refer to ABM, unless other-

wise speciﬁed.

Over the last decades, there has been a signiﬁcant

increase in the use of ABM in various disciplines,

such as social sciences, behavioral sciences, urban

land-use modeling, or spatial sciences. For instance,

ABMs have been employed to study social behaviors

and dynamics (Asgharpour et al., 2010; Hedstr

om and

Manzo, 2015), land-use patterns (Huang et al., 2014;

Matthews et al., 2007) and spatial processes (Torrens,

2010). Ensuring that models are understandable and

their processes are clear is essential for building trust

in their outcomes, particularly in real-world applica-

tions like these.

In the context of ABM, this is especially relevant

to how data are handled, as transparent data prepara-

tion and reuse can signiﬁcantly impact the reliability

and interpretability of the model. Hence, the majority

of time spent in ABM is often directed towards data

preparation and analysis rather than the design and

implementation of the models themselves (cf. Lee

et al. 2015; Munson 2012). Thus, by automating these

phases, the entire development process, from design

to dissemination, can be made more efﬁcient. For ex-

ample, by reusing components such as (preprocessed)

synthetic population data or infrastructure modules,

608

Fendrich, S. C., Flügger, P., Janzso, A., Kaub, D., Klein, S., Kravets, A., Mertes, P., Tran, N., Berndt, J. O. and Timm, I. J.

GAna: Model Generators and Data Analysts for Streamlined Processing.

DOI: 10.5220/0013165600003890

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 3, pages 608-618

ISBN: 978-989-758-737-5; ISSN: 2184-433X

researchers can reduce the resources required for data

preparation in models with similar elements. This

reuse, thus, allows a greater focus on reﬁning the

models and interpreting their results.

In this paper, we propose the GAna (Model

Generators and Data Analysts) workﬂow, which aims

to streamline both preprocessing and postprocessing

of data in the modeling process. By including these

steps in the overall application, it becomes easier to

backtrack the computational processes determining

results, thereby increasing explainability and promot-

ing reproducibility. Due to the automation of pro-

cesses that are often manual, this concept is less prone

to human error. Additionally, both of these aspects are

likely to increase user trust in the completed model,

thereby increasing acceptance. Moreover, this im-

proves the efﬁciency of the overall workﬂow, allow-

ing for faster and cleaner results at every step. The

development of intelligent model generators and data

analysts can be applied to a wide range of practical

applications. While we mainly focus on the bene-

ﬁts of such a workﬂow for ABM and especially social

simulation, it is possible to transfer the concept to ad-

ditional ﬁelds such as Natural Language Processing.

The straightforward approach allows for quick adjust-

ments to individual components, allowing for the ex-

change of data, required formats, and model types.

The following Section 2 details the baseline upon

which the concept has been built, focusing on simi-

lar previous approaches and tools, as well as a gen-

eral overview of methods to enhance reproducibil-

ity and efﬁciency in simulation experiments. Subse-

quently, Section 3 formally introduces the workﬂow

concept by discussing the functionalities of the com-

ponents within the workﬂow, as well as requirements

for those components. In Section 4 the practical use

of the concept is highlighted, featuring two projects as

case studies which successfully make use of the core

components in the workﬂow - model generators and

data analysts. These projects focus on different top-

ics, with AKRIMA dealing with crisis management,

and GreenTwin focusing on last-mile logistics. This

demonstrates how the GAna workﬂow is applicable

to a variety of projects in ABM contexts. Finally, re-

sults and limitations are discussed in Section 5, with

an outlook on future work.

2 ENHANCING

REPRODUCIBILITY IN

SIMULATION EXPERIMENTS

The increasing use of simulation experiments across

various research ﬁelds presents several challenges.

One of these challenges is bridging the gap between

the technical knowledge of developers, who design

and implement the models, and the domain-speciﬁc

expertise—such as in the social or economic sci-

ences—held by experts in those ﬁelds. This gap often

leads to errors, particularly during the manual cus-

tomization of models, which in turn affects the re-

producibility of experiments. Consequently, there is

a growing need for systematic methodologies to en-

hance reproducibility, aligning with the broader ob-

jectives of enhancing transparency and accountability

in computational science. According to Dalle (2012),

both technical and human-related factors hinder re-

producibility. An additional challenge lies in the in-

consistent and sometimes careless use of the term ”re-

producibility” itself. Careful attention must be paid

to the correct usage of the term to prevent misunder-

standings, as the related terminology is not consis-

tently deﬁned (Feitelson, 2015).

To mitigate these issues, automation is empha-

sized as a critical solution, reducing human interven-

tion and thereby minimizing the risk of errors in simu-

lation studies. Reproducibility in modeling and simu-

lation is inherently limited, which is why Taylor et al.

(2018) differentiate between the “art” and “science”

of simulation. They argue that while the scientiﬁc el-

ements of models—such as data collection and com-

putational modules—are often reproducible, the artis-

tic aspects, like conceptual modeling, rely heavily on

tacit knowledge and are therefore less reproducible.

Nevertheless, reproducibility remains critical for en-

suring scientiﬁc rigor in modeling and simulation.

Without it, research ﬁndings may lack credibility and

broader applicability. As a result, reproducibility has

become a widely recognized best practice in science

(Feger and Wo

zniak, 2022). Challenges to achieving

full reproducibility include legal restrictions, evolving

software platforms, and the inherent complexities of

model construction. Taylor et al. (2018) suggest that

while full reproducibility may not always be attain-

able, improved documentation, open access practices,

and standardization can signiﬁcantly enhance trans-

parency and accountability in the ﬁeld. Effective doc-

umentation is especially crucial, as poor documenta-

tion is often the primary reason experiments cannot

be reproduced (Raghupathi et al., 2022). Therefore,

addressing these challenges requires thorough docu-

mentation of the processes behind data creation.

Provenance, as highlighted by Herschel et al.

(2017), ensures that data processing steps are trans-

parent, reproducible, and veriﬁable, which enhances

the reliability of AI models. Provenance documents

how data are created, manipulated, and interpreted,

allowing users to trace each step and validate the re-

sults. Ruscheinski and Uhrmacher (2017) identify key

GAna: Model Generators and Data Analysts for Streamlined Processing

609

gaps in current provenance methodologies and pro-

pose a model to bridge these gaps by effectively doc-

umenting and managing the processes behind simula-

tion models and data. In further research, Ruscheinski

et al. (2018) present advancements in the application

of the PROV Data Model (PROV-DM) to simulation

models, proposing a PROV ontology to capture the

provenance of these models. While earlier research

primarily focused on documenting the provenance of

simulation data, this work shifts the focus to the mod-

els themselves, addressing the complexities of model

development. The authors emphasize the importance

of documenting the entire process of model genera-

tion, including the relationships between data, simu-

lation experiments, and model reﬁnements, ensuring

that each step is traceable and veriﬁable. By leverag-

ing PROV-DM, they provide a framework for identi-

fying and relating the entities and activities involved

in the development of a simulation model.

Beyond the approach presented, there are addi-

tional, more technical methods for designing simu-

lation experiments. Teran-Somohano et al. (2015),

demonstrate a model-driven approach, offering web-

based assistance for creating simulation experiments.

This allows experts from various domains to design

experiments without needing expertise in experimen-

tal design or specialized knowledge. Another ap-

proach involves using schemas to describe an exper-

iment, which are then mapped to executable code

(Wilsdorf et al., 2019). When models from different

domains already exist, it is possible to merge them.

Pierce et al. (2018) present an iterative approach for

such a procedure. Additionally, existing models from

previous studies can be adapted to the speciﬁc con-

ditions of a new study (Wilsdorf et al., 2021). These

approaches support both model development and the

documentation of experiments.

GAna aims to increase reproducibility and efﬁ-

ciency by structuring and automating operations with

data for pre- and postprocessing, e.g., enabling trans-

fer to other regions, domains and models (see also

(Skoogh and Johansson, 2008)). The presented ap-

proach focuses on improving both the efﬁciency and

quality of input and output data management in sim-

ulation contexts. By streamlining the identiﬁcation,

collection, and preparation of data, it helps address

common challenges in these areas, ensuring that data

used in simulations is reliable and of high quality.

3 THE GAna WORKFLOW

In this section, we present the GAna workﬂow, which

outlines key steps ranging from selecting data in-

puts and processing them for use in the evaluation

and analysis of simulation models. Our group works

on cognitive social systems that include important

recurring components in modeling, such as popula-

tion structures, daily routines (e.g., job assignments)

and infrastructure setup. To streamline future devel-

opments, the value of reusable components through

automation and process structuring has become ap-

parent over time. The structuring of components in

this approach simpliﬁes the integration and reuse of

generators and analysts in the modeling process and

ensures detailed documentation of input and output

ﬁles. The group recognizes that different use cases

require different levels of detail and therefore empha-

sizes the importance of modularity in the initial setup

and adaptation. By creating common and meaning-

ful interfaces, the group aims to improve interoper-

ability and adaptability in future modeling contexts.

These concepts, originally developed in the AKRIMA

project, and later applied in the GreenTwin project as

a second use case to demonstrate their wider applica-

bility and potential for continued development.

First, we describe the general workﬂow structure

in its entirety. Subsequently, we focus on each com-

ponent in the workﬂow, introduce its functionalities,

and discuss requirements that should be met when im-

plementing the approach. Since the needs of different

working groups or projects will vary, the formulated

requirements and how they can be achieved and vali-

dated might need to be extended or adapted.

The upper half of Figure 1 displays the steps of the

GAna workﬂow, consisting of the components: data,

model generator, model, data analyst and output. The

data input, further discussed in Section 3.1, is criti-

cal and is tailored both to the model’s data require-

ments and the hypotheses being tested. The raw input

data are used by model generators. A model genera-

tor’s purpose is to process incoming data, which often

originate from several data sources, and prepare these

for use in the respective model. Further details on

the generator’s function and structure are described

in Section 3.2. The prepared and processed data are

handed over to the respective (simulation) model (see

Section 3.3), where the model output is generated,

which can be further processed by data analysts. A

data analyst’s aim is to structure and summarize the

model output. This data postprocessing step prepares

the results for hypothesis testing and the creation of

visual aids (see Section 3.4). The GAna workﬂow

output primarily consists of visual aids, such as charts

and diagrams, which aid in the dissemination and dis-

cussions of results with stakeholders. Key statisti-

cal ﬁgures are used in analysis, decision-making, and

communication (see Section 3.5).

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

610

Data OutputModelGenerator Analyst

Deﬁnition of Hypotheses

Hypothesis Testing

Inter-

pretation

Hypothesis-Driven Development and Analysis

RQ1 - RQ4 RQ5 - RQ8

RQ9

RQ10

GAna Workﬂow

RQ5 - RQ8

RQ1

RQ11

Raw data

input

1...m

...

Prepared

data

1..m

Model

output

1..m

Analyst

output

1..m

...

Figure 1: The GAna Workﬂow Concept.

The lower half of the ﬁgure illustrates the accom-

panying hypothesis-driven development and analysis.

While this part can theoretically be considered sep-

arately from the actual workﬂow, we emphasize the

importance of such model development. Forming hy-

potheses, along with speciﬁc research questions that

can be accepted or rejected through scientiﬁc inquiry,

is a foundation for the development of any model.

This approach enables clear and objective conclusions

to be drawn (Lorig et al., 2017a,b). The choice of data

input depends on the hypotheses being addressed, and

vice versa, as data availability may limit or require the

reformulation of hypotheses. The deﬁned hypotheses

are tested using the model’s output, e.g., with the help

of data analysts, which process data accordingly. Fi-

nally, the interpretation of the results depends on the

choice of hypotheses, as well as the visualized out-

put. As described in (Lorig et al., 2017a,b), a step

towards automated testing of hypotheses can be made

by formalizing these hypotheses, e.g., by using a for-

mal speciﬁcation language which allows for an auto-

mated evaluation of hypotheses using methods such

as statistical hypothesis tests (cf. Lorig et al. 2017a).

This further increases objectivity and reproducibility

by minimizing experimenter bias.

3.1 Data

To be applicable for the GAna workﬂow, data input

must fulﬁll certain requirements, as indicated in Fig-

ure 1. Requirement RQ1: Data Format and Structure

refers to the format and internal structure of the in-

put data. While custom adapters can bridge the gap

between unusual ﬁle formats and the workﬂow, ﬁles

should be generally provided in widely accepted for-

mats, such as CSV or JSON. The structure of the ﬁles

should be standardized, i.e., ensuring consistent units

of measurement and data types. This reduces errors

and simpliﬁes the integration of data with other model

components, such as model generators. Additionally,

good standardization eliminates the need for prepro-

cessing steps that would otherwise be necessary.

The GAna workﬂow aims for ﬂexibility, allow-

ing a variety of different models to beneﬁt from it.

This is addressed in Requirement RQ2: Data Main-

tenance and Accessibility. Since model generators

can produce data inputs essential for multiple models,

commonly used data used should be accessible by all

users. To increase reproducibility and to allow utiliza-

tion by different working groups, usage of publicly

accessible data are preferred. Additionally, with fast-

paced advancements leading to frequent data changes,

it is crucial to ensure that the data remains as current

as possible. Hence, the workﬂow should make use of

recurrent studies whenever possible. Another crucial

aspect is tackled in Requirement RQ3: Data Quality

concerning the completeness of the data input for the

respective application case or model scope. The ﬁles

should contain all relevant data ﬁelds, with a minimal

number of missing values. This leads directly to the

last Requirement we discuss concerning data input,

namely RQ4: Data Privacy and Usefulness. This re-

quirement addresses the trade-off between the neces-

sary anonymity of people in the data and the need for

unbiased data, ensuring that groups of people are not

completely excluded and can therefore be considered

in the subsequent data set and model (see also Leavy

et al. 2020; Kuhlman et al. 2020). Anonymity is al-

ways a prerequisite and must be guaranteed, but data

should always be valid with regard to the exclusion

of people with particular characteristics who form an

absolute minority in the data set but could be of great

relevance for certain models or general research (cf.

Schroeder et al. 2024).

GAna: Model Generators and Data Analysts for Streamlined Processing

611

3.2 Model Generator

The model generator typically uses raw data as in-

put, which is preprocessed according to its type and

returned in a common structure across all incoming

data. This process is repeated for every data source

and allows for the uniﬁed processing of data in differ-

ent formats. Higher quality of the process is allowed

by preprocessing multiple data sources individually;

this also allows for using differently structured data

and different data types as input, which can be com-

bined later on. The modular approach enables creat-

ing multiple versions and combinations of generator

steps, thereby allowing different versions of the pro-

cessed data to be created. Combining and ﬁltering the

data removes unnecessary, unclean, or unprocessable

data, while grouping relevant data points from multi-

ple sources enhances the overall results.

One such example is how data regarding building

location and data on population statistics can be com-

bined to create a synthetic population of an area (fur-

ther described in Section 4.3). The model generator

plays a key role in structural transformation by deﬁn-

ing the structure of the data used in the model. Multi-

ple generators may contribute to the input of a single

model. For instance, one generator may generate a

synthetic populace, whereas another prepares data for

the construction of street networks for models.

To ensure the effective operation of model gener-

ators within the GAna workﬂow, several key require-

ments must be met to maintain the quality and appli-

cability of the output. Requirement RQ5: Modular-

ity and Reusability emphasizes that model generators

should produce outputs that are not bound to any spe-

ciﬁc model. By using a well-deﬁned data structure,

this allows for data to be reused across various ap-

plications, such as synthetic populations or road net-

works, for different modeling scenarios. By support-

ing modular, multistep processes where each compo-

nent operates independently, the system gains greater

ﬂexibility. This modularity enables the generator to

be adapted to different models or workﬂows with-

out requiring signiﬁcant changes, making it easier to

modify or replace components without disrupting the

overall workﬂow.

Another critical aspect of model generation is pre-

serving the Data Integrity and Reproducibility (RQ6).

RQ6 ensures that any transformations or processes

applied to the data do not alter its original seman-

tics or logical relationships. For example, when ag-

gregating or combining different datasets, it is es-

sential to consider the semantic overlap to maintain

coherence of the data. This helps avoid inconsis-

tencies that could arise from mismatches in mean-

ing or scope between datasets. Additionally, gener-

ators must behave deterministically, consistently pro-

ducing the same output for identical inputs, especially

when using pseudo-random values within the process.

This guarantees reproducibility and reliability across

repeated operations.

Efﬁciency is also a fundamental requirement for

the operation of model generators. If feasible, Re-

quirement RQ7: Performance and Scalability urges

to reﬂect for ad hoc analysis. They must be able

to adapt to various use cases, from handling small

datasets to scaling up for larger, more complex data

environments. For example, this could include mod-

eling different region sizes, ranging from individual

cities to larger municipalities. This adaptability en-

sures that the system remains responsive to diverse

performance needs while maintaining ﬂexibility.

Finally, ensuring Compliance with Standards and

Transparency (RQ8) is essential. RQ8 emphasizes

the need to follow established guidelines and best

practices, grounded in well-founded methodologies.

This could be as simple as considering technical and

ethical guidelines for statistical practice, data protec-

tion regulation, or the speciﬁc data usage rights of

a dataset. For synthetic data, it is recommended to

use established technical methods like Iterative Pro-

portional Fitting Deming and Stephan (1940) or Sim-

ulated Annealing Kirkpatrick et al. (1983) to guar-

antee the expected statistical properties. The system

must operate with full transparency, recording all as-

sumptions to build trust and maintain accuracy. Doc-

umenting the provenance of the entire process makes

each step traceable and veriﬁable, as proposed by

Ruscheinski and Uhrmacher (2017).

3.3 Model

The model is treated as a black box: it receives tai-

lored inputs from the model generators, processes

it, e.g., by making use of the input for simulation

runs, and produces outputs for the downstream data

analysts. Depending on the application’s focus, the

model can be, for instance, a simulation or a math-

ematical model. In our case, we typically focus on

simulation models, speciﬁcally those that make use

of ABMs, which facilitates the integration of model

generator outputs across multiple models. Examples

of such models are given in Section 4, where two case

studies are discussed.

To ensure ﬂexibility within the GAna work-

ﬂow, Requirement RQ9: Interface Compatibility and

Workﬂow Adaptability states that the model’s input

and output interfaces must be semantically aligned

with those of the generator and analyst, respectively,

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

612

for the relevant workﬂows. Alternatively, custom

adapters can be provided to make the data compatible

with those workﬂows. The model should meaning-

fully utilize the generator’s output (e.g., integrating a

synthetically generated population) and produce out-

puts that are usable for the analyst’s workﬂows. This

adaptability allows for model modiﬁcations without

changing the generator or analyst, preserving work-

ﬂow integrity and efﬁcient data transfer. Requirement

RQ10: Documentation and Usability emphasizes the

need for comprehensive documentation detailing the

model’s input and output interfaces, such as formats

and data types. Thus, it should outline the workﬂow

requirements for both the generator and analyst to

ensure proper integration. Clear documentation sup-

ports any necessary data transformations and efﬁcient

use of the model within the workﬂow.

3.4 Data Analyst

Data analysts serve the purpose of making the model

output reveal its key insights by visualizing the data

using graphs or calculating aggregated results like

means, standard deviation, or conﬁdence intervals.

The analyst output thus could be any kind of plot,

a (intermediate) data set, or even the result of auto-

mated hypothesis testing.

To accomplish this, several analyst functions are

implemented, ranging from data cleansing, validation

and aggregation steps over ﬁltering and statistics to

plotting of the data. By combining these functions

into integrated workﬂows, complex evaluations can

be prepared once and then be executed automatically.

A typical workﬂow for a data analyst looks like

this: The output of a simulation model is represented

as a data node, which then undergoes several prepro-

cessing steps. These steps involve aggregating mul-

tiple simulation runs, ﬁltering out irrelevant data, or

cleaning and organizing the data for analysis. Once

preprocessing is complete, the data are transformed

into the required format for further analysis. The ﬁnal

step is to represent the data visually or statistically, us-

ing plots, descriptive statistics, or more complex ana-

lytical methods, to extract meaningful insights.

Analysts should be structured modular to allow

reuse of its subcomponents. Some workﬂows might

be completely generic or adjustable, while others can

refer to speciﬁc aspects of the respective model’s ap-

plication area.

Like model generators, data analysts within the

GAna workﬂow must maintain quality, reliability, and

reusability in their processes. Requirement RQ5:

Modularity and Reusability emphasizes that data ana-

lysts should evaluate multistep processes, where each

component operates independently and can be reused

across different models. This adaptability ensures that

evaluations can be applied to various outputs — such

as analyzing affected agents in a crisis or trafﬁc be-

havior during an emergency — without requiring sig-

niﬁcant changes to the overall analysis workﬂow. A

critical aspect of model analysis is ensuring that the

outputs remain consistent and logical throughout the

entire process.

Requirement RQ6: Data Integrity and Repro-

ducibility mandates that analysts maintain accuracy

and consistency in the outputs, especially when com-

bining or comparing multiple datasets. When inte-

grating different model outputs, it is crucial to pre-

serve the semantic relationships and logical structure.

Analysts must also ensure that outputs are determinis-

tic, consistently producing the same results for identi-

cal inputs. This ensures reliability in both current and

future analyzes.

Requirement RQ7: Performance and Scalability

ensures that the analysis process operates efﬁciently

under various conditions. This allows for ad-hoc anal-

ysis of previously generated data, considering outputs

from models with different levels of complexity. The

analysis process should support a range of tasks, from

simple plotting to complex calculations, and make the

results quickly accessible whenever possible. Trans-

parency and adherence to industry standards are es-

sential for ensuring the credibility of the analysis out-

put, which lays the foundation for further interpre-

tation. Requirement RQ8: Compliance with Stan-

dards and Transparency highlights the importance of

data analysts following established best practices or

widely recognized methodologies to ensure both ac-

curacy and clarity of the analysis. For example, using

reputable libraries such as Plotly

for data visualiza-

tions or scikit-learn

for data analysis can contribute

to standardization and reproducibility. All assump-

tions must be clearly documented to ensure trust and

transparency.

3.5 Output

The workﬂow results in the formation of suitable out-

put, for example, in the form of statistical key ﬁgures

or visual aids such as histograms or graphs. To be ap-

plicable in communication with stakeholders and to

be usable for dissemination, output should also fol-

low a few requirements. Firstly, output should be

easy to interpret and consistently presented to make

sure that various user groups, especially stakeholders,

can understand its meaning and derive the intended

https://plotly.com/python/

https://scikit-learn.org/

GAna: Model Generators and Data Analysts for Streamlined Processing

613

conclusions. The given output in combination with

the respective model should be able to communicate

assumptions made as well as limitations, allowing

user groups to be aware of possible constraints (trans-

parency) (RQ11: Clarity and Consistency). Secondly,

as was already mentioned for input data, it is essential

that the data types used for output ﬁles follow indus-

try standards (see Section 3.1), enabling further pro-

cessing using common tools (RQ1: Data Format and

Structure).

The requirements formulated above should be

considered a guideline that can be extended or

adapted to the speciﬁc needs of a working group or

project. While adaption is possible, and the speciﬁc

implementation is up to the user, they are intended as

a starting point and should help to ensure that basic

functionality as well as quality and transparency stan-

dards are met.

4 APPLYING GENERATORS AND

ANALYSTS: TWO CASE

STUDIES

In this section we introduce two distinct case studies

to demonstrate the applicability and effectiveness of

the proposed approach in different contexts. To this

end, the application projects are ﬁrst presented, the

data and tools used to realize the workﬂow are ex-

plained and examples of model generators and data

analysts are described.

The AKRIMA project

(Automatic Adaptive Cri-

sis Monitoring- and Management-System) aims at of-

fering a generic toolkit for monitoring arbitrary re-

gions, with a focus on logistic processes and criti-

cal infrastructure. The approach builds mostly upon

publicly available data sources that can be combined

and processed to present an extensive overview for

monitoring and crisis management. This should sup-

port decision makers regarding the search for appro-

priate crisis response measures. For this project, vari-

ous crisis-relevant software and analysis components

are developed and integrated, with the aim of present-

ing explainable information to support decision mak-

ers. These include, among others, a social simulation

dashboard focusing on the analysis of pandemic sce-

narios, a process simulation for evaluating business

processes during times of crisis, a map dashboard to

visualize the impact on supply chains, and a critical

infrastructure analysis that estimates the impact of cri-

sis scenarios regarding their Robustness of Accessibil-

ity (RoA) (Kaub et al., 2024). The applied software

https://akrima.dfki.de/

components focus on the representation of the pop-

ulation, their homes, workplaces, critical infrastruc-

ture, geographical features like water levels or ﬂood-

ing zones, and logistic routes.

The GreenTwin project (Green digital twin with

artiﬁcial intelligence for CO

-saving cooperative mo-

bility and logistics in rural areas) researches how pro-

environmental behavior in rural areas can be pro-

moted, with a particular focus on individual trans-

portation and logistics. For the project, several sce-

narios are investigated, using an agent-based simu-

lation approach with a Digital Twin of a rural area.

Besides delivery services and demand-driven prod-

uct ranges, scenarios regarding mobility on-demand

or shared economy are examined. These scenar-

ios are combined into a marketplace platform with

the goal of motivating individuals to move towards

pro-environmental behavior by offering compelling

and ﬁnancially sensible alternatives to CO

-intensive

individual transportation (Bae et al., 2024). The

project’s Digital Twin is structured in three lay-

ers (individual, spatial, and social) with each one

requiring speciﬁc types of model generators with

varying degrees of complexity (Rodermund et al.,

2024). The GreenTwin simulation model shares many

represented entities with the previously discussed

AKRIMA model. But it focuses more on the repre-

sentation of individual schedules and their daily rou-

tines - like going to work, getting groceries, pursu-

ing leisure activities while considering their preferred

mode of transport.

Both AKRIMA and GreenTwin beneﬁt from sev-

eral of the aforementioned workﬂow components, as

they have a need for tailored data from heterogeneous

sources to implement realistic model behavior. This

requirement stems from the geospatial and social na-

ture of the applied modeling approaches.

4.1 Data & Statistics

Depending on the speciﬁc scenario, various general

and more speciﬁc data sources need to be combined

to create composite data components that can be used

by following processes in the workﬂow. These can

range from crowdsourced data - like OpenStreetMap

(OSM)

to proprietary data from companies or ad-

ministrative authorities. Additionally, the results from

scientiﬁc research, in the form of theories and statis-

tics, are often required for a valid implementation of

generators and analysts.

One essential requirement shared by both case

studies is the representation of various infrastructural

entities - like street networks, buildings, Points of

https://www.openstreetmap.de/

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

614

Interest (POIs), or district boundaries. Therefore,

the usage of crowdsourced, publicly available data

from OSM is reasonable (RQ2). Due to the uniform

API provided by various libraries (RQ1, RQ8), deal-

ing with OSM data becomes a straightforward and

generic way to fulﬁll this need. OSM also performs

quite well when evaluated against the requirements

formulated in Section 3, as it comes in a standard-

ized cross-regional format, as well as getting regu-

larly maintained by a wide range of contributors. Data

quality in poorly covered regions can be compromised

(especially regarding completeness), but depending

on the speciﬁc use case, this might not have a notice-

able negative impact (RQ3).

The other fundamental data input for various

model generators is the regularly surveyed census

statistics

. Unlike the non-personal data available

through OSM, census data — such as from the Ger-

man census — provides detailed demographic infor-

mation with a resolution as ﬁne as 1 hectare. Due

to its regular nationwide standardized procedure and

advanced statistical methods, it also meets our formu-

lated data requirements (see Section 3.1). Data pri-

vacy (RQ4) is always a concern when dealing with

personal data, but the responsible statistical ofﬁce en-

sures the anonymization of the published statistics.

4.2 Tools

In our case studies, we use several tools to handle

both preprocessing and postprocessing phases of the

model generators and data analysts respectively. At

the core of our workﬂow, we apply Python as a fun-

damental programming language for data preprocess-

ing and analysis. Its vast ecosystem provides libraries

and frameworks for varying use cases like geospatial

computing (Pandas

) or social simulation (Mesa

Python and its libraries are widely adopted for

modeling and data processing, offering strong sup-

port for transparency and adherence to best prac-

tices, as required by RQ8. The active development

and thorough documentation of these tools ensure the

clear recording of assumptions and methodologies,

supporting accuracy, clarity, and transparency in the

GAna workﬂow.

Taipy

is a versatile Python library designed to

create data-driven web applications. It allows or-

ganizing the codebase into three main components:

data nodes, tasks, and scenarios. Data nodes repre-

sent variables, tasks correspond to functions, and sce-

E.g., in Germany: https://www.zensus2022.de

https://pandas.pydata.org/

https://mesa.readthedocs.io/

https://taipy.io/

narios are well-deﬁned ordered combinations of data

nodes and tasks. Such a scenario implementation is

shown in Figure 2, which visualizes the process of a

model generator realized in Taipy.

The modular design aligns with RQ5, enabling

components to be reused and easily modiﬁed. This al-

lows for the seamless integration of additional tasks,

such as data transformers for individual models, into

existing scenarios. The library provides a high-level

abstraction layer for deﬁning and automating analysis

workﬂows in the form of a GUI, where tasks and data

nodes can be arranged and connected

Taipy supports RQ6 by providing structured work-

ﬂows that deﬁne clear relationships between data

nodes and tasks, ensuring that data transformations

are traceable and consistent. The library enforces val-

idation and control over data ﬂows, reducing errors

and maintaining data quality. Additionally, by of-

fering a transparent and well-documented approach

to workﬂow design, Taipy aligns with RQ8, making

workﬂows more comprehensible and enabling less

technical users to modify or interpret the output.

GeoPandas

is a critical tool for handling and

manipulating geospatial data. It offers a range of

integrated functions that streamline the processing

steps by building on Pandas and extending the core

functionality to support various geometric operations.

One example is the possibility of spatially joining two

datasets by deﬁning an appropriate predicate, such as

within or intersects. This is achieved due to the in-

tegration of the Shapely library, which offers a wide

range of geometric operations and classes.

Its functionalities can be reused for data process-

ing across different tasks, supporting the modular ap-

proach outlined in RQ5. Besides this, GeoPandas

proves as good choice due to its easy integration with

other data formats and libraries such as GeoJSON,

Shapeﬁles, or spatial databases. For our case stud-

ies, GeoPandas proves as an essential tool due to the

nature of the data required by our models.

4.3 Model Generators

As outlined in Section 3.2, model generators serve as

preprocessing units designed for speciﬁc use cases,

depending on the model they support. The complexity

of these generators can vary signiﬁcantly, from simple

operations like data ﬁltering to complex processes in-

volving multiple data sources and advanced statistical

methods.

https://marketplace.visualstudio.com/items?

itemName=Taipy.taipy-studio

https://geopandas.org/

GAna: Model Generators and Data Analysts for Streamlined Processing

615

Figure 2: Synthetic Population Taipy Scenario.

At the simplest level, a model generator might

only load data and perform minimal preprocessing.

For example, using OSMnx

, data can be queried and

ﬁltered from OSM to retrieve speciﬁc geospatial in-

formation. In our discussed case studies, we apply

several generators of this type to load buildings, ad-

ministrative boundaries, street networks, POIs, or ge-

ographical features like rivers.

A more advanced example of a model genera-

tor is used to prepare for the RoA analysis intro-

duced in Section 4. This generator integrates data

from multiple sources, including historical ﬂooding

records from administrative authorities and outputs

from the previously mentioned street network gen-

erator. The RoA index is used to evaluate the in-

frastructural robustness under crisis scenarios such as

ﬂooding. The generator intersects the street network

with the ﬂooded shapes to identify ﬂooded and non-

ﬂooded sections in the network. The data are then

used by the method to compare accessibility of vari-

ous points in the network in disrupted and undisrupted

states.

Even more sophisticated generators are the Pop-

ulation Generator (cf. Figure 2), Workplace As-

signer, and Social Network Generator, which require

the integration of heterogeneous data sources from

various origins. These sources include administra-

tive authorities (e.g., census data), publicly available

data (e.g., OSM), statistical methods (e.g., Simulated

Annealing), proprietary employee data from compa-

nies, or proprietary datasets from authorities (RQ2).

More speciﬁcally, the population generation process

as shown in Figure 2 involves several steps ranging

from loading and ﬁltering the required input data, to

creating and assigning households to cells, and ex-

plicitly mapping the households to the ﬁltered build-

ings. Based on the speciﬁc household types, demo-

graphic attributes like age and gender can be deducted

to create a population with sufﬁcient attributes that

can be used by various models (RQ4).

https://osmnx.readthedocs.io/

4.4 Data Analysts

Following the explanation in Section 3.4, data ana-

lysts serve the purpose of making the model output

reveal its key insights. Similar to the data generators,

the complexity of workﬂows will vary based on the

number, size, and properties of the datasets and the

speciﬁc analytical requirements.

The most commonly used analyst tool allows ba-

sic analysis of the results of agent-based simulation.

Since it is typical to have some degree of random-

ness as part of an ABM and therefore the data farm-

ing process, multiple replications of the same parame-

ter combinations will be run using individual random

seeds. While this allows reﬂecting uncertainty, e.g.,

in a decision-making process within a model, it also

has to be considered when analyzing the data.

The Generic Time series Chart Analyst allows the

aggregation of such data by solely providing the name

of a generic aggregation variable (i.e., the column

name of a dataset) in addition to the data itself, and

calculates the mean, standard deviation, and conﬁ-

dence intervals. The resulting data are either plotted

statically and exported as an image or as an interac-

tive chart using Plotly and serialized to allow, e.g.,

embedding it on a website. Besides the plot itself, in-

termediate results can either be used for further anal-

ysis like hypothesis testing or for result tables due to

the modular approach of such analysts.

A more advanced analyst is used to process the

output of the above-mentioned RoA component. The

analyst takes data about the distances and reachability

of POIs as well as geographic information about the

administrative boundaries of the relevant area to cal-

culate scores for individual samples that can be hierar-

chically aggregated on a street, district, city, county,

or even country level. Including the street network

created by the upstream generator, data can be plot-

ted as a graph, highlighting potential problem areas or

relevant (emergency) services. Serialization of the an-

alyzed results allows downstream visualizations, e.g.,

within a dashboard or web application.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

616

5 DISCUSSION AND

CONCLUSIONS

The process of pre- and postprocessing of data takes

up a majority of time in the entire development pro-

cess of a (simulation) model. In an attempt to simplify

this process and to make it more efﬁcient and trans-

parent for users and stakeholders, we introduced the

GAna workﬂow - a conceptual framework describing

the steps from processing incoming data for the use in

a model and preparing model output for dissemination

purposes. To do so, we ﬁrst presented the entire work-

ﬂow process including the alignment to hypothesis-

driven execution of simulation studies. Subsequently,

we focused on the respective steps in the workﬂow

whereby we deﬁned requirements for each step that

should be met, either by the choice of data, tools or

the overall structure of the components. The core of

the workﬂow consists of model generators and data

analysts that are designed to be applicable in as many

contexts as possible. To demonstrate this, we intro-

duced the application of model generators and data

analysts in two simulation models with different fo-

cuses.

An important challenge lies in testing when new

modules are added or existing modules are modiﬁed.

For this reason, the development of single compo-

nents and their subsequent integration in the work-

ﬂow is recommended. This approach also ensures

reusability. In addition, a consistency check of the

intermediate results is recommended to avoid an er-

ror propagation between the components. To address

this issue, Taipy allows emitting intermediate results

and thus enables testing for consistency. This tool is

also a good choice to reduce the implementation over-

head. Versioning the stages of development of the

components, e.g., using Git enables their use in differ-

ent contexts at different points in time. This also en-

hances transparency and provenance of the processes.

Currently, the GAna workﬂow has been predom-

inantly applied to simulation models, like those pre-

sented in Section 4, which, despite differing in their

objectives, share similar settings such as the granular-

ity of the population and spatial aspects. This limits

the conclusions that can be drawn about the applica-

bility of the approach when applied to other model

types.

Hence, the next steps in the development pro-

cess of this approach include the testing of the over-

all workﬂow on a wider range of models than has

been done thus far. By performing evaluation studies

in terms of key characteristics such as usability and

transferability, conclusions on the general applicabil-

ity of the approach in different contexts can be drawn.

Furthermore, future work might focus on evaluating

and enhancing the robustness of the model genera-

tor by analyzing how variations in the incoming data

impact the stability and consistency of the generated

models. At the time of writing, an entire workﬂow

has not yet been implemented, but results could still

be obtained by using individual model generators and

data analysts without a fully automated workﬂow, es-

pecially regarding hypotheses and hypotheses testing.

ACKNOWLEDGEMENTS

This work has been conducted in the projects

AKRIMA and GreenTwin. AKRIMA – Automatic

Adaptive Crisis Monitoring and Management Sys-

tem - is a consortium project funded from 01/2022

until 03/2025 within the “Research for civil secu-

rity” program (sifo.de) by the German Federal Min-

istry of Education and Research (BMBF) under grant

number 13N16251. In its context the conceptual

development of the GAna workﬂow was conducted

and ﬁrst generators and analysts were implemented.

GreenTwin - Green Digital Twin with artiﬁcial in-

telligence for CO

-saving cooperative mobility and

logistics in rural areas - is a project funded by the

Federal Ministry for the Environment, Nature Con-

servation, Nuclear Safety and Consumer Protection

(BMUV) (No. 67KI31073C). In this context addi-

tional generators, e. g., to create daily routines for

the agents and speciﬁc analysts were implemented

and applied to support the simulation based analysis

of last mile logsitics. We would further like to ac-

knowledge the work of Marek Graca, Brian Kr

amer,

Benedikt L

uken-Winkels, and Lukas Tapp in the de-

velopment of this workﬂow.

REFERENCES

Asgharpour, A., Bravo, G., Corten, R., Gabbriellini, S.,

Geller, A., Manzo, G., Gilbert, N., Tak

acs, K., Terna,

P., and Troitzsch, K. G. (2010). The impact of agent-

based models in the social sciences after 15 years of

incursions. History of economic ideas, 18:197.

Bae, Y. E., Berndt, J. O., Brinkmann, J., Dartmann, G.,

Hucht, A., Janzso, A., Kopp, S., Kravets, A., Lidynia,

C., Lotz, V., et al. (2024). GreenTwin: Developing a

Digital Twin for sustainable cooperative mobility and

logistics in rural areas. In Proceedings of REAL CORP

2024, pages 361–373. CORP–Competence Center of

Urban and Regional Planning.

Dalle, O. (2012). On reproducibility and traceability of sim-

ulations. In Proceedings of the 2012 Winter Simula-

tion Conference, page 11p. IEEE.

GAna: Model Generators and Data Analysts for Streamlined Processing

617

Deming, W. E. and Stephan, F. F. (1940). On a least squares

adjustment of a sampled frequency table when the

expected marginal totals are known. The Annals of

Mathematical Statistics, 11(4):427–444.

Feger, S. S. and Wo

zniak, P. W. (2022). Reproducibility: A

researcher-centered deﬁnition. Multimodal Technolo-

gies and Interaction, 6(2).

Feitelson, D. G. (2015). From repeatability to reproducibil-

ity and corroboration. ACM SIGOPS Oper. Syst. Rev.,

49:3–11.

Hedstr

om, P. and Manzo, G. (2015). Recent trends in agent-

based computational research: A brief introduction.

Sociological Methods & Research, 44(2):179–185.

Herschel, M., Diestelk

amper, R., and Ben Lahmar, H.

(2017). A survey on provenance: What for? what

form? what from? The VLDB Journal, 26(6):881–

906.

Huang, Q., Parker, D. C., Filatova, T., and Sun, S. (2014). A

review of urban residential choice models using agent-

based modeling. Environment and Planning B: Plan-

ning and Design, 41(4):661–689.

Kaub, D., Lohr, C., Reis David, A., Das Chandan, M. K.,

Chanekar, H., Nguyen, T., Berndt, J. O., and Timm,

I. J. (2024). Shortest-path-based resilience analysis of

urban road networks. In International Conference on

Dynamics in Logistics, pages 132–143. Springer.

Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983).

Optimization by simulated annealing. Science,

220(4598):671–680.

Kuhlman, C., Jackson, L., and Chunara, R. (2020). No com-

putation without representation: Avoiding data and

algorithm biases through diversity. In Proceedings

of the 26th ACM SIGKDD International Conference

on Knowledge Discovery & Data Mining, page 3593,

New York, NY, USA. Association for Computing Ma-

chinery.

Leavy, S., O’Sullivan, B., and Siapera, E. (2020). Data,

power and bias in artiﬁcial intelligence. In AI for So-

cial Good Workshop.

Lee, J.-S., Filatova, T., Ligmann-Zielinska, A., Hassani-

Mahmooei, B., Stonedahl, F., Lorscheid, I., Voinov,

A., Polhill, J. G., Sun, Z., and Parker, D. C. (2015).

The complexities of agent-based modeling output

analysis. Journal of Artiﬁcial Societies and Social

Simulation, 18(4).

Lorig, F., Becker, C. A., and Timm, I. J. (2017a). For-

mal speciﬁcation of hypotheses for assisting computer

simulation studies. In Proceedings of the Symposium

on Theory of Modeling & Simulation, pages 1–12.

Lorig, F., Lebherz, D. S., Berndt, J. O., and Timm, I. J.

(2017b). Hypothesis-driven experiment design in

computer simulation studies. In 2017 Winter Simu-

lation Conference (WSC), pages 1360–1371. IEEE.

Matthews, R. B., Gilbert, N. G., Roach, A., Polhill, J. G.,

and Gotts, N. M. (2007). Agent-based land-use mod-

els: a review of applications. Landscape Ecology,

22:1447–1459.

Munson, M. A. (2012). A study on the importance of and

time spent on different modeling steps. ACM SIGKDD

Explorations Newsletter, 13(2):65–71.

Pierce, M. E., Krumme, U., and Uhrmacher, A. M. (2018).

Building simulation models of complex ecological

systems by successive composition and reusing sim-

ulation experiments. In 2018 Winter Simulation Con-

ference (WSC), pages 2363–2374. IEEE.

Raghupathi, W., Raghupathi, V., and Ren, J. (2022). Repro-

ducibility in computing research: An empirical study.

IEEE Access, 10:29207–29223.

Rodermund, S. C., Janzso, A., Bae, Y. E., Kravets, A.,

Schewerda, A., Berndt, J. O., and Timm, I. J. (2024).

Driving towards a sustainable future: A multi-layered

agent-based digital twin approach for rural areas. In

ICAART (1), pages 386–395.

Ruscheinski, A., Gjorgevikj, D., Dombrowsky, M., Budde,

K., and Uhrmacher, A. M. (2018). Towards a prov

ontology for simulation models. In International

Provenance and Annotation Workshop, pages 192–

195. Springer.

Ruscheinski, A. and Uhrmacher, A. M. (2017). Provenance

in modeling and simulation studies – bridging gaps.

In Proceedings of the 2017 Winter Simulation Confer-

ence, pages 871–883. IEEE.

Schroeder, D., Chatﬁeld, K., Chennells, R., Partington, H.,

Kimani, J., Thomson, G., Odhiambo, J. A., Snyders,

L., and Louw, C. (2024). The exclusion of vulnera-

ble populations from research. In Vulnerability Re-

visited: Leaving No One Behind in Research, pages

25–47. Springer.

Skoogh, A. and Johansson, B. (2008). A methodology

for input data management in discrete event simula-

tion projects. In 2008 Winter Simulation Conference,

pages 1727–1735. IEEE.

Taylor, S. J. E., Eldabi, T., Monks, T., Rabe, M., and

Uhrmacher, A. M. (2018). Crisis, what crisis–does re-

producibility in modeling and simulation really mat-

ter? In 2018 Proceedings of the Winter Simulation

Conference, page 11p. IEEE.

Teran-Somohano, A., Smith, A. E., Ledet, J., Yilmaz, L.,

and O

guzt

un, H. (2015). A model-driven engi-

neering approach to simulation experiment design and

execution. In 2015 Winter Simulation Conference

(WSC), pages 2632–2643. IEEE.

Torrens, P. M. (2010). Agent-based models and the spatial

sciences. Geography Compass, 4(5):428–448.

Wilsdorf, P., Dombrowsky, M., Uhrmacher, A. M., Zim-

mermann, J., and van Rienen, U. (2019). Simula-

tion experiment schemas–beyond tools and simulation

approaches. In 2019 Winter simulation conference

(WSC), pages 2783–2794. IEEE.

Wilsdorf, P., Wolpers, A., Hilton, J., Haack, F., and Uhrma-

cher, A. M. (2021). Automatic reuse, adaption, and

execution of simulation experiments via provenance

patterns. ACM Transactions on Modeling and Com-

puter Simulation, 33:1 – 27.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

618