Exploring the Test Driven Development of a Big Data Infrastructure

Examining Gun Violence Incidents in the United States of America

Daniel Staegemann

, Malte Rathjens

, Hannes Hinniger, Vivian Schmidt

and Klaus Turowski

Magdeburg Research and Competence Cluster VLBA, Otto-von-Guericke University Magdeburg, Magdeburg, Germany

Keywords: Software Engineering, Big Data, Test Driven Development, TDD, Testing, Quality Assurance, Microservice,

Prediction, Gun Violence.

Abstract: Big data (BD) and the systems used for its harnessing heavily impact many aspects of today’s society and it

has been repeatedly shown that they can positively impact the operations of organizations that incorporate

them. However, creating and maintaining these applications is extremely challenging. Therefore, it is

necessary to pay additional attention to the corresponding quality assurance. One software engineering

approach that combines high test coverage, the enabling of comprehensive regression tests, but also positively

impacts the developed applications’ design, is test driven development. Even though by now it has a somewhat

long history in software development in general, its use in the context of BD engineering is not common.

However, an approach for the test driven development of BD applications that is based on microservices has

been proposed rather recently. To gain further insights into its feasibility, the publication at hand explores its

application in the context of a prototypical project implementation. Hereby, the chosen use case is the analysis

and prediction of gun violence incidents in the United States of America, which also incorporates NFL match

game data, under the assumption that the games could potentially influence the occurrence of such incidents.

1 INTRODUCTION

In the contemporary landscape of technological

advancement, the proliferation of data has emerged as

a significant driver of innovation and progress,

presenting both unprecedented opportunities as well

as challenges. The exponential growth in the volume,

velocity, and variety (Laney 2001), led to a situation,

where common technologies were no longer able to

keep up with the demands (Chang and Grady 2019),

which is generally described by the term big data

(BD), even though there is not a single, universally

applied definition for it (Diebold 2021; Volk et al.

2022). Since its emergence, BD has been identified as

a facilitator of increased productivity (Müller et al.

2018) and shifted how organizations perceive,

analyze, and leverage information. Thus, the ability

to harness the insights buried within all this data has

become instrumental in addressing issues across

https://orcid.org/0000-0001-9957-1003

https://orcid.org/0009-0001-9229-4303

https://orcid.org/0009-0008-0530-5643

https://orcid.org/0000-0002-4388-8914

various domains, like, for instance, enabling

evidence-based policymaking and targeted

interventions (Höchtl et al. 2016).

However, assuring the quality of the

corresponding applications is a challenging task

(Staegemann et al. 2019), whose importance also

shows in the vastness of the related research

endeavors (Ji et al. 2020).

In turn, test driven development (TDD) is a

popular approach for developing software. Due to its

associated advantages (Agha et al. 2023) that could

be also valuable in the BD realm, the idea was

proposed (Staegemann et al. 2020b) to apply it not

only for common software but also in the context of

big data engineering (BDE). However, despite the

associated potential advantages and some

corresponding works such as (Staegemann et al.

2023), it is still rather underexplored in the context of

BD and the related applications. To contribute to the

Staegemann, D., Rathjens, M., Hinniger, H., Schmidt, V. and Turowski, K.

Exploring the Test Driven Development of a Big Data Infrastructure Examining Gun Violence Incidents in the United States of America.

DOI: 10.5220/0012829600003764

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 21st International Conference on Smart Business Technologies (ICSBT 2024), pages 103-114

ISBN: 978-989-758-710-8; ISSN: 2184-772X

103

corresponding knowledge base and help assess its

viability, in the following, TDD will be applied to

develop a microservice-based BD application that

analyzes gun violence incidents in the United States

of America (USA). More precisely, the application is

designed to provide users with an overview of the

expected incidents of gun violence in each state of the

USA. By analyzing historical data, probabilities for

incidents leading to injuries and/or deaths within a

specified time period will be calculated and displayed.

The calculations are based on a dataset (Ko 2018) on

gun violence in the USA, which records more than

260,000 gun violence incidents from January 2013 to

March 2018 (inclusive). This data will be processed

along with seasonal patterns, specifically NFL game

data (Horowitz 2018), by a predictive microservice to

generate forecasts for the expected incidents. The

results are then visually presented to the users. Based

on its objectives, the application is, therefore, of a

descriptive and predictive nature (Roy et al. 2022).

Even though the presented prototypical

implementation uses historical data that are available

in batch, stream processing will be simulated by

transmitting the data in a somewhat continuous way

over time, instead of providing it all at once. Thus,

even though the dataset is rather small in volume in

the context of BD, the developed application still

showcases a highly typical BD use case (Volk et al.

2020). Moreover, the developed architecture is

designed in a way to allow for future scaling, leaning

further into the demands associated with BD.

While the pursued analytical task is generally

interesting, it has to be noted that the publications’

primary focus is not on the actual content of the use

case but on the technical aspects with the goal of

further exploring the application of TDD in BDE

itself. For this reason, the actual quality of the

prediction is of secondary importance. Moreover, the

application is only of a prototypical nature and does

not aspire to be fully refined.

The publication is structured as follows.

Following this introduction, the general TDD

approach is described. Afterward, the application’s

basic design is outlined. This is followed by a section

that describes the actual implementation including the

testing. Subsequently, the findings are discussed and,

finally, a conclusion of the work is given.

2 THE DESIGN

The following section describes the developed

application’s design, including the BD infrastructure,

and the utilized technologies, and broadly outlines the

corresponding functionality. To meet the project

requirements, a message-driven BD infrastructure

setup was selected and automated, with Apache

Kafka (Apache Software Foundation 2024a) being

specifically chosen as the central component. In the

scope of this work, Kafka and its ecosystem

component ZooKeeper (Apache Software Foundation

2024b) are harnessed for scalability. This strategic

implementation guarantees high availability,

allowing the system to effortlessly adapt to growing

data volumes and user requirements while

maintaining consistent performance and minimizing

downtime. Figure 1 provides an overview of the

infrastructure created.

For the implementation of the functionalities, the

programming language Python is predominantly used

due to its widespread use in the data science field.

Moreover, the open-source automation tool Ansible

(Red Hat 2024) is also based on this programming

language. For this reason, both the Kafka Producer

and Consumer in the backend, as well as a Flask

Figure 1: The Developed application’s architecture.

ICSBT 2024 - 21st International Conference on Smart Business Technologies

104

Application (Ronacher 2024) on the frontend side,

will be based on Python.

2.1 Ansible

Within the scope of this prototypical implementation,

Ansible serves as the foundational technology for

both, the deployment and the execution of automated

tests. To meet the demands of the proposed TDD

approach, it is mandated that tests are written for all

services provided by Ansible, which are primarily

Docker containers (Docker Inc. 2024), before

composing the code for the automated deployment.

This practice ensures that the functionality and

reliability of each service are verified at the outset,

laying a robust groundwork for facilitating

continuous integration and continuous delivery

(CI/CD) while adhering to the principles of TDD.

2.2 Docker

For this implementation, containerization is a

fundamental element. The chosen development

approach and the defined use case favor the decision

to containerize individual functions of the

infrastructure and the web application. For this,

Docker has been chosen. In line with the prototype’s

requirements, the installation or general provisioning

and subsequent availability check of Docker

including Docker Networks are validated by tests that

are predefined. The Docker network is essential to

ensure that the various microservices can

communicate with each other without disruption.

With the aid of Docker and Ansible, the infrastructure

for processing the data can then be set up and tested.

2.3 Zookeeper

For the operation of the presented application,

ZooKeeper is indispensable, managing fundamental

Kafka operations such as broker coordination and

maintaining state. Its functionality is particularly

crucial to ensuring a reliable messaging system and

must be considered in the implementation phase when

applying TDD.

2.4 Kafka Broker

In this project, a Kafka Broker enables the efficient

management and transfer of data streams and is

central to high availability and scalability. In

connection with TDD, unit and integration tests must

be carried out, with particular emphasis placed on

correct interaction with the broker. When defining

tests, it must be noted that the Kafka Broker is

dependent on ZooKeeper.

2.5 Kafka Producer

Within the implemented infrastructure, a Kafka

Producer is deployed and tested as a Python container.

The Producer reads data on gun violence incidents

from a given CSV file and sends it to Kafka. As part

of the tests, it must be ensured that the Producer

container is not only set up and running but also

capable of reaching and processing the provided data.

2.6 Kafka Consumer

The Consumers are responsible for reading and

processing the messages, where they filter, pre-

process, and subsequently store the data in a database.

Within this application, each Consumer is responsible

for a certain number of (federal) states, which can

vary depending on the number of Consumers. This

ensures that with increased data traffic, the

deployment of more Consumers and, hence, faster

processing can be facilitated.

The NFL Game data is retrieved from a CSV file

in the prototype, then processed and written to a

database. In a production environment, this data

would most likely be pulled from an API. This

process takes place at regular intervals to ensure that

the latest data is available and can be used by the

prediction algorithm. The retrieved NFL data

includes the game ID, game day, home team, and

away team. This data is enriched in pre-processing by

mapping the available team data with the states of

origin of these teams. This information (game ID,

game day, home team, away team, home state, away

state) is then written to the database in order to make

it persistently available to Consumers. A Consumer

instance then retrieves the Gun Violence data relevant

to it, which are those for the states assigned to the

Consumer.

During the pre-processing step, the data provided

by the Producer are filtered and cleaned. For instance,

only those raw datasets that have at least an incident

ID, a date, and an assigned state are processed.

Additionally, fields such as state, county, and city are

checked and cleansed of simple spelling errors, such

as special characters. Moreover, for incidents that

lack a value for killed persons, this value is set to zero.

These pre-processing steps are essential to

meaningfully feed the data into the fbprophet

(Prophet) algorithm (Facebook Open Source 2024),

in accordance with the objective.

Exploring the Test Driven Development of a Big Data Infrastructure Examining Gun Violence Incidents in the United States of America

105

The processing of data for the purpose of

forecasting, using the Prophet service, subsequently

also takes place within the producer. As soon as data

for an additional month are available (these are

consumed sorted by date), another forecast for the

coming 365 days is carried out. This forecast is

carried out for each state the Consumer is handling.

The results of the forecast, as well as the incident

numbers, are written into the connected database.

The Prophet forecasting procedure retrieves all

incidents per state, as well as the NFL game dates,

where an NFL team from the specific state is involved.

With respect to the given data, Prophet will forecast

the expected number of incidents and victims.

To complement the existing forecasting

mechanism focused on fatalities within the

Consumers, an additional Consumer will be

introduced to establish predictions related to injuries.

This data is already provided by the producer. This

expansion introduces a microservice to the

architecture that can, in theory, be seamlessly

replaced, and is validated and secured by the

corresponding existing tests (Staegemann et al.

2020b). Such adaptability needs to be considered not

only in a TDD context but also in the actual

deployment and maintenance of the system. This

ensures that as the project evolves and different

analytical needs emerge, the system remains agile and

capable of integrating new functionalities with

minimal disruption, which is common in real systems

(Staegemann et al. 2020a).

For error analysis and monitoring purposes, in this

artificial setting (Sonnenberg and Vom Brocke 2012),

the preprocessed raw data are additionally stored in

the database.

2.7 Database

To have the generated data persistently available for

the web application, it is stored in a database. As

previously introduced, the database stores a

prediction value for each day and state per forecast.

Since the forecast always projects one year ahead,

after a successful run upon reaching the month-end of

a given date, there will be 12 values per date and state

in the table, comprising eleven predictions and the

actual realized value.

By maintaining all forecasts, not only can

transparency be ensured, but also the values can be

provided as an additional feature to the users on the

front end, allowing them to gain an implicit

understanding of the error tolerance.

Within the microservices architecture, the

database should be structured to enable the seamless

integration and testing of different database services.

Additionally, ensuring this modularity aligns with the

proposed TDD for BDE approach (Staegemann et al.

2020b). This, in turn, needs to be considered in the

implementation phase.

2.8 Access

The user can access the application through a Flask

web application, which can be reached via an instance

of HAProxy (Tarreau 2023), which is an open-source

high-performance load balancer and proxy. However,

these are of little relevancy to the actual use case and

are, therefore, only occasionally mentioned but not

further discussed in the following.

3 THE IMPLEMENTATION

The implementation of the showcased application can

be generally divided into two logical areas. These are

the infrastructure and the application.

Since the main motivation of the implementation

is to showcase the use of TDD in BDE, both of these

were strictly implemented following this approach.

3.1 General Implementation Approach

The general approach adopted in this study entails

several sequential steps aimed at ensuring efficient

implementation. Initially, the implementation process

was conceptualized. Subsequently, thorough

planning of necessary integration tests and unit tests

was conducted to ensure comprehensive coverage.

An essential aspect of the TDD is the test

implementation prior to the development of code.

Thus, the approach facilitates the creation of code that

aligns closely with the predefined testing criteria.

Finally, the implementation of the actual productive

code follows after the completion of the test planning

and implementation. The iterative nature of this

process allows for the incorporation of additional

tests as deemed necessary during the implementation

phase.

The development approach for the infrastructure

follows TDD principles. In doing so, the following

decisions needed to be focused during the

conceptualization of the tests.

Emphasizing scalability and efficiency, a primary

objective of the endeavor is to construct a

microservice-based infrastructure tailored for a BD

application. This infrastructure is intended to

encompass functionalities such as data creation,

ICSBT 2024 - 21st International Conference on Smart Business Technologies

106

storage, consumption, processing, analysis, and

visualization.

To achieve these objectives, a message-driven

architecture utilizing Apache Kafka was chosen. This

architectural choice necessitates the inclusion of

ZooKeepers, Kafka Brokers, Kafka Producers, and

Kafka Consumers. Additionally, a database is

required to store both the processed raw data and

predictions based on these data, along with the Flask-

based web application.

The web application, the Kafka producer

responsible for processing CSV files, as well as the

Consumers performing data processing and

prediction tasks, are implemented in Python.

A crucial consideration in the design is the

scalability of the Consumers to handle high message

volumes. Consequently, the Kafka brokers are also

engineered with scalability in mind.

Notably, the microservice architecture's flexibility

as well as the availability of the corresponding tests

allow for the interchangeability of components,

including the database, which can be seamlessly

switched between PostgreSQL and MySQL. The

robustness of the test framework is paramount to

ensure smooth transitioning between these database

systems.

3.2 The Infrastructure

In accordance with the development approach, the

development process for the infrastructure part

adhered to rigorous testing procedures. Before the

commencement of any coding activities, the testing

strategy and specific test cases were planned. Given

that the infrastructure setup was orchestrated using

Ansible, the testing framework was conceptualized

within the context of Ansible's operational

environment.

Leveraging Ansible's modular structure

encompassing plays, playbooks, tasks, and roles

facilitated the planning of both unit and integration

tests. In this framework, a unit is defined as either an

individual Ansible task or a cohesive set of tasks

(combined as a playbook) aimed at accomplishing a

specific objective. Conversely, an integration test is a

play consisting of a collection of roles or playbooks

that are interdependent in at least one direction.

Each unit test adheres to a standardized procedure,

encompassing the following sequential steps:

1) Verification of preconditions to ensure that the

specified conditions were not fulfilled prior to

executing the unit

2) Provision possible prerequisites

3) Execution of the unit under test

4) Validation to confirm that the prescribed

conditions have been successfully met subsequent

to the unit's execution

5) Cleanup of any artifacts or residual elements

generated during the testing process to maintain

the integrity of subsequent tests

By systematically following this testing regimen, the

robustness and reliability of the infrastructure were

continuously validated, ensuring adherence to

defined specifications and facilitating the detection

and resolution of potential issues at an early stage of

development.

To ensure that each unit is tested properly and that

the respective mapping is always clearly visible, there

is a test in a test directory that has the same name as

the tested unit. In most cases, a unit in this project is

a playbook that contains a test playbook with the

same name as the setup playbook.

3.2.1 Docker

In accordance with prior specifications, the Docker

environment is configured using Ansible. The setup

process is designed to be a singular playbook. In order

to validate this playbook, initial verification ensures

the absence of Docker. Subsequently, execution of

the setup-docker playbook occurs, followed by

validation to confirm Docker installation completion.

These procedural steps are combined into an Ansible

play with three distinct playbooks:

1) The initial playbook verifies preconditions.

2) The "setup-docker" playbook is imported.

3) The final playbook verifies postconditions.

The preconditions verification playbook utilizes

Ansible's service_facts module to ascertain the

absence of the Docker service. Additionally, the

package_facts module confirms the non-existence of

container.io, docker-ce-cli, and docker-ce packages.

Conversely, the postconditions verification

playbook validates the presence of the Docker binary

in service_facts, confirms the Docker service status

as running, and verifies the installation of

container.io, docker-ce-cli, and docker-ce packages.

These validations collectively fulfill the requirements

of the unit test.

It is deemed unnecessary to individually test

minor tasks such as repository addition or GPG Key

insertion, as failure in any of these tasks would result

in the failure of the Docker installation playbook.

At this stage, a dedicated integration test is

deemed unnecessary, as subsequent container tests

inherently verify the integrity of Docker integration.

Exploring the Test Driven Development of a Big Data Infrastructure Examining Gun Violence Incidents in the United States of America

107

3.2.2 Docker Network

To enable inter-container communication, especially

via Docker container names, a dedicated Docker

bridge network is set up. This network configuration

is encapsulated within a specialized playbook to

facilitate ease of customization or substitution with

alternative network setups.

Similar to the testing methodology outlined for

Docker, the validation of this playbook can be

approached through unit testing via an Ansible play.

This combines a pre-validation playbook, the import

of the setup-docker-network playbook, a post-

validation playbook, and the incorporation of a

cleanup playbook to remove the network upon test

completion.

Pre-validation utilizes the docker network

command to verify the non-existence of the

designated network. An expected outcome is the

failure of docker network inspect NETWORK_NAME

if the network is absent.

Conversely, post-validation involves the

verification that docker network inspect

NETWORK_NAME no longer results in failure,

indicating successful network creation.

Given the subsequent testing of inter-container

communication during integration tests for

components, additional integration tests specific to

the Docker network are deemed redundant. The

functionality of the Docker network will be implicitly

affirmed during these broader integration tests.

The Ansible implementation of the mentioned test,

as an example of how these look like, is depicted in

Figure 2.

3.2.3 Zookeeper

The infrastructure, centered around Apache Kafka,

necessitates the inclusion of ZooKeepers to oversee

various aspects of the cluster such as Kafka topics and

management. In consideration of ensuring high

availability as well as scalability, for this

implementation, a foundational configuration

comprising three ZooKeepers has been established.

Scaling the number of ZooKeepers is facilitated by a

singular Ansible variable within the inventory.

The ZooKeeper setup, as well as previous setup

steps, consists of a single Ansible playbook, and an

Ansible play is used to verify the setup-zookeeper

unit.

To ensure the correctness of the test, a pre-

validation playbook within the Ansible unit test is

essential. This playbook aims to verify that the

specified number, denoted as N, of ZooKeepers has

not yet been instantiated. This validation process

involves iterating over the designated quantity of

ZooKeepers and examining their existence through

the utilization of the docker_container_info Ansible

module. This module yields a crucial exists fact,

indicating the presence or absence of the container.

After preconditions are validated, necessary

dependencies are provisioned. This includes the setup

of Docker as well as the setup of the Docker bridge

network.

Subsequently, a post-validation playbook is

executed following the execution of the imported

setup-zookeepers playbook. Similar to the pre-

validation phase, this post-validation playbook

iterates over the specified number of ZooKeepers.

Figure 2: The docker network test.

ICSBT 2024 - 21st International Conference on Smart Business Technologies

108

However, in this phase, the playbook reassesses the

existence of each container and further ensures that

after a time interval of 20 seconds, the status of the

container transitions to running, and the ExitCode

attains a value of 0. These additional checks serve to

corroborate the successful instantiation and

operational state of each ZooKeeper instance since

misconfigurations of ZooKeepers will result in an

instant fail on start-up.

This validation could be strengthened by parsing

the logs of ZooKeeper, but this would lead to a fragile

test, which was, therefore, omitted.

Upon completion of the post-validation phase, a

cleanup task is initiated to remove all containers and

associated dependencies. This measure ensures that

the testing process leaves no residual artifacts within

the system, maintaining the integrity and cleanliness

of the testing environment.

Integration tests specific to ZooKeeper are

considered redundant. The functionality of

ZooKeeper will be implicitly affirmed during

integration tests of Kafka.

3.2.4 Kafka Broker

In the context of the developed application, Apache

Kafka serves as the central component. Given that the

infrastructure design relies on Apache Kafka's

integration with ZooKeeper, the testing of Kafka can

be construed as an integration test for the ZooKeeper

setup described earlier. Consequently, testing Kafka

essentially entails testing ZooKeeper as well.

Just as the previously discussed test

implementations, the testing process involves the

utilization of Ansible Play, which combines various

playbooks. Initially, a pre-validation-checking

Ansible playbook is executed to ascertain the absence

of the Kafka container. This verification is conducted

using the docker_container_info and assert Ansible

modules, which validate the absence of the container

by assessing the boolean value of exists.

Subsequently, the setup entails configuring any

requisite preconditions, including Docker, Docker

Bridge Network, and Zookeepers. Following this, the

setup-kafka playbook is imported to initiate the Kafka

setup process.

Upon completion of the setup, a post-validation

step is undertaken. This involves verifying the

existence of the container, ascertaining its

operational status as running, and confirming the

absence of any error exit codes. This validation

process employs the docker_container_info and

assert Ansible modules, which assess the boolean

values of exists, the status of the container, and the

absence of error exit codes, respectively.

Finally, the cleanup phase ensues, wherein any

artifacts generated during the testing process are

removed to maintain a clean testing environment.

3.2.5 Kafka Producers

In the context of Kafka Producers within the

infrastructure setup, it shall be noted that the decision

to implement the Kafka Producer in Python is aligned

with the utilization of the official Python container

image. As outlined previously, the Producer connects

to Kafka, establishes the gun-violence-raw-data

Kafka topic, reads the Gun Violence Dataset from a

CSV file, and subsequently writes each line to the

specified Kafka topic. Upon successful execution of

the send command, a log entry is generated,

indicating the delivery of messages to the gun-

violence-raw-data topic. The log message is Message

delivered to the gun-violence-raw-data topic.

From an infrastructure standpoint, the entire setup

of the Kafka Producer constitutes a unit, necessitating

a unit test for the infrastructure setup using Ansible.

This test encompasses the verification of a running

Kafka broker, which relies on ZooKeeper, as well as

the existence of a functional Docker bridge network

and Docker. This comprehensive integration test is

designed to specifically evaluate the Kafka Producer

unit.

The testing methodology for the setup-kafka-

producer unit is structured similarly to the tests

conducted for other components discussed earlier.

Pre-validation checks are conducted, prerequisites

are established, the setup-kafka-producer playbook is

imported, post-validation assessments are performed,

and a cleanup playbook is executed to remove any

generated artifacts.

The pre-validation stage involves confirming the

existence of containers. From an infrastructure

perspective, the validation of the Kafka Producer's

functionality is based on the logging of the specific

message upon successful message delivery to Kafka.

This validation is incorporated into the post-

validation playbook, wherein the Ansible command

module is combined with the until module to retrieve

container logs intermittently over a 30-second period,

with a 3-second interval, to verify the occurrence of

the expected log message.

3.2.6 Kafka Consumers

In alignment with the architectural decisions, the

Kafka Consumer is implemented utilizing the official

Python container image. As previously outlined, the

Exploring the Test Driven Development of a Big Data Infrastructure Examining Gun Violence Incidents in the United States of America

109

Consumer connects to Kafka, subscribes to the gun-

violence-raw-data Kafka topic, retrieves messages,

processes them, and subsequently stores both

predictions and processed raw data in a designated

database (either MySQL or PostgreSQL). Upon

successful message processing, the Consumer logs a

confirmation message, denoted as Successfully

processed message.

Similar to the approach taken for the Kafka

Producer, the entire setup of the Kafka Consumer is

regarded as a unit from an infrastructure standpoint.

Therefore, it necessitates a unit test for the

infrastructure setup utilizing Ansible. This

comprehensive test encompasses the verification of a

running database, Kafka Producer, Kafka Broker

(which in turn requires ZooKeeper), as well as the

existence of a functional Docker bridge network and

Docker.

As elaborated previously, the Kafka Consumer

script also requires unit tests from a software

development perspective, which will be discussed in

section 3.3.

The Ansible Play designed to test the setup-kafka-

consumer unit mirrors the structure of the Kafka

Producer test described earlier. However, there are

notable differences:

• Additional prerequisites include the presence of

the database and Kafka Producer.

• The expected log message during post-validation

differs.

From an infrastructure perspective, the validation of

the Kafka Consumer's functionality depends on the

logging of a specific message upon successful

message processing. This validation process is similar

to that of the Kafka Producer. It incorporates the

Ansible command module combined with the until

module. This combination facilitates the retrieval of

container logs intermittently over a 30-second period,

with a 3-second interval, to verify the occurrence of

the expected log message.

3.2.7 Database

In accordance with the architecture described earlier

and depicted in Figure 1, a switch for the database

technology has been implemented. The choice

between MySQL and PostgreSQL can be determined

via the Ansible variable database_technology.

Consequently, two technology-specific unit tests are

necessary to validate the playbook's functionality. It

is imperative that integration tests are robust enough

to seamlessly handle the switch between databases

without any additional complexities.

Integration tests are performed during the testing

of the Kafka Consumer unit, as it necessitates a

database and orchestrates the setup of the selected

database beforehand. The unit tests for MySQL and

PostgreSQL share a similar structure, as consistency

in testing approaches for analogous components is

paramount (Staegemann et al. 2022).

Similar to the aforementioned tests, both MySQL

and PostgreSQL unit tests follow a standardized

structure:

1) Pre-validations are executed

2) Prerequisites are established

3) The playbook responsible for setting up the

database is imported

4) Post-validation procedures are undertaken

5) Artifacts generated during the testing process are

cleaned up

To comprehensively test the database setup and

functionality, the post-validation phase involves

initiating a database client container. Subsequently,

various SQL commands are executed on the database

using the docker_container Ansible module and its

command option. These commands include creating

a table and inserting data into it. Following this,

database logs are retrieved and analyzed to verify that

the executed commands are appropriately reflected.

This validation is conducted using the assert Ansible

module to ensure the integrity of the database

operations. Once the validation of database

functionality is complete, the database instance is

removed to simulate cleanup operations.

Subsequently, the database is restarted to confirm the

persistence of data through volume management.

Finally, the persistence of data is validated by

utilizing the database client and executing a SELECT

statement to verify that the data remains intact within

the newly created database container.

3.3 The Application

The testing of the application adheres to the same

principles as outlined for infrastructure setup testing.

However, unlike the infrastructure components

implemented with Ansible, the application is

developed in Python. It comprises three major

components: Kafka Producer logic, Kafka Consumer

logic, and a web application, whereby the latter, as

previously mentioned, will not be further described.

Although each of these components undergoes

integration and unit testing from the perspective of

the infrastructure setup, as previously detailed,

additional unit tests are required to validate the

ICSBT 2024 - 21st International Conference on Smart Business Technologies

110

internal implementation of the Python code. To

facilitate this, the unittest Python module is employed.

Each Python script or project is accompanied by a

tests.py file, which contains multiple unit tests

validating the internal functionality of the project.

These tests cover various aspects of the script's

functionality, ensuring the correctness of its behavior.

From a procedural standpoint, these unit tests

could be executed before deploying the logic via

Ansible or integrated as a dedicated step within a

CI/CD pipeline.

Again, the development process follows the

principles of TDD as well, wherein tests are written

prior to the implementation of logic, ensuring a robust

and testable codebase.

3.3.1 Kafka Producer

As mentioned earlier, in this implementation, to

simulate real data creation, the Kafka Producer

ingests data from a CSV file, wherein each line of the

file is transmitted as a message to the Kafka cluster.

In real-world scenarios, assessing the

functionality of the Python function responsible for

data production proves to be challenging due to its

indefinite runtime. Consequently, unit testing of this

function becomes challenging. As a result, validating

the implementation relies on integration tests. These

integration tests, implemented in Ansible, were

described in section 3.2.5.

An aspect of the internal functionality can be

tested, however, by ensuring that the function fails

when presented with an invalid path to the CSV file.

While this file is only used as an input for the

simulated data streaming, to adhere to the TDD

philosophy, this aspect will still be tested.

Furthermore, the Kafka producer's connectivity to

the Kafka cluster is testable. This can be validated by

calling the function with erroneous addresses, which

shall lead to failure while valid addresses facilitate

successful connection establishment, returning an

instance. The Kafka connection test is shown in

Figure 3.

Figure 3: Kafka connection test.

3.3.2 Kafka Consumer

As shown in section 2.6, the Kafka Consumer serves

a multi-faceted role within the project, extending

beyond mere data consumption. To summarize, it

engages in the consumption of Gun violence data

from the Kafka broker by subscribing to a designated

topic. A connection to the database is established, and

it proceeds to iterate over the consumed data,

determining its responsibility based on the US state

mentioned in each message. Each message undergoes

pre-processing, encompassing data cleanup, invalid

data filtering, and relevancy assessment. The pre-

processed data are then stored in a dedicated database

table. Additionally, the Kafka Consumer periodically

conducts predictions using fbprophet (with

seasonality adjusted for NFL Game days) regarding

gun violence per assigned state for the ensuing 365

days. By doing this, eleven predictions are created per

state and date followed by the actual value. The

predictions as well as the actual values are,

subsequently, written to another database table,

facilitating data comparison. Since the Kafka

Consumer constitutes a sophisticated script, it

necessitates comprehensive test coverage to validate

critical functionalities. Accordingly, prior to

implementation, multiple tests are provided.

Even though the script could mostly be tested

locally, some of the tests require several

dependencies to be deployed. This includes a working

database, as well as a working Kafka Broker.

As described earlier, there can be 1 to n Kafka

Consumers. Each Consumer is responsible for

handling a subset of all messages. The assignment is

based on the US state the message mentions.

Therefore, each Consumer instance calculates a

subset of US states, based on the total number of

Consumers as well as its own id, defined via input

argument.

For this, a basic test ensures the script fails with a

non-zero exit code, if the input data is invalid. Invalid

could mean that the ID is greater than the total number

of Consumers, or that an ID is smaller than 1.

Additionally, it is crucial to validate that the summed-

up number of all states distributed to the Consumers,

is equal to the actual total number. This ensures that

the logic constantly handles individual numbers of

Consumers, without duplicated or missing entries.

The corresponding code is shown in Figure 4.

To further validate the consistency, an additional

check is performed, which does not count the entries

but instead validates that the combined list of all

states that were distributed between the Consumers

only contains unique values.

Exploring the Test Driven Development of a Big Data Infrastructure Examining Gun Violence Incidents in the United States of America

111

Figure 4: Test to verify the number of states after distribution to the consumers.

Moreover, there are specific fields like state,

incident_id, and date that are mandatory for further

processing. Additionally, there are invalid messages

that need to be either cleaned up or ignored. Based on

the fact that the raw data is static and well-known, all

critical errors within the data are known as well.

The function parse_kafka_message is used to

parse the input and perform the preprocessing for

each consumed message. It's evident that this is a

crucial part of the logic. Therefore, several tests are

planned and implemented.

The test test_parsing_messages validates:

• parsed valid messages contain 29 comma-

separated attributes

• parsed valid messages contain non-empty

incident_id, state, and date attributes

• parsed invalid messages that contain empty

incident_id, state, or date attributes are expected

to be filtered out by the function

• special characters, thus all characters other than

alphanumeric signs (letters and numbers) and “.”

(period), “ “ (space), + (plus), and “-“ (hyphen),

are removed from parsed messages

Further tests are provided for the mapping of teams

from the NFL data to US states. Here,

• the function should return an error if an invalid

state is provided

• the function should return an empty list if a state

has no NFL Team with scheduled games

• the function should fail if the input is not a string

• the function should return a list if a state has 1 to

n teams with scheduled games

• the function should return a unique list of games

even if two teams in the same state are opponents

4 DISCUSSION

During the development of the described BD

application, the utilization of TDD has yielded

several benefits.

Since BD projects entail unique challenges

(Staegemann et al. 2019), including scalability, high

availability, handling peaking heavy loads of data,

and managing times of reduced data flow, an effective

infrastructure must be capable of addressing these

challenges. TDD, in turn, compels developers to

consider edge cases and possibilities that may

otherwise be overlooked, which, in turn, can help to

reduce the developed applications’ susceptibility to

errors that might impact correctness or performance.

A notable example arises from the dynamic

handling of Consumers and workload distribution

across all instances. An unexpected edge case

identified was a potential scenario, where the total

number of US states may not be evenly divisible by

the total number of Consumers.

Additionally, several situations were encountered

where mishandling a single invalid message could

disrupt consumption, potentially leading to the need

for manual intervention or overlooking missing

information. Yet, the use of TDD facilitated the

identification and proactive handling of errors as

crucial aspects of the development process. This is

important because potential malfunctions that are not

proactively eliminated during the development phase

might, at worst, not even be noticed in the

environment of BD applications in productive

operation but still negatively impact the obtained

results.

Consequently, to minimize the risk of such issues,

without adhering to the TDD approach, most likely,

several additional iterations of error search and code

refactoring would have been necessary during the

ICSBT 2024 - 21st International Conference on Smart Business Technologies

112

implementation of the infrastructure as well as the

application logic.

However, the extensive data infrastructure, which

is characterized by numerous co-dependencies, has

led to many integration tests. Although these prove to

be more costly than unit tests, they are essential for

validating the interactions between all the

interconnected components. Further, ensuring the

smooth implementation of unit tests in the presence

of interdependencies presented challenges that

required prior consideration of testing strategies.

Nevertheless, overall, the proposed TDD

approach for BDE has shown its feasibility and utility.

Hereby, especially the added flexibility through the

option to swap a part of the infrastructure rather

effortlessly (in this case the database) with an

alternative, while being protected through a net of

existing tests, highlighted the approach’s potential to

increase the flexibility and maintainability of the

developed solutions. This, in turn, can prove highly

valuable in dynamic business environments as

discussed in (Staegemann et al. 2020a).

5 CONCLUSION

The harnessing of BD has been shown to have a

positive impact on organizational operations.

However, creating the corresponding applications is

a demanding task and one of the big challenges is to

ensure that the applications have the necessary quality.

While quality assurance for BD applications is widely

researched, this is still far from a solved problem. In

response, one rather recent suggestion was the

utilization of the TDD approach in the BD domain.

Thus, to further explore the application of TDD in

BDE, in this research, a corresponding

implementation was created. As a use case for this,

the prediction of incidents of gun violence in the USA

was chosen as this is a typical BD use case. Further,

even though the data were available in batch, stream-

processing was simulated, since this is a rather typical

requirement in BD scenarios.

Within the implementation, there are two major

parts, the infrastructure and the application logic,

which were both created in a test driven manner.

For the prediction itself, historical gun violence

data were used, which were amended by NFL match

day data that were considered a potentially impactful

influencing factor. In doing so, the combination of

data from different sources, as another typical BD

requirement, was reflected.

Through the project, it was not only shown that it

is possible to apply the test driven approach for the

regarded use case but also that it benefitted the

development. This happened on one hand through

positive impacts on the developed application’s

design and on the other hand through the associated

quality assurance that helped to avoid errors that

would have necessitated cumbersome and time-

consuming additional code reviews to find them or

might, in a worst case scenario, even have been

undetected at all without the use of TDD.

Moreover, as a particularly noteworthy feature,

the increased flexibility and maintainability of the

developed application stand out because the

availability of the corresponding tests facilitates the

swapping of components with alternatives with

comparatively low effort while being protected

through the previously implemented tests. Thereby,

the proposed test driven approach seems especially

valuable in dynamic business environments, where

the applications continuously need to evolve to adapt

to changing environments, circumstances, and

business needs.

However, it has to be noted that the focus was

decisively on the technical realization of the approach

itself, therefore, the quality of the results of the

prediction algorithm and the frontend application

used for visualization were not part of the main

priority.

As part of similar future research endeavors,

beyond the here employed unit- and integration tests,

for instance, End-to-End testing could be conducted

to validate the system’s entire functionality from end

to end. Performance and security tests on the other

hand could help to identify other issues. Further,

exploring the implementation of hardening measures

for strengthening the system’s security, such as TLS-

encrypted communications, might also be worthwhile,

since it could potentially significantly influence the

required test framework and tests themselves.

REFERENCES

Agha, D., Sohail, R., Meghji, A. F., Qaboolio, R., and

Bhatti, S. (2023). “Test Driven Development and Its

Impact on Program Design and Software Quality: A

Systematic Literature Review,” VAWKUM

Transactions on Computer Sciences (11:1), pp. 268-

280 (doi: 10.21015/vtcs.v11i1.1494).

Apache Software Foundation. (2024a). “Apache Kafka,”

available at https://kafka.apache.org, accessed on Apr

11 2024.

Apache Software Foundation. (2024b). “Welcome to

Apache ZooKeeper,” available at https://

zookeeper.apache.org/, accessed on Apr 11 2024.

Exploring the Test Driven Development of a Big Data Infrastructure Examining Gun Violence Incidents in the United States of America

113

Chang, W. L., and Grady, N. (2019). “NIST Big Data

Interoperability Framework: Volume 1, Definitions,”

Special Publication (NIST SP), Gaithersburg, MD:

National Institute of Standards and Technology.

Diebold, F. X. (2021). “What's the big idea? “Big Data” and

its origins,” Significance (18:1), pp. 36-37 (doi:

10.1111/1740-9713.01490).

Docker Inc. (2024). “docker,” available at https://

www.docker.com/, accessed on Apr 11 2024.

Facebook Open Source. (2024). “Prophet: Forecasting at

scale.,” available at https://facebook.github.io/prophet/,

accessed on Apr 12 2024.

Höchtl, J., Parycek, P., and Schöllhammer, R. (2016). “Big

data in the policy cycle: Policy decision making in the

digital era,” Journal of Organizational Computing and

Electronic Commerce (26:1-2), pp. 147-169 (doi:

10.1080/10919392.2015.1125187).

Horowitz, M. (2018). “Detailed NFL Play-by-Play Data

2009-2018: nflscrapR generated NFL dataset wiith

expected points and win probability,” available at

https://www.kaggle.com/datasets/maxhorowitz/nflplay

byplay2009to2016/data, accessed on Apr 12 2024.

Ji, S., Li, Q., Cao, W., Zhang, P., and Muccini, H. (2020).

“Quality Assurance Technologies of Big Data

Applications: A Systematic Literature Review,”

Applied Sciences (10:22), p. 8052 (doi:

10.3390/app10228052).

Ko, J. (2018). “Gun Violence Data: Comprehensive record

of over 260k US gun violence incidents from 2013-

2018,” available at https://www.kaggle.com/datasets/

jameslko/gun-violence-data, accessed on Apr 12 2024.

Laney, D. (2001). “3D data management: Controlling data

volume, velocity and variety,” META group research

note (6:70).

Müller, O., Fay, M., and Vom Brocke, J. (2018). “The

Effect of Big Data and Analytics on Firm Performance:

An Econometric Analysis Considering Industry

Characteristics,” Journal of Management Information

Systems (35:2), pp. 488-509 (doi: 10.1080/

07421222.2018.1451955).

Red Hat. (2024). “Ansible,” available at https://

www.ansible.com/, accessed on Apr 11 2024.

Ronacher, A. (2024). “Flask,” available at https://

flask.palletsprojects.com/en/3.0.x/, accessed on Jan 16

2024.

Roy, D., Srivastava, R., Jat, M., and Karaca, M. S. (2022).

“A Complete Overview of Analytics Techniques:

Descriptive, Predictive, and Prescriptive,” in Decision

Intelligence Analytics and the Implementation of

Strategic Business Management, P. M. Jeyanthi, T.

Choudhury, D. Hack-Polay, T. P. Singh and S. Abujar

(eds.), Cham: Springer International Publishing, pp. 15-

30 (doi: 10.1007/978-3-030-82763-2_2).

Sonnenberg, C., and Vom Brocke, J. (2012). “Evaluations

in the Science of the Artificial – Reconsidering the

Build-Evaluate Pattern in Design Science Research,” in

Design Science Research in Information Systems.

Advances in Theory and Practice, D. Hutchison, T.

Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C.

Mitchell, M. Naor, O. Nierstrasz, C. Pandu Rangan, B.

Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y.

Vardi, G. Weikum, K. Peffers, M. Rothenberger and B.

Kuechler (eds.), Berlin, Heidelberg: Springer Berlin

Heidelberg, pp. 381-397 (doi: 10.1007/978-3-642-

29863-9_28).

Staegemann, D., Chadayan, A. K., Mathew, P., Sudhakaran,

S. N., Thalakkotoor, S. J., and Turowski, K. (2023).

“Showing the Use of Test-Driven Development in Big

Data Engineering on the Example of a Stock Market

Prediction Application,” in Proceedings of Eighth

International Congress on Information and

Communication Technology, X.-S. Yang, R. S. Sherratt,

N. Dey and A. Joshi (eds.), Singapore: Springer Nature

Singapore, pp. 867-877 (doi: 10.1007/978-981-99-

3243-6_70).

Staegemann, D., Volk, M., Daase, C., and Turowski, K.

(2020a). “Discussing Relations Between Dynamic

Business Environments and Big Data Analytics,”

Complex Systems Informatics and Modeling Quarterly

(23), pp. 58-82 (doi: 10.7250/csimq.2020-23.05).

Staegemann, D., Volk, M., Jamous, N., and Turowski, K.

(2020b). “Exploring the Applicability of Test Driven

Development in the Big Data Domain,” in Proceedings

of the 31st Australasian Conference on Information

Systems (ACIS), Wellington, New Zealand. 01.12.2020

- 04.12.2020.

Staegemann, D., Volk, M., Nahhas, A., Abdallah, M., and

Turowski, K. (2019). “Exploring the Specificities and

Challenges of Testing Big Data Systems,” in

Proceedings of the 15th International Conference on

Signal Image Technology & Internet based Systems,

Sorrento, Italy. 26.11.2019 - 29.11.2019.

Staegemann, D., Volk, M., Pohl, M., Haertel, C., Hintsch,

J., and Turowski, K. (2022). “Identifying Guidelines for

Test-Driven Development in Software Engineering—A

Literature Review,” in Proceedings of Seventh

International Congress on Information and

Communication Technology, X.-S. Yang, S. Sherratt, N.

Dey and A. Joshi (eds.), Singapore: Springer Nature

Singapore, pp. 327-336 (doi: 10.1007/978-981-19-

2397-5_30).

Tarreau, W. (2023). “HAProxy,” available at

https://www.haproxy.org/, accessed on Apr 11 2024.

Volk, M., Staegemann, D., Trifonova, I., Bosse, S., and

Turowski, K. (2020). “Identifying Similarities of Big

Data Projects–A Use Case Driven Approach,” IEEE

Access (8), pp. 186599-186619 (doi: 10.1109/

ACCESS.2020.3028127).

Volk, M., Staegemann, D., and Turowski, K. (2022).

“Providing Clarity on Big Data: Discussing Its

Definition and the Most Relevant Data Characteristics,”

in Proceedings of the 14th International Joint

Conference on Knowledge Discovery, Knowledge

Engineering and Knowledge Management, Valletta,

Malta. 24.10.2022 - 26.10.2022, SCITEPRESS -

Science and Technology Publications, pp. 141-148

(doi: 10.5220/0011537500003335).

ICSBT 2024 - 21st International Conference on Smart Business Technologies

114