GreenCC: A Hybrid Approach to Sustainably Validate Manufacturing

Data in Industry 4.0 Environments

Simon Paasche

and Sven Groppe

Automotive Electronics, Robert Bosch Elektronik GmbH, Salzgitter, Germany

Institute of Information Systems, University of L

ubeck, Germany

Keywords:

Consistency Checking, Green Computing, Industry 4.0, Big Data.

Abstract:

The era of big data streams forces companies to rethink their business models to gain competitive advantages.

To fully make use of the collected information, data have to be available in high quality. With big data, the

impact of information and communications technology (ICT) is also increasing. The extended use of ICT

leads to an increase in energy consumption and thus also in the CO

footprint, both of which in turn result in

high costs. A tradeoff between making use of the data and reducing the resources required for data acquisition

and validation arises. Our work investigates how data validation in smart manufacturing environments can be

implemented in an energy-efﬁcient and resource-saving way. Therefore, we present a combination of a light

consistency checker (LightCC) and a full consistency checker (FullCC) which can be activated in periods with

a high probability of defects. Our LightCC uses heuristics to predict missing messages and identiﬁes time

frames with an increased likelihood for further inconsistencies. In these periods, our FullCC can be activated

to perform an accurate validation. We call our developed system green consistency checker (GreenCC).

1 INTRODUCTION

Data-driven technologies enable companies to

achieve competitive advantages (Tao et al., 2018).

In order to fully exploit these advantages, data must

be of high quality (Tao et al., 2018) and (Tian et al.,

2017). At the same time, the energy consumption

of information and communication technologies

(ICT) has been increasing steadily for years, ranging

between 1 % and 3.2 % of the global consumption in

2020 and is prognosticated to increase up to 23 % by

2030 (Geiger et al., 2021). Depending on a country’s

energy mix, ICT is therefore also responsible for high

emissions of climate-damaging gases.

In our work, we focus on ICT in manufactur-

ing environments at Bosch. Figure 1 shows a smart

surface mount technology (SMT) line, to assemble

printed circuit boards with electronic components.

The four most important processes of such a line from

a data point of view are: (1) Solder Paste Printing

(SPP), for printing solder paste on a panel, (2) Sol-

der Paste Inspection (SPI), to check the print, (3) Sur-

face Mounted Devices (SMD), to assemble individual

components, and (4) Solder Joint Inspection (SJI), to

inspect the ﬁnal product. During processing, the ma-

chines continuously send data about their completed

Figure 1: Run through a smart SMT line with data from

SPP, SPI, SMD, and SJI.

steps. In previous research we have covered the top-

ics of data quality in such scenarios. We name devia-

tions from the target state inconsistencies. These can

be divided into four categories: (1) missing message,

which describes the absence of an expected message,

(2) multiple message, when information is available

twice, (3) incorrect content, which refers to the con-

tent of a single message, and (4) with contradictions,

which considers the relationships between messages.

To identify inconsistencies, we developed a sys-

tem and termed it consistency checker (CC) (Paasche

and Groppe, 2022). Thinking about the above-

mentioned issue of energy usage in ICT applications,

Paasche, S. and Groppe, S.

GreenCC: A Hybrid Approach to Sustainably Validate Manufacturing Data in Industry 4.0 Environments.

DOI: 10.5220/0012147900003541

In Proceedings of the 12th International Conference on Data Science, Technology and Applications (DATA 2023), pages 621-628

ISBN: 978-989-758-664-4; ISSN: 2184-285X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

621

this work takes into account energy usage of our cur-

rent system and presents an enhanced version with

lower consumption. Our green consistency checker

(GreenCC) consists of two units: (1) light conistency

checker (LightCC), to perform a resource-efﬁcient

light check and (2) full consistency checker (FullCC),

to check incoming data in more detail.

The paper is organized as follows: Section 2 pro-

vides an overview of related approaches in the ﬁeld

of data validation in industry scenarios. In Section

3, we focus on streamiﬁed consistency checking and

present our approach to reach sustainability. After-

wards, we have at look on our evaluation results. In

experiments, we compare the energy consumptions

and the associated climate footprints of our proto-

types. In particular, we address application scenarios

of our GreenCC. Thereafter, Section 5 discusses our

experimental results. During the discussion, we pri-

marily point out the green aspects of our approach.

Finally, we conclude in Section 6.

2 RELATED WORK

In literature, there is a large body of work on the topic

of data validation. Haav et al. (Haav et al., 2019) re-

fer to a real world data validation scenario from the

timber industry. For the deﬁnition of the constraints

SHACL shapes are used. Although they described

their show case in detail and give descriptive exam-

ples, Haav et al. do not present a concrete demon-

stration of the approach. Furthermore, their approach

is not intended to work on big data streams. Another

practical streaming scenario from healthcare sector is

given by Cort

es et al. (Cort

es et al., 2015). In their

work, the authors address data validation approaches,

but evaluate data throughput to identify challenges in

the big data area. Further approaches are provided by

(Gao et al., 2018), (Tian et al., 2017), and (Xuanyuan

et al., 2016). However, these works either use ﬁxed

knowledge bases, do not work on a data stream, or

do not use a real application scenario. Further, the

presented prototypes do not offer different modes of

operation, depending on the incoming data. Although

the impact of green solutions in software engineering

and manufacturing is increasing, none of the related

approaches focus on or mention sustainable aspects.

3 GreenCC

In earlier studies, we have already examined incon-

sistencies in our data in detail. Our consistency

checker (CC) validates an incoming data stream us-

ing SPARQL Protocol And RDF Query Language

(SPARQL)

queries. Since our machine data is in

JSON format, we ﬁrst have to transform it into the

Resource Description Framework (RDF)

. Due to the

size of each ﬁle, this process already consumes time

and thus computing resources.

To be more efﬁcient, we have developed an ap-

proach to predict inconsistencies of categories 1 and

2 using heuristics. We make use of the fact that when

similar products are manufactured on similar lines, a

similar number of messages is also generated. If we

create a knowledge base about which machines are

involved in a speciﬁc line for the production of cer-

tain products, we can deviate from the level of detail

of the previous checks for categories 1 and 2. In this

case, we only need the number of messages as well

as the current manufacturing line and product type.

This information is only a fraction of the total JSON

ﬁles and can be retrieved without complex transfor-

mations. However, the number can vary slightly due

to various quality tests in a line, so we have to deﬁne

a tolerance range for each product family. We then

compare this data with our knowledge base and ob-

tain an prediction of whether our data is valid or not.

Since category 3 and 4 refer to the message con-

tent, it is not possible to use heuristics for checking.

Nonetheless, we identiﬁed that most inconsistencies

of categories 3 and 4 occur after changes in the man-

ufacturing environment or our data pipeline. The oc-

currences of speciﬁc types of patterns, e.g. unusual

message frequencies, trigger more granular inspec-

tions to check for category 3 and 4 inconsistencies.

JSON data

Incorporate knowledge

Category 1 & 2

Continuous Data Stream

Initiate full

check and

transmit data

Figure 2: Architecture of our GreenCC. Based on the cur-

rent manufacturing data stream, the system predicts the like-

lihood of inconsistencies. In periods with a high probability

of inconsistencies, a complete check can be activated.

Figure 2 shows the overall architecture of our

GreenCC. Our system consists of two modules:

LightCC and FullCC. The LightCC monitors an in-

coming data stream for discrepancies. To make the

https://www.w3.org/TR/sparql11-query/

https://www.w3.org/RDF/

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

622

monitoring step adaptable to changes in the manu-

facturing environment, this unit accesses a knowledge

base during operation. In our approach, the LightCC

unit also takes on the task of handling a continuous

data stream. If the stream shows signs of a high proba-

bility of inconsistencies, a detailed consistency check

can be performed on the incoming data if required.

3.1 Light Consistency Checking

Figure 3: Overview of the LightCC. Our systems predicts

the likelihood of inconsistencies on the current data stream.

Figure 3 presents a schematic overview of our

LightCC approach. Similar to its predecessors, the

system is directly connected to our manufacturing

data. In regular mode, our LightCC monitors the in-

coming data stream. The Monitoring unit is the heart

of our LightCC. This unit enables the handling and

monitoring of a continuous data stream. Thereby,

it carries out central tasks of the entire consistency

checker. For this purpose, it consists of four submod-

ules : (1) Stream Handler, (2) Window Builder, (3)

Consistency Prediction, and (4) Change Detection.

(1) Stream Handler. Our Stream Handler accesses

a continuous data stream and forwards incoming data

to the Window Builder. Through a publish-subscribe

structure the unit acts in an event-based way.

(2) Window Builder. For this unit we follow the

idea of a keyed window. Based on an identiﬁer, the

incoming machine ﬁles are divided into data sets. In

this way we can check the consistency of the data of

a product (cf. data set at the beginning of Section 3).

(3) Consistency Prediction. During runtime, this

module accesses a knowledge base in which experi-

ence values for the products to be manufactured are

mapped. The knowledge base is implemented using

Web Ontology Language (OWL)

. The matching be-

tween knowledge base and manufacturing data can

https://www.w3.org/OWL/

thus be done using SPARQL queries. Depending on

the frequency distribution, we can predict inconsis-

tencies from categories 1 and 2.

(4) Change Detection. This unit monitors whether

changes have occurred in a line. An indication for

it can be for example that over a longer period no

data of a line has been sent and the information ﬂow

starts again. Furthermore, certain message types sig-

nal an adjustment. If a change has been made to a

machine or line, there tends to be a higher risk of

inconsistencies occurring. For this reason, it is im-

portant to check complete data sets after changes to

be able to exclude inconsistencies of all four cate-

gories. In this case, the connection between Window

Builder and Consistency Prediction is interrupted and

the whole window is transmitted to the FullCC unit.

For this purpose, there is a bidirectional connection

between Window Builder und Change Detection. If

no inconsistencies or further changes occur within a

predeﬁned period of time (e.g. 30 mins), the system

automatically sets the change status of this line to in-

active and our LigtCC continues to monitor the stream

and to predict inconsistencies for categories 1 and 2.

3.2 Full Consistency Checking

Figure 4: Architecture of our FullCC. The unit performs an

accurate consistency check.

Figure 4 provides a detailed overview of our FullCC.

When triggered, the FullCC receives JSON ﬁles as

input. These JSON ﬁles are forwarded to be trans-

formed into RDF format. The Validation unit con-

sists of two submodules: (1) RDF Transform and (2)

SPARQL Validator.

(1) RDF Transform. During validation, expert

knowledge is incorporated into the checking pro-

cess. Since this knowledge is represented in SPARQL

queries, we ﬁrst have to convert a data set into the se-

mantic RDF format. For this we use a domain on-

tology, which includes a formal understanding of our

manufacturing steps. In this way we obtain a mapping

GreenCC: A Hybrid Approach to Sustainably Validate Manufacturing Data in Industry 4.0 Environments

623

between machine parameters and knowledge. The re-

sult is a graph structure.

(2) SPARQL Validator. In the actual consistency

check, we validate the previously generated RDF

graph using SPARQL queries. The SPARQL queries

contain our deﬁnition of consistency. As a result, this

step provides an overview of whether there is an in-

consistency and if so, what kind of inconsistency oc-

cured. With this result we can annotate the initial

JSON ﬁle and write it back to our message broker.

4 EXPERIMENTAL RESULTS

Our experiments focus on the energy consumption of

our current CC in comparison to our novel approach

using efﬁcient design elements. We performed the

evaluations on a computer with 16 GB RAM and an

11th generation i5-1145G7 processor.

For our evaluation, we implemented four bench-

marking systems. Three systems are based on Apache

Flink

, using keyed window operator. Our ﬁrst Flink

approach validates the data for category 1 and 2 in-

consistencies using predeﬁned SPARQL rules. Since

Flink does not offer a semantic package, we use RD-

FLib

to implement it. The second approach lever-

ages Flink’s streaming capability and already relies

on heuristics to perform consistency checking. Thus,

we still use semantics to query knowledge, but we al-

ready predict if a data set is consistent or not (cf. Sec-

tion 3.1). Our third Flink system is constructed like

the ﬁrst one, using also predeﬁned SPARQL queries,

with the difference that a complete check for all cate-

gories is performed. Comparing these three systems,

we determine the semantic overhead. Further, we use

an optimized SPARQL query to showcase the effort

to perform semantic data validation. Doing so, we are

able to determine the overhead of Flink in our applica-

tions. We compare these four systems at the end with

our optimized LightCC and FullCC. For better com-

parability of the approaches, we have implemented

each system in Python. To measure the energy con-

sumption we use the Python package CodeCarbon

CodeCarbon can be added in existing code to measure

the consumed energy in kilowatt-hours (kWh). In our

evaluation, we primarily want to know, how much en-

ergy and emissions we can save, using our GreenCC

(LightCC + FullCC) in comparison to our previous

developed approaches. In the following analyses, we

https://ﬂink.apache.org/

https://rdﬂib.dev/

https://codecarbon.io/

evaluate our LightCC and FullCC separately to de-

termine the overhead for relevant operations such as

RDF transformation and change detection.

4.1 Energy Consumption

In our ﬁrst evaluation, we compare the total consump-

tion when applying our systems in manufacturing

plants of different sizes (small, medium, and large).

For a small plant, we consider about 400 k messages

per day. In medium plants, its about 1.5 Mio, and for

larger plants we assume 2.5 Mio messages per day.

As can be seen in Table 1, the energy consump-

tion of our semantic and heuristic Flink approaches

are close to each other. All three values have a daily

consumption around 7 kWh (medium size). This is

approximately as much energy as is needed to prepare

490 cups of coffee

. This result shows that the scope

and the validation method play a minor role when us-

ing Flink. With an additional look at the consump-

tion of pure SPARQL, we can conclude that consis-

tency checks with queries produce only a small over-

head in our use case. However, the single SPARQL

query only provides a reference value, since no fea-

tures are given to trade a continuous data stream, the

pure query cannot be used directly in our manufactur-

ing scenario. Therefore, it is not surprising that our

optimized approaches also consume more energy than

pure SPARQL. The difference between our semantic

and heuristic Flink approaches can be explained by

the fact that in the semantic approaches the JSON data

has to be transformed into RDF (approx. 0.2 kWh

(medium size) per day). From this, we can conclude

that the handling the data stream and partitioning it

into data sets consumes most of the energy. However,

RDF transformation is also noticeable.

Overall, our optimized approaches are very close

to each other. In a medium-sized plant, the maximum

difference is less than 0.2 kWh per day (LightCC vs.

FullCC (all)). This is approximately the energy con-

sumption already determined for the Flink systems

used for RDF transformation. The difference seems

to be marginal at ﬁrst. Considering the fact that con-

sistency checks are applied in a huge manufacturing

environment, the energy saving becomes more impor-

tant. Considering only ten plants, the saving is already

2 kWh each day. Calculated over a year, this is far

more than 700 kWh. Furthermore, our table shows

that an additional management of the line states does

not have a signiﬁcant impact. Thus, the LightCC with

Change Detection is preferable. The marginal dif-

ference between our FullCC approaches again shows

https://www.verivox.de/strom/themen/1-

kilowattstunde/

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

624

Table 1: Overview of daily energy consumtion in kilowatt-hours (kWh) and corresponding costs for small, medium, and large

manufacturing plant.

Approach

Small plant per day Medium plant per day Large plant per day

(Costs\day) (Costs\day) (Costs\day)

Flink

(1&2)

1.898 kWh 7.116 kWh 11.860 kWh

(24.04 Cent) (90.16 Cent) (150.27 Cent)

Flink

(all)

1.949 kWh 7.308 kWh 12.180 kWh

(24.69 Cent) (92.59 Cent) (154.32 Cent)

Flink

heuristic

1.856 kWh 6.960 kWh 11.600 kWh

(23.52 Cent) (88.18 Cent) (146.97)

SPARQL

0.090 kWh +e

0.336 kWh +e

0.560 kWh +e

(1.14 Cent) +c

(4.26 Cent) +c

(7.10 Cent) +c

LightCC

1.226 kWh 4.596 kWh 7.660 kWh

(15.53 Cent) (58.23 Cent) (97.05 Cent)

LightCC with

Change Detection

1.229 kWh 4.608 kWh 7.680 kWh

(15.57 Cent) (58.38 Cent) (97.31 Cent)

FullCC (1&2)

1.251 kWh 4.692 kWh 7.820 kWh

(15.85 Cent) (59.45 Cent) (99.08 Cent)

FullCC (all)

1.261 kWh 4.728 kWh 7.880 kWh

(15.97 Cent) (59.90 Cent) (99.84 Cent)

Legend

: ˜0.049 kWh RDF transform in small plant

: ˜0.183 kWh RDF transform in medium plant

: ˜0.305 kWh RDF transform in large plant

: ˜0.61 Cent RDF transform in small plant

: ˜2.32 Cent RDF transform in medium plant

: ˜3.86 Cent RDF transform in large plant

that the actual SPARQL evaluation has small impact

on energy consumption.

Comparing our optimized consistency checkers

with Flink, it is noticeable that the required energy

on a daily basis differs signiﬁcantly. The differences

between, e.g., our LightCC and the heuristic Flink

amount to almost 2.4 kWh in a medium-sized plant.

These differences can be explained by the complex

range of functions and the focus on performance. Our

approach is precisely designed for consistency check-

ing in manufacturing scenarios and offers the better

choice from a purely ecological point of view.

The strong differences in power consumption are

also reﬂected in the cost analysis, as costs and con-

sumption are directly related. By adding an aver-

age German kWh price for large industrial customers

of about 12.67 Cent

, we receive annual operating

costs between about 212.54 Euro and 337.95 Euro

for a medium-sized manufacturing plant. Although

the costs are within acceptable dimensions, the differ-

ences between our prototyped systems are at a high

level and thus also offer a ﬁnancial attraction.

https://www.bmwk.de/Redaktion/DE/Artikel/Energie/

energiedaten-gesamtausgabe.html

4.2 Climate Footprint

In our second evaluation, we consider the climate

footprint of our eight systems. We compare the en-

ergy consumption determined in Table 1 with the

electricity mix of relevant industrial states. This cal-

culation is less relevant for the AE area, since the

plants are supplied with green electricity. This keeps

emissions at a very low level. However, savings en-

sure that the generated energy can be used elsewhere.

Furthermore, the results highlight the general neces-

sity of green software approaches. We report the foot-

print in carbon dioxide equivalents (CO

e) in grams

per day. Our referenced data refer to the year 2021

and are averaged values for the speciﬁed region.

Our results are shown in Table 2. Higher electric-

ity consumption entails higher CO

e emissions. Our

optimized systems and especially our LightCC have

an advantage in terms of daily emissions. Consider-

ing for example the transport sector in Germany (148

Mio. t CO

e in 2022(Hendzlik et al., 2022)) our mea-

sured values are still relatively small. However, we

should note that considered on just one medium-sized

plant, daily CO

e emissions are already between 1.20

kg and 1.92 kg in the EU and between 2.03 kg and

3.22 kg worldwide. This corresponds to a saving of

https://ember-climate.org/data-catalogue/yearly-

electricity-data/

GreenCC: A Hybrid Approach to Sustainably Validate Manufacturing Data in Industry 4.0 Environments

625

Table 2: Daily carbon-dioxide equivalents (C0

e) in gram per kWh. The table shows the footprint respectively for small,

medium, and large plants.

Approach

Plant

Size

Germany EU USA World

366 gC0

e/kWh 262 gC0

e/kWh 379 gC0

e/kWh 441 gC0

e/kWh

Flink

(1&2)

small: 695 g 497 g 719 g 837 g

medium: 2604 g 1864 g 2697 g 3138 g

large: 4341 g 3107 g 4495 g 5230 g

Flink

heuristic

small: 679 g 486 g 703 g 818 g

medium: 2547 g 1824 g 2638 g 3069 g

large: 4246 g 3039 g 4396 g 5116 g

Flink

(all)

small: 713 g 511 g 739 g 859 g

medium: 2675 g 1915 g 2770 g 3223 g

large: 4458 g 3191 g 4616 g 5371 g

SPARQL

small: 33 g + ge

23 g + eu

34 g + us

40 g + w

medium: 123 g + ge

88 g + eu

127 g + us

148 g + w

large: 205 g + ge

147 g + eu

212 g + us

247 g + w

LightCC

small: 449 g 321 g 465 g 540 g

medium: 1682 g 1204 g 1742 g 2027 g

large: 2804 g 2007 g 2903 g 3378 g

LightCC with

Change Detection

small: 450 g 322 g 466 g 542 g

medium: 1687 g 1207 g 1746 g 2032 g

large: 2811 g 2012 g 2911 g 3387 g

FullCC (1&2)

small: 458 g 328 g 474 g 552 g

medium: 1717 g 1229 g 1778 g 2069 g

large: 2826 g 2049 g 2964 g 3449 g

FullCC (all)

small: 461 g 330 g 478 g 556 g

medium: 1730 g 1239 g 1792 g 2085 g

large: 2884 g 2065 g 2987 g 3475 g

Legend

: Additional CO

e in plant of size small (˜17.9 g), medium (˜67.0 g), large (˜111.6 g) in Germany

: Additional CO

e in plant of size small (˜12.8 g), medium (˜47.9 g), large (˜79.9 g) in EU

: Additional CO

e in plant of size small (˜18.6 g), medium (˜69.4 g), large (˜115.6 g) in USA

: Additional CO

e in plant of size small (˜21.6 g), medium (˜80.7 g), large (˜134.5 g) worldwide

nearly 40 %. On an annual basis, this means about

262 kg in EU or over 434 kg in the world. Since the

number of plants worldwide is higher, savings of sev-

eral tons can be assumed. The total savings potential

is therefore signiﬁcant in terms of climate protection

and sustainable data management. This is particularly

evident when comparing our LightCC and FullCC

approaches. In a medium-sized plant, the footprint

ranges from 30 g to 50 g per day.

Overall, our Flink approaches have a higher en-

ergy consumption and corresponding climate foot-

print. The main reason for this is that Flink is a

generic stream processing framework with a large fea-

ture set. This large set has a negative impact in our

scenario. Our custom-developed systems are lean ap-

plications speciﬁcally designed to be used as data val-

idation systems in the smart manufacturing domain.

4.3 Scenarios for GreenCC

Our analyses have shown that heuristic approaches for

data validation tasks have a positive impact on energy

consumption. Since we monitor on changes in a man-

ufacturing line and switch the mode of operation, our

GreenCC generates a small additional overhead (see

Table 1). This overhead can be attributed primarily to

the additional effort required for RDF transformation.

Table 3 summarizes our evaluated scenarios. The

GreenCC used for the evaluation is a combination

of LightCC with Change Detection and FullCC (all).

With this we model the operation of our CC in the

manufacturing environment. Our ﬁrst scenario shows

the GreenCC with no change. The energy consump-

tion and footprint correspond to LightCC. If there is

at least one change per 30 mins and per line, the mea-

sured values correspond to our FullCC (last scenario).

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

626

Table 3: Energy consumption, costs, and CO

e footprint our GreenCC (LightCC with Change Detection + FullCC (all)) in

different scenarios. A scenario indicates in each case how often changes have taken place during one day. The footprint refers

to an averaged German footprint of 366 gCO

e/kWh in 2021.

GreenCC

Scenarios

Energy

Footprint Costs

(366 gC0

e/kWh) (12.67 Cent/kWh)

No change 4.608 kWh 1687 g 58.38 Cent

4.636 kWh 1697 g 58.74 Cent

Half of the time

and lines

4.668 kWh 1708 g 59.14 Cent

4.696 kWh 1719 g 59.50 Cent

Min. one change

per 30 mins per line

4.728 kWh 1730 g 59.90 Cent

In real operation, we are in the front range of the sce-

narios. Ususally, only a few changes occur in a small

amount of lines. Thus, our GreenCC runs predom-

inantly in heuristic mode. During 24 h operation,

the energy requirement per hour of our LightCC is

about 0.192 kW. In comparison, the FullCC requires

approx. 0.197 kW. In particular, however, the low

hourly energy requirement also shows that it is ad-

visable to switch to the FullCC for a certain period

of time in the event of changes in a line. With the

FullCC, it can be quickly determined in an automated

manner whether the adjustments have been made as

desired. In this way, our system can not only be used

as validation framework but also as test environment

for machines and data pipelines.

5 DISCUSSION

In our GreenCC, we employed heuristics at the cur-

rent time to detect category 1 and 2 inconsistencies

and identify time periods when further inconsisten-

cies are more likely. If we categorize our inconsisten-

cies by urgency, the following emerges:

High: Missing message: A missing message rep-

resent the most important inconsistency. The rea-

son for this is that the information contained in the

messages cannot be easily recovered. It is indeed

possible to map the actual process if the machine

and product type are known. However, informa-

tion about the exact process duration and execu-

tion remains hidden. For later analyses, this infor-

mation is of utmost importance.

Middle: Incorrect content + with contradiction:

Missing or wrong content is an information gap.

Usually, a few parameters are faulty, which leads

to additional work in subsequent analyses. How-

ever, through targeted data cleaning approaches,

the faulty parts can be reproduced. Considering

the additional overhead of our FullCC and tak-

ing into account that ususally inconsistencies of

these two categories occur after changes in man-

ufacturing environment, it is recommended to use

our heuristic check during normal operation.

Low: Multiple messages: Multiple messages can

be excluded in a subsequent analysis (e.g. using

SQL command DISTINCT) or removed from the

cluster via expensive operations. There is no loss

of information due to the occurrence of this incon-

sistency. However, by checking category 1 in our

LightCC, this category is also indirectly checked

without additional overhead.

This prioritization shows the necessity to perma-

nently monitor completeness. With the LightCC unit,

we focus exactly on this issue. During regular op-

eration mode, an efﬁcient check for completeness is

continuously performed.

As we already mentioned in Section 2, current

data validation approaches do not place the focus on

sustainability. However, in our evaluation we ﬁgure

out that, especially when applying validation systems

in real scenarios, we can signiﬁcantly reduce energy

consumption and CO

e emissions. However, sustain-

able aspects of software encompass much more than

just active operation. In their work, Geiger et al.

(Geiger et al., 2021) provide an overview about how

to do sustainable software engineering. These crite-

ria include among others modularity and adaptability.

As we explained in Section 3, our GreenCC is divided

into modules. The modules communicate with each

other via a deﬁned interface and are thus interchange-

able. In addition, the knowledge base used is out-

sourced and realized as an ontology. Using semantics

allow for standardized access to the knowledge and

GreenCC: A Hybrid Approach to Sustainably Validate Manufacturing Data in Industry 4.0 Environments

627

enable a good extensibility to a changing manufactur-

ing environment. Further, ontologies allow to trans-

fer our approach to other areas by adapting checking

rules and domain knowledge.

Another aspect to making software more sustain-

able is to use an event-based approach instead of pull-

based. We adress this by using a publish-subscribe

architecture (message broker). Further, the broker al-

lows to connect multiple instances of our GreenCC

and thus to balance workload. Since we developed

our GreenCC unit wise, it is further conceivable to

run each module on a different node.

Geiger et al. (Geiger et al., 2021) also mention

that software should be implemented in a lean way

and only perform exactly one task, what our system

complies with. Further, it is recommended to choose

the programming language wisely. Selection criteria

strongly depend on the speciﬁc use case. By using

Python, we are in the lower middle in terms of ex-

ecution time, energy and memory consumption (cf.

(Pereira et al., 2017)). In our evaluation, the focus is

on comparing different methods to validate manufac-

turing data. For better comparability, especially with

regard to the overheads through monitoring software,

all systems are implemented in the same language

and monitored with the same tools. In general, when

looking at our systems, we can see that the chosen

method already offers a decisive advantage in terms of

costs, energy consumption, and emissions. With fur-

ther regard to the integration in a globally operating

company and the resulting need for manageability of

the used language, Python offers an advantage at this

point compared to, for example, C or Go. However, it

is conceivable to implement parts of the system (e.g.

the stream handling unit) efﬁciently in future work.

6 CONCLUSION

We present a system for validating stream data in a

resource-efﬁcient manner. Our GreenCC is a Python

based system that monitors incoming messages and

predicts inconsistencies based on patterns that occur.

For detailed analyses, a full consistency check can be

initiated. Our analyses have shown that compared to

our previous systems, energy consumption can be re-

duced signiﬁcantly, especially when applying the sys-

tem to large manufacturing plants. The lower energy

consumption stands out in particular when consider-

ing CO

e emissions. Seen over the year, these can be

reduced in a medium-sized plant in the EU by a fac-

tor of about 0.6, which corresponds to 262 kgCO

(LighCC vs. Flink (all)). Overall, the use of our

GreenCC is proﬁtable in each of our scenarios.

As the relevance for smart sustainable software is

existent, future work will continue to focus on green

data validation in manufacturing environments. Pat-

tern detection, as used in our GreenCC, offers many

opportunities, e.g. by using machine learning (ML)

algorithms. Depending on how resource consuming a

training process is and how often we have to re-train,

ML can offer a beneﬁt when applying in a large het-

erogenous manufacturing environment.

REFERENCES

Cort

es, R., Bonnaire, X., Marin, O., and Sens, P. (2015).

Stream processing of healthcare sensor data: studying

user traces to identify challenges from a big data per-

spective. Procedia Computer Science, 52:1004–1009.

Gao, S., Dell Aglio, D., Pan, J. Z., and Bernstein, A. (2018).

Distributed stream consistency checking. In Interna-

tional Conference on Web Engineering, pages 387–

403. Springer.

Geiger, L., Hopf, T., Loring, J., Renner, M., Rudolph,

J., Scharf, A., Schmidt, M., and Termer, F. (2021).

Ressourcenefﬁziente programmierung.

Haav, H.-M., Maigre, R., Lupeikiene, A., Vasilecas, O., and

Dzemyda, G. (2019). A semantic model for product

conﬁguration in timber industry. In Databases and

Information Systems X, volume 315, pages 143–158.

IOS Press.

Hendzlik, M., Lange, M., Kl

ockner, P., Lambrecht, M.,

Frey, K., Dziekan, K., Dross, M., and Schmied, M.

(2022). Bausteine f

ur einen klimagerechten verkehr.

Internationales Verkehrswesen, 74(1).

Paasche, S. and Groppe, S. (2022). Enhancing data qual-

ity and process optimization for smart manufacturing

lines in industry 4.0 scenarios. In Proceedings of The

International Workshop on Big Data in Emergent Dis-

tributed Environments, BiDEDE ’22, New York, NY,

USA. Association for Computing Machinery.

Pereira, R., Couto, M., Ribeiro, F., Rua, R., Cunha, J., Fer-

nandes, J. P., and Saraiva, J. (2017). Energy efﬁciency

across programming languages: how do energy, time,

and memory relate? In Proceedings of the 10th ACM

SIGPLAN international conference on software lan-

guage engineering, pages 256–267.

Tao, F., Qi, Q., Liu, A., and Kusiak, A. (2018). Data-driven

smart manufacturing. Journal of Manufacturing Sys-

tems, 48:157–169.

Tian, Y., Michiardi, P., and Vukoli

c, M. (2017). Bleach:

A distributed stream data cleaning system. In 2017

IEEE International Congress on Big Data (BigData

Congress), pages 113–120. IEEE.

Xuanyuan, S., Li, Y., Patil, L., and Jiang, Z. (2016). Conﬁg-

uration semantics representation: A rule-based ontol-

ogy for product conﬁguration. In 2016 SAI Computing

Conference (SAI), pages 734–741. IEEE.

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

628