Real-Time Manufacturing Data Quality: Leveraging Data Proﬁling and

Quality Metrics

Teresa Peixoto

1 a

, Bruno Oliveira

1 b

Oscar Oliveira

1 c

and Fillipe Ribeiro

2 d

CIICESI, School of Management and Technology, Porto Polytechnic, Portugal

JPM Industry, Portugal

{tmop, bmo, oao}@estg.ipp.pt, ﬁllipe.ribeiro@jpm.pt

Keywords:

Data Quality, Data Proﬁling, Real-Time Data Analysis, Smart Manufacturing Environments, Industry 4.0.

Abstract:

Ensuring data quality in decision-making is essential, as it directly impacts the reliability of insights and

business decisions based on data. Data quality measuring can be resource-intensive, and it is challenging to

balance high data quality and operational costs. Data proﬁling is a fundamental step in ensuring data quality,

as it involves thoroughly analyzing data to understand its structure, content, and quality. Data proﬁling enables

teams to assess the state of their data at an early stage, uncovering patterns, anomalies, and inconsistencies

that might otherwise go unnoticed. In this paper, we analyze data quality metrics within Industry 4.0 envi-

ronments, emphasizing various critical aspects of data quality, including accuracy, completeness, consistency,

and timeliness, and showing how typical data proﬁling outputs can be leveraged to monitor and improve data

quality. Through a case study, we validate the feasibility of our approach and highlight its potential to improve

data-driven decision-making processes in smart manufacturing environments.

1 INTRODUCTION

In modern data-driven business systems, maintaining

high data quality is essential for ensuring the relia-

bility of decision-making processes (Rangineni et al.,

2023). The increasing adoption of the Internet of

Things (IoT) and Cyber-Physical Systems (CPS) in

the industry has signiﬁcantly escalated the volume

of data generated (Goknil et al., 2023), revelling the

need for advanced techniques for analyzing and un-

derstanding its characteristics (Tverdal et al., 2024).

Much of this data is streamed in real-time, resulting in

substantial volumes of (time series) data organized by

timestamps (Hu et al., 2023), which require constant

supervision. Ensuring data quality is vital for opti-

mizing processes, enabling predictive maintenance,

and supporting data-driven decisions, ultimately en-

hancing the efﬁciency and reliability of automated

systems while minimizing error rates (Goknil et al.,

2023). However, maintaining data quality within the

context of Industry 4.0 presents challenges, including

managing multiple data sources, varying formats and

https://orcid.org/0009-0002-6136-8602

https://orcid.org/0000-0001-9138-9143

https://orcid.org/0000-0003-3807-7292

https://orcid.org/0000-0002-8851-9688

standards, real-time validation, and integrating legacy

systems with emerging technologies.

Through data proﬁling (Abedjan et al., 2018), en-

gineers and analysts gain detailed visibility into var-

ious data attributes, such as distributions, data types,

and relationships across tables or databases. This in-

sight helps in establishing baseline quality metrics,

enabling teams to set realistic and meaningful qual-

ity standards. For instance, if proﬁling reveals that a

particular dataset has a high percentage of null val-

ues in key ﬁelds, it signals that corrective actions

are necessary, like data cleaning or enrichment, be-

fore the data is used in downstream applications.

Once the data characteristics are known, engineers

can automate quality monitoring processes that con-

tinuously check data against predeﬁned thresholds

(Tverdal et al., 2024). For example, if certain data

ﬁelds should never be empty, data proﬁling enables

the creation of validation rules to enforce this require-

ment. As data ﬂows through pipelines, these checks

help maintain quality by identifying any deviations

from expected standards. Overall, data proﬁling is

more than a preliminary step; it is an ongoing prac-

tice in data engineering that sustains data quality over

time.

This paper presents a data quality evaluation ap-

proach tailored to the unique characteristics of IoT

Peixoto, T., Oliveira, B., Oliveira, Ó. and Ribeiro, F.

Real-Time Manufacturing Data Quality: Leveraging Data Proﬁling and Quality Metrics.

DOI: 10.5220/0013242900003944

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 10th International Conference on Internet of Things, Big Data and Security (IoTBDS 2025), pages 56-68

ISBN: 978-989-758-750-4; ISSN: 2184-4976

and Industry 4.0 data, applied within a speciﬁc case

study. It investigates key dimensions and metrics of

data quality, leveraging data proﬁling outputs to ac-

tively enhance the evaluation process. By integrat-

ing proﬁling results, this approach enables continu-

ous, real-time assessment of manufacturing data, re-

ducing the need for human intervention and support-

ing automated, high-frequency quality monitoring in

industrial environments.

The rest of this article is structured as follows.

Section 2 reviews related work. Section 3 describes

the case study. Section 4 presents the results and dis-

cusses the main ﬁndings. Finally, Section 5 concludes

the article.

2 RELATED WORK

With the advances in industrial technology, the in-

creasing number of sensors deployed for monitoring

manufacturing processes leads to the constant gener-

ation of large volumes of time series data (Schultheis

et al., 2024). The growing volume and complexity

of time series data present signiﬁcant challenges for

data analysis (Hu et al., 2023). Analyzing this data

can uncover underlying patterns, reveal correlations

and periodicity between events, and provide a deeper

understanding of the nature and mechanisms of these

events. Through the analysis of time trends, valuable

information can be extracted, such as anomaly de-

tection, classiﬁcation, and clustering (Bandara et al.,

2020). Missing data, outliers or duplicated records are

some examples of problems typically found in time-

series data (Tverdal et al., 2024).

The quality and continuity of data present signiﬁ-

cant bottlenecks in Industry 4.0 data. Various factors

can lead to a decline in data quality. For instance, sys-

tems can face sensor malfunctions and failures, result-

ing in corrupted sensor measurements. These issues

can vary from electromagnetic interference, packet

loss, and signal processing faults (Goknil et al., 2023).

Poor data quality affects trust and reliance on these In-

dustry 4.0 systems. Data Monitoring, Data Cleaning

and Data Repair are three types of data quality identi-

ﬁed in (Goknil et al., 2023).

The deﬁnition of data quality is intricate and

context-dependent. It can be described as the

degree to which data characteristics meet explicit

and implicit requirements in speciﬁc circumstances

(ISO/IEC 25012:2008, 2008). Notably, data quality

cannot be easily distilled into a single metric or deﬁ-

nition. Instead, it is a multifaceted concept that must

be carefully evaluated against the particular needs and

objectives of data users.

Data Quality Dimensions (DQD) (Loshin, 2011)

are characteristics or attributes used to assess the qual-

ity of data. These dimensions are crucial for under-

standing and measuring the ﬁtness of data for its in-

tended use. By establishing clear criteria for evalu-

ation, these dimensions ensure that data aligns with

established needs and expectations. Identifying these

relevant dimensions forms the basis for effectively

assessing data quality and initiating continuous im-

provement activities (Cichy and Rass, 2019).

Batini (Batini and Scannapieco, 2016) proposed

several DQD, such as Accuracy, Completeness and

Consistency, and organised them into groups based

on their similarity, with each group addressing spe-

ciﬁc problem categories, strategies and metrics for

evaluating data quality. In the Accuracy group, Ba-

tini distinguishes between Structural Accuracy and

Temporal Accuracy. Structural Accuracy refers to

the correctness of data within a stable time frame,

while Temporal Accuracy measures how quickly up-

dates in data values reﬂect real-world changes. In

addition, the author identiﬁes Timeliness as one of

the principal temporal accuracy dimensions, repre-

senting how up-to-date the data are for the speciﬁc

task. Completeness measures the extent to which

all required data is present and accounted for, ensur-

ing that the dataset includes all necessary informa-

tion without omissions or gaps, while Consistency

can be deﬁned as the ability of information to present

a uniform and synchronized representation of real-

ity, as established by integrity constraints, business

rules, and other formalities. This dimension identi-

ﬁes violations of semantic rules deﬁned on a data set.

Consequently, data values must be uniform and syn-

chronized in all instances and applications. Other au-

thors have proposed classiﬁcations for DQDs, such

as (Loshin, 2011; Mahanti, 2019; Zhang et al., 2021).

The latter presents a comprehensive evaluation frame-

work for sensor measurements in the context of IoT.

The authors in (Goknil et al., 2023) present a com-

prehensive overview of metrics for IoT, categorized

into 17 Data Quality Dimensions (DQDs). Note-

worthy, is that among these, the study by Liu et al.

(Liu et al., 2021) identiﬁes the dimensions of Accu-

racy, Completeness, Consistency and Timeliness as

the most relevant to the main data problems in smart

manufacturing scenarios.

For Accuracy, i.e., degree of precision in which

the stored data reﬂects reality, (Goknil et al., 2023)

presents 5 metrics found in the literature. For in-

stance, the metric identiﬁed by M4 and used in (Sicari

et al., 2016), is an accuracy metric that varies between

0 and 1 and indicates how close a value is to the cor-

rect values. This metric is calculated using the fol-

Real-Time Manufacturing Data Quality: Leveraging Data Proﬁling and Quality Metrics

lowing mathematical expression:

Accuracy =

x − min(X)

max(X) − min(X)

(1)

, where x is the value to be analyzed and X is the

data set. This normalized accuracy score is useful for

understanding the relative position of a value about

the entire range of data. If Accuracy is close to 0,

it means that x is close to the minimum value ob-

served in X, while Accuracy close to 1 means that x

is close to the maximum value observed in X. When

Accuracy is less than 0 or greater than 1, it indicates

that the x score is outside the range of values observed

in X, suggesting the presence of outliers or errors in

the data.

For Completeness, which refers to the expecta-

tion that certain attributes should have assigned val-

ues in the data under evaluation, many authors iden-

tify and propose various metrics to measure this di-

mension. Although the terminology may differ, the

metrics themselves are fundamentally similar. For in-

stance, (Goknil et al., 2023) presents six metrics, in-

cluding the metric identiﬁed as M13, which is also

used by (Byabazaire et al., 2020). Similarly, (Ma-

hanti, 2019) proposes metrics based on the same con-

cept. Despite minor variations, the underlying prin-

ciples of these quantitative metrics for completeness

remain consistent. The metric for the Completeness

dimension is expressed as follows:

Completeness =

total

− N

miss

total

(2)

where, N

miss

is the sum of missing values (such as

nulls, blank values or others) and N

total

is the total

number of data that should have been ﬁlled in. This

metric can be applied at both the record and attribute

levels, allowing completeness gaps to be identiﬁed at

different layers. In this way, it is possible to detect

a variety of underlying causes for data completeness

problems.

For Consistency, (Goknil et al., 2023) does not

offer a predeﬁned metric for consistency, since its

evaluation is based on contextual rules. However,

(Mahanti, 2019) measures consistency by the ratio of

the number of rules that are found in the data (N

rule

) to

the number of previously established rules that should

exist (N

total

). Presenting the following metric:

Consistency =

rule

total

(3)

This metric should be applied to individual

records as well as to cross-records of different data

sets. For example, on a plastic extrusion machine,

if the screw rotation sensor is showing high values,

it would be expected that the barrel temperature sen-

sors would also be showing high values, as the fric-

tion generated by high rotation tends to heat the ma-

terial. This type of analysis is essential to identify

inconsistencies that could indicate faults in the mon-

itoring system or problems with the machine. In ad-

dition, when receiving data associated with a speciﬁc

machine, such as a unique identiﬁer generated by the

sensors, it is essential that this machine is correctly

identiﬁed in a separate set of data that lists all of the

company’s machines.

For Timeliness, which refers to the degree of

timeliness of data for a speciﬁc task, one of the met-

rics presented by (Goknil et al., 2023), identiﬁed as

M28, was used in (Sicari et al., 2016). This metric is

deﬁned based on data Age and Volatility, where Age

represents the time elapsed since the creation of the

data, while Volatility characterizes the period during

which this data remains valid. Thus, Timeliness is

calculated as follows:

Timeliness = max



0, 1 −

Age

Volatility



(4)

In this metric, the Timeliness value varies between

0 and 1, where 0 indicates that the data is outside the

ideal period of analysis, while 1 means that the data is

entirely within the ideal range for decision-making.

Data proﬁling presents itself as a critical and rou-

tine task for IT professionals and researchers, involv-

ing a wide range of techniques to analyze datasets

and generate metadata. This process can yield sim-

ple statistics and the most common patterns within

data values. More complex metadata, such as inclu-

sion and functional dependencies, require examina-

tion across multiple columns (Naumann, 2014).

Data proﬁling faces three primary challenges:

ingestion, computation, and output interpretation

(Abedjan et al., 2018). Ingestion involves efﬁciently

loading and preparing data from diverse sources. The

metadata discovered through proﬁling can be applied

to improve data quality by translating patterns and

dependencies into constraints or rules for validation,

cleansing, and integration. (Oliveira and Oliveira,

2022) introduced a data pipeline to ensure a cer-

tain level of data ”normality”. This framework de-

rives system behaviours, such as load management

and quarantine, based on a straightforward reliabil-

ity score. It relies on services that utilize a message

and communication broker. The reliability score can

be enhanced through a plugin architecture and sim-

ple conﬁguration, enabling the development of spe-

cialized systems. This approach offers ﬂexibility and

adaptability, making it easier to maintain data quality

and system reliability.

IoTBDS 2025 - 10th International Conference on Internet of Things, Big Data and Security

Computational complexity is also a signiﬁcant

factor, as proﬁling algorithms must handle the number

of rows and columns in a dataset. Tasks often involve

inspecting various column combinations, which can

lead to exponential complexity, particularly in smart

manufacturing domains (Tverdal et al., 2024).

The use of data proﬁling outputs for data qual-

ity assessment was explored in (Kusumasari and Fi-

tria, 2016), in which OpenReﬁne was used to per-

form multiple analysis techniques on different data

elements. The data proﬁling with openReﬁne

can

detect data quality issues and provide suggestions to

improve data quality. This process is particularly im-

portant in research information systems, where data

quality directly impacts the success of Business Intel-

ligence applications (Azeroual et al., 2018). (Tverdal

et al., 2024) proposed EDPRaaS (Edge-based Data

Proﬁling and Repair as a Service), an approach de-

signed for efﬁcient data quality proﬁling and repair in

IoT environments. It uses data proﬁling to comple-

ment data quality assessment. The repair component

leverages Great Expectations

outputs for data correc-

tion tasks, while Pandas

proﬁling provides end-users

with reports on identiﬁed data quality issues.

In (Heine et al., 2019), the authors present a pro-

ﬁling component designed to streamline data quality

management by automatically generating rule sugges-

tions and parameters based on existing data. The pro-

ﬁling component analyzes data to propose rule candi-

dates, allowing users to review and activate the most

suitable rules with the aid of their business knowl-

edge.

This highlights the importance of treating data

proﬁling as a primary tool in data analysis environ-

ments, using metadata generated by proﬁling mecha-

nisms as an active asset for data quality assessment

and monitoring. Proﬁling outputs should be lever-

aged to drive data quality processes, enabling a more

proactive approach to data quality management and

allowing for quicker responses to shifts in data char-

acteristics. This is particularly valuable in smart man-

ufacturing scenarios, where data streams from numer-

ous sensors require agile, real-time quality control

(Agolla, 2021).

(Abedjan et al., 2018) deﬁne a set of tasks for

data proﬁling that ranges from simple analysis of

individual columns to identifying dependencies be-

tween multiple columns. For analyzing individual

columns, the authors divide the analysis into three

main categories: Cardinalities, Value Distributions

and Data Types, Patterns and Domains. The Cardi-

https://openreﬁne.org/

https://greatexpectations.io/

https://pandas.pydata.org/

nalities category provides simple summaries of the

data by means of counts, among the main tasks are

the number of rows in a table, the number or percent-

age of null values, the count of distinct values in a

column and uniqueness, which is the ratio between

the number of distinct values and the total number of

rows. The Value Distributions category summarises

the distribution of values per column, in this category

are the different types of frequency histograms (Equi-

width histograms, Equi-depth histograms, etc.), the

extremes of a numerical column (minimums and max-

imums), the constancy of a column, which is the ratio

between the frequency of the most common value and

the total number of rows, the quartiles and, ﬁnally,

the ﬁrst digit task, which is the distribution of the ﬁrst

digits of a set of numerical values. For the last cat-

egory of analysing individual columns, Data Types,

Patterns and Domains, which brings together eight

tasks: determining basic data types such as numeric,

alphabetic, alphanumeric, dates or times, identifying

more speciﬁc data types such as booleans, integers,

timestamps, among others. Other tasks in this cat-

egory are the minimum, maximum, median and av-

erage lengths of values in a column, the maximum

length of digits in numerical values, the maximum

number of decimal places in numerical values, his-

tograms of patterns of values, identiﬁcation of class

data (generic semantic data types) and identiﬁcation

of semantic domains.

In addition to analysing individual columns,

(Abedjan et al., 2018) also highlight the dependen-

cies between columns, which describe the relation-

ships between them and are essential for data in-

tegrity. Functional dependencies, Single column

combinations and Inclusion dependencies are some

of the main types. These dependencies help to iden-

tify primary keys and foreign keys, guaranteeing con-

sistency and facilitating data cleansing by identifying

dependency patterns which, when violated, indicate

possible quality problems.

3 CASE STUDY

Data proﬁling and data quality assessment are inter-

connected, with proﬁling as a foundational step that

supports quality monitoring and management. By

systematically analyzing the structure, distribution,

and relationships within datasets, data proﬁling gen-

erates detailed metadata that reveals potential issues.

By using proﬁling outputs to shape quality rules, or-

ganizations can maintain high data quality standards

dynamically, reducing the risk of data degradation

over time and supporting more accurate, reliable ana-

Real-Time Manufacturing Data Quality: Leveraging Data Proﬁling and Quality Metrics

lytics (Heine et al., 2019). The case study presented

in this paper demonstrates how this approach can be

effectively implemented to manage data quality in a

dynamic manufacturing setting.

The case study describes plastic extrusion in a

manufacturing environment scenario. Extrusion is

a core process in the manufacture of plastic prod-

ucts used in the production of various items, includ-

ing pipes, coatings, wire and cable insulation, and

monoﬁlaments. In this scenario, the focus is on the

single-screw extrusion process, which converts raw

material in the form of plastic granules into a viscous

molten ﬂuid, resulting in a ﬁnished solid or ﬂexible

product. Single-screw extrusion is a process that in-

volves a rotating screw with helical blades, which is

positioned inside a heated barrel. The extruder is fed

from a hopper mounted on top, with the plastic mate-

rial transported through the barrel by a rotating screw.

This screw moves the material along the barrel, where

it is heated and compressed. The molten plastic mate-

rial is then forced through a hole, known as a matrix,

which moulds it into the desired shape (Khan et al.,

2014).

To ensure the smooth operation of this process, the

machine has several sensors monitoring key parame-

ters such as temperature, pressure, and speed. As de-

tailed in the (Groover, 2010), the screw has multiple

functions and is divided into sections that align with

these functions. The sections and functions are as fol-

lows:

1. Feed Section: Responsible for moving the mate-

rial from the hopper door and preheating it.

2. Compression Section: Transforms the material

into a liquid consistency, extracts air trapped be-

tween the pellets, and compresses the molten

mass.

3. Metering Section: Homogenizes the molten

mass and develops sufﬁcient pressure to pump the

material through the die opening.

Several sensors capture data throughout the ﬂow

to analyze and optimize the process, reduce waste,

and minimize defects. The machines are equipped

with four temperature sensors (one for each screw

section and one for the ambient temperature), two

pressure sensors (one in the barrel and one for the am-

bient pressure), and a speed sensor in the screw.

Temperature sensors serve vital and distinct func-

tions in each section of the extruder. The tempera-

ture sensor in the feeding section (temp1) monitors

the initial temperature of the plastic granules as they

are transported from the hopper to the barrel. This

ensures that the material is being preheated correctly

to facilitate subsequent melting and avoid thermal

shocks that could affect the quality of the product.

In the compression section, the temperature sensor

(temp2) measures the temperature during the melting

and compression of the plastic material. This allows

the liquid consistency of the plastic and the extraction

of trapped air to be controlled, ensuring that the mate-

rial reaches the ideal viscosity for extrusion and pre-

venting defects caused by air bubbles. In the meter-

ing section, the temperature sensor (temp3) monitors

the ﬁnal temperature of the melt before it is forced

through the die. This ensures that the melt is at the

correct temperature to be moulded, preventing varia-

tions in the quality of the ﬁnished product. The am-

bient temperature sensor (ambient

temp

) measures the

temperature of the working environment by adjusting

the extruder’s operating parameters based on the am-

bient conditions. This is because the ambient temper-

ature can inﬂuence the efﬁciency of heat transfer and

the behaviour of the plastic material.

It is also important to note the crucial role that

pressure sensors play. The pressure sensor in the bar-

rel monitors the internal pressure during the extrusion

process, controlling it to ensure efﬁcient melting of

the material and to prevent problems such as over-

pressure, which can result in equipment damage or

cause the ﬁnal product to fail. The ambient pressure

(ambient

pressure

) sensor measures atmospheric pres-

sure in the working environment, allowing adjust-

ments to be made to the process to maintain consis-

tency in production. This is essential because ambi-

ent pressure conditions can affect the operation of the

equipment and the behavior of the plastic material.

The screw speed sensor (rotation) monitors the

rotation speed of the extruder screw, controlling the

feed rate and the ﬂow of material through the barrel.

The speed of the screw directly inﬂuences the homo-

geneity of the melt and the quality of the end prod-

uct. It is therefore crucial to keep the speed within

the ideal parameters to guarantee the efﬁciency of the

process and the integrity of the product.

The sensors transmit data continuously every sec-

ond. Considering the study environment, the tempera-

ture in the ﬁrst section of the screw is expected to vary

between 130 and 150 ºC, in the second section be-

tween 150 and 180 ºC and in the third section between

180 and 220 ºC. The pressure in the barrel varies be-

tween 70 and 350 bar, while the rotational speed of

the screw varies between 20 and 60 rpm. The am-

bient temperature varies between 18 and 30 ºC, and

the ambient pressure varies between 1005 and 1025

hPa. The details of the dataset ﬁelds are summarized

in Table 1.

A sample of data generated, covering 3 working

days, was considered to demonstrate the approach.

IoTBDS 2025 - 10th International Conference on Internet of Things, Big Data and Security

Table 1: Dataset Details.

Field Type Description

machine id string Machine id

timestamp timestamp Speciﬁc time of registration

ambient temp ﬂoat Ambient temperature (ºC)

ambient pressure ﬂoat Ambient pressure (hPa)

rotation ﬂoat Screw rotation speed (RPM)

temp1 ﬂoat Temperature in the ﬁrst section of the screw (ºC)

temp2 ﬂoat Temperature in the second section of the screw (ºC)

temp3 ﬂoat Temperature in the third section of the screw (ºC)

pressure ﬂoat Pressure in the barrel (Bar)

The dataset consists of 203273 records, with data

collected from multiple sensors. Of these, 139209

records from the ﬁrst two days are used for histori-

cal analysis, while the remaining 64065 records from

the last day are used for real-time analysis. The pres-

ence of null values in a speciﬁc sensor may indicate a

malfunction, which is crucial for quickly identifying

and solving technical problems that could affect pro-

duction. Additionally, the presence of outliers may be

caused by sensor malfunctions, resulting in substan-

tially non-standard records. Such anomalies often in-

dicate that the process deviates from ideal conditions,

which could affect the quality of the ﬁnal product.

Figure 1 represents this sample, showing the values

recorded by each sensor over the 3 days and provid-

ing an overview of the dataset. This visualization fa-

cilitates the identiﬁcation of null and anomalous val-

ues, critical aspects for data quality analysis. Identi-

fying and correcting these discrepancies immediately

is essential to maintaining operational efﬁciency and

avoiding large-scale production failures. This sam-

ple was chosen precisely because it contains the most

common data quality problems, representing a realis-

tic scenario for evaluating and dealing with inconsis-

tencies.

In industrial data processing, having a clear archi-

tecture is essential for capturing, processing, and stor-

ing information from various sources. Our architec-

ture leverages Apache Kafka

as a message broker to

ensure smooth, reliable, and efﬁcient communication

between sources and the data ingestion service. Each

industrial device acts as a distinct source, sending

data such as pressure and temperature through Kafka.

Messages are organized by speciﬁc topics (e.g., ma-

chine type) to maintain a coherent data ﬂow.

At the core of this system is the ingestion service,

which consumes messages from Kafka. It uses a con-

ﬁguration ﬁle to dynamically determine the topics and

tasks to execute on each message. To evaluate data

quality metrics, we conﬁgured only two tasks for this

https://kafka.apache.org/

study as represented in Figure 2: Raw Task for storing

data in its raw format and Data Quality for conducting

a data quality assessment. This assessment calculates

the metrics and stores them for analysis. Tools like

Grafana

can be used to visualize and analyze both

real-time and historical data, providing valuable in-

sights into system performance.

To evaluate data quality, the records were orga-

nized into 5-minute blocks, allowing for continuous

and incremental analysis over time. Each block goes

through speciﬁc processing steps before the quality

metrics are calculated. For instance, null values are

identiﬁed and treated, as they may indicate sensor

failures or data transmission interruptions. Dynamic

statistics, such as minimum and maximum values ad-

justed by percentiles, are calculated based on the data

available up to the previous block. The results are

stored and made available for viewing, ensuring that

both the processed data and the derived metrics can

be analyzed in real-time. Although some examples

of pre-processing have already been presented, all the

steps required to calculate each quality metric will be

detailed later.

4 DATA QUALITY ANALYSIS

To ensure data quality in a manufacturing environ-

ment, it is essential to implement metrics that fa-

cilitate effective assessment. These metrics monitor

data and support informed decision-making, enhanc-

ing operational performance. In this case study, data

quality metrics are applied every 5 minutes, incorpo-

rating previous data and proﬁle results derived from

functions using the Pandas library. While some met-

rics were used as presented in Section 2, others were

adapted to better align with the speciﬁc context of the

study.

To analyze the Accuracy dimension, Metric 1 was

https://grafana.com/

Real-Time Manufacturing Data Quality: Leveraging Data Proﬁling and Quality Metrics

Figure 1: Variation of each sensor over time.

Figure 2: Data Stream Pipeline.

applied, as deﬁned in eq. 1. In each numerical col-

umn, for blocks of 5 minutes, the average normal-

ized accuracy of the sensor data is calculated. The

10th and 90th percentiles of all data up to the time

of the block are used as dynamic limits, which al-

lows the calculation to adapt to the sensor’s behaviour

over time. The choice of percentiles, rather than the

actual minimum and maximum values, prevent out-

liers previously identiﬁed in Figure 1 from inﬂuenc-

ing accuracy inappropriately. Figure 3 illustrates the

variation of the actual minimum and maximum values

measured in each 5-minute block, along with the dy-

namic limits based on the percentiles. It can be seen

that the actual values (dashed in red for the maximum

and green for the minimum) show more pronounced

ﬂuctuations, which could distort accuracy if they were

used directly. The percentiles, on the other hand, of-

fer a more stable and adaptable range, justifying their

choice for a robust analysis that is less susceptible to

extreme deviations (orange line for the maximum and

blue for the minimum). For this calculation, proﬁling

tasks were used, such as identifying null values, ex-

tremes, and data types in each column. These tasks

correspond to the proﬁling categories deﬁned by the:

Cardinality, Value Distributions and Data Types, Pat-

terns and Domains, respectively.

IoTBDS 2025 - 10th International Conference on Internet of Things, Big Data and Security

Figure 3: Minimum and Maximum Variation.

The results of the Accuracy Metric 1 for the above

sample data are shown in Figure 4. This ﬁgure shows

the accuracy values of all the sensors, where it is pos-

sible to identify 4 to 5 situations on day 11 of Octo-

ber where the accuracy is close to the limits of the

[0, 1] interval. As the data from each column are

grouped into 5-minute blocks (with around 300 val-

ues per block) and the average accuracy of each block

is calculated dynamically based on normalization be-

tween the 10th and 90th percentiles, it is expected that

the majority of readings will fall within the accuracy

range [0.4, 0.6]. The 10th and 90th percentiles were

chosen after several attempts, as they proved to be

the most effective in representing the minimum and

maximum acceptable values for each column. This

range indicates a concentration of values within rel-

atively predictable limits. If the results fall outside

this expected range, this suggests the presence of val-

ues outside the expected standard, which may indi-

cate possible anomalies or signiﬁcant inaccuracies in

the operation of the sensors or the machine. Figure 5

shows the same data on the same day, but in separate

graphs, which makes it easier to identify the inaccu-

racies in each sensor in more detail.

To evaluate the Completeness dimension, two

metrics were considered. The Metric 2, eq. 2,

when applied to different contexts (i.e., both rows and

columns), offers a unique perspective on data com-

pleteness. This metric is used to assess column in-

tegrity, providing a comprehensive view of data com-

pleteness by attribute, which is essential to identify

problems in each sensor. By row, it provides a de-

tailed view of data integrity at the record level, help-

ing to identify problems at the machine level as a

whole. To evaluate this metric, the last day of the de-

ﬁned dataset sample (11 October) was used and Met-

ric 2 was applied to each 5-minute block (around 300

values per column and around 2400 values per row).

Through data proﬁling tasks, it was possible to iden-

tify the number of null values and the total count of

values, both by column and by row, as well as identify

the types of data. These tasks fall into the categories

of Cardinality and Data Types, Patterns and Domains.

The results obtained for completeness at row level are

shown in Figure 6, where you can see that complete-

ness is almost always 1 (ideal value), except between

5:55 and 6:05 where completeness drops to 0.79 and

0.66 in each block.

The Metric 5, as presented in eq. 5, evaluates data

completeness by offering a complementary perspec-

tive and identiﬁes missing records through timestamp

analysis, comparing the expected event pattern with

its actual occurrence. This approach provides insights

into the uniformity and temporal integrity of the data.

The mathematical representation of the metric is as

Real-Time Manufacturing Data Quality: Leveraging Data Proﬁling and Quality Metrics

Figure 4: Accuracy results.

Figure 5: Accuracy results by column.

follows:

Completeness =

occur

exp

(5)

, where N

occur

represents the number of occurrences

in the speciﬁed time interval and N

exp

the number of

expected occurrences in that same interval. This met-

ric provides a means of assessing the regularity of the

data and identifying any gaps or irregularities in the

IoTBDS 2025 - 10th International Conference on Internet of Things, Big Data and Security

Figure 6: Completeness results by row.

collection of events. To calculate eq. 5, the data pro-

ﬁling tasks presented by (Abedjan et al., 2018) are

used, such as counting the number of rows in the Car-

dinality category and the task of identifying data types

in the Data Types, Patterns, and Domains category.

Figure 7 illustrates the comparison between the

result obtained by the Metric 5 for 5-minute blocks

(represented by the blue line, where each point indi-

cates the total completeness in a block) and the ideal

completeness value (which would be 1). As shown in

the ﬁgure, there are times when the result corresponds

to the expected pattern, while at other times the result

decreases to the expected values, which could indicate

machine faults or data transfer problems. Incidentally,

none of the completeness values in the sample used

were higher than the expected value, if they were, this

could indicate duplicate values.

To calculate the Consistency Metric 3 (Eq. 3),

a historical dataset is required to establish the rules.

In this study, the ﬁrst two days were used to identify

the rules, while the last day was used for veriﬁcation.

Proﬁling and column dependencies were employed to

detect correlations, with a threshold of 0.7 indicat-

ing a high probability of a rule. Although correlation

alone is not sufﬁcient to conﬁrm a functional depen-

dency, it suggests possible relationships. Three strong

correlations were identiﬁed between different sensors

(temp1 with temp2, temp2 with temp3, and temp1

with temp3). The results, shown in Figure 8, reveal

that most of the 5-minute blocks on October 11 had a

consistency value of 1, indicating the identiﬁcation of

all three rules. A smaller number of blocks showed a

consistency value of 0.66, indicating the identiﬁcation

of two rules, and one block had a consistency value of

0.33, indicating the identiﬁcation of only one rule.

The Timeliness dimension, as deﬁned in Metric 4

(eq. 4), assesses the relevance of data based on its age,

which refers to the time that has passed since the data

was collected. This metric is essential to determine

whether the data is still valid and relevant for analysis

at the current time. To calculate Timeliness, the age

of the data is ﬁrst determined by calculating the dif-

ference between the current time and the timestamp

when the data was recorded. This process utilizes the

Data Types, Patterns and Domains category of data

proﬁling. Speciﬁcally, it involves identifying the col-

umn that represents the timestamp of the data.

The volatility parameter is crucial in deﬁning the

time window during which data remains relevant for

analysis. It can be expressed as a range, based on

domain knowledge, operational norms, or experience.

In this study, a volatility value of 10 minutes was ini-

tially assumed, meaning that data is considered cur-

rent and valid for 10 minutes from the moment of col-

lection. It is important to note that the volatility value

may vary depending on the system’s speciﬁc charac-

teristics or sensors used.

Thus, Timeliness provides a metric between 0 and

1, where 1 indicates that the data is perfectly current

and 0 indicates that the data is outside its valid period.

The closer the data are to the present, the higher the

timeliness score, reﬂecting its relevance to decision-

making. The results of the timeliness metric remained

at 0 for most of the day, indicating that the data was

outside the range of validity deﬁned for the analysis.

However, in the last blocks of time (between 17:50

and 18:00), there was an increase, with values reach-

ing 0.25 and 0.7499, due to the fact that this data was

the most recent.

To summarise how data proﬁling contributes to as-

sessing each data quality metric, Figure 9 presents a

source-to-target diagram based on the taxonomy iden-

tiﬁed by the authors in (Abedjan et al., 2018). The di-

agram begins with the different types of data proﬁling

analysis - the analysis of individual columns and de-

pendencies between multiple columns - which form

the basis of the subsequent analysis categories.

These categories of analysis include Cardinality,

Value Distributions, Data Types, Patterns, Domains

and Functional Dependencies (the latter is not associ-

ated with any task and is directly related to the met-

ric), each of which plays a key role in speciﬁc tasks

Real-Time Manufacturing Data Quality: Leveraging Data Proﬁling and Quality Metrics

Figure 7: Completeness results from Metric 5.

Figure 8: Consistency results.

Figure 9: Correspondence between data proﬁling tasks, metrics and dimensions.

such as checking for null values, counting records,

identifying extreme values and data types. These

tasks provide essential information for calculating

various quality metrics, which in turn are linked to

speciﬁc data quality dimensions such as accuracy,

completeness, consistency and timeliness.

This linkage provides a structured view of how

different elements of analysis and metrics converge

to support the dimensions of data quality.

5 CONCLUSIONS

The Internet of Things (IoT) and Cyber-Physical Sys-

tems (CPS) are integral to the advancement of Indus-

try 4.0 and smart manufacturing (Goknil et al., 2023).

IoT supports interconnected devices to collect and ex-

change data during the manufacturing process, while

CPS combines computing, networking, and physical

processes to create autonomous, adaptive systems.

IoTBDS 2025 - 10th International Conference on Internet of Things, Big Data and Security

These technologies enhance automation, efﬁciency,

and innovation within Industry 4.0. However, the vol-

ume and diversity of data generated by this environ-

ment present signiﬁcant challenges, including issues

like transmission noise, device malfunctions, and in-

stability.

To address these, we propose a data quality moni-

toring pipeline that integrates seamlessly into the core

process, ensuring continuous management of data

quality as part of the operational workﬂow, thus im-

proving data reliability and process efﬁciency. Met-

rics speciﬁcally tailored for IoT scenarios are used to

monitor data quality, allowing real-time assessment

with minimal conﬁguration and eliminating the need

for complex, custom solutions.

Data proﬁling is a fundamental component of this

pipeline, providing insights into the structure, dis-

tribution, and relationships within datasets. Proﬁl-

ing tasks, such as detecting null values, extreme val-

ues, data types, and dependencies, generate metadata

crucial for assessing data quality dimensions such as

Accuracy, Completeness, Consistency, and Timeli-

ness. Taking a proactive proﬁling approach, we en-

able rapid responses to quality issues, ensuring high

data quality over time. Moreover, integrating data

proﬁling into the monitoring pipeline helps address

common IoT challenges, such as sensor malfunctions

and data gaps, which could otherwise affect opera-

tional performance and product quality. The proﬁling

outputs allow for automated checks, reducing human

intervention and enabling timely adjustments to main-

tain process stability.

Future work will focus on improving both perfor-

mance and outcomes by incorporating advanced tech-

niques such as sketching methods (e.g., t-digest (Dun-

ning, 2021)).

ACKNOWLEDGEMENTS

This work has been supported by the European Union

under the Next Generation EU, through a grant of

the Portuguese Republic’s Recovery and Resilience

Plan (PRR) Partnership Agreement, within the scope

of the project PRODUTECH R3 – ”Agenda Mo-

bilizadora da Fileira das Tecno-logias de Produc¸

para a Reindustrializac¸

ao”, Total project investment:

166.988.013,71 Euros; Total Grant: 97.111.730,27

Euros.

REFERENCES

Abedjan, Z., Golab, L., Naumann, F., and Papenbrock, T.

(2018). Data proﬁling. Synthesis Lectures on Data

Management, 10:1–154.

Agolla, J. E. (2021). Smart Manufacturing: Quality Control

Perspectives. IntechOpen.

Azeroual, O., Saake, G., and Schallehn, E. (2018). Analyz-

ing data quality issues in research information systems

via data proﬁling. International Journal of Informa-

tion Management, 41:50–56.

Bandara, K., Bergmeir, C., and Smyl, S. (2020). Fore-

casting across time series databases using recurrent

neural networks on groups of similar series: A clus-

tering approach. Expert Systems with Applications,

140:112896.

Batini, C. and Scannapieco, M. (2016). Data and Informa-

tion Quality. Springer International Publishing.

Byabazaire, J., O’Hare, G., and Delaney, D. (2020). Us-

ing trust as a measure to derive data quality in data

shared iot deployments. In 2020 29th International

Conference on Computer Communications and Net-

works (ICCCN), pages 1–9. IEEE.

Cichy, C. and Rass, S. (2019). An overview of data quality

frameworks. IEEE Access, 7:24634–24648.

Dunning, T. (2021). The t-digest: Efﬁcient estimates of

distributions. Software Impacts, 7:100049.

Goknil, A., Nguyen, P., Sen, S., Politaki, D., Niavis, H.,

Pedersen, K. J., Suyuthi, A., Anand, A., and Ziegen-

bein, A. (2023). A systematic review of data quality in

cps and iot for industry 4.0. ACM Computing Surveys,

55(14s):1–38.

Groover, M. P. (2010). Fundamentals of modern manufac-

turing: materials, processes, and systems. John Wiley

& Sons.

Heine, F., Kleiner, C., and Oelsner, T. (2019). Automated

Detection and Monitoring of Advanced Data Quality

Rules, pages 238–247. Springer, Cham.

Hu, C., Sun, Z., Li, C., Zhang, Y., and Xing, C. (2023).

Survey of time series data generation in iot. Sensors,

23.

ISO/IEC 25012:2008 (2008). Software engineering Soft-

ware product Quality Requirements and Evaluation

(SQuaRE) Data quality model. Standard, Interna-

tional Organization for Standardization, Geneva, CH.

Khan, J., Dalu, R., and Gadekar, S. (2014). Defects in ex-

trusion process and their impact on product quality.

International journal of mechanical engineering and

robotics research, 3(3):187.

Kusumasari, T. F. and Fitria (2016). Data proﬁling for data

quality improvement with openreﬁne. In 2016 Inter-

national Conference on Information Technology Sys-

tems and Innovation (ICITSI), pages 1–6. IEEE.

Liu, C., Peng, G., Kong, Y., Li, S., and Chen, S. (2021).

Data quality affecting big data analytics in smart fac-

tories: research themes, issues and methods. Symme-

try, 13(8):1440.

Loshin, D. (2011). The Practitioner’s Guide to Data Qual-

ity Improvement. Elsevier.

Real-Time Manufacturing Data Quality: Leveraging Data Proﬁling and Quality Metrics

Mahanti, R. (2019). Data Quality: Dimensions, Measure-

ment, Strategy, Management, and Governance. ASQ

Quality Press, USA.

Naumann, F. (2014). Data proﬁling revisited. ACM SIG-

MOD Record, 42:40–49.

Oliveira,

O. and Oliveira, B. (2022). An extensible frame-

work for data reliability assessment. In Proceedings

of the 24th International Conference on Enterprise In-

formation Systems, pages 77–84. SCITEPRESS - Sci-

ence and Technology Publications.

Rangineni, S., Bhanushali, A., Suryadevara, M., Venkata,

S., and Peddireddy, K. (2023). A review on enhancing

data quality for optimal data analytics performance.

International Journal of Computer Sciences and En-

gineering, 11:51–58.

Schultheis, A., Malburg, L., Gr

uger, J., Weich, J., Bertrand,

Y., Bergmann, R., and Asensio, E. S. (2024). Iden-

tifying Missing Sensor Values in IoT Time Series

Data: A Weight-Based Extension of Similarity Mea-

sures for Smart Manufacturing, pages 240–257.

Sicari, S., Cappiello, C., Pellegrini, F. D., Miorandi, D.,

and Coen-Porisini, A. (2016). A security-and quality-

aware system architecture for internet of things. Infor-

mation Systems Frontiers, 18:665–677.

Tverdal, S., Goknil, A., Nguyen, P., Husom, E. J., Sen, S.,

Ruh, J., and Flamigni, F. (2024). Edge-based data pro-

ﬁling and repair as a service for iot. pages 17–24. As-

sociation for Computing Machinery.

Zhang, L., Jeong, D., and Lee, S. (2021). Data qual-

ity management in the internet of things. Sensors,

21(17):5834.

IoTBDS 2025 - 10th International Conference on Internet of Things, Big Data and Security