Exploring the Impact of Dataset Accuracy on Machinery Functional

Safety: Insights from an AI-Based Predictive Maintenance System

Padma Iyenghar

1,2 a

innotec GmbH, Hornbergstrasse 45, 70794 Filderstadt, Germany

Faculty of Engineering and Computer Science,

University of Applied Sciences Osnabrueck, 49009 Osnabruck, Germany

Keywords:

Data Quality, Data Accuracy, Functional Safety, Artiﬁcial Intelligence (AI), Machine Learning (ML),

Predictive Maintenance, Reliability, Availability.

Abstract:

This paper focuses on the critical role of dataset accuracy in the context of machinery functional safety within

an AI-based predictive maintenance system in a manufacturing setting. Through experiments introducing

perturbations simulating real-world challenges, a decrease in performance metrics was observed—factors such

as sensor noise, labeling errors, missing data, and outliers were identiﬁed as contributors to the compromise

of the AI model’s accuracy. Implications for reliability and availability were discussed, emphasizing the

need for high-quality datasets to minimize the risk of unplanned downtime. Recommendations include the

implementation of robust data quality assurance processes and improved outlier detection mechanisms to

ensure the reliability and availability of machinery in high-risk environments.

1 INTRODUCTION

AI-enabled machinery functional safety involves the

application of Artiﬁcial Intelligence (AI) techniques

to enhance safety functions within machinery in ad-

herence to established standards such as ISO 13849

(ISO13849, 2023) and ISO 62061 (IEC62061, 2022).

One prominent example is predictive maintenance,

where AI algorithms analyse data from sensors and

equipment to predict potential failures before they oc-

cur. By employing Machine Learning (ML) mod-

els, such systems can identify patterns, anomalies,

and degradation in machinery performance, enabling

timely maintenance interventions to prevent unex-

pected breakdowns. This approach aligns with the

broader principles of machinery functional safety,

aiming to minimize risks and ensure safe operation.

However, recognizing the pivotal role of data

quality is essential for the effective implementation

of AI-driven enhancements in a machinery functional

safety. Multiple deﬁnitions of data quality exist in the

literature. A commonly adopted deﬁnition is ﬁtness

for use for a speciﬁc purpose. In the context of func-

tional safety, this entails ensuring that the data is suit-

able for attaining technological objectives, facilitating

https://orcid.org/0000-0002-1765-3695

informed and effective decision-making, and optimiz-

ing functional safety processes for AI-based software

and systems. Poor data quality can lead to unreliable

models and may pose signiﬁcant risks. Thus, ensuring

data quality is crucial when developing AI/ML mod-

els for high-risk applications in the context of func-

tional safety.

In the aforesaid context, this paper conducts a

comprehensive investigation on the impact of dataset

accuracy in machinery functional safety example.

The study provides novel insights into the implica-

tions of dataset inaccuracies on the reliability and

availability of machinery in high-risk environments.

Beyond analysis, the paper offers practical recom-

mendations for practitioners, suggesting measures to

enhance data quality assurance processes and improve

outlier detection mechanisms. The ﬁndings and rec-

ommendations serve as valuable guidance for practi-

tioners involved in the development of AI systems for

machinery functional safety and beyond, emphasiz-

ing practical considerations for achieving reliable and

available machinery.

Inline with the aforementioned contributions, the

following represent the novelties:

• Real-world use case focus: The paper centres on

a practical use case of predictive maintenance in a

manufacturing setting, providing insights into the

484

Iyenghar, P.

Exploring the Impact of Dataset Accuracy on Machinery Functional Safety: Insights from an AI-Based Predictive Maintenance System.

DOI: 10.5220/0012683600003687

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2024), pages 484-497

ISBN: 978-989-758-696-5; ISSN: 2184-4895

implications of dataset inaccuracies.

• Systematic experimentation: The study employs

systematic experiments introducing perturbations

(i.e., intentional disruptions or alternations), sim-

ulating common challenges faced by real-world

datasets used for training AI models.

• Identiﬁcation of contributing factors: Speciﬁc

factors, including sensor noise, labelling errors,

missing data, and outliers, are identiﬁed and anal-

ysed for their contribution to the compromise of

the AI model’s accuracy.

The remainder of the paper is organized as fol-

lows. Following this introduction section, related

work and a summary of key insights on data quality

assurance for AI/ML model in the context of func-

tional safety is presented in section 2. The real-world

practical use case and experimental scenarios are de-

tailed in section 3. Results and analysis are presented

in section 4. Discussion and conclusion is presented

in section 5.

2 RELATED WORK AND

SUMMARY OF KEY INSIGHTS

Data quality is a fundamental consideration in the de-

velopment of AI/ML models, because it directly im-

pacts the performance, accuracy, and reliability of AI

models

. The critical importance of high-quality data

for AI applications is empirically investigated in (Bu-

dach, 2022). The impact of six traditional data quality

dimensions such as consistent representation, com-

pleteness, feature accuracy, target accuracy, unique-

ness and target class balance on the performance of

ﬁfteen popular ML algorithms across various tasks,

highlighting the relationship between data quality di-

mensions and algorithm effectiveness, is presented in

(Budach, 2022). This study underscores the critical

role of accurate labelling, feature accuracy, and data

quality dimensions in different ML scenarios, pro-

viding insights on algorithm performance robustness.

But, a speciﬁc use case pertaining to a speciﬁc domain

like functional safety is missing in (Budach, 2022).

More recently, in (Priestley et al., 2023) a survey

of data quality requirements that matter in ML devel-

opment pipelines is presented. The proposed frame-

work categorizes data quality criteria based on both

the stage of the ML lifecycle and the main dimen-

sions of data quality, providing a practical guide for

data practitioners and organizations to enhance their

https://research.aimultiple.com/data-quality-ai/

data management routines in preparation for ML ap-

plications. While (Sessions and Valtorta, 2009) in-

troduces a data accuracy assessment method employ-

ing a Bayesian Network learning algorithm, empirical

studies examining the inﬂuence of dataset accuracy

on machinery functional safety, particularly in real-

world scenarios, are currently lacking.

On the other hand, several approaches have

emerged to validate data fed to ML pipelines. For ex-

ample, the validation system implemented in (Poly-

zotis et al., 2019) and (Schelter et al., 2020) focus

on validating data given a classiﬁcation pipeline as a

black box. In (Schelter et al., 2021), an experimental

library JENGA is introduced for testing ML-model’s

robustness under data errors. While the authors use

the concept of polluters or data corruptions, an exten-

sive experimental study is missing in (Schelter et al.,

2021). Automatic and precise data validation for ML

is discussed in (Shankar et al., 2023).

A most recent survey on AI for safety-critical sys-

tems in industrial and transportation domains is pre-

sented in (Perez-Cerrolaza, 2023). However, it lacks

an in-depth experimental evaluation on a speciﬁc as-

pect such as dataset quality in an industrial use case.

Further, experiences of adopting automated data val-

idation in an industrial ML project is presented in

(Lwakatare et al., 2021). The results in (Lwakatare

et al., 2021) indicate that adopting a data validation

process and tool in ML projects is an effective ap-

proach of testing ML-enabled software systems.

The literature review, state-of-the-art analysis, and

related work collectively reveal that various dimen-

sions of data quality have been scrutinized in relation

to the performance of ML algorithms. Nevertheless,

there is a noticeable gap in speciﬁc research focusing

on data quality assurance within the functional safety

domain. Moreover, there is a lack of exploration into

crucial aspects like dataset accuracy in a machinery

setting, particularly in real-world use cases such as

AI-based predictive maintenance systems.

2.1 Data Quality Assurance Facets

In this section, some important facets of data qual-

ity assurance within AI-based machinery functional

safety software/systems are summarized. General in-

sights are drawn from the literature review, neverthe-

less, an effort is made to adapt these insights to the

speciﬁc requirements of the functional safety domain

wherever applicable.

1. Safety critical scenario identiﬁcation (data rele-

vance): In the context of machinery functional

safety, it’s crucial to identify safety-critical sce-

narios. This involves collaboration with domain

Exploring the Impact of Dataset Accuracy on Machinery Functional Safety: Insights from an AI-Based Predictive Maintenance System

485

experts to understand potential hazards and risks

associated with machinery operations. Simulation

tools like Simulink

and LabVIEW

can be em-

ployed to analyse and simulate critical scenarios

to ensure the system’s response meets safety re-

quirements.

2. Safety validation procedures (data accuracy): En-

suring data accuracy includes implementing val-

idation procedures during data collection and

cross-referencing with authoritative sources. Im-

plementing validation procedures speciﬁc to

safety-critical data involves rigorous testing and

veriﬁcation processes during data collection.

Tools like Great Expectations

or custom scripts

are useful for data integrity checks.

3. Safety standard compliance (data consistency):

Machinery functional safety often requires adher-

ence to speciﬁc safety standards (e.g., ISO 13849

for machinery safety). Data consistency in terms

of formats, units, and representations becomes

crucial for compliance. Libraries such as Pandas

or TensorFlow Data Validation

can be applied for

standardization and cleaning.

4. Safety gap analysis (data completeness): Ad-

dressing data completeness involves performing

gap analysis speciﬁcally tailored to safety require-

ment. Identifying missing data points that are

critical for safety assessments and augmenting

datasets with relevant safety-related information

is essential. Tools like Apache NiFi

or custom

scripts are valuable for data augmentation.

5. Real-time safety monitoring (temporal relevance):

Here, temporal relevance extends to real-time

monitoring of safety-critical data. Ensuring that

the safety data is continuously updated to reﬂect

changes in the operational environment is cru-

cial. Automated data update scripts, alongside

version control systems like Git

, facilitate track-

ing changes over time.

6. Safety critical bias mitigation (data collection

bias): Similar to general AI systems, machinery

functional safety must address bias, especially if

it could impact safety-critical decisions. Conduct-

ing bias analysis speciﬁc to safety-related data and

employing tools like TensorFlow’s Fairness Indi-

https://www.mathworks.com/help/simulink/

https://www.ni.com/documentation/en/labview/

https://greatexpectations.io/

https://pandas.pydata.org/

https://www.tensorﬂow.org/tfx/guide/tfdv

https://niﬁ.apache.org/docs.html

https://git-scm.com/doc

cators

and AI Fairness 360

becomes essen-

tial for mitigating biases in safety-critical appli-

cations.

7. Functional safety security measures (data security

and privacy): Data security and privacy in ma-

chinery functional safety involve additional con-

siderations due to the potential impact on hu-

man safety. Encryption and anonymization tech-

niques must be robustly implemented. Privacy-

preserving libraries like PySyft

are valuable for

safeguarding safety-critical data in the context of

machinery operations.

8. Safety critical event labeling (data labelling

q)uality: In the realm of machinery functional

safety, labelling quality extends beyond general

data labelling. It involves accurately labelling

safety-critical events and standardizing proce-

dures for doing so. Labelling tools like Label-

box

, along with tools like Dataturks

for reli-

ability testing, are instrumental in accurate data

labelling. They can be adapted to ensure accurate

labelling of safety-critical data/events.

9. Safety outlier detection and handling: Outliers

in safety-critical data may indicate abnormal be-

haviour that could lead to hazardous situations.

Specialized outlier detection methods using sta-

tistical techniques or ML algorithms Tools like

Scikit-learn (Pedregosa, 2011) for Python provide

various algorithms for outlier detection, and cus-

tom preprocessing scripts can be employed for

identifying and handling safety-critical outliers.

10. Safety documentation protocols: Documentation

in machinery functional safety domain involves

not only general data documentation but also spe-

ciﬁc safety documentation. Comprehensive data

dictionaries, safety protocols, and preprocess-

ing steps must be documented. Documentation

systems like Conﬂuence

or version-controlled

repositories like GitHub

are effective for man-

aging data documentation.

11. Validation and Testing: Validation and testing in

the context of machinery functional safety require

a heightened focus on ensuring that safety require-

ments are met. Using separate datasets for train-

ing, validation, and testing becomes critical, and

https://www.tensorﬂow.org/tfx/guide/tfma

https://github.com/Trusted-AI/AIF360

https://github.com/OpenMined/PySyft

https://www.labelbox.com/

https://www.dataturks.com/

https://www.atlassian.com/software/conﬂuence

https://github.com/

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

486

automated testing frameworks like TensorFlow

Extended (TFX)

or pytest

should be adapted

for safety-critical model validation and testing.

12. Continuous Monitoring: Continuous monitoring

in the context of machinery functional safety

involves real-time monitoring of safety-critical

data and model performance. Establishing alert-

ing systems for drift detection like Seldon Al-

ibi Detect

and utilizing monitoring platforms

contribute to effective continuous monitoring of

safety-related aspects in machinery operations.

Thus, to ensure reliability and integrity of data

in functional safety-critical applications, a compre-

hensive data quality assurance framework is essen-

tial. The crux of the such a framework involves en-

suring accuracy, consistency, completeness, and rel-

evance of data. This includes measures such as do-

main analysis, accurate data handling, standardiza-

tion, addressing bias, security measures, high-quality

labelling, outlier detection, thorough documentation,

and continuous monitoring. The selection of tools is

crucial, emphasizing the need for careful evaluation

based on project requirements and technology stack

in functional safety domains.

While various aspects have been discussed above,

the work presented in this paper primarily focuses on

assessing the inﬂuence of dataset accuracy on func-

tional safety. The evaluation is conducted through the

analysis of an AI-based predictive maintenance sys-

tem, extracting valuable insights on dataset accuracy

from a systematic experimentation.

3 SYSTEMATIC

EXPERIMENTATION

Let us consider a real-world scenario in the context of

machinery functional safety, speciﬁcally for a manu-

facturing process where an AI-based system is used

for predictive maintenance. The AI model is trained

to predict potential failures in a critical component of

a machine. The machine has a key component, a mo-

tor, and the aforementioned AI model has been de-

ployed to predict potential failures in the motor based

on (historical) sensor data.

As outlined in section 2.1, data accuracy stands

out as one of the key aspects in ensuring data qual-

ity assurance. The objective of our systematic exper-

imentation is to illustrate the impact of input data ac-

curacy on the performance of the model, subsequently

https://www.tensorﬂow.org/tfx

https://docs.pytest.org/en/7.4.x/

https://github.com/SeldonIO/alibi-detect

inﬂuencing predictive maintenance outcomes and im-

plications for functional safety aspects, such as relia-

bility and availability.

3.1 Dataset Description

The sensor values are used to simulate input data for

the aforesaid predictive maintenance AI model. In the

real-world scenario, this involves training a model to

predict whether a machine is likely to experience a

failure or malfunction based on historical sensor data.

The AI model, is trained to learn patterns in the

input data (features) to predict the target variable (la-

bels). The synthetic dataset is generated with ﬁve

features for each data point. They are, temperature:

which is the operating temperature of the motor, vi-

bration: the vibration levels experienced by the mo-

tor, current draw: the amount of current the motor is

drawing during operation, voltage ﬂuctuations: ﬂuc-

tuations in the voltage supplied to the motor and run-

ning hours: the total number of hours the motor has

been in operation. These values are intended to sim-

ulate readings from sensors on machinery. The labels

are binary (0 or 1) and represent the target variable.

In the context of predictive maintenance, these labels

indicate whether a machine is operating normally (0)

or if there is a potential issue (1).

In the experimental setup, a range of datasets

each comprising different sample sizes namely, 1000,

10000, 20000, 50000, 100000, 150000, and 300000

is utilized. The investigation aims to analyse the im-

pact of varying data quality and also incorporates a

sensitivity analysis by simulating issues within the

datasets. This approach allows to comprehensively

assess how challenges in data accuracy affect the per-

formance of AI/ML models, especially concerning

accuracy and generalization across different dataset

sizes.

Please note that, while the paper presents a real-

life case study of predictive maintenance, the uti-

lization of synthetic datasets is a common practice

in ML research for several reasons. Firstly, syn-

thetic datasets offer control over various parameters,

enabling systematic experimentation and analysis of

model behaviour under different conditions. Sec-

ondly, real-world datasets for high-risk environments

like predictive maintenance can be limited or inacces-

sible due to privacy concerns or proprietary reasons.

Therefore, synthetic data allows to simulate realistic

scenarios while maintaining data privacy and acces-

sibility. Additionally, using synthetic data facilitates

the reproducibility of the study, as the dataset gen-

eration process can be fully documented and shared,

ensuring transparency and enabling other researchers

Exploring the Impact of Dataset Accuracy on Machinery Functional Safety: Insights from an AI-Based Predictive Maintenance System

487

to validate and build upon the ﬁndings. Thus, while

the data may be synthetic, the study’s focus remains

on addressing real-world challenges and providing in-

sights applicable to practical predictive maintenance

scenarios.

3.2 Issues in Dataset

The experimental setup considers a speciﬁc set of

issues with the AI dataset, and the subsequent text

brieﬂy each identiﬁed issue. A remark on the solu-

tions to avoid each of the issue is brieﬂy mentioned.

Note that the selection of these speciﬁc issues is based

on their critical impact on the overall quality and re-

liability of the AI model for predictive maintenance.

This decision is based on a combination of domain

knowledge, best practices in AI/ML and literature

reviews (Katsuki and Osogami, 2023), (Teh et al.,

2020), (Northcutt et al., 2021).

3.2.1 Sensor Precision and Calibration

• Issue: Sensor measurements may have inherent

inaccuracies or imprecisions.

• Solution: Implement sensor calibration proce-

dures to enhance the precision of measurements.

Regularly validate and adjust sensor accuracy to

minimize errors in the dataset.

3.2.2 Labelling Errors

• Issue: Errors in labelling data instances can lead

to misinterpretations by the AI model.

• Solution: Conduct a thorough review and vali-

dation of labelled data. Implement a process for

cross-checking and verifying labelled instances to

ensure accuracy.

3.2.3 Incomplete or Missing Data

• Issue: Incomplete or missing data points may

result in an imprecise understanding of the ma-

chine’s health.

• Solution: Perform a gap analysis to identify and

address missing data. Augment the dataset with

additional instances to ensure completeness and

accuracy.

3.2.4 Outliers in Training Data

• Issue: An outlier refers to anomalies or extreme

values present in the training dataset and can neg-

atively impact model accuracy.

• Solution: Utilize outlier detection techniques to

identify and handle anomalies in the dataset. En-

sure that the model is not biased by rare or ex-

treme cases.

3.3 Experiments on Dataset Accuracy

Impact

To empirically illustrate the importance of AI dataset

accuracy in the context of machinery functional

safety, experiments are designed that highlight the im-

pact of inaccurate or incomplete datasets on the per-

formance of the predictive maintenance AI model.

The AI model is trained using a dataset with properly

labelled instances and minimal noise. This is referred

to as the baseline model for the experiments.

3.3.1 Data Perturbation

Perturbations are introduced (in the baseline model)

to simulate common issues in real-world datasets, as

listed below

• Sensor Noise: Random noise is added to sensor

measurements to simulate inaccuracies.

• Labelling Errors: Labelling errors are introduced

to simulate inaccuracies in the ground truth (i.e.,

accurate and reliable data)

• Missing Data: Random data points are removed

to simulate incomplete datasets.

• Outliers: Outliers are introduced to test the

model’s robustness to extreme cases.

The perturbation strength ranges from 0.1 to

0.9 indicating the proportion or percentage of the

dataset that will be perturbed. For example, perturba-

tion strength = 0.1 implies that 10% of the dataset will

be perturbed. The speciﬁc behaviour of the perturba-

tion depends on the perturbation type, and each type

of perturbation (sensor noise, labelling errors, missing

data, outliers) has its own mechanism for introducing

changes to the dataset.

For the experiments in this paper, perturbations in-

troduced are: sensor noise: adds random noise to the

features, labelling errors: ﬂips a certain percentage of

labels, missing data: introduces missing values (NaN)

in a certain percentage of data and outliers: replaces

a certain percentage of data with outliers. The combi-

nation of perturbation type and perturbation strength

allows simulating different scenarios of data imper-

fections and evaluate how well the AI/ML model per-

forms under such conditions.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

488

3.4 Performance Metrics and

Evaluation Scenarios

To demonstrate the impact of data accuracy and noisy

data on model performance, various evaluation met-

rics such as accuracy, precision, recall, and F1 score

are used to compare the model’s performance for var-

ious evaluation scenarios. In the following, the per-

formance metrics are deﬁned, and the evaluation sce-

narios are brieﬂy described.

3.4.1 Performance Metrics

Accuracy metric represents the overall correctness of

the model’s predictions. It’s calculated as the ratio

of correctly predicted instances to the total instances.

Precision is the proportion of true positive predic-

tions among all positive predictions. It indicates the

model’s ability to avoid false positives. Recall (sen-

sitivity) is the proportion of true positive predictions

among all actual positive instances. It measures the

model’s ability to capture all relevant instances. F1-

Score is the harmonic mean of precision and recall. It

provides a balanced measure of precision and recall,

especially when there is an imbalance in the class dis-

tribution.

Apart from these performance metrics, sensitivity

analysis is conducted to assess the model’s robustness

and performance under varying conditions or param-

eter changes, providing insights into how the model’s

predictions may be inﬂuenced by alterations in input

parameters or external factors. Thus, sensitivity anal-

ysis contributes to quality assurance for data quality

by revealing the impact of data variations on model

performance. It facilitates the identiﬁcation of poten-

tial data quality issues and guides strategies for im-

proving both data quality and model robustness.

3.4.2 Evaluation Scenarios

The evaluation scenarios in the empirical study are

outlined below.

• Baseline Evaluation: The baseline model is eval-

uated on a clean, accurate dataset to establish a

reference performance level.

• Impact of Sensor Noise: The model’s perfor-

mance when trained and tested with datasets con-

taining varying degrees of sensor noise is evalu-

ated.

• Effect of Labelling Errors: The model’s robust-

ness to labelling errors by training it on datasets

with introduced label inaccuracies is evaluated.

• Handling Missing Data: The model’s perfor-

mance when faced with missing data points, as-

Figure 1: Baseline evaluation metrics for different dataset

sizes.

sessing the impact of data incompleteness is in-

vestigated.

• Robustness to Outliers: The model’s ability to

handle outliers by training and testing on datasets

containing extreme values is tested.

3.5 Implementation

A custom-deﬁned script is implemented in Python

which uses libraries such as NumPy, scikit-learn and

Matplotlib to generate synthetic datasets, perturb the

data, train and evaluate logistic regression models,

perform baseline evaluation, detailed sensitivity anal-

yses varying perturbation strength and type, and visu-

alize the results. The experiments are run on an Intel

Core i7-8550U-1.8 GHz CPU, X64-based PC system

running Windows 10.

4 RESULTS AND ANALYSIS

Results and analysis of the results is presented in this

section, primarily based on the performance metrics

and evaluation scenarios outlined in section 3.4.1 and

section 3.4.2 respectively.

4.1 Baseline Evaluation

The evaluation of the model on clean and accurate

datasets of varying sizes reveals interesting insights

into its performance metrics, as seen in Fig. 1. The

accuracy of our model, representing the proportion

of correctly classiﬁed instances, ranges between 49%

and 55%.

Precision, which measures the accuracy of posi-

tive predictions, demonstrates moderate values rang-

ing from approximately 48% to 53%. This implies

that when the model predicts a positive instance, it is

Exploring the Impact of Dataset Accuracy on Machinery Functional Safety: Insights from an AI-Based Predictive Maintenance System

489

reasonably accurate, but there is room for improve-

ment.

Recall (sensitivity) values, ranging from 26% to

81%, indicate the model’s ability to capture a sub-

stantial proportion of actual positive instances. De-

spite the variability, the model excels in identifying

positive instances.

F1 score, a harmonic mean of precision and re-

call, provides a balanced assessment. The F1 scores

in range from 34% to 62%, indicating a reasonable

compromise between precision and recall. The high

recall values suggest that the model is adept at iden-

tifying positive instances, but the lower overall accu-

racy implies challenges in correctly classifying neg-

ative instances. This discrepancy may stem from the

model’s struggle with certain negative cases.

Thus, the model used here model exhibits

strengths in capturing positive instances but faces

challenges in achieving higher overall accuracy. This

is left intentionally unoptimized, as the primary focus

of this work is on investigating the effects of pertur-

bations on data quality. Therefore, maintaining the

baseline performance as is, the performance on per-

turbed datasets was analysed.

4.2 Sensor Noise Perturbation

The sensitivity analysis for sensor noise perturbation

for each perturbation strength across different datasets

is shown in Fig.2. For a dataset size of 1000 samples,

the accuracy decreases as perturbation strength in-

creases. Precision and F1 Score show a similar trend.

High recall, at lower perturbation strengths, suggests

a trade-off between precision and recall. For a dataset

size of 10000, similar trends are seen as the 1000-

size dataset. The impact of sensor noise on accuracy

is consistent across dataset sizes. For dataset sizes

20000 and 30000 the general trend of decreasing ac-

curacy with increased perturbation strength persists.

In summary, from Fig. 2, it is evident that the model’s

performance is consistently affected by sensor noise

across different dataset sizes.

4.3 Labelling Errors Perturbation

The sensitivity analysis for labelling errors perturba-

tion for various dataset sizes is shown in Fig. 3. For

a dataset size of 1000, similar trends to sensor noise

are seen, but labelling errors have a slightly more pro-

nounced impact on accuracy. On the other hand, pre-

cision and recall are affected, showing a trade-off. For

dataset size of 10000, 20000 and 30000 consistent

patterns with a decrease in accuracy, precision, and

recall are seen as perturbation strength increases. In

summary, labelling errors have a more signiﬁcant im-

pact on performance compared to sensor noise.

4.4 Missing Data Perturbation

The sensitivity analysis for missing data perturbation

for various dataset sizes is shown in Fig. 4. For

dataset size of 1000, a similar pattern of decreas-

ing accuracy with higher perturbation strength is ob-

served. The impact is more pronounced in recall, sug-

gesting missing data features affects model sensitiv-

ity. For data set size of 10000, 20000 and 30000 con-

sistent trends across dataset sizes is observed. The

model is sensitive to missing data, and the impact in-

creases with dataset size.

4.5 Outliers Perturbation

The sensitivity analysis for outliers perturbation for

various dataset sizes is shown in Fig. 5. For dataset

size of 1000 samples, the accuracy shows a ﬂuctuat-

ing pattern with perturbation strength. Precision and

recall also show a similar pattern. For dataset size of

10000, 20000 and 30000, accuracy, precision, and re-

call exhibit similar trends. Outliers have a noticeable

impact on the model’s performance, and the effect re-

mains consistent across different dataset sizes.

4.6 Overall Inference on Impact of Data

Accuracy

The baseline model performance serves as a reference

point or control and provides a basis for understand-

ing how the model responds to different data condi-

tions. Hence, the baseline model with about 50% ac-

curacy is justiﬁed in the context of demonstrating the

impact of perturbations on data accuracy.

The sensitivity analysis clearly demonstrates that

the model’s performance is sensitive to various types

of perturbations. The decrease in accuracy, precision,

and recall indicates that noisy data signiﬁcantly af-

fects model performance. In general, as the dataset

size increases, the impact of perturbations on model

performance becomes more pronounced. Labelling

errors and missing data have a more substantial im-

pact on performance compared to sensor noise and

outliers. There’s a noticeable trade-off between preci-

sion and recall in many scenarios, indicating the sen-

sitivity of the model to noisy data.

The experiment underscores the importance of

data accuracy in training robust ML. Noisy data, in-

troduced through different perturbation types, con-

sistently degrades model performance across various

dataset sizes. Larger datasets provide some resilience

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

490

Figure 2: Sensitivity analysis for sensor noise perturbation for various dataset sizes.

Exploring the Impact of Dataset Accuracy on Machinery Functional Safety: Insights from an AI-Based Predictive Maintenance System

491

Figure 3: Sensitivity analysis for labelling errors perturbation for various dataset sizes.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

492

Figure 4: Sensitivity analysis for Missing data perturbation for various dataset sizes.

Exploring the Impact of Dataset Accuracy on Machinery Functional Safety: Insights from an AI-Based Predictive Maintenance System

493

Figure 5: Sensitivity analysis for Outliers perturbation for various dataset sizes.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

494

to perturbations, but maintaining data accuracy is still

critical.

4.7 Implications on Functional Safety

The observed impact of dataset inaccuracies on the

performance of the AI model has signiﬁcant implica-

tions for the reliability and availability of machinery.

These are crucial aspects in the context of functional

safety.

4.7.1 Reliability

Reliability, in this context, refers to the ability of the

AI model to consistently provide accurate predictions

regarding the health and maintenance needs of ma-

chinery. The decrease in performance metrics under

perturbations indicates that inaccuracies in the dataset

can compromise the reliability of the AI model. Fur-

ther, inaccuracies in the dataset can lead to false

alarms or missed predictions, affecting the overall re-

liability of the predictive maintenance system. An

unreliable AI model may result in unexpected break-

downs, leading to unplanned downtime and potential

safety risks. Thus, to enhance model reliability, it’s

crucial to address dataset inaccuracies through im-

proved data quality assurance processes. Implement-

ing a more robust outlier detection mechanisms to

handle extreme cases and to ensure the model’s pre-

dictions are trustworthy could help improve model re-

liability.

4.7.2 Availability

Availability, in the context of machinery functional

safety, refers to the system’s ability to operate as ex-

pected without disruptions, downtime, or failures. In-

accurate predictions from the AI model can impact

the availability of machinery, as maintenance actions

may be taken when unnecessary or delayed when

needed. Unreliable predictions may lead to unnec-

essary maintenance interventions, reducing the avail-

ability of machinery due to time spent on non-critical

tasks. Conversely, if the model fails to detect ac-

tual issues due to inaccuracies, it may lead to unex-

pected failures, increasing downtime and impacting

availability.

Thus, it is imperative to ensure that the AI model

is trained on a high-quality dataset to improve the ac-

curacy of predictions and reduce the likelihood of un-

necessary maintenance actions. Further, implement-

ing a well-deﬁned feedback loop to continuously im-

prove the model’s accuracy over time, contributing

to increased availability by minimizing false predic-

tions and identifying actual issues promptly, is rec-

ommended.

4.7.3 Overall Considerations

Rigorous data quality assurance processes serve as a

critical risk mitigation strategy. By addressing dataset

inaccuracies, the overall risk of unreliable predictions

and potential safety hazards is reduced. Striking a

balance between accurate predictions and minimizing

downtime is essential. The AI model should provide

reliable insights without causing unnecessary disrup-

tions to machinery operations. The ﬁndings under-

score the importance of continuous monitoring, feed-

back loops, and adaptation to maintain and improve

the reliability and availability of the AI system over

time. The AI system should be integrated with exist-

ing safety measures and protocols to ensure that pre-

dictions align with the overall safety goals of the ma-

chinery.

Thus, addressing dataset inaccuracies is pivotal

for ensuring the reliability and availability of machin-

ery in functional safety applications. Accurate pre-

dictions from AI models contribute to efﬁcient main-

tenance practices, reduce the risk of unplanned down-

time, and enhance overall safety in high-risk environ-

ments.

4.8 Discussion

Factors such as sensor noise, labelling errors, missing

data, and outliers were identiﬁed as contributors to

the compromise of the AI model’s accuracy through

systematic experimentation. Perturbations simulating

these factors were introduced into the datasets used

for training and testing the AI model. By observing

the changes in performance metrics such as accuracy,

precision, recall, and F1 score under different per-

turbation scenarios, the impact of each factor on the

model’s accuracy was quantiﬁed and analysed. This

empirical approach allowed for a comprehensive un-

derstanding of how inaccuracies in the dataset affect

the model’s predictive capabilities.

4.8.1 Importance of Dataset Quality

In the experiments, perturbations were systematically

introduced to simulate common real-world challenges

in datasets used for training AI models in machinery

functional safety. The consistent trend of decreasing

performance metrics across all scenarios underscores

the paramount importance of high-quality, accurate

datasets in ensuring the reliability and effectiveness

of AI models.

Exploring the Impact of Dataset Accuracy on Machinery Functional Safety: Insights from an AI-Based Predictive Maintenance System

495

• Sensor noise impact: The decline in performance

under increased sensor noise emphasizes the vul-

nerability of the AI model to inaccuracies in sen-

sor measurements. This underscores the need for

precise sensor calibration and robust data prepro-

cessing techniques to mitigate noise-induced er-

rors.

• Effect of labelling errors: The decrease in per-

formance when labelling errors were introduced

highlights the sensitivity of the AI model to in-

accuracies in ground truth information (i.e., accu-

rate or reliable data). It underscores the necessity

for rigorous data labelling validation processes to

prevent misinterpretations and subsequent model

inaccuracies.

• Handling missing data: The deterioration in per-

formance due to missing data emphasizes the

signiﬁcance of data completeness. Incomplete

datasets compromise the model’s ability to un-

derstand and predict machine health accurately.

Strategies for data augmentation or careful han-

dling of missing data are crucial in maintaining

dataset completeness.

• Robustness to outliers: The observed decline in

performance in the presence of outliers under-

scores the importance of robust model architec-

ture. Models must be designed to handle extreme

cases without signiﬁcant degradation in perfor-

mance. Additionally, outlier detection and prepro-

cessing steps are essential to ensure model stabil-

ity.

4.8.2 Implications of Compromised Dataset

Accuracy

The implications of compromised dataset accuracy

are profound, particularly in safety-critical environ-

ments such as machinery operations. Inaccurate or

noisy data can lead to a cascade of issues that under-

mine the reliability and availability of machinery, ul-

timately impacting operational efﬁciency and safety.

Firstly, compromised dataset accuracy can result

in unreliable predictions from AI models. For in-

stance, sensor noise or labelling errors may cause

false alarms or missed detections of potential machin-

ery failures, leading to unnecessary maintenance in-

terventions or, conversely, delayed responses to crit-

ical issues. This not only affects the reliability of

predictive maintenance systems but also jeopardizes

the availability of machinery by increasing the risk

of unplanned downtime. Moreover, inaccurate pre-

dictions may erode trust in AI-driven systems, po-

tentially leading to scepticism among operators and

maintenance personnel, further exacerbating safety

risks. Additionally, compromised dataset accuracy

can hinder the effectiveness of decision-making pro-

cesses, as unreliable insights may prompt inappropri-

ate actions or overlook genuine threats. These impli-

cations underscore the need for stringent data quality

assurance measures to ensure the integrity and relia-

bility of datasets used in AI-driven applications

4.8.3 Recommendations for Data Quality

Assurance

The ﬁndings in this paper have direct implications for

practitioners and organizations involved in the devel-

opment of AI models not only for machinery func-

tional safety but in high-risk environments across in-

dustries. The recommendations presented in this pa-

per, although lacking extensive real-world empirical

validation, are based on fundamental principles of

data quality assurance and ML. These principles, such

as sensor calibration, labeling validation, data com-

pleteness strategies, and outlier handling, transcend

speciﬁc industries and can be adapted to diverse do-

mains beyond manufacturing. The challenges posed

by sensor noise, labeling errors, missing data, and

outliers are ubiquitous in AI-driven systems, making

the strategies proposed here applicable across vari-

ous applications. While empirical validation would

strengthen the evidence, the transferability of these

strategies lies in their robustness and resilience to

address common data quality issues across different

contexts.

Thus, to enhance the robustness and reliability of

AI systems in high-risk applications, the following

recommendations are proposed:

• Investment in sensor calibration: Prioritize reg-

ular calibration of sensors to ensure precise and

accurate measurements. This reduces the impact

of sensor noise on the AI model’s predictions and

improves overall system reliability.

• Rigorous labelling validation: Establish robust

procedures for validating and cross-verifying la-

belled data. This includes periodic audits and con-

sistency checks to minimize the introduction of la-

belling errors during model training.

• Data completeness strategies: Implement strate-

gies to address missing data, such as data aug-

mentation or imputation techniques. Ensuring a

complete dataset enhances the model’s ability to

accurately capture the underlying patterns in ma-

chinery health.

• Enhanced outlier handling: Develop models with

enhanced outlier detection mechanisms. This in-

cludes preprocessing steps to identify and handle

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

496

outliers effectively, preventing them from nega-

tively impacting model performance.

• Continuous monitoring and improvement: Estab-

lish continuous monitoring mechanisms to track

the performance of the AI model over time. Im-

plement feedback loops that allow the model to

adapt and improve based on corrected data, main-

taining accuracy in dynamic operational environ-

ments.

In summary, the experiments provide empirical evi-

dence supporting the critical role of accurate datasets

in machinery functional safety setting. By adhering

to these recommendations, practitioners can build AI

systems that are more resilient, accurate, and reliable,

ultimately contributing to enhanced safety outcomes

in high-risk applications.

5 CONCLUSION

The empirical experiments conducted to investigate

the impact of dataset accuracy on AI model perfor-

mance in the realm of machinery functional safety

have yielded valuable insights into the critical na-

ture of high-quality data in ensuring the reliability

and effectiveness of AI systems. It is evident that in-

accuracies in the training dataset lead to diminished

predictive capabilities, potentially compromising the

safety and reliability of machinery in industrial set-

tings. The ﬁndings in this work underscore two key

points, namely the importance of dataset quality and

recommendations for data quality assurance.

Future work in this domain could explore ad-

vanced techniques for enhancing dataset quality, such

as the integration of anomaly detection algorithms

and robust preprocessing methods. Investigating

the adaptability of the AI model to dynamic opera-

tional environments and evolving machinery condi-

tions would be valuable. Another avenue for research

involves experimentation with a combination of real-

world datasets and synthetic datasets. This approach

would allow for a more comprehensive evaluation of

model performance and generalizability by incorpo-

rating the complexities and nuances present in real-

world data, while still maintaining the beneﬁts of con-

trolled experimentation offered by synthetic datasets.

REFERENCES

Budach, Lukas, e. (2022). The effects of data quality on ma-

chine learning performance. arXiv:2207.14529. arXiv

preprint, https://arxiv.org/abs/2207.14529.

IEC62061 (2022). Safety of machinery—Functional

safety of safety-related electrical, electronic and pro-

grammable electronic control systems. IEC.

ISO13849 (2023). Safety of machinery—Safety-related

parts of control systems—Part 1: General principles

for design. ISO.

Katsuki, T. and Osogami, T. (2023). Regression with sen-

sor data containing incomplete observations. In Pro-

ceedings of the 40th International Conference on Ma-

chine Learning, volume 202 of Proceedings of Ma-

chine Learning Research, Honolulu, Hawaii, USA.

Lwakatare, L. E., R

ange, E., Crnkovic, I., and Bosch,

J. (2021). On the experiences of adopting auto-

mated data validation in an industrial machine learn-

ing project. CoRR, abs/2103.04095.

Northcutt, C. G., Athalye, A., and Mueller, J. (2021). Perva-

sive label errors in test sets destabilize machine learn-

ing benchmarks.

Pedregosa, F. e. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Perez-Cerrolaza, e. (2023). Artiﬁcial intelligence for safety-

critical systems in industrial and transportation do-

mains: A survey. ACM Comput. Surv. Just Accepted.

Polyzotis, N., Zinkevich, M., Roy, S., Breck, E., and

Whang, S. (2019). Data validation for machine learn-

ing. In Talwalkar, A., Smith, V., and Zaharia, M., ed-

itors, Proceedings of Machine Learning and Systems,

volume 1, pages 334–347.

Priestley, M., O’donnell, F., and Simperl, E. (2023). A

Survey of Data Quality Requirements That Matter in

ML Development Pipelines. J. Data and Information

Quality, 15(2).

Schelter, S., Rukat, T., and Biessmann, F. (2020). Learn-

ing to validate the predictions of black box classi-

ﬁers on unseen data. In 2020 ACM SIGMOD Interna-

tional Conference on Management of Data, SIGMOD

’20, pages 1289–1299, New York, NY. Association for

Computing Machinery.

Schelter, S., Rukat, T., and Biessmann, F. (2021). JENGA -

A framework to study the impact of data errors on the

predictions of machine learning models. In EDBT,

pages 529–534. OpenProceedings.org.

Sessions, V. and Valtorta, M. (2009). Towards a method for

data accuracy assessment utilizing a bayesian network

learning algorithm. J. Data and Information Quality,

1(3).

Shankar, S., Fawaz, L., Gyllstrom, K., and Parameswaran,

A. (2023). Automatic and precise data validation for

machine learning. page 2198–2207, New York, NY,

USA. Association for Computing Machinery.

Teh, H., Kempa-Liehr, A., and Wang, K. (2020). Sensor

data quality: A systematic review. Journal of Big

Data, 7:11.

Exploring the Impact of Dataset Accuracy on Machinery Functional Safety: Insights from an AI-Based Predictive Maintenance System

497