tailored for a Portuguese manufacturer specialising in
wood-based panels, aims to optimise the melamine
impregnation process.
3.2 Data Analysis and Pre-Processing
The data pertaining to the melamine impregnation
process was collected from sensors placed across
two identical production lines from January 2022 to
February 2024. This data was in tabular format and
consisted of 77 distinct feature columns. The sepa-
rate datasets were consolidated into a single unified
dataset to facilitate more accurate analysis and in-
sights, resulting in a total of 105,000 samples.
A ”Defect Code” was associated with each sam-
ple, serving as an identifier for the type of defect
that occurred during the process. Samples produced
with no defects were assigned the ”0” defect code.
Initially, 91 different codes were identified. How-
ever, the analysis revealed that many of these actually
corresponded to the same defect description despite
having different defect codes. To address this issue
samples, with repeated descriptions were merged, re-
taining the code with the most samples, reducing the
number of distinct defect codes to 60. However, pre-
dicting the type of defect that the panel had using 60
classes in a classification task would overly compli-
cate the process. As a solution, all defect codes and
descriptions were grouped into 7 defect categories
based on similar properties. This categorisation en-
sured a more manageable classification.
Given that the dataset had not undergone any prior
cleaning or processing, further preparation was re-
quired to be suitable for subsequent modelling. Ini-
tially, irrelevant features, duplicated columns, and
those with a majority of missing or invalid values
were removed. After, the Pearson correlation coeffi-
cient was calculated for each pair of feature columns.
One feature from each pair with a coefficient equal to
or greater than
|
0.9
|
was removed to eliminate redun-
dancy. Categorical feature columns were then con-
verted to a numerical representation, as most ML al-
gorithms require numerical inputs to process the data
effectively. The mapping between each category and
its numerical representation was saved in an external
file to ensure consistency in later processing. Follow-
ing this, samples with missing or invalid values were
discarded, and boxplots were utilised to identify and
eliminate samples with outlier values.
After completing the data cleaning and pre-
processing, the final dataset contained approximately
72,000 samples, indicating an initial reduction of
nearly 30%. Only around 2% of the samples rep-
resented defects, resulting in an imbalanced dataset.
The number of feature columns was reduced from the
initial 77 to 50.
3.3 Defect Prediction and Explanation
Modelling
Four ML methods were evaluated to predict defective
wood panels. Given the availability of labelled data
and its tabular format, supervised learning methods
were implemented, focusing on classification tasks.
Specifically, CatBoost, RF, XGB, and an ensem-
ble combining CatBoost, XGB, and RF were tested.
These algorithms were implemented using libraries
such as Scikit-learn, CatBoost, and XGBoost. Hyper-
parameter tuning and model optimisation were con-
ducted using the Scikit-learn GridSearchCV method.
The implemented models were trained to perform
three different types of classification:
• Binary classification: Predicting whether a sam-
ple is likely to be defective.
• Multiclass classification (1): Predicting the spe-
cific defect type for a sample previously identified
as defective.
• Multiclass classification (2): Predicting the defect
category for a sample previously identified as de-
fective.
The dataset underwent a chronological train-test
sampling split. The training dataset contained ap-
proximately 56,000 samples, while the testing dataset
comprised around 15,500 samples. As previously dis-
cussed, the available dataset was imbalanced, with de-
fective samples representing only 2% of the data. This
can negatively impact model training, compromising
both its accuracy and efficiency. To mitigate this is-
sue, the SMOTE algorithm was applied to the training
data to balance class occurrences. Figure 2 showcases
the significant variances observed among the different
defect types before applying the SMOTE algorithm.
The model’s performance was evaluated using re-
call and precision metrics, given the imbalanced na-
ture of the dataset. Emphasis was placed on recall as
it focuses on minimising false negatives, ensuring that
the model is proficient at identifying all actual defects.
An added layer of transparency and interpretabil-
ity was integrated using XAI methods. Local model-
agnostic techniques, specifically LIME and SHAP,
were employed for this purpose. LIME uncovered
the specific ”rules” or conditions that influenced each
prediction, while SHAP highlighted the contribution
of each feature to the model’s predictions.
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
256