Labor Accidents Classiﬁcation Using Machine Learning

adio da Silva Barbosa

1 a

, Yandre M. G. Costa

1 b

, Juliano H. Foleis

and Diego Bertolini

2 c

Department of Information Tchnology, State University of Maring

a — DIN/UEM, Paran

a, Brazil

DACOM, Universidade Tecnol

ogica Federal do Paran

a (UTFPR), Paran

a, Brazil

Keywords:

Labor Accidents, Work Accidents, Occupational Accidents, Machine Learning, Classiﬁcation.

Abstract:

The application of artiﬁcial intelligence is increasingly growing in all public and private industry ﬁelds. In this

work, we propose applying machine learning techniques to perform work accident classiﬁcation according to

Brazilian laws. The type of accident is part of the communication of occupational accidents (CAT) database

held by the National Institute of Social Security. In Brazil, that communication can come from different

sources. Because of this, some of them lack the type of work accident. This information is crucial to allow

labor authorities to understand better the circumstances surrounding the accidents and to help plan and create

more speciﬁc strategies to prevent them. In this sense, we perform data cleaning, and we use feature engi-

neering techniques to treat data from CAT database. Following, we use machine learning algorithms aiming

to perform the classiﬁcation according to the type of accident. For this, we investigate a strategy to identify

the type of labor accident when this information is missing using algorithms based on trees or gradient boost-

ing trees. Preliminary results showed promising performance, where the algorithms achieved the following

weighted average F1-score for labor accident types classiﬁcation: XGboost 0.94, CAtboost 0.94, Lightgbm

0.94, and Random Forest 0.91. As far as we know, work accident type classiﬁcation using machine learning,

considering Brazilian labor legislation and a huge governmental dataset is addressed for the ﬁrst time in this

work.

1 INTRODUCTION

The International Labour Organization (ILO) (ILO,

2021) estimates that around 2.39 million workers die

annually around the world due to labor-related acci-

dents. Approximately 340 million labor accidents

happen worldwide yearly, meaning a great social

and ﬁnancial cost for the respective countries. In

Brazil, according to SmartLab Observatory of Occu-

pational Safety and Health of Public Labour Prosecu-

tor (MPT) (MPT, 2020), between 2018 and 2020, the

average of labor accidents reached 569,998 per year.

This shows the seriousness of the situation and how

important is to build better statistical information to

understand the problem and to propose solutions that

can be effective to deal with it.

Labor accidents may have many different causes,

most of which could be avoided (Alli, 2008). The

https://orcid.org/0000-0002-3597-1961

https://orcid.org/

https://orcid.org/0000-0002-6196-4538

In this work, ‘labor accidents’, ‘work accidents’, and

‘occupational accidents’ refer to the same concept.

causes vary from insufﬁcient protective measures in

the working environment, exhausting working sched-

ules, lack of adequate rest, and many workers being

exposed to excessive extra working hours. According

to Brazilian law number 8,213 of 1991 (PBPS, 1991),

labour accidents can be classiﬁed into three types:

• Typical - represents the labor accident that occurs

while the worker performs their duties.

• Displacement - represents the labor accident that

occurs while the worker is going to or leaving the

workplace.

• Labour Illness - all illnesses suffered by workers

caused or related to the tasks they perform while

working.

Still, according to the same law, labor accidents

must be communicated to public authorities. This

communication is essential to help the federal govern-

ment and the Labour Inspectorate, in particular, build

more knowledge about the overall picture of work ac-

cidents in the country and how they occur. It is also

important as a prof for workers to request governmen-

tal assistance.

Barbosa, E., Costa, Y., Foleis, J. and Bertolini, D.

Labor Accidents Classiﬁcation Using Machine Learning.

DOI: 10.5220/0011856500003467

In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS 2023) - Volume 1, pages 509-516

ISBN: 978-989-758-648-4; ISSN: 2184-4992

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

509

In Brazil, the labor accident is communicated

through a form known as CAT (acronym for

“Comunicac¸

ao de Acidente de Trabalho”, in Por-

tuguese). It can be sent by different entities (e.g.,

people or institutions). For instance, it can be sent

by the company where the accident happened, by the

respective employee’s union, by any public authority,

or even by the worker himself or his family. The pos-

sibility of many sources sending CATs helps increase

the number of communications sent. However, it also

causes a problem of missing information. For exam-

ple, many CATs have the type of accident missing.

It makes it difﬁcult for the labor authorities to under-

stand the accidents fully.

As it is difﬁcult to deal with the consequences of

work accidents, prevention is crucial because it can

avoid all related suffering and ﬁnancial costs. Be-

yond reducing the risks of damage to the worker’s

health, which is by far the most important one, pre-

vention also allows the companies’ activities to keep

going without interruptions. Another reason is that

avoiding accidents from happening prevents the com-

panies from suffering sanctions from the govern-

ment, whether it becomes evident that the accident

was caused because the company failed to regard the

safety and security measures instated by law. There-

fore, it is crucial to have all information, especially

the type of accident, to understand the whole picture

and move straight toward building better strategies

and concrete actions to minimize labor accidents.

Machine Learning (ML) is an area of artiﬁcial in-

telligence that allows computers to learn from data

(Abu-Mostafa et al., 2012). It is part of an inter-

disciplinary scientiﬁc ﬁeld that combines computer

science, data science, and statistics through complex

data analysis, searching for patterns in the data. There

are four types of ML algorithms(Burkov, 2019): Su-

pervised, semi-supervised, unsupervised, and rein-

forcement.

Supervised learning algorithms perform predic-

tion or classiﬁcation for new data samples based on

knowledge obtained from examples previously pre-

sented to the algorithm in the training phase. These

algorithms can improve themselves as much data are

analyzed over time, and more predictions and clas-

siﬁcations are made. This happens because the al-

gorithms feed themselves with information about the

process to help correct their own mistakes. In this

work, the hypothesis is that ML can use data related

to labor accidents to classify the types of labor acci-

dents. Therefore, this work proposes to use Machine

Learning to create a model that uses supervised learn-

ing algorithms to predict the missing type of accident

in the CATs. It can help the Brazilian labor inspec-

torate improve its service to the general public. Thus,

the following research question (RQ) is raised in this

scenario: Is it possible to apply ML algorithms to

classify the types of labor accidents using the CAT

database as input data?

The remaining of this paper is organized as fol-

lows: Section 2 presents some works in any way close

to this one available in the literature. Section 3 de-

scribes the Machine Learning techniques that support

the development of this research. Section 4 introduces

the CAT database and the study design. Section 5

presents results and discussions. And ﬁnally, we de-

scribe concluding remarks and future works.

2 RELATED WORKS

(Di Noia et al., 2020) used machine learning strate-

gies in the context of work accidents, predicting oc-

cupational disease risks using computational intelli-

gence and pattern recognition techniques. They used

real data about the worker and the workplace from the

Italian Health Authority (ASL). Three machine learn-

ing algorithms were experimented: K-means, Support

Vector Machines, and K-Nearest Neighbors. In sum-

mary, the authors obtained encouraging results using

artiﬁcial intelligence approaches to create an alterna-

tive for occupational disease risk prediction.

(Shkanov, 2019) investigated Multiclass Classi-

ﬁers for Processing Archives of Accidents in Man-

ufacturing. This work compared the best techniques

for preprocessing labor accident data. The data com-

prised 1,600 acts from the archive of accidents in

Russia’s metallurgy industry. Each act of accident

is presented in free textual form and includes three

parts: Event description, analysis of reasons, and rec-

ommendations. They chose and made adjustments

to the best classiﬁcation methods for these data and

used two classiﬁers to group 19 classes related to the

causes of accidents and another to group 39 classes

of recommendations. The preprocessing was per-

formed in three steps: text normalization, ﬁltering,

and parametrization. They used the following algo-

rithms in the classiﬁcation phase: Logistic Regres-

sion, Naive Bayes, Random Forest e Gradient Boast-

ing. The results classifying the reasons for accidents

showed an accuracy of 79% for Random Forest, 82%

for Gradient Boosting, and 84% for Logistic Regres-

sion. In comparison, the results for classifying the

recommendations showed an accuracy of 63% for

Random Forest, 64% for Gradient Boosting, and 66%

for Logistic regression.

Another related work uses the Random Forest

model to predict occupational accidents at construc-

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

510

tion sites in Korea. In this work, (Kang and Ryu,

2019) analyzed the question of labor accidents in the

Korean construction industry. The data is composed

from Korea Occupational Safety and Health Agency

(KOSHA). The KOSHA’s dataset includes 9,796 ac-

cident reports in construction sites from 2008 to 2014.

They have 55 input variables such as age, occupa-

tion injury, occupational accident, and others. The

occupation accident types were classiﬁed into nine

classes and set up as output values. They also gath-

ered weather data from Korea Meteorological Agency

(KMA) to include, in the accident dataset, temper-

ature, humidity, wind speed, and precipitation. Ini-

tially, they generated and analyzed derived variables

comparing the occurrence of accidents against tem-

perature, season, and wind chill changes. They used

the technique of determining feature importance that

permits creating a ranking of features regarding their

contribution to the overall result of the model. To

classify the types of labor accidents, they used the

Random Forest CART (Classiﬁcation and Regression

Tree) algorithm that resulted in an accuracy of ap-

proximately 71%.

(Su

arez S

anchez et al., 2011) investigated the pre-

diction of work-related accidents according to work-

ing conditions using the Support Vector Machines

classiﬁer. The database was composed of the re-

sponses to a 78 variables questionnaire applied to a

total of 11,054 workers in Spain, carried out between

December 2006 and April 2007. The target was to de-

termine if a worker has suffered or not an accident in

the last year. Then, they employed SPPCA (Semi-

parametric principal component analysis) to reduce

the dimensions of the feature vectors. The algorithm

MARS (Multivariate adaptive regression splines) was

evaluated, and it was able to reduce the number of

features from 78 to 18. The features pointed with the

most signiﬁcant importance by MARS were used as

input to feed the SVM classiﬁer. Hence, based on

work conditions, they were able to classify the work-

ers that suffered and those that did not suffer occupa-

tional accidents in twelve months with an accuracy of

99.77%.

In (Su

arez S

anchez et al., 2016), the authors ap-

plied K-Nearest Neighbor to classify workers accord-

ing to their risk of musculoskeletal disorders. In

that work, they dealt with the binary classiﬁcation of

workers developing or not musculoskeletal disorders

caused by occupational tasks during work. The input

database was composed of the responses to a 78 vari-

ables questionnaire applied to a total of 11,054 work-

ers in Spain, carried out between December 2006 and

April 2007.

(Sarkar et al., 2019) dealt with the prediction

of occupational accidents and extracting decision-

making rules from labor accident data. The database

had 3308 incident records and 16 features (15 cate-

gorical and one textual). The dataset generated af-

ter a preprocessing step had 1500 instances and 13

attributes. The authors used SVM and ANN (Artiﬁ-

cial Neural Network) to perform classiﬁcation. For

parameters optimization, they used GA (genetic al-

gorithm) and PSO (particle swarm optimization) al-

gorithms to reach the model’s higher accuracy and

robustness. The SVM algorithm achieved a higher

performance with higher accuracy and robustness.

Hence, a set of nine rules were extracted to iden-

tify the root causes of wounds, circumstances where

workers were barely hit, and property damage. To

achieve these results, they used data from a steel plant.

3 GRADIENT BOOSTING TREES

(GBT)

In this section, we brieﬂy describe Gradient Boosting

Trees (GBT) algorithms, used for the development of

the experiments presented in this work.

In this work, we deal with tabular data as inputs

for the models. The literature shows that GBT is

among the most used methods for modeling discrete

or tabular data (Feng et al., 2018). (Shkanov, 2019)

also corroborates that, as GBT was also successfully

used in that work.

As pointed by (Freund and R, 1997), Gradi-

ent Boosting algorithms combine weak learners into

strong learners in an iterative way. The objective of

gradient boosting is to ﬁnd an approximation function

F(x) that can map instances of x to their output values,

by minimizing a given loss function for a speciﬁc set

of training data T D = (x

, y

)

. It makes global con-

vergence of the algorithm by following the direction

of the negative gradient. The Gradient boosting builds

an additive approximation of F(x) as a weighted sum

of functions as shown in Equation 1.

(x) = F

m−1

+ p

(x) (1)

In this case, p

is the weight of the m

function,

(x) . These functions are the models of the ensem-

ble (i.e. decision trees), and the approximation is con-

structed iteratively. As stated in (Zhou, 2012), ensem-

ble learning is a technique that tries to construct a set

of learners and combine them by boosting weak learn-

ers that are just slightly better than random. The en-

sembled learner has better generalization ability and

can make very accurate predictions.

GBT algorithms are based on trees and, as ex-

plained, use some methods that try to create a ro-

Labor Accidents Classiﬁcation Using Machine Learning

511

bust predictor based on the combination of less efﬁ-

cient ones (Yuichiro, 2012; James et al., 2013; Kubat,

2017). These methods are well known as committees.

When considering the result of the committee of al-

gorithms, the prediction tends to compensate for the

individual errors of each predictor (Kubat, 2017). The

ﬁrst adaptative boosting algorithm proposed in the lit-

erature was AdaBoost (Freund, 1997). Since this mo-

ment, they evolved up to the arrival of Gradient boost-

ing algorithms (Bentejac, 2020).

GBT produces models sequentially in a form of

a linear combination of decision trees, working in

an inﬁnite dimensional optimization problem (Biau,

2019). Boosting is an ensemble strategy that works

by dividing the training data and using each part to

train different models or one model with different se-

tups. Then the results are combined together using

majority vote (Daoud, 2019). They use a stage-wise

approach and the loss function to avoid overﬁtting.

It happens by training learners based on minimizing

the differential loss function of a weak learner using a

gradient descent optimization process.

Concerning the GBT algorithms, in this work,

we experiment and compare the results of XGboost,

Lightgbm, and CATboost. Each one of them holds

some peculiarities that will be explored below as ex-

plained by (Daoud, 2019).

XGboost was developed by (Chen and Guestrin,

2016) as a scalable machine learning system. It differ-

entiates itself, mainly from the other gradient boost-

ing, because it adds a new term to the loss function to

deal with overﬁtting.

According to (Ke, 2017), Lightgbm uses XGBoost

as a baseline, but executes the classiﬁcation problem

through the combined application of the following

two techniques:

• Gradient-based One-Side Sampling - The model

omits the majority of examples where the Gradi-

ent weight is expected to be smaller, helping going

into branches with more importance for the infor-

mation gain.

• Exclusive Feature Bundling - Reduces features

sparsity by bundling them together and reducing

their total number, helping decrease training time.

As stated in (Prokhorenkova, 2019), Catboost in-

troduces two new functions, one for handling categor-

ical features and the other is an ordered permutation-

driven boosting. It handles the issue of exponential

feature combination growth by using a greedy method

at every new split of the current tree.

The growing development, utilization, and ﬂexi-

bility of ML puts this technique in a good position to

solve the speciﬁc problem of this work. In the case of

this research, machine Learning supervised classiﬁca-

tion algorithms will be used to determine the types of

labor accidents that are missing in the available CATs.

Among a great number of algorithms, GBT shows

in the literature a good performance using data with

the same characteristics as the CAT’s data. As an ex-

ample, we can cite (Shkanov, 2019), in which Gradi-

ent Boosting and Linear Regression obtained the best

performances when compared to Random Forest and

ıve Bayes.

4 MATERIALS AND METHODS

This section introduces the main characteristics of the

dataset used for developing this work. In addition, we

also describe the methodological design of the pro-

posed solution.

4.1 CAT Database

In this section, we describe the ﬁelds selected to com-

pose the CAT database and some details to help un-

derstand its particularities and usefulness in the de-

velopment of this work. As part of the work, we se-

lected the attributes to compose the database also tak-

ing into account tax secrecy concerns. In addition,

we have discarded some columns, originally present

in the data, in which we noticed too many null or spu-

rious values. The database produced contains 76,017

instances and a total of 30 attributes. This data was

collected from 2019 to 2022.

Table 1 shows the ﬁelds of the CAT database used

in the experiments described here. As we can see,

the data basically contains relevant information asso-

ciated with the accident occurrences.

4.2 Study Design

In this section, we describe the study design ﬂow, ac-

cording to the steps shown in Figure 1, which start

from selecting the data to be used as input for the clas-

siﬁcation model, followed by the classiﬁcation phase,

and ﬁnal results.

Firstly, we analyzed the CAT database to better

understand the overall quality of the data. In this

stage, the aim was to determine if the data had miss-

ing, null, or spurious values that could affect the mod-

els results. Therefore, we performed data cleaning to

eliminate those values from the database to preserve

the data integrity. In this direction, we eliminated

blanks and dots in some column names.

The next stage was to perform feature engineer-

ing. Firstly, we executed feature transformation by

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

512

Figure 1: Graphical Abstract - Study Design Flow.

calculating the age of each worker using their date of

birth. As the database had some categorical values,

we executed the encoding method to transform all cat-

egorical columns into numeric ones. This method cre-

ates a new column for every different value a speciﬁc

categorical column holds and attributes the value 1 to

it.

After performing feature engineering, based on

expert knowledge, to deﬁne the data composition, we

have chosen some of the most used and promising

classiﬁers algorithms described in the ML literature

to deal with tabular data. The comparison between

these different algorithms in the task investigated here

is one of the main objectives of this work.

The classiﬁcation algorithms selected for this

work are based on gradient boosting trees. They were

chosen based on the literature considering the dom-

inant method for modeling discrete or tabular data

(Feng et al., 2018).

In the following step, we observed the feature im-

portance built in XGBoost. The importance is calcu-

lated by the amount that each feature split point im-

proves the performance, weighted by the number of

observations the node is responsible for. This cre-

ates a ranking that informs how much each feature

improves the overall results. Figure 2 shows the Top-

10 features that most contribute to the results.

An important step for improving the boosting-

based tree algorithms results is ﬁnding the best set

of hyperparameters. For this, we performed 5-fold

Figure 2: Top-10 Feature Importance.

cross-validation. It works by exhaustively searching

subsets of hyperparameters space of the targeted al-

gorithm to ﬁnd the best combination that can improve

the outcome. For the algorithms used in this work, we

describe in Table 2 the hyperparameters found using

grid-search.

The next stage was conducted by applying those

hyperparameters to tune the machine learning algo-

rithms to perform the classiﬁcation of the type of acci-

dent. Therefore, all algorithms were set up with their

respective hyperparameters in order to perform the

classiﬁcation. The ﬁnal stage was to plot a confusion

matrix for each one of the algorithms and compare the

results of them, demonstrating which one performed

better.

5 RESULTS AND DISCUSSIONS

After the hyperparameters were set up for each al-

gorithm, the classiﬁcation task was performed. The

training and test subset were deﬁned as 80-20, re-

spectively. Table 3 shows the number of accidents per

type. As we can see, the data is imbalanced.

These experiments aimed to improve the initial

classiﬁcation to achieve better and more robust re-

sults. Furthermore, the aim is also to compare differ-

ent classiﬁers’ performance and demonstrate which

achieves the best results. Therefore, all classiﬁers re-

ceived the same data to make it possible to compare

their results.

We used some metrics to analyse the performance

of the models. Among them we used macro-averaged

F1 score (or macro F1 score). It is computed using the

arithmetic mean (aka unweighted mean) of all the per-

class F1 scores where all classes are treated equally

regardless of their support values. Another metric is

weighted-averaged F1 score where the calculation is

done via mean of all per-class F1 scores while con-

sidering each class’s support. This Support refers to

the number of actual occurrences of the class in the

dataset while the ‘weight’ refers to the proportion of

each class’s support relative to the sum of all support

Labor Accidents Classiﬁcation Using Machine Learning

513

Table 1: CAT Database.

# Field Description

1 Accident’s Date Date of the accident

2 Accident’s estate

code

Estate code of the acci-

dent

3 CBO Employee occupation

code

4 Employer’s CNAE Employer’s activity code

5 Contrator’s CNAE Contractor’s activity

code

6 SEX Employee’s gender

7 CID-10 International disease

code

8 Date of birth Employee’s Date of birth

9 CAT’s type Type of CAT

10 Employer’s city Employer´s city

11 Employer’s Estate

code

Employer’s estate code

12 Employee marital

status

Employee marital status

13 Pensioner’s Activity Pensioner’s Activity

14 Pensioner’s beneﬁt

value

Pensioner’s beneﬁt value

15 Pensioner’s city Pensioner’s city

16 Pensioner’s state

code

Pensioner’s state code

17 Accident local type Accident local type

18 Part of the body hit Part of the body hit

19 Agent responsible Agent responsible for the

accident

20 Accident’s situation Situation that cause the

accident

21 Nature of the injury Nature of the injuries

22 Retirement indicator Retirement indicator

23 Leave Indicator Leave Indicator

24 Police complaint in-

dicator

Police complaint indica-

tor

25 Hospitalization indi-

cator

Hospitalization indicator

26 Lack of CAT indica-

tor

Lack of CAT indicator

27 CAT issuance indica-

tor

CAT issuance indicator

28 CAT issuance delay

indicator

CAT issuance delay indi-

cator

29 Medical leave solici-

tation indicator

Medical leave solicita-

tion indicator

30 Death indicator Death indicator

values. The ﬁrst above mentioned metric is affected

by the fact that CAT´s database is unbalanced while

the second metric is not affected.

In the ﬁrst experiment, we used CatBoost Classi-

ﬁer. As done for the experiment previously shown,

this model had also to be trained, It was performed

with 5-fold cross-validation and hyperparameters set

up accordingly with grid-search execution. The hy-

perparameters were set up as follows: CATBoost -

learningrate: 0.2, maxdepth: 5, nestimators: 300.

CatBoost works similarly to XGBoost. In this

case, iterations or estimators control the maximum

number of trees that the model will have and the depth

parameter represents how big the tree is. Table 4

shows the results, the ﬁrst part of the table shows the

classiﬁcation rates obtained with different metrics for

each class, while the second part of the table shows

the overall performance.

Another algorithm experimented was LightGbm

Classiﬁer. As performed in the previous experimen-

tation, this model had also to be trained. Again, it

was performed with 5-fold cross-validation and the

parameters were tuned accordingly with grid-search.

The parameters were setup as follows: boostingtype’:

’gbdt’, colsamplebytree: 0.65, ’learningrate’: 0.01,

nestimators: 8, numclass: 3, numleaves: 6, objec-

tive: multiclass, regalpha’: 1, reglambda: 1, seed:

500, subsample: 0.7.

LightGbm works, in general, similarly as the pre-

vious algorithms. But, performing a Leaf-wise tree

growth. In this case, iterations or estimators control

the maximum number of trees that the model will

have and the depth parameter represents how big the

tree is.

Table 5 shows the results for this model. The ﬁrst

part of the table shows the classiﬁcation metrics for

each class while the second part of the table shows

the overall performance.

The following algorithm experimented was Ran-

dom Forest Classiﬁer. As occurred in the previous

experimentation, this model had also to be trained and

its parameters tuned accordingly with grid-search.

The parameters were set up as follows:

Random Forest algorithm also works with a num-

ber of decision trees working ensembled as a com-

mittee. The fundamental concept of this model is

that the trees are relatively uncorrelated, and conse-

quently, the trees may correct the errors each other.

Table 6 shows the results for the model. The ﬁrst

part of the table shows the classiﬁcation metrics for

each class while the second part of the table shows

the overall performance.

The last experimentation used XGBoost Classi-

ﬁer, where the best results were achieved. The algo-

rithm was run with 5-fold cross-validation and tuned

with the hyperparameters that resulted from grid-

search execution. The better hyperparameters found

were: Subsample: 0.5; Num classes: 3, nestimators:

100, maxdepth: 6, learningrate: 0.2, colsamplebytree:

0.5, colsamplebylevel: 0.5.

XGBoost works by creating and adding trees

level-wise (Daoud, 2019; Chen and Guestrin, 2016) to

the model sequentially, in order to correct the residual

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

514

Table 2: Hyperparameters.

XGBOOST CATBoost Lightgbm Random Forest

subsample: 0.5 nestimators: 300 subsample: 0.7 nestimators:200

numclass: 3 maxdepth: 5 numclass: 3 maxfeatures: 7

nestimators: 100 learningrate: 0.2 nestimators: 8 minsamples leaf: 1

maxdepth: 6 boostingtype’: gbdt minsamples split: 2

learningrate: 0.2 learningrate’: 0.01 njobs: 1

colsamplebytree: 0.5 colsamplebytree: 0.65

colsamplebylevel: 0.5 numleaves: 6

objective: multiclass

regalpha’: 1

reglambda: 1

seed: 500

subsample: 0.7

Table 3: CAT - Work Accidents per TYPE.

Typical Displacement Illness

63,602 10,788 1,627

Table 4: CatBoost Classiﬁer results.

Class Precision Recall F1-score

0 - Illness 0.96 0.97 0.97

1 - Displacement 0.89 0.68 0.77

2 - Typical 0.81 0.80 0.81

Macro avg 0.89 0.82 0.85

Weighted avg 0.94 0.94 0.94

Table 5: LightGBm Classiﬁer results.

Class Precision Recall F1-score

0 - Illness 0.96 0.97 0.97

1 - Displacement 0.90 0.66 0.76

2 - Typical 0.84 0.82 0.83

Macro avg 0.90 0.82 0.85

Weighted avg 0.94 0.94 0.94

Table 6: Random Forest Classiﬁer results.

Class Precision Recall F1-score

0 - Illness 0.92 0.98 0.95

1 - Displacement 0.97 0.50 0.66

2 - Typical 0.84 0.58 0.68

Macro avg 0.91 0.68 0.76

Weighted avg 0.91 0.91 0.91

errors in the predictions from the existing sequence of

trees. As the trees grow, the learning rate or shrinkage

factor represents how fast the model will learn, mean-

ing how many corrections will be made for each new

tree added. The parameter n estimators stands for the

number of estimators or iterations and represents the

total number of trees that the model will have and the

depth parameter represents how high is the tree.

As shown in Table 7, the XGBoost algorithm pre-

sented the best overall performance. Thus, we also

present in Figure 3 the confusion matrix obtained us-

ing it. The ﬁrst part of Table 7 shows the rates ob-

Figure 3: XGBOOST Confusion Matrix.

tained with different classiﬁcation metrics for each

class while the second part of the table shows the

overall performance.

Table 7: XGBoost Classiﬁer results.

Class Precision Recall F1-score

0 - Illness 0.96 0.97 0.97

1 - Displacement 0.89 0.76 0.82

2 - Typical 0.83 0.82 0.83

Macro avg 0.90 0.85 0.87

Weighted avg 0.94 0.95 0.94

Finally, taking into account the results obtained,

we can conclude that the RQ raised in the introduction

was positively answered.

6 CONCLUDING REMARKS AND

FUTURE WORK

The drama of occupational accidents victims work-

ers causes a negative impact on companies’ activities

and on the whole of Brazil’s economy as well. It also

means a greater expenditure of public funds. Hence,

there is a huge urgency to search for alternatives to

support the creation of preventive actions to reduce

Labor Accidents Classiﬁcation Using Machine Learning

515

the occurrence of occupational accidents.

Although the initiatives from companies aiming at

reducing work accidents can be extremely useful, the

Brazilian labor inspectorate is deﬁnitely the institu-

tion capable of proposing alternatives that can be used

in the entire country and across all industries.

In this case, machine learning presents itself as a

tool that, applied across the communication of labor

accidents, has the capacity to automatically classify

the types of labor accidents in cases this information

is missing. This classiﬁcation can help the labor in-

spectorate to create educational and ﬁscal actions to

reduce the problem. However, It is important to notice

that this is a very complex problem and several initia-

tives has to be implemented simultaneously to have

a wider impact. Therefore, the initiative proposed in

this research can be one of them.

Experiments accomplished on the CAT database

showed that XGBoost achieved the best performance

for the classiﬁcation of labor accident type, obtaining

0.87 of Macro avg F1-score, and 0.94 of Weighted

avg F1-score.

Future research could focus on other aspects of

work accidents. There are many possibilities where

machine learning can be used, for instance, to predict

work illness and work accidents with fatal outcomes.

It is clear that this subject is very important and the

development of new researches are welcome to con-

tribute to reducing labor accidents and, therefore, to

help create a safer environment for workers across in-

dustries and across the globe.

ACKNOWLEDGEMENTS

This research has been partly supported by the Brazil-

ian agencies National Council for Scientiﬁc and Tech-

nological Development (CNPq) and Brazilian Labor

Ministry.

REFERENCES

Abu-Mostafa, Y. S., Magdon-Ismail, M., and Lin, H.-T.

(2012). Learning from data, volume 4. AMLBook

New York, NY, USA:, New York, USA.

Alli, B. O. (2008). Fundamental principles of occupational

health and safety second edition. Geneva, Interna-

tional Labour Organization, 15.

Bentejac, C. e. a. (2020). A comparative analysis of gradient

boosting algorithms.

Biau, G. e. a. (2019). Accelerated gradient boosting.

Burkov, A. (2019). The Hundred-Page Machine Learning

Book. Andriy Burkov, Canada, 1 edition.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree

boosting system.

Daoud, E. A. (2019). Comparison between xgboost, light-

gbm and catboost using a home credit dataset. Inter-

national Journal of Computer and Information Engi-

neering, 13(1):6 – 10.

Di Noia, A., Martino, A., Montanari, P., and Rizzi, A.

(2020). Supervised machine learning techniques and

genetic optimization for occupational diseases risk

prediction. Soft Computing, 24(6):4393–4406.

Feng, J., Yu, Y., and Zhou, Z.-H. (2018). Multi-layered

gradient boosting decision trees. Advances in neural

information processing systems, 31.

Freund, Y. and R, S. (1997). A short introduction to boost-

ing.

Freund, Y. e. a. (1997). A decision-theoretic generalization

of on-line learning and an application to boosting.

ILO (2021). International labour organization world statis-

tic. Available on 25th May 2021.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013).

An introduction to statistical learning, volume 112.

Springer.

Kang, K. and Ryu, H. (2019). Predicting types of occu-

pational accidents at construction sites in korea using

random forest model. Safety Science, 120:226–236.

Ke, G. e. a. (2017). Lightgbm: A highly efﬁcient gradient

boosting decision tree.

Kubat, M. (2017). An introduction to machine learning,

volume 2. Springer, Zurich, Switzerland.

MPT (2020). Observatory of occupational safety and health

of the public labor prosecutor of brazil (MPT). Avail-

able on 25th October 2020.

PBPS (1991). Brazilian labor law nº 8,213, from 24th July

1991. Brazilian ofﬁcial journal of the union.

Prokhorenkova, L. e. a. (2019). Catboost: unbiased boost-

ing with categorical features.

Sarkar, S., Vinay, S., Raj, R., Maiti, J., and Mitra, P. (2019).

Application of optimized machine learning techniques

for prediction of occupational accidents. Computers &

Operations Research, 106:210–224.

Shkanov, B. e. a. (2019). Multiclass classiﬁers for pro-

cessing archives of accidents in manufacturing. In

2019 IEEE 14th International Conference on Com-

puter Sciences and Information Technologies (CSIT),

volume 1, pages 187–190.

arez S

anchez, A., Iglesias-Rodr

ıguez, F., Riesgo

Fern

andez, P., and de Cos Juez, F. (2016). Applying

the k-nearest neighbor technique to the classiﬁcation

of workers according to their risk of suffering muscu-

loskeletal disorders. International Journal of Indus-

trial Ergonomics, 52:92–99.

arez S

anchez, A., Riesgo Fern

andez, P., S

anchez

Lasheras, F., de Cos Juez, F., and Garc

ıa Nieto,

P. (2011). Prediction of work-related accidents ac-

cording to working conditions using support vector

machines. Applied Mathematics and Computation,

218(7):3539–3552.

Yuichiro, A. (2012). Pattern recognition and machine

learning. Elsevier.

Zhou, Z.-H. (2012). Ensemble Methods: Foundations and

Algorithms. Chapman & Hall/CRC.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

516