Occupational Accidents Prediction in Brazilian States: A Machine

Learning Based Approach

J. M. Toledo

1,2 a

and Thiago J. M. Moura

Federal Institute of Para

ıba (IFPB), Avenida Primeiro de Maio, 720, Jaguaribe,

ao Pessoa, Para

ıba, CEP 58015-435, Brazil

Minist

erio do Trabalho e Emprego, Esplanada dos Minist

erios, Bloco F, Bras

ılia, DF, Brazil

Keywords:

Occupational Accidents, Machine Learning, Regression Problems.

Abstract:

Occupational accident is an unexpected event connected to work that may result in injury and/or death of

workers. Thus, the possibility of predicting the occurrence of occupational accidents can assist the government

in labor policy-making, protecting the lives and health of workers. In this work, we propose the use of machine

learning models to predict the occurrence of occupational accidents in each Brazillian state. We use multiple

datasets concerning socio-economic, employment, and demographic data as sources to obtain an integrated

table utilized to train regression models (linear regression, support vector regressor, and LightGBM) and make

predictions. We verify that the developed models show high predictive performance and explainability, with

the R-squared metric reaching 0.90.

1 INTRODUCTION

The ﬁrst joint report produced by the International La-

bor Organization (ILO) and the World Health Organi-

zation (WHO) to assess the burden of illnesses and

injuries at work estimates that these cause the deaths

of almost two million workers per year (Organiza-

tion et al., 2021). In 2016 alone, occupational acci-

dents and work-related diseases caused the death of

1.9 million people, overloading the countries’ health

systems, reducing family income, and decreasing eco-

nomic productivity (Organization et al., 2021).

About Brazil, between 2012 and 2021, there

were 6,161,623 (six million, one hundred and sixty-

one thousand, six hundred and twenty-three) oc-

cupational accidents and work-related diseases re-

ported to ofﬁcial government agencies, in addition to

22,954 (twenty and two thousand, nine hundred and

ﬁfty-four) deaths due to work-related reasons (MPT,

2023). It is worth highlighting that social security ex-

penses estimated to result from such facts have al-

ready exceeded 133 billion reais (MPT, 2023) (ap-

proximately 25.7 billion dollars in 2023 exchange

rate), a signiﬁcant portion of the Brazilian Gross Do-

mestic Product (GDP).

According to specialized literature, however,

work-related accidents and illnesses are caused by

https://orcid.org/0000-0001-9284-0549

multiple factors that could be prevented (Alli, 2008).

According to Brazilian legislation, employers are de-

manded to implement preventive measures to elimi-

nate or mitigate the risks present in the workplace,

while the government is responsible for enforcing la-

bor legislation and promoting a safe working environ-

ment, with a focus on preventing accidents and work-

related diseases. To achieve this objective, govern-

ment agencies can use technology to increase the ef-

ﬁciency of their actions.

Recently, we have experienced the growth of ma-

chine learning (ML), driven not only by the data avail-

ability but also by the increasing processing power of

computers (Alpaydin, 2021). This ﬁeld of study al-

lows machines to learn from past data and make pre-

dictions (Alpaydin, 2021). Machine learning algo-

rithms have been applied to solve various problems,

such as building recommendation systems, fraud de-

tection, and image recognition (Alpaydin, 2021).

Several areas of knowledge, such as medicine and en-

gineering, have also made use of advances in the area

to automate diagnoses and anticipate results.

Although Brazilian legislation requires the report-

ing of occupational accidents and work-related dis-

eases for public entities, there is a delay between the

reporting of occupational accidents and their use by

the government. Thus, forecasting the number of oc-

cupational accidents can anticipate preventive actions.

Toledo, J. and Moura, T.

Occupational Accidents Prediction in Brazilian States: A Machine Learning Based Approach.

DOI: 10.5220/0012557900003690

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 26th International Conference on Enterprise Information Systems (ICEIS 2024) - Volume 1, pages 595-602

ISBN: 978-989-758-692-7; ISSN: 2184-4992

595

Furthermore, predicting work accidents for the coun-

try’s economic sectors can help in establishing public

policies for health and safety at work and prevention

of occupational accidents.

The objective of this work is to obtain a dataset

on the number of occupational accidents, as well as

the extraction and processing of information that are

used as independent variables in predictive models

and which is obtained from multiple sources. Sub-

sequently, we analyze the use of machine learning al-

gorithms to predict the number of occupational ac-

cidents in each economic activity and Brazilian state

(Brazillian territory divisions). As far as we know,

this work is unprecedented in Brazil and may help the

government to act more efﬁciently, reducing pension

costs, and increasing the general well-being of soci-

ety.

2 BACKGROUND AND RELATED

WORKS

In this section, we brieﬂy examine the fundamental

concepts of machine learning and perform a biblio-

graphical review of the employment of ML in occu-

pational accident prediction.

2.1 Machine Learning

In recent years, the increase in computational capacity

and data storage has driven the development of ML,

which has been applied in many areas of knowledge.

In this work, we intend to predict the number of oc-

cupational accidents in each Brazillian state and eco-

nomic activity. Therefore, the target variable is a con-

tinuous number, and, as a consequence, we propose

the use of regression models. Thus, let us brieﬂy de-

scribe the ML regression algorithms implemented in

the proposed experimental protocol.

The simplest regression model is called linear re-

gression (James et al., 2013). It assumes that there is

approximately a linear relationship between the fea-

tures and the target variable. Data is used to ﬁnd the

best linear coefﬁcients which minimize the discrepan-

cies between predicted and actual output values. This

kind of model, although simple compared to more

modern models as the ones described below (Support

Vector Machines - SVM and LightGBM), is widely

used in science (James et al., 2013).

The SVM algorithms denote a class of ML mod-

els developed in the 1990s and which gained popular-

ity since then (James et al., 2013). The SVM models

were initially introduced for classiﬁcation problems

and later generalized to other situations, being cur-

rently used in various domains of application, such as

text categorization and computer vision (Mammone

et al., 2009).

LightGMB is a gradient boosting tree algorithm

that was developed by Microsoft, focusing on the ef-

ﬁciency and scalability of the ML model (Ke et al.,

2017). Compared to other boosting trees, Light-

GBM saves time and computational cost, allowing

researchers and developers to deal with big datasets

(Schapire et al., 1999).

Given the variety of available machine learning

models, deciding which method produces the best re-

sults in a given dataset is an important task(James

et al., 2013). Thus, let us brieﬂy summarize the met-

rics used in this work to evaluate the prediction of the

trained regression models.

The Root mean square error (RMSE) represents

the squared root of the squared differences between

the actual values and the predicted values of a vari-

able. The closer RMSE is to zero, the better the pre-

dictions. On the other hand, the mean absolute per-

centage error (MAPE) is the mean percentual differ-

ence between the predicted and actual value of a vari-

able. Finally, the coefﬁcient of determination (R

)

represents the proportion of variance in the target that

can be explained by the features. The values of R

range between 0 and 1 and the greater the value of

this metric, the more explainable is the target variable

by the features through the regression model.

2.2 Related Works

In recent years, some works have been produced us-

ing ML techniques on themes related to workers’

health and, more speciﬁcally, using data on occupa-

tional accidents.

In this regard, Sarkar et al. predicted whether an

accident caused damage to workers or property with

an accuracy of around 90% (ninety percent) by per-

forming tests with SVM and ANN (Artiﬁcial Neu-

ral Networks) and applying GA (genetic algorithm)

and PSO (particle swarm optimization) algorithms

to reﬁne the hyperparameters of the models (Sarkar

et al., 2019). In turn, Recal et al. used logistic re-

gression, SVM, ANN, and SGB (Stochastic Gradient

Boosting) to classify work accidents that occurred in

the construction industry in Turkey, working in two

scenarios: binary prediction (fatal accident or not)

and prediction in three classes (simple, severe, or

fatal accident)(Recal and Demirel, 2021). The au-

thors conclude that the SVM and SGB algorithms

performed better in the two-class problem, while the

SGB obtained better metrics in the three-class prob-

lem. In addition, the authors state that the predictions

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

596

in the class of fatal accidents surpassed the results

of other classes in accuracy, which reveals that the

selected features have characteristics associated with

the severity of accidents and, therefore, the trained

models can be used to prevent future occurrences (Re-

cal and Demirel, 2021).

Khairuddin et al. analyzed a public OSHA (Occu-

pational Safety and Health Administration) database

using ﬁve machine learning algorithms: SVM, KNN

(K-Nearest Neighbors), Na

ıve Bayes, Decision Tree,

and Random Forest (Khairuddin et al., 2022). The au-

thors used a feature optimization technique through

which only the three most important features of the

models are maintained in the algorithms’ training pro-

cess. Using the described methodology, the authors

could predict the possibility of hospitalization with

89% accuracy (eighty-nine percent) and with 95% ac-

curacy (ninety-ﬁve percent) the occurrence of ampu-

tation as a result of an accident at work.

ML models were also used for predictions in some

speciﬁc economic activities. Koc et al., for exam-

ple, used data from approximately 48,000 accidents

in civil construction in Turkey and predicted the pos-

sibility of permanent disability of the injured workers

with an accuracy of 82% (eighty-two percent) through

the application of the algorithm XGBoost (Extreme

Gradient Boosting) and with the use of a genetic al-

gorithm to ﬁx the hyperparameters of the model (Koc

et al., 2021).

In another work, Scott et al. used prehospital care

data to predict which admissions occurred as a result

of occupational accidents in rural areas (Scott et al.,

2021). Intending to help reduce the underreport-

ing of occupational accidents, the authors used the

ıve Bayes algorithm and claimed to reduce by 69%

(sixty-nine percent) the need for visual inspection of

pre-hospital care cases (Scott et al., 2021). In the

medical-hospital activity, Koklonis et al. used post-

accident (or post-incident) data to classify events into

ﬁve classes: needle/cut accident, fall, incident, acci-

dent, and safe condition (Koklonis et al., 2021). The

authors categorized the data into the classes above

with an accuracy of 93% (ninety-three percent), per-

forming tests with the Na

ıve Bayes, MLP (multilayer

perceptron), KNN, and BN (Bayesian Networks) al-

gorithms.

In Brazil, the Labor Inspectors created a binary

classiﬁcation model for accidents that was able to cre-

ate a probability of occurrence of accidents (Toledo

et al., 2020). The trained model presented an 86%

(eighty-six percent) accuracy in the test dataset and

the generated probabilities have been used in the plan-

ning of inspections in the country (Toledo et al.,

2020).

3 OCCUPATIONAL ACCIDENTS

IN BRAZIL

As an initial step in building predictive models, it is

necessary to understand the data used as features and

target variables. To this end, an exploratory analy-

sis of the occupational accident data in Brazil is per-

formed in this section.

Brazilian laws oblige all companies in which oc-

cupational accidents and work-related diseases occur

to communicate these facts to the government through

a digital document named Occupational Accident

Communication (CAT - Comunicac¸

ao de Acidente de

Trabalho). Data about the employee (like age, gen-

der, and professional activity), the accident/disease

(type of accident/disease, causative factor, etc.), and

the employer (such as its economic activity) are in-

formed in this document. These data are received by

the Brazilian government which creates a dataset that

is used in this work, after an anonymization process.

It is worth mentioning that we do not consider work-

related diseases, just maintaining in the analyzed data

the occupational accidents. From 2016 to 2022, a to-

tal of 2.387.938 occupational accidents were reported

in Brazil, which will be analyzed in what follows.

In Fig. 1, we represent the line plot of the number

of occupational accidents (shown in blue) and deaths

resulting from accidents (shown in red) in Brazil for

the period under consideration. We can observe that

the number of accidents decreased in 2020 due to the

COVID-19 pandemic outbreak, while, on the other

hand, there was an increase in the number of work-

related diseases, due to the same cause. In the period

considered excluding the year 2020, the number of

occupational accidents was at the level of 450 thou-

sand, while the number of deaths resulting from work-

related causes was close to 2 thousand.

Figure 1: Line plots of the number of occupational acci-

dents and work-related deaths.

Occupational Accidents Prediction in Brazilian States: A Machine Learning Based Approach

597

In Fig. 2, we show the distribution of occupa-

tional accidents in Brazil by sex and age of work-

ers, presenting the age pyramid of these accidents in

Brazil. It is possible to verify that the most affected

age group is made up of young men, aged between 21

and 25 years. In general, it is also possible to verify

that the number of occupational accidents is higher

among men (69.4% of the occupational accidents oc-

cur with men). The types of activities carried out by

male workers in Brazil and the inexperience of young

people at work may explain this demographic distri-

bution.

Figure 2: Age pyramid of work-related diseases in Brazil.

In Fig. 3, we can observe the distribution of occu-

pational accidents by the type of injury, classiﬁed us-

ing the category of the International Statistical Clas-

siﬁcation of Diseases (ICD), for the ten most frequent

types. We can notice that injuries related to muscu-

loskeletal factors (hand and wrist injuries and frac-

tures, foot and ankle injuries, etc.) are the most fre-

quent diseases consequent to accidents. Communica-

ble diseases are also in the list, most related to health

assistant professionals.

Figure 3: Bar diagram of the distribution of occupational

accidents in Brazil by type of injury for the ten most fre-

quent types.

In this work, we aim to obtain a machine learning

model to predict the number of occupational accidents

in the Brazilian states. Thus, the target variable is ob-

tained from the CAT dataset as we discuss below.

4 PROPOSED APPROACH AND

DATA PREPARATION

This section describes the methodology adopted in

this work in addition to analyzing the steps in data

preparation and the dataset obtained.

4.1 The Methodological Path

The methodology used in the present work is summa-

rized in Fig. 4. Firstly, we use data from multiple

sources to obtain an integrated dataset containing all

the features and the target variable that are used in

this work. Then, we execute a preprocessing stage

and split the dataset into training and test data, which

are used to implement the ML models and analyze the

results. In this section, we detail the data integration

and preparation step presented in Fig. 4, while the

data preprocessing and ML model training end evalu-

ation are described in Sec. 5.

It is important to mention that we used Python

programming language (Van Rossum et al., 2007) in

all steps of this work, from data preparation to model

training and evaluation.

4.2 Data Preparation

Let us start by describing the data preparation, the

ﬁrst step of the methodology used in this work and

depicted in Fig. 4.

In this work, the target variable is the number

of occupational accidents in Brazilian states for each

economic activity and by year. Thus, we use the

data acquired from the mentioned CAT communica-

tion and obtain the number of occupational accidents

in a given economic activity in a Brazilian city each

year by grouping and counting the number of rows.

When constructing the dataset, only accidents that

occurred between 2016 and 2021 were kept, as we

would not have all the features available for 2022.

The data used as features were obtained by inte-

grating multiple datasets, as shown in Fig. 4. These

datasets were obtained from public sources and the

Brazilian Labor Ministry databases. Thus, an impor-

tant contribution is made in this work: the integration

of data from multiple sources to obtain a single uni-

ﬁed table containing all variables needed to train the

models.

The construction of a public sociodemographic

dataset in Brazil has already been done (Toledo et al.,

2023). Integrating data from public sources related

to population, economy, employment, education, and

health, the authors obtained a socioeconomic statis-

tics dataset for all 5,570 Brazillian cities (Toledo

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

598

Data

Pre-processing

Data

Train-test

split

Training

Data

Test Data

Model training/

hyperparameter

tuning

Model Evalution

1. Data preparation

2. Data pre-processing and train-test split

3. Model training and evalution

Figure 4: The methodological path used in this work.

et al., 2023). From these public sources, some vari-

ables are chosen to compose the dataset used in this

work, as we discuss below.

The Gross Domestic Product (GDP) represents the

value of all the ﬁnished goods and services produced

in a region and, as a consequence, it is related to the

economic activity and the need for work. On the other

hand, the Human Development Index (HDI) is related

to health, education, and work conditions.

General data from Brazilian cities are also in-

cluded as features: the population and working staff.

We can expect that the greater a region is (bigger pop-

ulation and working staff), the greater the number of

occupational accidents too.

From the Brazilian Labor Ministry databases, we

include data related to the employers and employees.

The economic activity of an enterprise, given by the

Brazilian National Classiﬁcation of Economic Activ-

ities (CNAE), is used as a categorical variable. The

numbers of employers and employees in each Brazil-

ian state are also included. It is important to observe

that, as described in Sec. 3, the sex and age of work-

ers are determinant variables in the occurrence of ac-

cidents. Thus, the mean age of the employees, the

average time they work in a given employer, and the

proportion of females were included as features in our

study.

Finally, we included features obtained from the

Brazilian Labor Inspection. The number of irregu-

larities related to informal workers and the number

of irregularities related to working hours are added,

since the mentioned WHO/ILO joint report points out

exposure to long work hours as the major cause of

deaths related to work and informal jobs being cor-

related to the occurrence of accidents (Organization

et al., 2021). In Brazil, Labor Inspectors stop work

activities if serious and imminent risks to workers’

health are detected, in procedures called embargos or

interdictions, whose numbers per economic activity

in a given state are also included in this work. Since

it should be expected that a higher number of oc-

cupational accidents occur in economic activities in

which a greater number of irregularities are detected,

the mentioned variables are considered in ML model

training.

It is worth mentioning that the described datasets

are joined using the Brazilian cities and the year as

keys.

4.2.1 The Resulting Dataset

After the step of data integration, a uniﬁed dataset is

obtained containing, for each Brazillian city, the num-

ber of occupational accidents in each economic activ-

ity and all the corresponding features. Brazil is di-

vided into 27 states and, these regions, are divided

into cities. As we intend to predict the accidents in

each Brazilian state, we proceed to the proper aggre-

gation summing all the numerical variables but the

ones that are average numbers (and which begin with

”avg”). At this step, we also calculate the population

density (ratio between population and surface area)

and employers’ density (number of employers divided

by the surface area). The features of the dataset used

to train the machine learning models proposed in this

work are displayed in Table 1. We describe each vari-

able, informing its type, unity, maximum and mini-

mum values.

It is essential to state that we take into account the

correlation between variables when choosing the fea-

tures for model training and if a feature pair has a cor-

relation near one we removed one of them, maintain-

ing only the ones listed in Table 1. For example, ini-

tially, we intended to use the total number of female

workers and the total salaried workers as features. But

as the number of female workers and the number of

employees have a Pearson correlation near 1, just the

second variable is maintained. Similarly, the total

Occupational Accidents Prediction in Brazilian States: A Machine Learning Based Approach

599

Table 1: Data dictionary.

Variable Description Type Unit Min value Max value

UF Brazilian state string - - -

Cnae Brazilian economic activity classiﬁcation string - - -

Population Population int - 1.85 × 10

3.08 × 10

WorkingStaff Working staff int - 0 2.33 × 10

PopulationDensity Number of people by km

ﬂoat - 0.6 5363.08

HDI Human Development Index (HDI) ﬂoat - 0.469 0.847

GPD Gross Domestic Product (GDP) ﬂoat 10

R$ 2.36 2.18 × 10

NrEmployers Number of employers int - 0 633, 656

EmployersDensity Number of enterprises by km

ﬂoat - 0 5.36

NrEmploees Number of employees int - 0 1.07 × 10

PropFemale Proportion of female workers ﬂoat - 0 1

AvgAge Average age of employees ﬂoat years 0 56.70

AvgTime Average time working for the employer ﬂoat years 0 25.55

NrIrregularities Nr. of irregularities related to informal workers int - 0 662

NrIrregHours Nr. of irregularities related to working hours int - 0 270

NrEmbargoes Nr. of embargoes/closures int - 0 1547

salaried population and the working staff have a cor-

relation coefﬁcient near one and, thus, the ﬁrst feature

was removed.

5 EXPERIMENTAL PROTOCOL

In this section, we describe the data prepossessing and

the machine learning models training, the ﬁnal steps

of Fig. 4.

5.1 Data Preprocessing

As a data preprocessing step, null numerical data

were replaced by zero. The categorical variables were

transformed into numerical variables using the target

encoding strategy (Micci-Barreca, 2001), since there

are a large number of variable categories, a situation

for which the strategy has proven effective (Pargent

et al., 2022). The numerical variables were resized

by subtracting them from their means and dividing

them by the standard deviation of their distributions, a

strategy called standard scalar. After the preprocess-

ing step, the resulting dataset has 11,255 (eleven thou-

sand, two hundred and ﬁfty-ﬁve) rows (also called in-

stances in ML problems).

5.1.1 Train-Test Split

Evaluating the performance of an ML model in an un-

biased dataset is an essential step. So, it is a com-

mon practice to split the initial dataset into training

and test ones. In this work, the dataset resulting from

the preprocessing step was randomly divided into a

training dataset, which contains 80% of the data in-

stances, or 9,004 (nine thousand and four) rows, and

a test dataset, including the remaining 20% rows, or

2,251 (two thousand, two hundred and ﬁfty one).

5.2 Moldel Training

As already mentioned, we intend to predict the num-

ber of occupational accidents in Brazil in each state

and for each economic activity. So, it is clear that this

problem is a regression one and, as a consequence, the

correct choice of the studied models must be made.

Although linear regression is a very simple super-

vised learning model, it is useful and widely used

in science (James et al., 2013). In this work, lin-

ear regression with the standard hyperparameters of

the Python library Scikit Learn is used as a baseline

model.

In this study, we analyze the use of models SVM

and LightGBM, since they have been presenting a

high performance in regression problems (Bent

ejac

et al., 2021) and were also used in similar problems

to the one proposed in this work in other countries

(Di Noia et al., 2020; Toledo et al., 2020).

The ML models have a set of hyperparameters,

which are adjusted in training steps, that can improve

the models’ performance and help prevent overﬁtting.

The hyperparameter search space used for SVM and

LighGBM models can be seen in Table 2.

In the model training step and for hyperparam-

eter search, we also use cross-validation with four

folds for all the models and Bayesian optimization.

In this process, the training dataset is divided into

four folds in each iteration and, while one of them

is used for evaluating the model, the other three are

used for training the algorithm. After all of the iter-

ations, the best hyperparameters are chosen and the

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

600

Table 2: Hyperparameter search spaces.

Model Hyperparameter search space

SVR

C:[0.1,1,10,100]

gamma:[’scale’, ’auto’]

kernel:[’linear’, ’poly’, ’rbf’]

LightGBM

max depth: [5, 9]

num leaves: [6, 17]

boosting type: [gbdt, dart]

subsample: [0.7, 0.8, 0.9, 1.0]

colsample bytree: [0.8, 0.9, 1.0]

learning rate: [0.05, 0.5]

Table 3: Metrics for the implemented regression models.

Model R

MAPE RMSE

Linear regression 0.492 21.27% 743.20

SVR 0.725 3.31% 546.54

LightGBM 0.908 1.86 % 316.60

whole dataset is used to train the algorithm, which is

evaluated with the test dataset.

6 RESULTS AND DISCUSSION

This section discusses the predictions obtained by the

ML models trained as shown in Sec. 5.

In Table 3, we list the metrics obtained for the

models in the test dataset. We can notice that the

LightGBM has the higher R

and the lower values

of MAPE and RMSE. Observe that the value of R

reaches the values of 0.725 for SVM and 0.908 for

LightGBM, which tells us that the features and mod-

els chosen explain the target variable, distancing from

random guesses.

Upon analyzing the MAPE metric, we can observe

that there is only a 1.86% variation between the ac-

tual and predicted values for the target variable when

using the LightGBM model. On the other hand, the

values of RMSE presented in Table 3 are below the

standard deviation for the target variable, which is

1051.72.

The LightGBM algorithm calculates a score for

each feature, representing the feature’s importance,

with a higher score representing a larger effect on the

prediction. We depict in Fig. 5 the relative feature

importance for the trained model.

We can observe that the Brazilian state has the

highest importance. The territory is related to the

economic activities developed and to the population,

which can explain the score. The total work staff is

the second most important feature since we can ex-

pect a growth in the number of accidents in territories

with a higher number of workers. The average time

that the employees work with the employers is also

Figure 5: Feature importance for LightGBM algorithm.

Table 4: Hyperparameter search spaces.

Model Best hyperparameters

SVR

C = 10

gamma=’auto’

kernel= ’poly’

LightGBM

max depth = 9

num leaves= 7

boosting type=gbdt

subsample=0.7

colsample bytree=1.0

learning rate= 0.48

an important feature, indicating that the experience in

the workplace reduces the probability of accidents.

Finally, for reproducibility reasons, we list in Ta-

ble 4 the models’ hyperparameters that gave the best

metrics in the training step.

7 CONCLUDING REMARKS

In this work, we obtain an integrated dataset con-

taining the number of occupational accidents in each

Brazilian state and socioeconomic variables used as

features. We, thus, examine the use of ML models

to predict the number of occupational accidents in the

country.

Analyzing the results obtained so far, it is possible

to verify that it has been possible to build predictive

models to predict the number of accidents occurring

in a given state of the federation.

The high R

values obtained for the SVM and

LightGBM algorithms allow the conclusion that the

trained models can explain the target variable based

on the selected features. Besides that, MAPE values

in the order of 1.8% to 3.3% mean that there is a low

percentual difference between the predicted and ac-

tual value of the accident number.

In this work, we predict the number of work acci-

dents for each economic activity in Brazilian states. A

Occupational Accidents Prediction in Brazilian States: A Machine Learning Based Approach

601

challenge that still needs to be faced is the prediction

of work accidents for each of the 5,570 (ﬁve thousand

ﬁve hundred and seventy) cities in the country, which

we intend to do in future contributions. In this prob-

lem, there is a greater granularity in data, consider-

ably increasing the number of training instances. Fur-

thermore, not all economic activities are developed in

all cities in the country, which will need to be ana-

lyzed in the data preprocessing stages.

Another possibility for future work is the use of

time series analysis techniques to forecast the number

of occupation accidents. To this end, it is necessary

to perform appropriate transformations in the occu-

pational accident dataset, evaluate the granularity of

the information, and choose the correct experimental

protocol.

Given the importance of the government’s preven-

tive action strategies to safeguard workers’ health, the

continuity of research seems to be essential.

REFERENCES

Alli, B. O. (2008). Fudamental Principles of Occupational

Health and Safety.

Alpaydin, E. (2021). Machine learning. Mit Press.

Bent

ejac, C., Cs

org

o, A., and Mart

ınez-Mu

noz, G. (2021).

A comparative analysis of gradient boosting algo-

rithms. Artiﬁcial Intelligence Review, 54:1937–1967.

Di Noia, A., Martino, A., Montanari, P., and Rizzi, A.

(2020). Supervised machine learning techniques and

genetic optimization for occupational diseases risk

prediction. Soft Computing, 24(6):4393–4406.

James, G., Witten, D., Hastie, T., Tibshirani, R., et al.

(2013). An introduction to statistical learning, vol-

ume 112. Springer.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W.,

Ye, Q., and Liu, T.-Y. (2017). Lightgbm: A highly

efﬁcient gradient boosting decision tree. Advances in

neural information processing systems, 30.

Khairuddin, M. Z. F., Lu Hui, P., Hasikin, K., Abd Razak,

N. A., Lai, K. W., Mohd Saudi, A. S., and Ibrahim,

S. S. (2022). Occupational injury risk mitigation:

machine learning approach and feature optimization

for smart workplace surveillance. International jour-

nal of environmental research and public health,

19(21):13962.

Koc, K., Ekmekcio

glu,

O., and Gurgun, A. P. (2021). In-

tegrating feature engineering, genetic algorithm and

tree-based machine learning methods to predict the

post-accident disability status of construction work-

ers. Automation in Construction, 131:103896.

Koklonis, K., Saraﬁdis, M., Vastardi, M., and Koutsouris,

D. (2021). Utilization of machine learning in support-

ing occupational safety and health decisions in hos-

pital workplace. Engineering, Technology & Applied

Science Research, 11(3):7262–7272.

Mammone, A., Turchi, M., and Cristianini, N. (2009). Sup-

port vector machines. Wiley Interdisciplinary Re-

views: Computational Statistics, 1(3):283–289.

Micci-Barreca, D. (2001). A preprocessing scheme for

high-cardinality categorical attributes in classiﬁcation

and prediction problems. ACM SIGKDD Explorations

Newsletter, 3(1):27–32.

MPT (2023). Observat

orio de seguranc¸a e sa

ude no tra-

balho. Accessed: 2023-10-02.

Organization, W. H. et al. (2021). Who/ilo joint estimates of

the work-related burden of disease and injury, 2000–

2016: global monitoring report.

Pargent, F., Pﬁsterer, F., Thomas, J., and Bischl, B.

(2022). Regularized target encoding outperforms tra-

ditional methods in supervised machine learning with

high cardinality features. Computational Statistics,

37(5):2671–2692.

Recal, F. and Demirel, T. (2021). Comparison of machine

learning methods in predicting binary and multi-class

occupational accident severity. Journal of Intelligent

& Fuzzy Systems, 40(6):10981–10998.

Sarkar, S., Vinay, S., Raj, R., Maiti, J., and Mitra, P. (2019).

Application of optimized machine learning techniques

for prediction of occupational accidents. Computers &

Operations Research, 106:210–224.

Schapire, R. E. et al. (1999). A brief introduction to boost-

ing. In Ijcai, volume 99, pages 1401–1406. Citeseer.

Scott, E., Hirabayashi, L., Levenstein, A., Krupa, N., and

Jenkins, P. (2021). The development of a machine

learning algorithm to identify occupational injuries in

agriculture using pre-hospital care reports. Health in-

formation science and systems, 9:1–9.

Toledo, J., Moura, T. J., and Timoteo, R. (2023). Brstats: a

socioeconomic statistics dataset of the brazilian cities.

In Anais do V Dataset Showcase Workshop, pages 67–

78. SBC.

Toledo, J., Timoteo, R. D. A., and Silva Barbosa, E. (2020).

Intelig

encia artiﬁcial para predic¸

ao de acidentes de

trabalho no brasil e sua aplicac¸

ao pela inspec¸

ao do

trabalho. Revista da Escola Nacional da Inspec¸

ao do

Trabalho.

Van Rossum, G. et al. (2007). Python programming lan-

guage. In USENIX annual technical conference, vol-

ume 41, pages 1–36. Santa Clara, CA.

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

602