
the importance of comprehensive analyses. In Brazil,
Gonc¸alves et al. (2018) found that low education lev-
els and physical inactivity are associated with depres-
sion in women, while living with a partner and engag-
ing in physical exercise act as protective factors. Ad-
ditionally, Emerson and Llewellyn (2023) point out
that about 20
Furthermore, Maia et al. (2023) demonstrate that
Machine Learning techniques can identify risk fac-
tors for depression among Brazilian youth, provid-
ing support for public policies. However, they high-
light ethical challenges in using such technologies.
Therefore, policies that consider the social determi-
nants of depression and train healthcare professionals
for proper interventions are essential (Santos and Kas-
souf, 2007). Early intervention programs and contin-
uous support are crucial, especially for young people
(Brito, 2011).
After reviewing related studies, it is essential to
acknowledge that each offers valuable insights into
youth depression. However, it is important to high-
light that these studies have certain limitations that
our work can address more comprehensively. It is
necessary to recognize limitations such as small sam-
ple sizes, specific focuses, or the absence of longitudi-
nal data. Table 1 presents comparisons and limitations
of each study.
This study’s proposal stands out by specifically
addressing depression among Brazilian youth and
identifying specific factors that may cause this con-
dition. It aims to contribute to the field by broadening
existing perspectives, overcoming the limitations of
related works, and exploring new aspects of youth de-
pression, such as violence, socioeconomic conditions,
and access to mental health services.
4 MATERIALS AND METHODS
4.1 Database Description
The research was based on the National Health Sur-
vey (PNS) Database, a nationwide household sur-
vey conducted by the Ministry of Health (MS) and
the Brazilian Institute of Geography and Statistics
(IBGE). For this study, the most recent version of the
PNS, from the year 2019, was utilized. This version
offers comprehensive information on various sociode-
mographic, behavioral, and health characteristics, in-
cluding data related to depression. The original 2019
PNS database comprises 1,087 attributes and 293,726
instances.
The primary objective of this study is to analyze
the occurrence of depression among Brazilian youth.
For this purpose, the central attribute used for filtering
the instances was Q092, which indicates whether a
physician or mental health professional has ever diag-
nosed the respondent with depression. All instances
where this attribute was absent were excluded from
the analysis. Additionally, the attribute C008, which
refers to the age of the household member on the ref-
erence date, was used as a filtering criterion to restrict
the analysis to youth aged 15 to 29 years, as defined
by the Statute of Youth (da Juventude, 2015).
From the original 1,087 attributes, the most rel-
evant ones were selected based on the risk factors
associated with depression, as identified in previous
studies such as Sim
˜
oes (2021). After filtering the
instances and selecting the attributes of interest, the
resulting dataset contained 63,260 instances (62,334
without a depression diagnosis and 926 with a depres-
sion diagnosis) and 32 attributes. Table 2 provides a
detailed breakdown of these attributes and their re-
spective descriptions.
4.2 Methodology
During Step 0, we performed several preprocessing
steps to reduce noise in the database, including han-
dling duplicates, removing outliers, and managing
missing values. We also analyzed the correlation of
attributes with the class to eliminate redundancies.
These actions were carried out carefully to ensure that
the database was clean and organized before applying
machine learning algorithms. This ensured that the
models were trained on more representative and ro-
bust data, minimizing distortions.
In Step 1, we further analyzed the dataset and
made additional adjustments by removing attributes
with incomplete information or few responses. We
also merged some attributes and their responses to ob-
tain more consistent variables, improving the quality
of the data. By the end of this step, the dataset con-
tained 27,701 instances, of which 26,775 had no de-
pression diagnosis, 926 had a positive diagnosis, and
24 attributes.
After preprocessing, we split the dataset into 80%
for training and 20% for testing, with stratification to
maintain the correct proportion between cases with
and without a depression diagnosis. We applied strat-
ified cross-validation with 10 iterations (StratifiedK-
Fold), using the average of the results to represent
model performance. This process was essential for
a more reliable evaluation of the models during hy-
perparameter tuning, improving the robustness of the
results.
Additionally, we used the Ant Colony-based in-
stance and attribute selection technique (RantIFS). In-
Using Machine Learning to Analyze the Impact of Lifestyle and Socioeconomic Factors on the Incidence of Depression Among Young
Brazilians
641