Feature Selection for Stock Market Prediction: A Comparison of Relief

and Information Gain Methods

Humberto O. Braganc¸a

1 a

, Rafael A. Berri

1 b

, Bruno L. Dalmazo

1 c

, Eduardo N. Borges

1 d

Viviane L. D. de Mattos

1 e

, Richard F. Pinto

1 f

, Fabian C. Cardoso

2 g

and Giancarlo Lucca

3 h

Federal University of Rio Grande (FURG), Rio Grande, Brazil

University of Rio Verde (UniRV), Rio Verde, Brazil

Catholic University of Pelotas (UCPel), Pelotas, Brazil

humberto.obj@gmail.com, {dalmazo, rafaelberri, eduardoborges, vivianemattos, richard pinto}@furg.br,

Keywords:

Machine Learning, Feature Selection, Stocks, Technical Analysis, Financial Market.

Abstract:

This study explores an approach to predictive analysis in the ﬁnancial market, using a data set composed of

ﬁnancial information from different companies listed on the stock market, which provides a more detailed and

contextualized view of the behavior of shares. Based on these indicators, feature selection methods, such as

Relief and Information Gain, are applied to identify the most relevant variables for building predictive models.

One of the main contributions of this work is the use of cross-validation to evaluate attribute selection, a

technique that has not yet been explored in this context with this dataset. The results show that the combination

of new ﬁnancial indicators and cross-validation offers a solid basis for more accurate analysis, with important

implications for investors, ﬁnancial analysts and policymakers in the stock market. This work expands the

boundaries of the literature on feature selection and opens possibilities for future research in emerging markets.

1 INTRODUCTION

The Brazilian ﬁnancial market, B3

, is a large and

dynamic emerging market with unique characteris-

tics that require adapted analytical and predictive ap-

proaches (Chen and Metghalchi, 2012). Its complex-

ity, driven by diverse economic sectors and volatil-

ity, presents challenges and opportunities for ﬁnancial

analysis (Bouri et al., 2020).

In recent years, predictive analysis using machine

learning has proven effective for ﬁnancial decision-

making. Supervised learning models help assess

risks and make informed decisions, emphasizing their

importance in managing corporate ﬁnancial perfor-

mance (Cuervo, 2023).

https://orcid.org/0009-0006-6610-500X

https://orcid.org/0000-0002-3812-4186

https://orcid.org/0000-0002-6996-7602

https://orcid.org/0000-0003-1595-7676

https://orcid.org/0000-0002-3512-6290

https://orcid.org/0009-0007-0176-3383

https://orcid.org/0000-0002-2842-0387

https://orcid.org/0000-0002-3776-0260

https://www.b3.com.br/

Feature selection, a key step in improving model

efﬁciency and generalization, involves identifying the

most relevant variables to enhance prediction ac-

curacy and streamline the learning process (Chan-

drashekar and Sahin, 2014). This process is partic-

ularly important in ﬁnancial markets, where the vol-

ume of data can overwhelm traditional methods, im-

proving model performance and reducing overﬁtting

(Htun et al., 2023).

Despite the importance of feature selection in

global markets, there is limited research on its appli-

cation in the Brazilian context. The country’s eco-

nomic and ﬁnancial speciﬁcities, such as its regula-

tory environment and market structure, impact the be-

havior of ﬁnancial indicators. Emerging technologies

and new indicators derived from detailed data can im-

prove ﬁnancial analysis and provide deeper insights

into the Brazilian market (Kohn and Moraes, 2007).

The application of machine learning methods,

coupled with extensive datasets, can signiﬁcantly en-

hance the accuracy and reliability of economic fore-

casts in Brazil, underscoring the importance of tai-

lored approaches for ﬁnancial analysis in emerging

markets (Araujo and Gaglianone, 2023). The com-

996

Bragança, H. O., Berri, R. A., Dalmazo, B. L., Borges, E. N., D. de Mattos, V. L., Pinto, R. F., Cardoso, F. C. and Lucca, G.

Feature Selection for Stock Market Prediction: A Comparison of Relief and Information Gain Methods.

DOI: 10.5220/0013481300003929

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 27th International Conference on Enterprise Information Systems (ICEIS 2025) - Volume 1, pages 996-1003

ISBN: 978-989-758-749-8; ISSN: 2184-4992

bination of cross-validation and feature selection in

Brazilian stock market data is still underexplored,

highlighting a research opportunity to enhance pre-

dictive models and forecasting precision.

To address this gap, this study proposes the uti-

lization of an dataset with ﬁnancial indicators spe-

ciﬁc to Brazil, sourced from decades of detailed ﬁ-

nancial data. This dataset is used to apply advanced

feature selection techniques and evaluate the predic-

tive performance of models using cross-validation, a

technique little explored in the national context.

The main objective of this work is to evaluate how

different feature selection methods, applied to this

set of ﬁnancial data, can improve the performance of

predictive models in the Brazilian ﬁnancial market.

Speciﬁcally, Information Gain and Relief methods are

used to choose the key features.

This document is organized as follows: Section

2 covers Feature Selection Techniques and Technical

Analysis Indicators, in the Literature Review. Sec-

tion 3 summarizes key prior studies. Our methodol-

ogy is detailed in Section 4, while Section 5 presents

the study’s ﬁndings. Finally, Section 6 discusses the

results and future research opportunities.

2 BACKGROUND

This section aims to give insight into key concepts

for the article, starting with the technical analysis in-

dicators used in the models, followed by the feature

selection methods.

2.1 Technical Analysis Indicator

Technical analysis indicators are vital instruments

used to examine the price trends of various ﬁnancial

assets, such as stocks, currencies, and commodities.

Their main goal is to predict future market move-

ments through graphical analysis, employing mathe-

matical formulas based on historical price and trading

volume data of the assets (Shi et al., 2022).

2.1.1 Moving Average

Moving average is a statistical technique used to

smooth the volatility of a time series of data (Billah

et al., 2024), facilitating the identiﬁcation of patterns

and trends by reducing random variation. It computes

the average of a set of values within a sliding window

over time, offering a clearer insight into the underly-

ing movements within the time series.

2.1.2 Standard Deviation

The standard deviation is a key metric in stock tech-

nical analysis, measuring price volatility by calculat-

ing the variability of closing prices around their mov-

ing average (Altman and Bland, 2005). It is derived

from the variance, which averages the squared differ-

ences between prices and the mean, with its square

root yielding the standard deviation. This measure is

essential for constructing Bollinger Bands, identify-

ing overbought and oversold levels, and assessing as-

set risk—where higher values indicate greater volatil-

ity and risk, while lower values suggest stability.

2.1.3 MACD

The Moving Average Convergence Divergence

(MACD) is a key technical analysis tool used to iden-

tify changes in an asset’s trend strength, direction,

momentum, and duration. By leveraging historical

data, it helps forecast price movements in ﬁnancial

markets. The MACD is computed using two exponen-

tial moving averages (EMAs) (Halilbegovic, 2016),

which assign greater weight to recent data. Typically,

these EMAs are based on 26-period and 12-period

time frames. Additionally, a signal line, which is a

nine-period EMA of the MACD line, is included in

the dataset as a feature.

Several indicators stem from the MACD, includ-

ing the MACD Slope and MACD Histogram. The

MACD Slope measures the rate of change of the

MACD over time, representing its angular coefﬁcient.

A rising MACD Slope suggests a strengthening up-

trend, whereas a falling slope indicates a downtrend.

It is computed by measuring the variance between

MACD values at different time points. The MACD

Histogram, another derivative indicator, represents

the difference between the MACD and the signal line

(MACD - Signal) (Kang, 2021), visually depicting

momentum shifts and trend changes.

2.1.4 Relative Strenght Index (RSI)

The RSI, created by J. Welles Wilder in 1978, gauges

whether a stock is overbought or oversold by analyz-

ing recent closing prices.

It is considered as an oscillator, ranging from 0

to 100. commonly applied to identify swing points,

where it is an overbought or oversold conditions of

an asset, helping to predict potential trend reversals.

This indicator can effectively predict market move-

ments by identifying overbought or oversold condi-

tions, further supporting its practical application in ﬁ-

nancial markets (Bansal, 2016).

It is also used the indicators VSDME12 and VS-

Feature Selection for Stock Market Prediction: A Comparison of Relief and Information Gain Methods

997

DME26, which are a variation of the moving aver-

age, it is an adaptive moving average, which incor-

porates volatility and speed in the calculation. The

VSDME (which stands for Volatility and Speed Di-

vergence Moving Average), utilizes α equal to 12 and

τ of 26 for VSDME12, and for VSDME26 utilize α

of 26 and τ equal to 52.

VSDME = VSDME

− VSDME

(1)

2.2 Feature Selection Methods

Feature selection is vital in model development.

While more features can improve performance, too

many, especially with limited training data can hinder

learning and cause overﬁtting. The goal is to retain

only essential attributes, remove redundancies, and

improve model efﬁciency (Janecek et al., 2008).

Feature selection is vital in model development.

While more features can improve performance, too

many—especially with limited training data—can

hinder learning and cause overﬁtting. The goal is to

retain only essential attributes, remove redundancies,

and improve model efﬁciency.

2.2.1 Information Gain

Information Gain is a metric used to measure the re-

duction in uncertainty or entropy in a set of data when

a characteristic (or attribute) is chosen to divide the

data. It is often used in machine learning algorithms,

such as decision trees, to determine which feature

should be used at each node. The central idea is that

dividing the data based on a characteristic should re-

sult in purer subsets, that is, with less unpredictability.

Information Gain is calculated based on, the dif-

ference between the original entropy (before splitting)

and the sum of the entropies of the subsets generated

after splitting. The greater the Information Gain of a

feature, the more relevant it is to predict the target and,

therefore, the more useful it is in building the predic-

tive model. The effectiveness of Information Gain in

selecting relevant features in high-dimensional con-

texts, such as microarray data, demonstrates its appli-

cability in different domains (Yu and Liu, 2016).

2.2.2 Relief method

The Relief an individual valuation ﬁlter method (Ur-

banowicz et al., 2018), that evaluates the relevance of

attributes based on the proximity of instances of dif-

ferent classes. For each instance in the dataset, the

algorithm identiﬁes the closest instance of the same

class (near neighbor) and the closest instance of a dif-

ferent class (far neighbor).

It then adjusts the attribute weights based on how

those attributes help differentiate instances of differ-

ent classes. Attributes that help distinguish between

classes receive greater weight, while those that do not

make a difference have reduced weight. This method

is useful in problems with complex, high-dimensional

data, as it selects the most informative features for the

learning model.

3 RELATED WORK

With the huge amount of data generated by the ﬁnan-

cial market, more predictions are being made by Ma-

chine Learning algorithms (Jain and Vanzara, 2023).

A notable example is the application of deep learn-

ing techniques, particularly Long Short-Term Mem-

ory (LSTM) networks, to the S&P 500 dataset for

predicting stock price movements based on histori-

cal data (Kamalov et al., 2020). The study empha-

sizes the importance of daily closing values and trad-

ing volumes, analyzing data from 1990 to 2020.

The proposed model outperformed several bench-

mark models in predicting the directional movements

of the index. For example, one study applied Sup-

port Vector Regression (SVR) to predict stock prices,

focusing on preprocessing the NASDAQ (National

Association of Securities Dealers Automated Quota-

tions) dataset (Dash et al., 2023).

Technical analysis indicators like MACD, ADX,

Williams, and MFI were converted into correlation

tensors for enhanced processing in deep learning

models, including LSTM and DNN networks. This

method improved stock price predictions and buy/sell

signal detection (Kamalov et al., 2019).

A recent publication in the academic literature in-

troduces the BovDB as a benchmark dataset for re-

search in stock market prediction (Cardoso et al.,

2022). This dataset, which is publicly accessible and

pre-processed, encompasses daily stock data for all

companies listed on B3 from 1995 to 2020. Notably,

the authors have introduced a novel metric referred

to as the “factor” aimed at mitigating the inﬂuence of

signiﬁcant events within the dataset. Utilizing both

the factor and the BovDB allows for a comprehensive

analysis of the historical time series of Brazilian stock

prices, tracing back to the inception of Brazil’s Real

monetary plan.

This article presents an innovative approach by in-

tegrating new ﬁnancial indicators, developed from an

unprecedented dataset composed of detailed ﬁnancial

information from Brazilian companies over several

decades. Unlike conventional indicators, these new

indicators capture nuances of local ﬁnancial behavior,

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

998

providing a more in-depth and relevant view for pre-

dictive analysis in the national context. The creation

of this dataset not only ﬁlls a critical data gap, but

also establishes a solid foundation for future research,

allowing for more robust and contextualized analyses.

Furthermore, the use of cross-validation as part

of the methodology for feature selection is an inno-

vative approach in the context of the Brazilian stock

exchange. Although cross-validation is a technique

widely used in machine learning and feature selection

studies, its speciﬁc application in the selection of ﬁ-

nancial attributes for analyzing the Brazilian market

is still rare.

By employing this technique, we ensure that the

results obtained are not only speciﬁc to the dataset

used, but also generalizable, increasing the reliability

and practical applicability of the conclusions. This

rigorous approach raises the methodological stan-

dard of research in emerging markets, encouraging

the adoption of more robust and replicable practices.

Theoretically, the work enriches the feature selection

literature by introducing a new perspective based on

ﬁnancial indicators speciﬁc to the Brazilian market,

while, in practice, it offers valuable insights for in-

vestors, ﬁnancial analysts and policymakers.

4 METHODOLOGY

This section presents the methodology for evaluating

predictive models in the Brazilian ﬁnancial market.

The study constructs a dataset with daily trading data

from B3 (Brazilian Stock Exchange) covering 1995

to 2020, with a focus on 2010-2020. This dataset,

structured with price data, dates, and stock identiﬁers,

enables market trend analysis and serves as the foun-

dation for generating technical analysis indicators.

Feature selection methods, including Relief and

Information Gain, are applied to identify the most rel-

evant attributes. Sequential techniques such as Se-

quential Forward and Backward Selection reﬁne the

feature set further (Aha and Bankert, 1995). Features

are eliminated iteratively, prioritizing model accuracy.

Cross-Validation ensures robust performance evalua-

tion by dividing data into k subsets, reducing bias and

improving generalization.

Stratiﬁed K-Fold Cross-Validation is used at key

feature selection points, preserving class distribu-

tions. Performance metrics such as accuracy and F1-

score are averaged across folds. Visualization tools

highlight critical feature contributions, and models

undergo ﬁnal training on the entire dataset before de-

ployment. This approach minimizes overﬁtting and

enhances predictive reliability for the Brazilian ﬁnan-

Data

Labels

. Stability

. Descent

. Sharp Descent

. Rise

. Sharp Rise

Generate

Technical

Analysis

indicators

Feature

Selection

Methods

Robust

Classifier

Methods

Information Gain

Relief

Figure 1: Diagram methodology.

cial market.

5 RESULTS

In this chapter, we present the ﬁndings and insights

obtained from the research, organizing the discussion

into two subsections. The ﬁrst subsection 5.1 focuses

on the BovDb (Cardoso et al., 2022) and (Souza et al.,

2024), offering a detailed examination of the included

tables and how we managed it. The second subsec-

tion outlines the results from the cross-validation pro-

cess and evaluates the feature selection methods em-

ployed.

5.1 Input Data

The data of Brazilian Stocks are available to the pub-

lic in text ﬁles format, organized in a raw form. The

raw data is available in B3’s website. This study

utilizes data collected from BovDb (Cardoso et al.,

2022)

, which is a preprocessed dataset, from the

shares in the B3, it allows a better understanding of

the market and its behave. It contains data of daily

exchange of all shares in B3 from 1995 to 2020, but

we focused on the 7 most representative shares on the

Brazilian stock market and considered only the pe-

riod of 2010 to 2020. During this shorter period of

time, the companies generated an ample amount of

data, ensuring the relevance of the analysis. BovDb

comprise ﬁve distincts tables, providing a deep view

of the market landscape.

https://sol.sbc.org.br/index.php/dsw/article/view/17411

Feature Selection for Stock Market Prediction: A Comparison of Relief and Information Gain Methods

999

First table is the Company table, this table cor-

relates the name and identiﬁcation for every com-

pany that has had a presence in B3 between the

years of 1995 and 2020. It encompasses a total of

1728 companies within this database. The column

“id company” is the auto-incremented integer, serves

as the unique identiﬁer for the company, functioning

as the primary key. Additionally, it is utilized as a

foreign key on the Ticker to reference the aforemen-

tioned company. And the other column is the “Com-

pany” column, refering to the company’s name.

The Ticker table stores the data of the stocks. It

relates the code of the stock for each company, the

codes are formed by a pattern of numbers and letters

that helps the investor to identify each company and

the type of share that corresponds with it, the table

contains 2540 stocks in it. The difference between

the amount of companies and stocks, is due to the fact

that a single company can have more than one type

of share. The ﬁrst column is the “id ticker”, whis

is an auto incremented integer serving as the Ticker

identiﬁer, acting as a primary key. It is also utilized

as a foreign key in both the EventPrice and Price ta-

bles to reference the former. The other column is the

“ticker”, being the company’s stock symbol. The “co-

disi” column is the stock code in B3.

The “Price” table stores the data negotiation of the

trading ﬂoor for each stock, providing us with enough

information to understand the movement of the stock

throughout the trading ﬂor. The “date” column is the

date of trade for a stock, serves as a crucial component

in conjunction with the id ticker, collectively form-

ing a composite primary key for identifying a speciﬁc

ticket on any given date. Within the context of Event-

Price, the date, along with the id ticker and id event,

forms a composite primary key signifying the occur-

rence date of a particular event.

Each one is a column of a given date,“open” rep-

resents the opening price, ’high’ represents the high-

est price, ’low’ represents the lowest price, ’average’

represents the average price, ’close’ represents the

closing price, ’buy offer’ represents the best offering

price, ’business’ represents the quantity of transac-

tions executed with the stock, ’sell offer’ represents

the best selling price, ’amount stock’ represents the

aggregate trading volume on the stock. The last col-

umn is the “Factor” which is the combined effect of

events is considered from the most recent to the old-

est until a particular date is attained, showcasing the

chronological progression

The “Event” table presents different types of

events, containing 12 occurrences. The “id

event”

is the auto-incremented integer serves as the unique

identiﬁer for the Event, making it the primary key.

Additionally, it functions as a foreign key on the

EventPrice to reference the aforementioned Event.

The “description” column is a description of the

event. And the “ds bovespa” is the abbreviated Event

designation, as indicated in the documentation sup-

plied by B3.

The last table is the “Eventprice” table, showing

that over time a stock can undergo different events,

this table presents the trading ﬂoor days that hap-

pened a event. It also presents if factor was applied

in a stock, and its value. For example, stock split,

in which the number of shares increases to provide

greater liquidity without affecting the total value of

the company’s capital, factor is applied in this case,

so it spossible to perform a better analysis of the stock

over time. The “factor” column is how signiﬁcant is

the event on a speciﬁc stock and trade. The ”applied”

column represents if an event has occurred or not in a

speciﬁc day.

The “Price” table was used to build the Techni-

cal Analysis Indicators, that serves as features for

the Train and Test dataset. For the Moving Aver-

age and Standard Deviation calculations, we con-

ducted a thorough analysis of various stock mar-

ket trading sessions, each lasting window size of γ

length. We examined the opening, maximum, mini-

mum, average, closing, offer/buy, offer/sell, volume,

and business amount data. To determine the values for

MACD, MACD Histogram, MACD Slope, MACD

DF, MACD VSDME12, and MACD VSDME26, we

captured the opening, closing, maximum, average,

and minimum. Additionally, for the MACD Signal,

we utilized the prices from the other MACD indi-

cators, considering 9 stock market trading windows.

Similarly, for RSI, we performed analogous calcula-

tions using β stock market trading sessions for each

data point.

The lenght of the stock market trading sessions are

β lenght = 7, 14 and 21 γ lenght = 5, 10, 15, 20, 25,

30, 60, and 90 (representing the previous prices for

the ongoing analysis).

This resulted in 194 features of Technical Anal-

ysis Indicators, being necessary to normzalize it be-

cause the data was not in the same range. After ana-

lyzing the price table and considering the percentage

of gain or loss in the stock market trading sessions

ahead, we have identiﬁed 5 distinct labels. Descent

indicates a 0.5% decrease in the stock value, while

Sharp Descent signiﬁes a 1% decrease. On the other

hand, Rise denotes a 0.5% increase, and Sharp Rise

indicates a 1% increase. Lastly, Stability represents a

negligible ﬂuctuation in the stock value, either up or

down, by less than 0.5%.

With these data, we build the train and test dataset,

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

1000

the train dataset is composed of 15237 rows in total,

being 2756 rows of Stability, 3314 of Descent, 2612

of Sharp Descent, 3472 of Rise and 3083 of Sharp

Rise. The test dataset is composed of 3810 rows,

being 689 rows of Stability, 829 of Descent, 653 of

Sharp Descent, 868 of Rise, and 771 of Sharp Rise.

5.2 Evaluation

The prediction was performed using Random For-

est, a machine learning technique (Breiman, 2001).

The algorithm combines multiple decision trees to

improve accuracy and reduce the risk of overﬁtting,

making it ideal for classiﬁcation and regression tasks.

In this study, we use Random Forest to evaluate the

performance of the model based on selected attribute

data. The model was conﬁgured to generate 100 deci-

sion trees, without depth restriction, allowing the trees

to grow to their maximum height to capture complex

interactions in the data. Data sampling was performed

with 100% of the dataset in each tree, ensuring a com-

plete view during the construction of each tree. We

did not calculate the importance of attributes and all

characteristics were used in the trees without restric-

tion. Additionally, out-of-bag validation has been dis-

abled, with a focus on other evaluation metrics to en-

sure model robustness.

The model performance evaluation was carried

out using accuracy and F1-Score metrics. To calcu-

late these metrics, a simple average of the results ob-

tained in the 9 folds of the cross-validation process

was applied. The ﬁnal accuracy was calculated as the

simple average of the accuracies of each of the k-fold

cross-validation iterations.

Speciﬁcally, in the ﬁrst set of experiments, the k-

fold cross-validation technique with 9 folds was used.

Then, in the second part of the experiment, we ap-

plied stratiﬁed k-fold cross-validation, ensuring that

the distribution of classes was maintained in each of

the 9 folds, which is particularly important in unbal-

anced data sets. During this process, performance

metrics, such as accuracy and F1-Score, were cal-

culated and the simple average of these metrics was

used to evaluate the overall performance of the model.

This ensures that the model is trained and tested on

representative distributions of classes across all folds,

avoiding any variation that may occur randomly in

balanced datasets. Furthermore, the use of stratiﬁca-

tion helps to reduce variation in performance metrics,

such as accuracy and F1-score, providing a more con-

sistent evaluation of the model.

Figure 2 and 3 shows the performance of the mod-

els as features are removed according to the Informa-

tion Gain and Relief method, respectively. It is worth

highlighting that, it was employed Random Forest as

the classiﬁer, in which, using all available features

(194 total) achieved an average F1-Score of 0.475 and

an accuracy of 0.476 in the test data. Each graph ilus-

tration presents an orange dot and an green dot, which

means, the highest accuracy and a limit indicating that

the removal of features from that point onwards dras-

tically reduces the accuracy of the models. Showing

the importance of the remaining features.

Our ﬁrst analysis addresses the Information Gain

method. Initially the accuracy increases as the fea-

tures are being removed, until reaching its peak, and

then declining. The order in which the features were

eliminated corresponds to the reverse sequence ob-

tained from the Information Gain feature selection

approach. The accuracy results shown in the graphs

are derived from Cross-Validation conducted without

stratiﬁcation.

Figure 2: Information Gain features removal.

In this analysis, the OD indicates that the fea-

ture count stands at 79. Initially, the model trained

achieved an accuracy of 0.493, in the GD, where the

feature count was 38, the model delivered an accuracy

of 0.483.

In sequence, the same approach is adopted with

the Rellief method. The accuracy increases as the fea-

tures are being removed, the OD and GD are closer

to each other, in comparison to the Information Gain

method.

In our analysis, the OD on the chart includes

73 distinct features. We initially employed Cross-

Validation, which yielded an accuracy of 0.498, then

we applied this methodology to the GD, which con-

sists of 52 features, the model’s ﬁrst accuracy mea-

surement was 0.490.

Note that the Relief method initially performs bet-

ter, achieving higher heals as it removes the initial

features. However, after obtaining these initial accu-

racies superior to the Information Gain method, the

Relief method models were unable to maintain them

over time, causing the curve of accuracies to begin

Feature Selection for Stock Market Prediction: A Comparison of Relief and Information Gain Methods

1001

Figure 3: Relief method features removal.

earlier. This can be better observed when we com-

pare the two GDs, where the Relief method with 52

features achieved an accuracy of 0.490 and the Infor-

mation Gain method with 14 fewer features achieved

an accuracy of 0.483.

The Table 1 presents the top 10 most relevant fea-

tures according to Information Gain and the Relief

method.

Table 1: Features selected by Information Gain and relief.

Information Gain Relief

Rank Features Rank Features

1 sd 90 average 1 dp 90 offer/sell

2 sd 90 minimum 2 dp 90 offer/buy

3 sd 90 opening 3 dp 90 minimum

4 sd 90 closing 4 dp 90 opening

5 sd 90 maximum 5 dp 90 closing

6 sd 60 opening 6 dp 90 average

7 sd

90 offer/sell 7 dp 90 maximum

8 sd 90 offer/buy 8 dp 60 offer/buy

9 sd 60 maximum 9 dp 60 offer/sell

10 sd 60 minimum 10 rsi 21 average

The standard deviation ﬁnancial indicator stands

out as extremely relevant in both methods, occupying

all positions in the top 10 in each of them, except the

tenth position in the Relief method. In the information

gain method, eight standard deviation indicators refer

to the period of 90 windows and two to the period of

60 windows. In the Relief method, seven indicators

correspond to the period of 90 windows, two to the

period of 60 windows, while the tenth place was oc-

cupied by the RSI in 21 windows.

The Relief method with 73 features utilizing Strat-

iﬁed K-fold Cross-Validation achieved an accuracy of

0.493. And the model developed through Stratiﬁed

K-fold Cross-Validation was evaluated using a test

dataset, where it achieved an accuracy of 0.515 and

an F1-Score of 0.516. Applying the same approach

to the GD, representing a model with 52 features, it

initially achieved an accuracy of 0.488 with Strati-

Figure 4: Graphic Information Gain and Relief methods.

ﬁed K-fold Cross-Validation and the model developed

through Stratiﬁed K-Fold Cross-Validation was eval-

uated using a test dataset, where it achieved an accu-

racy of 0.492 and an F1-Score of 0.493

In this case of Information Gain, the OD is mark-

ing where the number of features is 79. Then it

is utilized Stratiﬁed K-fold Cross-Validation to build

a new model, after the training it achieved an ac-

curacy of 0.494. The model built with Stratiﬁed

Cross-Validation was assessed using the test dataset,

it achieved an accuracy of 0.512 and F1-Score of

0.512. The same procedure was adopted for the GD,

with Stratiﬁed Cross-Validation it achieved an accu-

racy of 0.480, and then this model was assessed using

the test dataset, it achieved an accuracy of 0.500 and

F1-Score of 0,501.

The accuracy values around 50% can be partially

explained by the complexity of the ﬁnancial market,

but also by the multiclass nature of our classiﬁca-

tion problem. The dataset was structured to predict

ﬁve distinct price movement classes which inherently

makes the classiﬁcation task more challenging. In ﬁ-

nancial prediction, price movements are often subtle

and inﬂuenced by numerous external factors, and dis-

tinguishing between similar classes, such as Stability

and small rises or descents, adds complexity.

6 CONCLUSIONS

This research explored the effectiveness of Informa-

tion Gain and Relief methods in improving predictive

performance in the Brazilian ﬁnancial market, using

Random Forest models. The data set, composed of

194 technical analysis indicators, was subjected to at-

tribute selection processes, with the methods evalu-

ated by progressive attribute removal and validation

by Cross-Validation.

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

1002

The results presented provide valuable insights for

the development of more efﬁcient models in the con-

text of the Brazilian ﬁnancial market, and future stud-

ies could explore the application of other attribute se-

lection methods or the adaptation of the methodology

in different ﬁnancial scenarios.

For future works, we intend to explore the poten-

tial of the selected features for new analyses, lever-

aging these optimized features in advanced machine

learning models, such as deep learning architectures,

to enhance prediction accuracy.

ACKNOWLEDGEMENTS

The authors would like to thank FAPERGS

(24/2551-0001396-2, 23/2551-0000773-8), CNPq

(305805/2021-5) and FAPERGS/CNPq (23/ 2551-

0000126-8). Fabian thanks to Fesurv-UniRV for the

pay leave, which helped to collaborate in this work.

REFERENCES

Aha, D. W. and Bankert, R. L. (1995). A comparative eval-

uation of sequential feature selection algorithms. In

Pre-proceedings of the Fifth International Workshop

on Artiﬁcial Intelligence and Statistics.

Altman, D. G. and Bland, J. M. (2005). Standard deviations

and standard errors. In Bmj. British Medical Journal

Publishing Group.

Araujo, G. S. and Gaglianone, W. P. (2023). Machine learn-

ing methods for inﬂation forecasting in brazil: New

contenders versus classical models. In Latin Ameri-

can Journal of Central Banking. Elsevier.

Bansal, S. (2016). Investigating the efﬁcacy of rsi in the

nifty 50 index. In Global journal of Business and In-

tegral Security.

Billah, M. M., Sultana, A., Bhuiyan, F., and Kaosar, M. G.

(2024). Stock price prediction: comparison of dif-

ferent moving average techniques using deep learn-

ing model. In Neural Computing and Applications.

Springer.

Bouri, E., Demirer, R., Gupta, R., and Sun, X. (2020).

The predictability of stock market volatility in emerg-

ing economies: Relative roles of local, regional, and

global business cycles. In Journal of Forecasting. Wi-

ley Online Library.

Breiman, L. (2001). Random forests. In Machine learning.

Springer.

Cardoso, F. C., Malska, J. A. V., Ramiro, P. J., Lucca, G.,

Borges, E. N., de Mattos, V. L. D., and Berri, R. A.

(2022). Bovdb: a data set of stock prices of all com-

panies in b3 from 1995 to 2020. In Journal of Infor-

mation and Data Management.

Chandrashekar, G. and Sahin, F. (2014). A survey on fea-

ture selection methods. In Computers & electrical en-

gineering. Elsevier.

Chen, C.-P. and Metghalchi, M. (2012). Weak-form market

efﬁciency: Evidence from the brazilian stock market.

In International Journal of Economics and Finance.

Citeseer.

Cuervo, R. (2023). Predictive ai for sme and large enterprise

ﬁnancial performance management. In arXiv preprint

arXiv:2311.05840.

Dash, R. K., Nguyen, T. N., Cengiz, K., and Sharma, A.

(2023). Fine-tuned support vector regression model

for stock predictions. In Neural Computing and Ap-

plications. Springer.

Halilbegovic, S. (2016). Macd-analysis of weaknesses of

the most powerful technical analysis tool. In Indepen-

dent Journal of Management & Production. Instituto

Federal de Educac¸

ao, Ci

encia e Tecnologia de S

Paulo.

Htun, H. H., Biehl, M., and Petkov, N. (2023). Survey of

feature selection and extraction techniques for stock

market prediction. In Financial Innovation. Springer.

Jain, R. and Vanzara, R. (2023). Emerging trends in

ai-based stock market prediction: A comprehensive

and systematic review. In Engineering Proceedings.

MDPI.

Janecek, A., Gansterer, W., Demel, M., and Ecker, G.

(2008). On the relationship between feature selection

and classiﬁcation accuracy. In New challenges for fea-

ture selection in data mining and knowledge discov-

ery. PMLR.

Kamalov, F., Smail, L., and Gurrib, I. (2019). Stock

price prediction using technical indicators: a predic-

tive model using optimal deep learning. In Learning.

Kamalov, F., Smail, L., and Gurrib, I. (2020). Forecasting

with deep learning: S&p 500 index. In 2020 13th In-

ternational Symposium on Computational Intelligence

and Design (ISCID). IEEE.

Kang, B.-K. (2021). Improving macd technical analysis by

optimizing parameters and modifying trading rules:

evidence from the japanese nikkei 225 futures market.

In Journal of Risk and Financial Management. MDPI.

Kohn, K. and Moraes, C. d. (2007). O impacto das

novas tecnologias na sociedade: conceitos e carac-

ter

ısticas da sociedade da informac¸

ao e da sociedade

digital. In XXX Congresso Brasileiro de Ci

encias da

Comunicac¸

ao.

Shi, Y., Li, B., Long, W., and Dai, W. (2022). Method for

improving the performance of technical analysis indi-

cators by neural network models. In Computational

Economics. Springer.

Souza, A. S., Lucca, G., Borges, E. N., Cardoso, F. C., Dal-

mazo, B. L., and Berri, R. (2024). Dataset for Intraday

Analysis of B3 stock prices.

Urbanowicz, R. J., Olson, R. S., Schmitt, P., Meeker, M.,

and Moore, J. H. (2018). Benchmarking relief-based

feature selection methods for bioinformatics data min-

ing. In Journal of biomedical informatics. Elsevier.

Yu, L. and Liu, H. (2016). Feature selection for high-

dimensional data: A fast correlation-based ﬁlter so-

lution. In Proceedings of the 20th international con-

ference on machine learning (ICML-03).

Feature Selection for Stock Market Prediction: A Comparison of Relief and Information Gain Methods

1003