(Dighe et al., 2018). They compare the usage pattern
and current transaction, to classify it as either fraud
or a legitimate transaction. Among the techniques
implemented are KNN, Na
¨
ıve Bayes, CFLANN, M-
Perceptron and DTrees.
Credit cards frauds have no constant patterns is
stated (Pumsirirat and Yan, 2018). Therefore, the use
of an unsupervised learning is necessary. They take
account that the frauds are committed once through
online mediums and then the techniques change. To
solve this issue, they implement a deep Auto-encoder
model and a restricted Boltzmann machine, that can
reconstruct normal transactions to find anomalies in
the patterns.
An intelligent agent can obtain a high rate of fraud
transaction with low false alarm rate, providing a con-
venient way to detect frauds (Chukwuneke, 2018).
Their implementation of the intelligent agent is focus
on detect the fraud when transaction is in progress,
taking into account the costumers pattern, and any de-
viation from the regular pattern is considered to the
fraudulent transaction.
2.2 Imbalanced Classification
Research work in imbalanced data classification is fo-
cused on two levels: data level and algorithmic level.
In data level, the objective is balance the class dis-
tribution by manipulating the training samples, tak-
ing into account the over-sampling minority class,
the under sampling majority class and their combi-
nations. The authors takes into account that the over-
sampling can lead to overfitting while under-sampling
lose valuable information on the majority class. On
the other hand, the objective of the algorithmic level
methods, is increase the importance of minority class
by improving algorithms by decision threshold ad-
justment, cost-sensitive learning and ensemble learn-
ing (Lin et al., 2019). An alternative loss function in
deep neural network that can capture the classification
errors from both minority class and majority class is
established (Wang et al., 2016). Extracting hard sam-
ples of minority classes and improved the bootstrap-
ping sampling algorithm which ensure the training
data in each mini-batch, by batch-wise optimization
with Class Rectification Loss function (Dong et al.,
2019).
2.3 Reinforcement Learning in
Classification
Recently, the deep reinforcement learning has had ex-
cellent results, because it can assist classifiers to learn
important features or select good instances from noisy
data. The authors understand the classification task as
sequential decision-making process, that uses a multi-
ple agents interacting with the environment to obtain
the optimal policy of classification. However, the in-
teraction between agents and environment, generate
an extremely high time complexity (Lin et al., 2019).
Establishing a deep reinforcement learning model di-
vided into instance selector and relational classifier, to
learn the relationship classification in noisy text data.
The instance selector part, implements an agent se-
lects high quality sentence from noisy data while the
relational classifier part learns from the previous se-
lected data and give a reward to the instance selector.
The finally model obtains a better classifier and high-
quality data set (Feng et al., 2018). A deep reinforce-
ment learning framework for time series data classi-
fication is established. This framework use a specific
reward function and a clearly formulated Markov pro-
cess (Martinez et al., 2018). The information avail-
able about imbalanced data classification with rein-
forcement learning is quite limited.
3 TECHNICAL BACKGROUND
3.1 Q-Learning
Q-Learning is one far-reaching reinforcement learn-
ing techniques that does not require a model of the
environment to learn to execute complex tasks. Es-
sentially Q-Learning makes possible for an algorithm,
to learn a sequential task, where rewards are re-
leased in a step by step fashion, until a journey called
”Episode” is completed. After training the ”educated”
agent develops a road map memory called ”policy”,
usually represented by a matrix Q, which optimizes
rewards capture trajectories in any definable environ-
ment. Q(s
t
,a
t
) gives the value of taking action a
t
in
a state s
t
. Equation 1 is the leading actor of the Q-
learning algorithm, derived from the Bellman equa-
tion by considering the first and second term of an
infinite series (Watkins and Dayan, 1992):
Q
obs
(s
t
,a
t
) = r + γmax
a
Q(s
t+1
,a
t+1
), (1)
where γ is the discount factor which manages the bal-
ance between immediate and future rewards. In this
equation the value of Q(s
t
,a
t
) of state and action is
given by the sum of the reward r with the discounted
maximum future expected reward after moving to the
next state S
t+1
. The value of Q
obs
(s
t
,a
t
) is com-
puted by an agent and then is updated with its the
own estimate of Q
∗
(s
t
,a
t
) in a Q-table. The term
max
a
Q(s
t+1
,a
t+1
) gives the maximum value for all
ICAART 2020 - 12th International Conference on Agents and Artificial Intelligence
280