Forests and Support Vector Machines) is introduced
in (Bhattacharyya et al., 2011), where the effective-
ness of these methods in this field is discussed.
Data Unbalance. As previously underlined, the un-
balance of the transaction data represents one of the
most relevant issues in this context, since almost all of
the learning approaches are not able to operate with
this kind of data structure (Batista et al., 2000), i.e.,
when an excessive difference between the instances
of each class of data exists. Several techniques of
pre-processing have been developed to face this prob-
lem (Japkowicz and Stephen, 2002; Drummond et al.,
2003).
Detection Models. The static approach (Pozzolo
et al., 2014) represents a canonical way to operate to
detect fraudulent events in a stream of transactions.
It is based on the initial building of a user model,
which is used for a long period of time, before its re-
building. In the so-called updating approach (Wang
et al., 2003), instead, when a new block appears, the
user model is trained by using a certain number of
latest and contiguous blocks of the sequence, then the
model can be used to infer the future blocks, or aggre-
gated into a big model composed by several models.
In another strategy, based on the so-called forgeting
approach (Gao et al., 2007), a user model is defined
at each new block, by using a small number of non-
fraudulent transactions, extracted from the last two
blocks, but keeping all previous fraudulent ones. Also
in this case, the model can be used to infer the future
blocks, or aggregated into a big model composed by
several models. In any case, regardless of the adopted
approach, the problem of the non-stationary distri-
bution of the data, as well as that of the unbalanced
classes distribution, remain still unaltered.
Differences with Our Approach. The proposed ap-
proach introduces a novel strategy that, firstly, takes
into account all elements of a transaction (i.e., nu-
meric and non-numeric),reducing the problem related
with the lack of information, which leads toward an
overlapping of the classes of expense. The introduc-
tion of the Transaction Determinant Field (TDF) set,
also allows to give more importance to certain ele-
ments of the transaction, during the model building.
Secondly, differently from the canonical approaches
at the state of the art, our approach is not based on
an unique model, but instead on multiple user models
that involve the entire set of data. This allows us to
evaluate a new transaction by comparing it with a se-
ries of behavioral models related with many parts of
the user transaction history. The main advantage of
this strategy is the reduction, or removal, of the issues
related with the stationary distribution of the data,
and the unbalancing of the classes. This because the
operative domain is represented by the limited event
blocks, and not by the entire dataset. The discretiza-
tion of the models, according to a certain value of d,
permit us to adjust their sensitivity to the peculiarities
of the operating environment.
3 PROBLEM DEFINITION
This section defines the problem faced by our ap-
proach, preceded by a set of definitions aimed to in-
troduce its notation.
Definition 3.1 (Input Set). Given a set of users
U = {u
1
,u
2
,...,u
M
}, a set of transactions T =
{t
1
,t
2
,...,t
N
}, and a set of fields F = { f
1
, f
2
,..., f
X
}
that compose each transaction t (we denoted as V =
{v
1
,v
2
,...,v
W
}, the values that each field f can as-
sume), we denote as T
+
⊆ T the subset of legal trans-
actions, and as T
−
⊆ T the subset of fraudulent trans-
actions. We assume that the transactions in the set
T are chronologically ordered (i.e., t
n
occurs before
t
n+1
).
Definition 3.2 (Fraud Detection). The main objective
of a fraud detection system is the isolation and rank-
ing of the potentially fraudulent transactions (Fan and
Zhu, 2011) (i.e., by assigning a high rank to the poten-
tial fraudulent transactions), since in the real-world
applications this allows a service provider to focus
the investigative efforts toward a small set of suspect
transactions, maximizing the effectiveness of the ac-
tion, and minimizing the cost. For this reason, we
evaluate the ability of our fraud detection strategy in
terms of its capacity to assign a high rank to frauds,
using as measure the average precision (denoted as
α), since it is considered the correct metric in this
context (Fan and Zhu, 2011). Others metrics com-
monly used to evaluate the fraud detection strategies,
such as the AUC (a measure for unbalanced datasets),
and PrecisionRank (a measure of precision within
a certain number of observations with the highest
rank) (Pozzolo et al., 2014), then are not taken in con-
sideration in this work.
The formalization of the average precision is
shown in Equation 1, where N is the number of trans-
actions in the set of data, and ∆R(t
r
) = R(t
r
) − R(t
r
−
1). Denoting as π the number of fraudulent transac-
tions in the set of data, out of the percent t of top-
ranked candidates, denoting as h(t) ≤ t the hits (i.e.,
the truly relevant transactions), we can calculate the
recall(t) = h(t)/π, and precision(t) = h(t)/t values,
then the value of α.
α =
N
∑
r=1
P(t
r
)∆R(t
r
) (1)