mendations on this matter for the new PO’s. For the
subset of old PO’s however, the audit step is effec-
tively executed and these results will be reported after
the discussion of the new PO’s. In what follows, the
term data set refers to the subset of new PO’s (33.814
observations).
The most important attributes to describe a PO
and its life cycle are the following: the name of the
creator, the supplier, the purchasing group, the type
of purchasing document, the number of changes, the
number of changes after the last release and the num-
ber of price related changes after the last release. Con-
cerning the numerical attributes, there are 91 creators
recurring in the data set, 3.708 suppliers, 13 purchas-
ing groups and 6 document types. (see Table 1)
Table 1: Categorical attributes.
Categorical Recurrence in data set
Creator 91
Supplier 3.708
Purchasing Group 13
Document Type 6
Of the 91 creators, not all of them introduce
equally as much PO’s in the ERP system, because of
the individual characteristics of each purchase. Some
creators, responsible for a particular type of purchase,
need to enter lots of PO’s, while other creators, re-
sponsible for other types of purchase, only enter a
few PO’s. Also the turnover in terms of personnel has
its reflection on the number of PO’s per employee.
Like creators, the frequency of suppliers in the data
set is liable to the specific characteristics of the prod-
uct or service supplied. There will be for example
more PO’s concerning monthly leasing contracts for
cars than there will be for supplying desks. Hence the
former supplier will be more frequently present in the
data set than the latter. Concerning the 13 purchasing
groups, there is no difference in expected fraud risk
between the different groups. Some groups are more
present than others in the data set, but this can all be
explained by domain knowledge. The same goes for
the six differentpurchasing documenttypes. All types
have their specific characteristics, but there is no ex-
pected difference concerning fraud risk.
The numerical attributes are described in Table 2.
For each attribute, three intervals were created, based
on their mean and standard deviation. For the first
attribute, the intervals were [2-4], [5-8] and [9-...], for
the second attribute [0-0], [1-2] and [3-...] and for the
last attribute [0-0], [1-1] and [2-...]. In Table 2 we
see that there is a highly skewed distribution for the
three attributes, which is to be expected for variables
that count these types of changes. The changes are
supposed to be small in numbers.
After creating these attributes and providing de-
scriptives, we turn to the third step of our method-
ology. For the specification of our model, we take
into account the particular type of fraud risk we wish
to reduce. The fraud risk linked with entering PO’s
into the ERP-system is connected with the number
of changes one makes to this PO, and more specifi-
cally, the changes made after the last release. There
is namely a built-in flexibility in the ERP system to
modify released PO’s without triggering a newrelease
procedure. For assessing the related risk, we selected
four attributes to mine the data. A first attribute is the
number of changes a PO is subjected to in total. A
second attribute presents the number of changes that
is executed on a PO after it was released for the last
time. The third attribute we created is the percent-
age of this last count that is price related. So what
percentage of changes made after the last release is
related to price issues? This is our third attribute. The
last attribute concerns the magnitude of these price
changes. Considering the price related changes, we
calculate the mean of all price changes per PO and its
standard deviation. On itself, no added value was be-
lieved to be in it. Every purchaser has its own field of
purchases, so cross sectional analysis is not really an
option. However, we combine the mean (µ) and stan-
dard deviation (σ) to create a theoretical upper limit
per PO of µ + 2σ. Next, we count for each PO how
often this theoretical limit was exceeded. This new
attribute is also taken into account in our model. In
this core model, no categorical attributes were added.
As a robustness check however, attributes like docu-
ment type and purchasing group were included in the
model. The results did not significantly change by
these inclusions.
4 LATENT CLASS CLUSTERING
ALGORITHM
For a descriptive data mining approach, we have cho-
sen for a clustering algorithm, more specifically a la-
tent class (LC) clustering algorithm. We prefer LC
clustering to the more traditional K-means clustering
for several reasons. The most important reason is that
this algorithm allows for overlapping clusters. An ob-
servation is provided a probability to belong to each
cluster, for example .80 for cluster 1, .20 for cluster 2
and .00 for cluster 3. This gives us the extra opportu-
nity to look at outliers in the sense that an observation
does not belong to any cluster at all. This is for exam-
ple the case with probabilities like .35, .35 and .30.
Other considerations to apply the LC clustering algo-
rithm are the ability to handle attributesof mixed scale
INTERNAL FRAUD RISK REDUCTION - Results of a Data Mining Case Study
163