Figure 1 summarizes the process we chose to
detect and eliminate outliers.
Normality Test
Normality?
Grubbs’ Test
Box Plot
YesNo
Figure 1: The outlier detection process.
Finally, the computation of γ3 and γ2 to evalu-
ate the value of JB, so as the Grubb’s and the Box
Plot statistics calculus, are performed in parallel in the
manner shown in Listing 1 (cf. Section 2). This in or-
der to fasten the response times. Other statistics used
in the next section are simultaneously collected here.
Because the corresponding algorithm is very simple
(the computation of each statistic is considered as a
single task), we do not present it.
4 DISCRETIZATION METHODS
Discretization methods, so as outlier management
methods, apply on columns of numerical values.
However, in a previous work, we integrated also
other types of column values, such as strings, by
performing a kind of translation of such values (based
on their frequency) into numerical ones (Ernst and
Casali, 2011). This is why our approach, a priori
dedicated to numerical values, can be easily extended
to any given database.
The discretization of an attribute consists in find-
ing NbBins disjoint intervals which will further repre-
sent it in a consistent way. The final objective of dis-
cretization methods is to ensure that the mining part
of the KDD process generates efficient results. In our
approach, we use only direct discretization methods
in which NbBins must be known in advance and rep-
resents the upper limit for every column of the input
data. NbBins was a parameter fixed by the end-user
in the mentioned previous work above. As an alterna-
tive, the literature proposes several formulas (Rooks-
Carruthers, Huntsberger, Scott, etc.) for computing
such a number. Therefore we use the Huntsberger for-
mula, the best from a theoretical point of view (Cau-
vin et al., 2008), and given by: 1 + 3.3 ×log
10
(N).
We apply the formula on the non null values of each
column.
4.1 Related Work
In this section, we only discuss the final discretization
methods that have been kept for this work. This
is because other implemented methods have not
revealed themselves to be as efficient as expected
(such as Embedded Means Discretization for ex-
ample), or are otherwise not a worthy alternative to
the presented ones (Quantiles based Discretization).
The methods we use are: Equal Width Discretization
(EWD), Equal Frequency Fisher-Jenks Discretization
(EFD-Jenks), AVerage and STandard deviation
based discretization (AVST), and K-Means. These
methods, which are unsupervised and static (Mitov
et al., 2009), have been widely discussed in the
literature: See for example (Cauvin et al., 2008) for
EWD and AVST, (Jenks, 1967) for EFD-Jenks, or
(Kanungo et al., 2002), (Arthur et al., 2011) and
(Jain, 2010) for KMEANS. For these reasons, we
only summarize their main characteristics and their
field of applicability in Table 1.
Let us underline that the computed NbBins value
is in fact an upper limit, not always reached, depend-
ing on the applied discretization method. Thus, EFD-
Jenks and KMEANS generate most of the times less
than NbBins bins. This implies that other methods
which generate the NbBins value differently, for ex-
ample through iteration steps, may apply if NbBins
can be upper bounded.
Example 1. Let us consider the numeric attribute
representing the weight of several persons SX =
{59.04,60.13,60.93,61.81,62.42, 64.26,70.34, 72.89,
74.42,79.40,80.46,81.37}. SX contains 12 values,
so by applying the Huntsberger’s formula, if we aim
to discretize this set, we have to use 4 bins.
Table 2 shows the bins obtained by applying all
the discretization methods proposed in Table 1. Table
3 shows the number of values of SX belonging to each
bin associated to every discretization method.
As it is easy to understand, we cannot find two
discretization methods producing the same set of bins.
As a consequence, the distribution of the values of SX
is different depending on the method used.
4.2 Discretization Methods and
Statistical Characteristics
As seen in the last Section, when attempting to dis-
cern the best discretization method for a column, its
shape is very important. We characterize the shape
of a distribution according to four criteria: (i) Mul-
timodal, (ii) Symmetric or Antisymmetric, (iii) Uni-
form, and (iv) Normal. This is done in order to deter-
POP: A Parallel Optimized Preparation of Data for Data Mining
39