basic probability assignment. In this case, E is the
frame of discernment. Once this correspondence is
established, the evaluation of the set of mined pat-
terns would be carried out by the method described in
the previous section, that is, using the function Q(m)
defined in Equation 11. In general terms, this mea-
sure adds additional information to the expert on the
quality of the sequences base. But, in particular, in
this paper, the Q(m) function (specifically Q( f )) will
be used to compare objectively three set of frequent
patterns mined from a dataset belonging to a medical
domain, which has been discretized using three differ-
ent discretization methods. Let TDM be an Apriori-
like temporal data mining algorithm, and let D be a
dataset. A special feature of TDM is that it can not
handle numerical attributes. Since D contains con-
tinuous attributes, very common in real datasets, it
will be require a discretization method to obtain a
dataset with only nominal attributes. Let d
1
, d
2
, d
3
be three different discretization methods that gener-
ate three different datasets, D
1
= d
1
(D), D
2
= d
2
(D),
and D
3
= d
3
(D). The execution of the algorithm on
each of the datasets (TDM(D
i
)) will results in three
different sequences bases, denoted as B S
D
i
, each one
characterized by a normalized frequency distribution
f
i
. In order to compare the three discretization meth-
ods and determine which one provides information
with less uncertainty, we propose the use of the Q( f
i
)
function, such that the best method is the one that gen-
erates a base with the highest value of Q. For a more
complete assessment, in the empirical evaluation we
will use also three different values for the minimum
support parameter of the TDM algorithm.
4 EMPIRICAL EVALUATION
From a practical point of view, we have carried out an
empirical evaluation using a preprocessed dataset that
represents the evolution of 363 patients in an Inten-
sive Care Burn Unit (ICBU) between 1992 and 2002.
The original database stores, for each patient, a lot of
clinical parameters such as age, presence of inhalation
injury, the extent and depth of the burn, the necessity
of early mechanical ventilation, and the patient sta-
tus int its last day of stay in the Unit, among others.
However, in the construction of the dataset, we only
take into account the temporal parameters, which in-
dicate the evolution of the patients during the resusci-
tation phase (first 2 days) and during the stabilization
phase (3 following days). Incomings, diuresis, fluid
balance, acid base balance (pH, bicarbonate, base ex-
cess) and other variables help to define objectives and
to assess the evolution and treatment response. For
each of these temporal variables, we used three dif-
ferent discretization methods (called d
1
, d
2
, and d
3
,
respectively). The first one (d
1
) is based on clinical
criteria and uses the knowledge previously defined in
the domain. A second discretization method (d
2
) can
be based on the usual interpretation of mean value
and standard deviation. In statistics, the mean is a
central value around which the rest of the values are
spread. When the distance of an element to the mean
is greater than two standard deviations it should be
carefully looked at because it may be a potential out-
lier. We have made a similar distinction defining nor-
mal values those in a interval of one standard devi-
ation around the mean, slightly high or slightly low
values those within two standard deviations around
the mean, and high or low values for those whose dis-
tance to the mean is greater than two standard devia-
tions. This discretization method is the one that ge-
nerates the highest number of patterns, since it does
not consider the domain values (that can be arbitrary
in the dataset) but the values found in the data and
therefore most of them should be found in the “nor-
mal” interval if they are distributed around the mean.
In the last method (d
3
), we used an entropy-based in-
formation gain with respect the output variable. The
information gain is equal to the total entropy for an
attribute if for each of the attribute values a unique
classification can be made for the output variable.
For each discretized dataset (D
i
), we have ob-
tained a set of maximal sequences using a version of a
temporal data mining algorithm designed for the ex-
traction of frequent sequences from datasets with a
time-stamped dimensional attribute. In the analysis
of the data, different values for the parameters of the
algorithm were set, in particular, maxspan to 5 days
(resuscitation and stabilization), and minimum sup-
port to 20%, 30%, and 50%. In total, 6 sets of fre-
quent sequences were extracted that, after the normal-
ization process, resulted in the generation of 6 nor-
malized sequences bases (denoted as B S
D
i
ms
, where D
i
is the dataset obtained by the discretization technique
d
i
, and ms is the minimum support). In terms of the
Theory of Evidence, each normalized sequences base
corresponds to a body of evidence or belief structure,
and is the input of the evaluation method that we pro-
pose in this paper. Table 1 shows the results of the
proposed evaluation method. For each discretization
technique, and for each minimum support value, we
indicate in the table the entropy (E
m
) and the non-
specificity based measure (J
m
), where J
m
= 1 − S
m
,
and S
m
the specificity like measure, both proposed by
Yager as a quality indicator of a belief structure.
The Q parameter is the quality measure defined
in the Equation 11, that is, the inverse of the classi-
ON THE EVALUATION OF MINED FREQUENT SEQUENCES - An Evidence Theory-based Method
267