pEn does not count self-matches and therefore can re-
duce bias. It has been found that SampEn can pro-
vide better relative consistency than ApEn because
it is largely independent of sequence length (Rich-
man and Moorman, 2000). MSE measures complex-
ity of time series data by taking into account multi-
ple time scales, and uses SampEn to quantify the reg-
ularity of the data. All of these three methods de-
pend on the selection of the two parameters known
as m and r: parameter m is used to determine the se-
quence length, whereas parameter r is the tolerance
threshold for computing pattern similarity. Results
are sensitive to the selections of these two parameters
and it has recently been reported that good estimates
of these parameters for different types of signals are
not easy to obtain (Lu et al, 2008). In this paper we
introduce a new entropy method called GeoEntropy
(GeoEn) which can provide an analytical procedure
for estimating the conrtol parameter r. We then apply
various entropy methods to study the complexity or
predictability of cancer using mass spectrometry data,
which are complex and large datasets. To improve
the entropy analysis, we use a novel probabilistic fu-
sion framework based on the engineering hypothesis
of permanence of ratio to combine the results from
different entropy algorithms.
1.1 GeoEntropy
Let z(X) be a regionalized variable which has charac-
teristics in a given region D of a spatial or time contin-
uum (Matheron, 1989). In the setting of a probabilis-
tic model, a regionalized variable z(X) is considered
to be a realization of a random function Z(X). In such
a setting, the data values are samples from a particular
realization z(X) of Z(X). We now consider n observa-
tion: z(X
α
), α = 1,... , I; taken at locations or times α.
If the objects are points in time or space, the possibil-
ity of infinite observations of the same kind of data is
introduced by relaxing the index α. The regionalized
variable is therefore defined as z(X) for all X ∈ D ,
and {z(X
α
),α = 1,..., I} is viewed as a collection of
a few values of the regionalized variable.
We now consider that each measured value in the
dataset has a geometrical or time point in the respec-
tive domain D , which is called a regionalized value.
The family of random variables {Z(X ),X ∈ D}, is
called the random function. The variability of a re-
gionalized variable z(X) at different scales can be
measured by calculating the dissimilarity between
pairs of data values, denoted by z(X
α
) and z(X
β
), lo-
cated at geometrical or time points α and β in a spa-
tial or time domain D, respectively (from now on
we address point/domain to imply either geometri-
cal or time point/domain). The measure of this semi-
dissimilarity, denoted by γ
αβ
, is computed by taking
half of the squared difference between the pairs of
sample values (the term semi is used to indicate the
half difference) as
γ
αβ
=
1
2
(X
α
− X
β
)
2
(1)
The two points x
α
and x
β
in space or time can be
linked by a space or time lag h = X
α
− X
β
(we use h
here as a scalar but its generalized form is a vector
to indicate various spatial orientations). Now let the
semi-dissimilarity depend on the lag h of the point
pair, we have
γ
α
(h) =
1
2
[(z(X
α
+ h) − z(X
α
)]
2
(2)
Using all samples pairs in a dataset, a plot of
the γ(h) against the separation h is called the semi-
variogram. The function γ(h) is referred to as the
semi-variance and defined as
γ(h) =
1
2N(h)
∑
(α,β)|h
αβ
=h
[z(X
α
) − z(X
β
)]
2
(3)
where N(h) is the number of pairs of data points
whose locations are separated by lag h.
The semi-variance defined in (3) is known as the
experimental semi-variance and its plot against h is
called the experimental semi-variogram, to distin-
guish it from the theoretical semi-variogram that char-
acterizes the underlying population. The theoretical
semi-variogram is thought of a smooth function repre-
sented by a model equation; whereas the experimental
semi-variogram estimates its form. The behavior of
the semi-variogram can be graphically illustrated by
the theoretical semi-variogram using the spherical or
the Matheron model which is defined as (Isaaks and
Srivastava, 1989)
γ(h) =
(
s
h
1.5
h
g
− 0.5(
h
g
)
3
i
: h ≤ g
s : h > g
(4)
where g and s are called the range and the sill of the
theoretical semi-variogram, respectively.
The concept of regionalized variables and its mod-
eling of variability in space continuum by means of
the semi-variogram have been described. What can
be observed is that the range g of the semi-variogram
presents an idea for capturing the auto-relationship
of the time-series data: within the range g, the data
points are related; when h > g, information about
relationship between the data points becomes satu-
rated and not useful. Based on this principle of the
BIOSIGNALS 2010 - International Conference on Bio-inspired Systems and Signal Processing
116