course it is impossible to consider all codes, let
alone all possible combinations of such codes.
Therefore, in algorithm A1 we set a more modest
goal and adopt the convention that to Assign a Code
[as in lines (05) and (09)] means that we restrict
ourselves to the combinations of integers between 1
and Ni (recall that Ni is the number different values
of variable i in the data). Still, there are Ni! possible
ways to assign a code to categorical variable i and
Ni! x Nj! possible encodings of two categorical
variables i and j. An exhaustive search is, in general,
out of the question. Instead, we take advantage of
the fact that, regardless of the way a random variable
distributes (here the value of the random encoding of
variables i and j results in correlation r
ij
which is a
random variable itself) the means of sufficiently
large samples very closely approach a normal
distribution(Feller, 1966). Furthermore, the mean
value of a sample of means
μ and its standard
deviation
σ are related to the mean μ and standard
deviation σ of the original distribution by
μμ
and
σSSσ . What a sufficiently large sample
means is a matter of convention and here we made
S=25 which is a reasonable choice. Therefore, the
loop between lines (03) and (15) is guaranteed to
end. In our implementation we split the area under
the normal curve in deciles and then used a
goodness-of-fit test with p=0.05 to determine that
normality has been achieved. This approach is
directed to avoid arbitrary assumptions regarding the
correlation's distribution and, therefore, not selecting
a sample size to establish the reliability of our
results. Rather, the algorithm determines at what
point the proper value of μ has been reached.
Furthermore, from Chebyshev's theorem, we know
that:
2
1
1)(
k
kXkP
(2)
If we make k=3 and assume a symmetrical
distribution, the probability of being within three σ's
of the mean is roughly 0.95.
Three other issues remain to be clarified.
1) To Assign a code to V[i] means that we generate
a sequence of numbers between 1 and Ni and then
randomly assign a one of these numbers to every
different instance of V[i].
2) To Store the code [as in line (06)] means NOT
that we store the assigned code (for this would imply
storing a large set of sequences). Rather, we store
the value of the calculated correlation along with the
root of the pseudo random number generator from
which the assignment was derived.
3) Thereafter, selecting the best code (i.e. the one
yielding a correlation whose value is closest to μ) as
in line (18) is a simple matter of recovering the root
of the pseudo random number generator and
regenerating the original random sequence from it.
3 CASE STUDY: PROFILE OF
CAUSES OF DEATH IN A
LARGE HUMAN POPULATION
In order to illustrate our method we analyzed a data
base corresponding to the life span and cause of
death of 50,000 individuals between the years of
1900 and 2007. The confidentiality of the data has
been preserved by changing the locations and
regions involved. Otherwise data are a faithful
replica of the original.
The database contains 50,000 tuples consisting
of 11 fields: BirthYear, LivingIn, DeathPlace,
DeathYear, DeathMonth, DeathCause, Region, Sex,
AgeGroup, AilmentGroup and InterestGroup.
Therefore, our working data base has 10 dimensions
since the last variable (InterestGroup) corresponds to
interest groups identified by human healthcare
experts in this particular case and was not
considered. This last field corresponds to a heuristic
clustering of the data and could be used for the final
comparative analysis of resulting clusters as the
comparative analysis between the expert’s clusters
and the resulting clusters. We will explore this line
in future works.
Once the data were encoded, we proceeded to
use an unsupervised learning method for the
clustering process. First we needed to determine the
number of clusters in which we would group our
sample.
We applied the fuzzy c-means algorithm to our
coded sample. To determine the number of clusters
we experimented with 17 different possibilities
(assuming from 2 to 18 clusters). In each step we
calculated the partition coefficient and classification
entropy of the clustered data.
We applied the fuzzy c-means algorithm to our
coded sample. To determine the number of clusters
we experimented with 17 different possibilities
(assuming from 2 to 18 clusters). In each step we
calculated the partition coefficient (pc) and
classification entropy (pe) of the clustered data, see
(Lee, 2005); (Shannon, 1949); (Vinh, 2009). Plotting
CLUSTERING OF HETEROGENEOUSLY TYPED DATA WITH SOFT COMPUTING
501