However, to give a particular database, select the
appropriate algorithm and the optimal number of
clusters, is usually a difficult task. Also relevant is a
suitable interpretation to the application area.
clValid R package (Brock et al., 2008) was used
for this purpose. It has three different measures for
assessing the goodness of clustering results and for
identifying the best performing clustering algorithm.
We used internal validation that uses intrinsic
information. The internal measures reflect the
compactness, connectedness and separation of the
cluster partitions. The internal measures are:
Connectivity. Related to their nearest neighbors in
the data space. Has a value between 0 and infinity
and should be minimized.
Average Silhouette width. Related with the degree
of confidence in the clustering assignment of a
particular observation. Has a value between -1 and
1 and should be maximized.
Dunn index. Related with the ratio of the smallest
distance between observations not in the same
cluster to the largest intra cluster distance. Has a
value between 0 and infinity and should be
maximized.
The Dunn index and silhouette width are both non-
linear combinations of the compactness and
separation.
K Means is an unsupervised learning algorithm
that tries to group data based on their similarity. It is
unsupervised because there is no outcome to be
predicted, and the algorithm tries to find patterns in
data. In this algorithm we must specify the number of
clusters to group data into. Randomly the algorithm
assigns each register to a cluster, and finds the center
of each cluster. Then, the algorithm iterates, reassigns
data points to the cluster whose center is closest and
calculates new center of each cluster. These two steps
are repeated until there are no significate variation,
calculated as the sum of the Euclidean distance
between the data points and their respective cluster
centers.
PAM stands for “Partition Around Medoids”. It
finds a sequence of medoids that are centrally located
in clusters. The medoids are placed into S, a set of
selected objects. If O is the universe of objects then
the set U = O-S contains the unselected objects. The
goal is to minimize the average dissimilarity of
objects to their closest selected object.
Hierarchical Clustering involves creating clusters
that have a predetermined ordering from top to
bottom. The basic process is:
1. Assign each item to its own cluster. The distances
(similarities) between the clusters equal the
distances between the items they contain.
2. Find the closest pair of clusters and merge them
into a single.
3. Compute distance between new clusters and each
of the old clusters
4. Repeat steps 2 and 3 until all items are clustered
into a single cluster of size N.
4.2 Cluster Algorithms Evaluation
We started consulting the maximum number of
people per home that was found in the questionnaires.
The maximum number of people was 10, but each
person has different characteristics such as age,
gender, and the type of family member, which might
be head of household, spouse or partner, son,
domestic worker, another relationship, or host.
There is a great diversity in the characteristics of
the columns that are in the original Data Base, and it
was considered to make extracts of the characteristics
of the columns that allow for a more natural and
familiar classification on the social needs that the
families present. In the columns 18-27 of the
database, the type of family member was found, in the
columns 29-37 the age of each family member, and in
the columns 38-47 the gender of each one. It is
important to say that in each home there are between
1 to 10 members, so some of the records have NA,
except that there is always a head of household. To
find the solution to this problem, the preprocessing
was done by reducing from 30 to 4 the number of
columns per home. In the new columns, the first one
refers to the head of household, the second one refers
whether are at least a child, in the third if the head of
household have a partner, and finally whether there is
a person over 64. By doing the above, it was possible
to work with the 3,000 questionnaires because now
all records have 1 or 0, allowing classify algorithms.
Then we select:
- Gender of the head of householder, 1 for men and
0 for women.
- Children, represented by 1 if there are one
children or more; 0 means no children.
- Partner, 1 if the head of householder has a partner
and 0 otherwise.
- Older than 64, 1 if there is a 64 or older person
living in the house, 0 otherwise. 64 is the most
used age in Mexico to define an elderly person.
The results obtained by the clValid function (Brock et
al., 2008) are shown in Table 4.
We discard hierarchical of two, because we are
trying to bring services more in line with the
physiological needs of a given region.