symbol and unitary cost for the remaining string editing operations. Moreover, differ-
ent types of normalization are used, namely: the classical normalization by the string
length (NSED); and the normalization by the length of the editing path – normalized
string edit distance (NSEDL).
In a different paradigm, the structural resemblance, use grammars to model the
cluster’s structure, and rules of composition of clusters are assumed to reflect the simi-
larity between patterns. Several approaches are described in the literature: Fu proposed
a distance between strings based on the concept of error correcting parsing (ECP); Fred
explored the notion of compressibility of sequences and algorithmic complexity using
Solomonoff’s code (SOLO); another approach by Fred, the ratio of decrease in grammar
complexity (RDGC), is based on the idea that if two sentences are structurally similar,
then their joint description is more similar than their isolated description due to sharing
rules of symbol of composition. For details on how to compute these measures consult,
for instance, [2] and the references therein.
Clustering Algorithms. Several clustering algorithms are addressed using both the
partitional and hierarchical agglomerative approaches.
On the first approach, one of best well known and mostly used algorithm for clus-
tering is the K-means algorithm. In order to apply it to string descriptions, we have
adapted it, in order to be based on proximity measures described previously for string
pairs. Moreover, the clustering prototypes are selected as the median string. A nearest
neighbor approach, that we will refer as Fu-NN, was also explored, adopting as dis-
tance measure the string edit distance (SED). The nearest-neighbor rule is the basis for
another algorithm, where clusters are modeled by grammars, but where sequences are
compared, not directly with patterns previously included in clusters, but with the best
matching elements in languages generated by the grammars inferred from clustered
data. For grammatical inference we used the Crespi-reghizzi’s method [2], without as-
suming a priori information. We will refer to this method as FU-ECP. In a different per-
spective the Spectral clustering algorithms [8] map the original data set into a different
feature space based on the eigenvectors of an affinity matrix, a clustering method being
applied to the new feature space. In order to extend the applicability of the method to
string patterns, the definition of the affinity matrix is derived from the normalized string
edit distance (NSEDL).
In the hierarchical perspective [9], we will explore the classical Single Link (SL),
Complete Link (CL), Average Link (AL), Ward’s Link(WL), and Centroid Based Link
(Centroid) [9]. To convert the similarity measures defined above, generically referred as
S(s
1
, s
2
), into dissimilarity measures, we use:d(s
1
, s
2
) = max(similarity) − S(s
1
, s
2
)
2.2 Clusterings Combination
Several combination methods have been proposed to obtain the combined solution, P
∗
,
[3–5]. Fred and Jain proposed a method, the Evidence Accumulation Clustering (EAC),
for finding consistent data partitions, where the combination of clustering ensemble is
performed transforming partitions into a co-association matrix, which maps the coher-
ent associations and represents a new similarity measure between patterns. To unsuper-
visely find the number of clusters, the lifetime method [3] can be used. Strehl and Gosh
41