scenario. In general this approach can be used for un-
weighted and weighted graphs. However, we observe
that it is possible to extend this method also to the
graphs with real vectors as weights on the edges, tak-
ing into account the (Euclidean) norm W
i j
=k ν(e
i j
) k
of the vector, assuming that the highest is the norm,
the stronger is the relationship.
In this paper, we will make use of this seriation
method (Robles-Kelly and Hancock, 2005) for three
types of edge-labeled graphs: unweighted, weighted
with a scalar in [0,1] and with a vector in [0, 1]
n
. Note
that the vertices can be arbitrarily labeled, but they are
irrelevant in this particular phase.
3.2 Sequence Mining and Embedding
In order to represent a seriated graph as a real valued
vector, we propose a method called GRADIS (GRan-
ular Approach for DIscrete Sequences), aimed at per-
forming two crucial tasks: the inexact substructures
identification, called the alphabet of symbols, and
the embedding of the sequences using the symbolic
histogram approach (Del Vescovo and Rizzi, 2007).
With a little bit of formalism, the proposed methodol-
ogy, seen from an high level of abstraction, can be
characterized using a mapping function f : Σ → E
that maps each sequence to a numeric vector, with
E ⊆ R
n
. This way, the problem is mapped into the
Euclidean domain.
Given an input dataset of sequences (i.e. se-
quences of characters) S = {s
1
,...,s
q
}, generated
from a finite alphabet Ω, the first task is to iden-
tify a set A = {a
1
,...,a
d
} of symbols. These sym-
bols are pattern substructures that are recurrent in the
input dataset. The identification of this set is per-
formed using an inexact matching strategy in con-
junction with a clustering procedure. Given a lower
l and upper L limit for the length of the subse-
quences, a variable length n-gram analysis is per-
formed on each input sequence of the dataset. An
n-gram of a given sequence s is defined as a con-
tiguous subsequence of s of length n. For in-
stance, if an input sequence is s = (A,B,C,D), with
l = 2,L = 3, we obtain the set of n-grams N
s
=
{(A,B),(A,B,C),(B,C),(B,C, D),(C, D)}. The car-
dinality of this set is then |N
s
| =
∑
L
i=l
(n − i) + 1 =
O(n
2
), where L ≤ n = len(s) (len(s) is the number
of elements into the sequence s). However, normally
L n and thus |N
s
| ' c · n, where c > 1 is a constant
factor. If M is the maximum length of a sequence of
the input dataset S, the cardinality of the whole set of
n-grams N = {n
1
,...,n
|N |
} of S is upper bounded by
|N | ≤ |S| ·
∑
L
i=l
(M − i) + 1 ' |S| · c · M.
Since the cardinality of N can become large, it is
convenient to perform this analysis on a subset of N ,
say N
∗
. For this purpose, a probabilistic selection of
each n-gram can be performed. With a user-defined
probability p an n-gram is selected, conversely with
probability 1 − p it is not selected. So, given a set of
m n-grams and a selection probability p, the chosen n-
grams can be described by a Binomial distribution. In
fact the expected number of selected n-grams in N
∗
is |N | · p, with variance |N | · p · (1 − p).
The set N
∗
is subjected to a clustering procedure
based on the BSAS algorithm. The Levenshtein dis-
tance is used as the dissimilarity measure between n-
grams and clusters are represented by the MinSOD
(Minimum Sum Of Distances) element. The cluster-
ing procedure aims to identify a list of different par-
titions of N
∗
, say L = (P
1
,...,P
z
). In fact the out-
come of the BSAS algorithm is known to strongly de-
pend on a clustering parameter called Θ. The opti-
mal value of Θ is automatically determined using a
logarithmic search algorithm. The Θ search interval
(usually [0, 1]) is recursively split in halves, stopping
each time that two successive values of the parameter
yield the same partition. The recursive search stops
anyway when the distance between two successive Θ
values falls under a given threshold. Each partition
P
i
= {C
1
,...,C
u
i
},i = 1 → z, is composed by u
i
clus-
ters of n-grams. Successively, these clusters are sub-
jected to a validation analysis, that takes into account
the quality of the cluster. For this purpose, for each
cluster C
j
∈ P
i
a measure of cluster compactness cost
K(C
j
) and size cost S(C
j
) is evaluated. Starting from
these measures, the following convex combination is
defined as the total cluster cost
Γ(C
j
) = (1 − µ) · K(C
j
) + µ ·S(C
j
) (1)
where K(C
j
) =
1
n−1
∑
n−1
i=1
d(n
i
,n
SOD
C
j
), n
SOD
C
j
is the
MinSOD element of the cluster C
j
and d(·, ·) is the
Levensthein distance. The size cost is defined as
S(C
j
) = 1 −
|C
j
|
|N
∗
|
. If the cost Γ(C
j
) is lower than a
given threshold τ, the cluster is retained, conversely
is rejected. Each (accepted) cluster C
j
is modeled by
a representative element, that is an n-gram n
SOD
C
j
that
minimizes the sum of distances between all the other
elements in the cluster. This representative element,
say a
j
= n
SOD
C
j
, is considered as a symbol of the input
dataset and it is added to the alphabet A. At the end
of this clustering procedure, the alphabet A of the in-
put dataset is determined. Actually, each symbol a
j
is
defined as a triple (n
SOD
C
j
,K(C
j
) · ε, 1 − Γ(C
j
)), where
K(C
j
) · ε is a factor used in the subsequent embed-
ding phase (ε is again a user-defined tolerance) and
1 − Γ(C
j
) is the ranking score (quality value of the
cluster C
j
) of the symbol a
j
in the alphabet A. Our
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
188