is also the nearest from the ‘virtual’ true center of the
set (Studer et al., 2009).
Sequence likelihood. The sequence likelihood P(s)
is defined as the product of the probability with which
each of its observed successive state is supposed to
occur at its position. Let s = s
1
s
2
···s
ℓ
be a sequence
of length ℓ. Then
P(s) = P(s
1
,1) · P(s
2
,2)···P(s
ℓ
,ℓ)
with P(s
t
,t) the probability to observe state s
t
at posi-
tion t. The question is how to determinate the state
probabilities P(s
t
,t). One commonly used method
for computing them is to postulate a Markov model,
which can be of various order. Below, we just con-
sider probabilities derivedfrom the first order Markov
model, that is each P(s
t
,t), t > 1 is set to the transi-
tion rate p(s
t
|s
t−1
) estimated across sequences from
the observations at positions t and t −1. For t = 1, we
set P(s
1
,1) to the observed frequency of the state s
1
at
position 1. The likelihood P(s) being generally very
small, we use −logP(s) as sorting criterion. The lat-
ter quantity is minimal when P(s) is equal to 1, which
leads to sort the sequences in ascending order of their
score.
3.2 Eliminating Redundancy
Once a sorted list of candidates has been defined, the
second stage consists in eliminating redundancy since
we do not want our representative set to contain simi-
lar sequences. The procedure is as follows:
• Select the first sequence in the candidate list (the
best one given the chosen criterion);
• Process each next sequence in the sorted list of
candidates. If this sequence is similar to none of
those already in the representative set, that is dis-
tant from more than a predefined threshold from
all of them, add it to the representative set.
The threshold for sequence similarity is defined as
a proportion of the maximal theoretical distance. For
the OM distance this theoretical maximum is for two
sequences (s
1
,s
2
) of length (ℓ
1
,ℓ
2
)
D
max
= min(ℓ
1
,ℓ
2
) · min
2C
I
,max(S)
+ |ℓ
1
− ℓ
2
| ·C
I
where C
I
is the indel cost and maxS the maximal sub-
stitution cost.
3.3 Size of the Representative Set
Limiting our representative set to the mere se-
quence(s) with the best representative score may lead
to leave a great number of sequences badly repre-
sented. Alternatively, proceeding the complete list of
candidates to achieve a full coverage of the data set is
not a suitable solution since we look for a small set of
representative sequences.
To control the size of the representative set, we
limit the size of the candidate list so that the cu-
mulated frequency of the retained distinct candidates
reaches a threshold proportion trep of the whole data
set. Setting for instance trep = 25% ensures that at
least 25% of the sequences will have a representative
in their neighborhood and that the final representative
set will contain at most 25% of the distinct sequences
in the whole set. Thus trep defines also a minimum
coverage level.
There are indeed other possible ways of control-
ling the size of the representative set such as fixing a)
the number or the proportion of sequences in the final
representative set, or b) the desired coverage level.
3.4 Measuring Quality
A first step to define quality measures for the repre-
sentative set is to assign each sequence to its nearest
representative according to the considered pairwise
distances. Let r
1
...r
nr
be the nr sequences in the repre-
sentative set and d(s,r
i
) the distance between the se-
quence s and the ith representative. Each sequence s is
assigned to its closer representative. When a sequence
is equally distant from two or more representatives,
the one with the highest representativeness score is
selected. Hence, letting n be the total number of se-
quences and na
i
the number of sequences assigned to
the ith representative, we have n =
∑
nr
i=1
na
i
. Once
each sequence in the set is assigned to a representa-
tive, we can derive the following quantities from the
pairwise distance matrix.
Mean distance. Let SD
i
=
∑
na
i
j=1
d(s
j
,r
i
) be the sum
of distances between the ith representative and its na
i
assigned sequences. A quality measure is then
MD
i
=
SD
i
na
i
the mean distance to the ith representative.
Coverage. Another quality indicator is the number of
sequences assigned to the ith representative that are in
its neighborhood, that is within a distance dn
max
nb
i
=
na
i
∑
j=1
d(s
j
,r
i
) < dn
max
.
The threshold dn
max
is defined as a proportion of
D
max
. The total coverage of the representative set is
the sum nb =
∑
nr
i
nb
i
expressed as a proportion of the
number n of sequences, that is nb/n.
Distance gain. A third quality measure is obtained
by comparing the sum SD
i
of distances to the ith rep-
SUMMARIZING SETS OF CATEGORICAL SEQUENCES - Selecting and Visualizing Representative Sequences
65