A and B, respectively, and let b be a “short” sequence from source B. As proposed by
Benedetto et al, the relative entropy D(A||B) (per symbol) can be estimated by
b
D(A||B) = (∆
Ab
− ∆
Bb
)/|b|, (4)
where ∆
Ab
= L
A+b
− L
A
and ∆
Bb
= L
B+b
− L
B
. Notice that ∆
Ab
/|b| can be seen
as the code length (per symbol) obtained when coding a sequence from B (sequence b)
using a code optimized for A, while ∆
Bb
/|b| can be interpreted as an estimate of the
entropy of the source B.
To handle the text authorship attribution problem, Benedetto, Caglioti and Loreto
(BCL) [4] defined a simplified “distance” function d(A, B) between sequences,
d(A, B) = ∆
AB
= L
A+B
− L
A
, (5)
which we will refer to as the BCL divergence. As mention before, ∆
AB
is a measure of
the description length of B when the coding is optimized to A, obtained by subtracting
the description length of A from the description length of A+B. Hence, it can be stated
that d(A, B
′′
) < d(A, B
′
) means that B
′′
is more similar to A than B
′
. Notice that the
BCL divergence is not symmetric.
More recently, Puglisi et al [13] studied in detail what happens when a compression
algorithm, such as LZ77 [10], tries to optimize its features at the interface between two
different sequences A and B, while compressing the sequence A + B. After having
compressed sequence A, the algorithm starts compressing sequence B using the dictio-
nary that it has learned from A. After a while, however, the dictionary starts to become
adapted to sequence B, and when we are well into sequence B the dictionary will tend
to depend only on the specific features of B. That is, if B is long enough, the algorithm
learns to optimally compress sequence B. This is not a problem when the sequence B
is sufficiently short for the the dictionary not to become completely adapted to B, but
is a serious problem arises for a long sequence B. The Ziv-Merhav method, described
next, does not suffer from this problem, this being what motivated us to consider it for
sequence classification problems [7].
3.2 Ziv-Merhav Empirical Divergence
The method proposed by Ziv and Merhav [3] for measuring relative entropy is also
based on two Lempel-Ziv-type parsing algorithms:
– The incremental LZ parsing algorithm [12], which is a self parsing procedure
of a sequence into c(z) distinct phrases such that each phrase is the shortest se-
quence that is not a previously parsed phrase. For example, let n = 11 and z =
(01111 000110), then the self incremental parsing yields (0, 1, 11, 10, 00, 110),
namely, c(z) = 6.
– A variation of the LZ parsing algorithm described in [3], which is a sequential pars-
ing of a sequence z with respect to another sequence x (cross parsing). Let c(z|x)
denote the number of phrases in z with respect to x. For example, let z as before
and x = (10010100110);then, parsing z with respect to x yields (011, 110, 00110 ),
that is c(z|x) = 3.