The third common approach to the computation
of the word’s complexity is Shannon entropy. It
uses the distribution of letters in the word to estimate
the word’s informativeness: H =
∑
|A|
i=1
p
i
log(1/p
i
),
where A is the word’s alphabet, p
i
∈ [0,1] is the rel-
ative frequency of the i-th letter in the word. From
our point of view it is a variant of the subword com-
plexity where the length of a subword is limited to 1
and instead of the ”number of different subwords” we
use one simple function of the ”frequencies of dif-
ferent letters”. One can find the detailed compari-
son between Shannon entropy and Kolmogorov com-
plexity in (Grunwald and Vitanyi, 2004). Not go-
ing into the details here we must note that Shannon
entropy is a ”rougher” measure of the informative-
ness than subword complexity. For example, statis-
tics of the symbols {0,1}, laying behind Shannon en-
tropy, will consider this two strings ”0000011111”
and ”0110001110” as equally complex because both
contain 5 ”zeros” and 5 ”units”, while subword com-
plexity will reflex more complex inner structure of the
second word.
We begin the article with the demonstration of
computation difficulties connected with the usage of
subword complexity. These difficulties inspire us to
analyze the structure of subword complexity and pro-
pose a new simple measure of a word’s complexity,
which is the development of the notion of subword
complexity but is convenient in practice. At the end
we give some experiments supporting the proposed
complexity measure. In the experiments we show
that the proposed measure does not only gain advan-
tage in computation time over the normalized classi-
cal subword complexity but also corresponds to the
AbC much better.
2 THEORY
Definition 1. Let W = (w
1
,.. . , w
n
) be a fi-
nite word whose length is n = |W|, where
∀i = 1..n w
i
∈ A = {a
1
,.. . , a
|A|
}, A is a finite set.
Any word W
s
= (w
i
,.. . , w
j
), where 1 ≤ i ≤ j ≤ n,
consisting of consecutive letters of W is called a
subword of W. A subword whose length is k is called
a k-subword.
Definition 2. Let us consider a word W. The
number of distinct k-subwords of the word W is
called the k-subword complexity K
k
(W) of W. The
number of all distinct subwords of W is called the
subword complexity K(W) of W.
Definition 3. A random word is a word W
H
=
(b
1
,.. .,b
n
) over the alphabet A = {a
1
,.. . , a
|A|
},
where ∀i = 1..n,∀ j = 1..|A| P(b
i
= a
j
) = 1/|A|.
To compute the number of k-subwords in a given
word of length n, we need to perform O(n − k + 1)
operations. Summing over all k = 1..n and applying
the formula for the sum of arithmetic progression to
n terms, we have time complexity O(n
2
). This time
complexity is too high to apply the notion of subword
complexity in practice for long words. Evidently the
subword complexity is summed from the k-subword
complexities, which are computed successively:
n
∑
k=1
K
k
(W) = K(W) (1)
But do all the k-subword complexities give an infor-
mative contribution to understanding the inner struc-
ture of a given word? If we take a very small k,
then almost all the possible k-subwords will exist in
a sufficiently long word, so for small subwords the
k-subword complexity tends to be equal to |A|
k
:
lim
|W|→∞
K
k
(W) = |A|
k
(2)
On the other hand, for a large k almost all the k-
subwords are different, so the number of different k-
subwords tends to be equal to the number of all k-
subwords, which is equal to n − k + 1 (we must note
that this situation is typical even for k << n):
lim
|W|→∞
K
k
(W) = n− k+ 1 (3)
We see that usually in both cases the k-subword com-
plexity is determined by the global parameters of a
given word such as the size of the alphabet or the
word’s length. ”Good” values of k are supposedly sit-
uated between ”small” and ”large”, so we will search
such k that satisfy both conditions simultaneously:
k = k
0
: |A|
k
0
= lim
|W|→∞
K
k
0
(W) = n− k
0
+ 1 ⇒
|A|
k
0
= n− k
0
+ 1 ⇒ k
0
≈ log
|A|
n (4)
This k
0
is not necessarily integer, so we will approx-
imate the value of K
k
0
(W) by the interpolation poly-
nomial in the Lagrange form:
K
k
0
(W) =
p
∑
i=1
K
k
i
(W)
∏
j∈B
(k
0
− k
j
)
∏
j∈B
(k
i
− k
j
)
(5)
where B = 1..p \{i}, p = 4 and k
1
,.. . , k
4
used for the
approximation are the nearest integers:
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
6