In figure 8, note that the final segment of the modified
sequence y
(0)
has its datums shifted by one position
relative to the original sequence x
(0)
. That becomes
a shift by 1 position in y
(1)
, and by 0.25 positions in
y
(3)
. Nevertheless, as can be seen in the figure, the
interpolated curves still coincide, even in that part.
6 COMPARISON WITH OTHER
ENCODINGS
We claim that our three-channel encoding is better
suited for multi-scale analysis than other alternatives
that have been considered.
6.1 Why Not a Single-channel
Encoding?
An obvious alternative is to encode each letter by
a single distinct number — say, map A,T,C,G to
0,1,2,3 respectively. However, with this encoding
certain sequences consisting of very distinct bases
would map to the same averaged code when fil-
tered — a coincidence that has no biological justifi-
cation. For example, if X = (AGAGAG ...) and Y =
(TCTCTC ...) we would have x = (0, 3, 0,3,0,3, . . .)
and y = (1,2,1,2,1,2, . . .), which would produce ap-
proximately the same sequence (1.5, 1.5,...) when
filtered with a moderately wide kernel.
6.2 Why Not a Two-channel Encoding?
Another problem of this encoding is that the strength
of the Fourier spectrum of a pattern depends on which
nucleotides it uses. For example, the sinusoidal com-
ponents with period 2 in the sequences X and Y
above have amplitudes 3.0 and 1.0, respectively, even
though the patterns are basically the same (an alterna-
tion of two letters).
The same problems will inevitably occur if we
were to map each base to a two-component vector
or a complex number, as proposed by E. A. Cheever
et al. (Cheever et al., 1989) and used by L. Pessoa
et al. (Pessˆoa et al., 2004). In these works, each
base is represented by a complex number: A, T, C,
and G are mapped to +1, −1, +i, and − i, respec-
tively, where i =
√
−1 is the imaginary unit. For ex-
ample, with X = (ATATAT ...) and Y = (GCGCGC ...)
we would have x = (+1,−1, +1,−1,+1,−1,...) and
y = (+i,−i, +i,−i,+i,−i,...), and both would be-
come very close to (0, 0,...) when filtered with a
moderately wide kernel.
6.3 Why Not a Four-channel Encoding?
Another obvious alternative would be to use a four-
channel encoding where each letter is mapped to a
cardinal vector of R
4
; that is, where coordinate j of
x[i] is 1 if and only if X[i] is the j-th letter of the al-
lowed alphabet. Namely,
A
→ (1,0,0,0)
T
→ (0,1,0,0)
C
→ (0,0,1,0)
G
→ (0,0,0,1)
However, note that the sum of all four coordinates
will be always 1, not only for the individual codes
but also for for any average of codes. Thus the four-
channel codes actually lie on a three-dimensionalsub-
space of R
3
, meaning that the encoding is redundant.
Indeed, the four-channel codes of the DNA letters
are the corners of a regular three-dimensional tetra-
hedron T
4
in R
4
; and any weighted average of those
codes is a point of T
4
. Indeed there is a simple one-
to-one mapping from a point x
′
= (x
′
0
,x
′
1
,x
′
2
) of R
3
to
a point x
′′
= (x
′′
A
,x
′′
T
,x
′′
C
,x
′′
G
) of R
4
that maps T
3
to T
4
.
Namely,
x
′′
A
= ( + x
′
0
+ x
′
1
−x
′
2
+ 1)/4
x
′′
T
= ( + x
′
0
−x
′
1
+ x
′
2
+ 1)/4
x
′′
C
= ( −x
′
0
+ x
′
1
+ x
′
2
+ 1)/4
x
′′
G
= ( −x
′
0
−x
′
1
−x
′
2
+ 1)/4
(5)
Note that x
′′
A
+ x
′′
T
+ x
′′
C
+ x
′′
G
is always 1. It can be ver-
ified that the following projection of R
4
to R
3
is a
one-to-one mapping of T
4
to T
3
that is the inverse of
the above mapping:
x
′
0
= + x
′′
A
+ x
′′
T
−x
′′
C
−x
′′
G
x
′
1
= + x
′′
A
−x
′′
T
+ x
′′
C
−x
′′
G
x
′
2
= −x
′′
A
+ x
′′
T
+ x
′′
C
−x
′′
G
(6)
Therefore, we conclude that the 4-channel encoding
above contains exactly the same information as our
proposed 3-channel encoding.
7 CONCLUSIONS
We described a three-channel encoding of DNA se-
quences that is adequate for multi-scale analysis –
specifically, for filtering and anti-aliased resampling
– and allows visualization of the results as smooth
curves in three-dimensional space.
We foundthat the tetrahedral encodingof genomic
sequences described in this paper is convenient for vi-
sualization of genomic sequences of moderate length
(a few hundred nucleotides), especially after filtering
and subsampling. We also found that, with correct
GeometricEncoding,Filtering,andVisualizationofGenomicSequences
223