protein with fixed number of points, which are
equidistant along the backbone.
Finally, we use modification of the ray based
descriptor (Vranic, 2004).
In the process of retrieval, descriptors are
compared by using the L
1
and L
2
norm.
2.3 Wavelet Based Approach
In this approach, we used similar scheme as given in
(Marsolo, 2006). First, the distance matrix is calcu-
lated. The distance matrix is very good represen-
tation of the protein 3D structure, namely, proteins
with similar structure will have similar distance
matrices. Also, distance matrix is invariant of
scaling, translation and rotation of the protein.
However, the distance matrix can not be used as
a descriptor, because calculation of the distance
between two descriptors of that size will have
complexity O(n
2
). Thus, in this approach a wavelet
analysis of the distance matrix is performed.
The wavelet transform is very useful and popular
tool in signal processing. Its main advantage is that
it can provide analysis of an image in different
resolution, and unlike the Fourier transform, not
only the frequency components of the image are
obtained, but they are also localized in space.
Since the discrete wavelet transform can be
applied only to signals of length 2
n
, the distance
matrix is scaled to nearest upper 2
n
by using techni-
ques for image scaling. This can also be done by
interpolating the 3D skeleton with additional Cα
atoms up to some predefined number of type 2
n
. The
obtained distance matrix will have size 2
n
x2
n
.
The detail coefficients, which represent the high
frequency components of the signal, are obtained by
filtering the signal with high-pass filter. Similarly,
the approximation coefficients, which represent the
low-frequency components of the signal, are
obtained by filtering the signal with low-pass filter.
The filtering of the signals is performed by convol-
ving the signal with the impulse response of the
high-pass filter and the low-pass filter of the particu-
lar wavelet transform for the details and approxima-
tion coefficients respectively.
In this paper the wavelet analysis goes to the last,
n-th level of decomposition (if the size of the matrix
is 2
n
), i.e. to the minimal resolution of the image
which is one approximation coefficient (pixel). That
coefficient will represent the average value of the in-
tensity of the image. If the feature vector consists of
all the obtained wavelet coefficients, then
performance of the comparison will not be changed.
So, the feature vector must be some subset of the
wavelet coefficients. The high-frequency
coefficients of the wavelet transform carriers the
signal details. Thus, those coefficients are not
relevant for the descriptor, because they will
represent the local protein 3D structure
characteristics. The structural protein comparison
means comparing their global structural characte-
ristics. That means that the low-frequency coeffici-
ents should represent the protein. In our experiment,
we have used 20 - 255 biggest wavelet coefficients.
Additionally, they are quantized to values 1 (positive
ones) and -1 (negative ones). In this paper we use
Haar wavelet transform.
By observing the distance matrices seen as
images, it can be noticed that they are divided in
regions with same or similar colour, which means
that the low-frequency components dominate in the
image. Since the Haar wavelet is good representative
of regions with low-frequency, it is expected that
this wavelet will retrieve similar images with high
precision. The length of the Haar filter banks is 2, so
the calculation of the Haar wavelet transform will be
very efficient.
In the process of retrieval, let with Q[i,j] and
T[i,j] label the coefficient in the wavelet matrices on
position [i,j]. Suppose that Q[0,0]= T[0,0]=0. For
matrices with dimensions mxn, we can use (4) to
measure their distance. The weights w
i,j
are
empirically obtained, and because the main
information of the image is in the upper-left corner,
the limitation w
i,j
=w
min
(max(i,j),5) is performed. To
speed up the time complexity of the function d(Q,T),
we only look for the coefficients that are on the
same location.
∑∑
==
−
−−
n
i
m
j
ji
w
TQwTQd
00
)5),,min(max(
0,0
|]0,0[]0,0[|),(
(4)
3 EXPERIMENTAL RESULTS
We have implemented a system for protein retrieval
based on the three approaches described above. Our
ground truth data contains 6979 randomly selected
protein chains from 150 SCOP protein domains.
90% of the data set serves as the training data and
the other 10% serves as the testing data. We will
examine the retrieval accuracy of the descriptors
according to SCOP hierarchy (
Murzin, 1995).