protein with fixed number of points, which are 
equidistant along the backbone. 
Finally, we use modification of the ray based 
descriptor (Vranic, 2004).  
In the process of retrieval, descriptors are 
compared by using the L
1
 and L
2
 norm. 
2.3  Wavelet Based Approach 
In this approach, we used similar scheme as given in 
(Marsolo, 2006). First, the distance matrix is calcu-
lated. The distance matrix is very good represen-
tation of the protein 3D structure, namely, proteins 
with similar structure will have similar distance 
matrices. Also, distance matrix is invariant of 
scaling, translation and rotation of the protein. 
However, the distance matrix can not be used as 
a descriptor, because calculation of the distance 
between two descriptors of that size will have 
complexity  O(n
2
). Thus, in this approach a wavelet 
analysis of the distance matrix is performed.  
The wavelet transform is very useful and popular 
tool in signal processing. Its main advantage is that 
it can provide analysis of an image in different 
resolution, and unlike the Fourier transform, not 
only the frequency components of the image are 
obtained, but they are also localized in space. 
Since the discrete wavelet transform can be 
applied only to signals of length 2
n
, the distance 
matrix is scaled to nearest upper 2
n
 by using techni-
ques for image scaling. This can also be done by 
interpolating the 3D skeleton with additional Cα 
atoms up to some predefined number of type 2
n
. The 
obtained distance matrix will have size 2
n
x2
n
.  
The detail coefficients, which represent the high 
frequency components of the signal, are obtained by 
filtering the signal with high-pass filter. Similarly, 
the approximation coefficients, which represent the 
low-frequency components of the signal, are 
obtained by filtering the signal with low-pass filter. 
The filtering of the signals is performed by convol-
ving the signal with the impulse response of the 
high-pass filter and the low-pass filter of the particu-
lar wavelet transform for the details and approxima-
tion coefficients respectively.  
In this paper the wavelet analysis goes to the last, 
n-th level of decomposition (if the size of the matrix 
is 2
n
), i.e. to the minimal resolution of the image 
which is one approximation coefficient (pixel). That 
coefficient will represent the average value of the in-
tensity of the image. If the feature vector consists of 
all the obtained wavelet coefficients, then 
performance of the comparison will not be changed. 
So, the feature vector must be some subset of the 
wavelet coefficients. The high-frequency 
coefficients of the wavelet transform carriers the 
signal details. Thus, those coefficients are not 
relevant for the descriptor, because they will 
represent the local protein 3D structure 
characteristics. The structural protein comparison 
means comparing their global structural characte-
ristics. That means that the low-frequency coeffici-
ents should represent the protein. In our experiment, 
we have used 20 - 255 biggest wavelet coefficients. 
Additionally, they are quantized to values 1 (positive 
ones) and -1 (negative ones). In this paper we use 
Haar wavelet transform.  
By observing the distance matrices seen as 
images, it can be noticed that they are divided in 
regions with same or similar colour, which means 
that the low-frequency components dominate in the 
image. Since the Haar wavelet is good representative 
of regions with low-frequency, it is expected that 
this wavelet will retrieve similar images with high 
precision. The length of the Haar filter banks is 2, so 
the calculation of the Haar wavelet transform will be 
very efficient.  
In the process of retrieval, let with Q[i,j] and 
T[i,j] label the coefficient in the wavelet matrices on 
position [i,j]. Suppose that Q[0,0]= T[0,0]=0. For 
matrices with dimensions mxn, we can use (4) to 
measure their distance. The weights w
i,j
 are 
empirically obtained, and because the main 
information of the image is in the upper-left corner, 
the limitation w
i,j
=w
min
(max(i,j),5) is performed. To 
speed up the time complexity of the function d(Q,T), 
we only look for the coefficients that are on the 
same location.  
 
∑∑
==
−
−−
n
i
m
j
ji
w
TQwTQd
00
)5),,min(max(
0,0
|]0,0[]0,0[|),(
 
(4)
3 EXPERIMENTAL RESULTS 
We have implemented a system for protein retrieval 
based on the three approaches described above. Our 
ground truth data contains 6979 randomly selected 
protein chains from 150 SCOP protein domains. 
90% of the data set serves as the training data and 
the other 10% serves as the testing data. We will 
examine the retrieval accuracy of the descriptors 
according to SCOP hierarchy (
Murzin, 1995).