tures SSSDB (Sonego et al., 2007; Chiang et al.,
2007).
Protein Preparation
We use C++ routines and the BALL library
(Kohlbacher and Lenhof, 2000) to analyse protein
structures. As a first step, hydrogen atoms are added.
After a consistency check of each residue, we assign
partial charges and atom radii to atoms according to
the AMBER force field. Finally, all hetero residues
are removed from the structure.
To be independent from protein annotation, all
SSEs are recomputed using the DSSP algorithm
(Kabsch and Sander, 1983), afterwards all residues
within loops are removed. α-helices are not further
modified, whereas all connected components of β-
strands linked by hydrogen bonds are merged into β-
sheets. In this way, each β-sheet is treated as a com-
plete 3D structure during the computation of interac-
tion energies.
Graphical Encoding of Energetic SSE
Interactions
To prepare the graph construction, we compute the
pairwise matrix of SSE interaction energies. Let A
and B be any secondary structures in a protein, and
let E[A, B, ...] be the AMBER energy of a set of SSEs,
then the pairwise interaction energy I[A, B] is given as
I[A, B] = E[A, B] − E[A] − E[B]. (1)
Graph construction serves for structural normal-
ization as well as extracting the interaction model.
It filters needless relations, while being independent
from the computed amount of I[A, B] energy and the
relative distance of the SSEs. Therefore, it makes pro-
teins having different numbers of SSEs comparable.
One convenient graph for this task is the Relative
Neighbourhood Graph (RNG) (Toussaint, 1980). The
RNG connects two labelled SSE nodes if the follow-
ing edge condition
I[A, B] ≤ max
C
{I[A,C], I[B,C]}, (2)
holds, where A, B,C are SSEs from the protein and
A 6= B 6= C. The RNG is a connected proximity graph
and, therefore, also connects SSEs that are too distant
for direct protein residue contacts. As its edge con-
dition resembles an ultra-metric (Milligan and Isaac,
1980), the RNG has shown great robustness in prac-
tice and is a powerful tool to extract meaningful per-
ceptual structures (Toussaint, 1980).
Graphlet Analysis
Graphlet analysis makes use of subgraph sampling
and, therefore, relies on a graph isomorphism test.
Each sampled subgraph is referred to as graphlet and
its frequency or probability within a network is esti-
mated by repeated sampling. In addition, statistical
graphlet analysis requires the knowledge of a back-
ground distribution to compute the probability of an
observation. As no analytical distribution function for
graphlets is known, their probabilities are in general
estimated from random graphs.
To obtain a random model resembling the input
graph distribution, each protein graph is randomized.
We use a random rewiring method where each edge
is split into two half-edges. Then, all half-edges are
randomized and rewired. This is repeated until a con-
nected graph is obtained or a maximum number of
iterations is reached. In the latter case, the last sam-
ple is saved. In summary, random rewiring conserves
important graph properties (e.g. the node degree). By
randomizing each graph once, we obtain a collection
of random graphs that closely resembles the test dis-
tribution.
Next, we estimate the graphlet distribution by ran-
dom sampling connected subgraphs. The goal of
the sampling is twofold: First, all existent graphlets
should be detected and, second, their distribution
should be estimated correctly. If all graphlets were
known in advance, drawing a fixed number of samples
would yield the maximum likelihood estimate of the
graphlet distribution, which is a multinomial distribu-
tion (Wassermann, 2004; Georgii, 2004). To achieve
this estimate, we employ a two-pass approach for this
task.
In the first pass, the data is exploratory sampled.
For counting the graphlet frequencies, we make use
of a Move-to-Front (MF) list that holds a counter for
each graphlet type. Thus, each sampled graphlet is
first searched in the MF list for counting its occur-
rence and inserted in the case it is not found. There-
fore, the MF list length increases during this pass. We
draw 1000 samples per graph of the database to min-
imize the possibility of missing patterns.
In the second pass, we keep the MF list fixed dur-
ing sampling to compute the maximum likelihood es-
timates. Again, we draw a total of 1000 samples, 5
repetitions with 200 samples, from each graph and,
thus, obtain 5 independent distribution estimates. If
sampling detects an unknown graphlet within the sec-
ond pass, a counter for unknownpatterns is increased.
Finally, we compute the distribution estimate by aver-
aging and normalizing all samplings for a graph.
We choose a sampling size of 1000 graphlets in
GRAPHLET DATA MINING OF ENERGETICAL INTERACTION PATTERNS IN PROTEIN 3D STRUCTURES
191