conformations of the structurally variable regions
(termed loops) and adds the side chains. Some
approaches, align multiple known structures firstly,
then, identifying structurally conserved regions to
construct an average structure, for modeling these
regions of the inquiry protein.
In this communication, we analyze a database of
pairs of proteins, sequence and structurally aligned
and raised few questions:
i. Can we predict the accuracy of the modelled
structure based on sequence identity score?
ii. When the selection of the protein with highest
identity score is justified?
iii. Can we formulate a set of rules for homology
modeling?
1.1 Materials and Methods
More than 124 unique homologs of the serine
protease family of proteins that have sequence
identity below 99% were downloaded from the
Brookhaven Protein Databank (PDB). Then, IMSA -
Intelligent Multiple Sequence Alignment
4
(in-house
software based on the Intelligent Learning Engine
(ILE) optimization technology) was utilized to
optimally align the whole set of all sequences.
Sequence identity score was calculated for each pair
of sequences. All residues from the multiple
sequence alignment were found only on 96 proteins
(see table 1). Other proteins lack coordinates of one
residue at least in their 3D structures. The alpha
carbons for residues of selected proteins were
extracted from the PDB structures and structurally
superimposed.
The quality of the models obtained by homology
modeling is quantitated with the Cα RMSD between
model and experimental structure. We have defined
'highly accurate’ model as one having <=2 Å RMSD
from the experimentally determined structure, while
models having Cα RMSD above this threshold and
<=4 Å were termed “reliable” models which could
fit for designing mutagenesis experiments but not
drug design and binding affinity tests. BioLib was
used for performing structural alignment and for
computing the Cα RMSD (BioLib is an open-
environment developing toolkit developed by
BioLog Technologies Ltd.).
The multiple sequence alignment matrix
obtained from running our in-house software on the
selected database of serine proteases, was processed
as described below, in order to specify which parts
of the whole set of sequences to select for homology
modeling. We use a “voting” approach, in which
each amino acid contributes to the conservation at a
sequence position according to its frequency in that
particular position (see equation 1). These
frequencies are measured in all sequences of the
database.
%100∗=
n
C
ij
ij
(1)
C
ij
is thus the conservation factor for residue type i
at sequence position j.
n
ij
is the number of
sequences, which have amino acid i at position j of
the multiple alignment, and k is the total number of
sequences in the database.
Table 1: PDB codes of 96 serine proteases (the first four
letters are the code of the protein in the PDB while the last
letter is the chain ID).
1AMHA 1ANB0 1ANC0 1AND0 1BRBE
1CO7E 1DPO0 1F7ZA 1SLUB 1SLWB
3TGJE 1QL9A 1J16A 1TRMA 1EZSC
1F5RA 1FY8E 3TGKE 1AN1E 1MCTA
1S83A 1TAWA 1UTNA 1OPHB 1V2OT
1V2QT 1V2RT 1V2ST 1V2WT 1V2NT
1V2LT 1H4WA 1TRNA 1UTMA 1HJ8A
1MBQA 1BIT0 1A0JA 1DX5M 1JOUB
1RD3B 1THPB 1C5LH 1H8DH 2THFB
1H8IH 1B7XB 1BTHH 1TQ7B 1SHHB
1VR1H 1UCYK 1EUFA 1FI8A 1PJPA
1NN6A 1KLT0 1IAUA 1GVKB 1HAXB
1QNJA 1BRUP 1DST0 1BIO0 1RFNA
1PFXC 1A0LA 1CGHA 1FXYA 1LO6A
1G2LA 1FAXA 1LTOA 1TON0 1NPMA
1MZAA 3RP2A 1AO5A 1KLIH 1KIGH
1AZZA 1EAXA 1GVZA 1PYTD 1OP8A
1ORFA 1RTFB 1AUTC 1P57B 1FIZA
1FIWA 1BQYA 1A5IA 1MD8A 1EQ9A
1EKBB
2 RESULTS AND DISCUSSION
In this study, we aim to assess models obtained by
homology protein modeling by looking on a large
set of sequence/structure alignments that belong to
the same protein family (adopt the same “fold”). We
have used in-house software for multiple sequence
alignment and the regions for model construction
(firstly using all the Cα atoms of the 160 common
residues and at the second time, we chose for model
construction SCRs based on the structural analysis
of one protein (1A0JA), see figure 1. The pair-wise
sequence alignments in our database ranges between
28% and 100%.
Sequence analysis of the database revealed
highly conserved amino acids that where distributed
along the protein chain (see figure 1, number of
BIOSIGNALS 2009 - International Conference on Bio-inspired Systems and Signal Processing
456