Constructing Structural Proﬁles for Protein Torsion Angle Prediction

Zafer Aydin

, David Baker

and William Stafford Noble

Department of Computer Engineering, Abdullah Gul University, 38080, Kayseri, Turkey

Department of Biochemistry, University of Washington, Seattle, WA 98195 U.S.A.

Department of Genome Sciences, Department of Computer Science and Engineering,

University of Washington, Seattle, WA 98195 U.S.A.

Keywords:

Protein Torsion Angle Prediction, Structural Frequency Proﬁles, Template Scoring, Proﬁle-Proﬁle Alignment.

Abstract:

Structural frequency proﬁles provide important constraints on structural aspects of a protein and is receiving a

growing interest in the structure prediction community. In this paper, we introduce new techniques for scoring

templates that are later combined to form structural proﬁles of 7-state torsion angles. By employing various

parameters of target-template alignments we improve the quality and accuracy of structural proﬁles consider-

ably. The most effective technique is the scaling of templates by integer powers of sequence identity score in

which the power parameter is adjusted with respect to the similarity interval of the target. Incorporating other

alignment scores as multiplicative factors further improves the accuracy of proﬁles. After analyzing the indi-

vidual strengths of various structural proﬁle methods, we combine them with ab-initio predictions of 7-state

torsion angles by a linear committee approach. We show that incorporating template information improves the

accuracy of ab-initio predictions signiﬁcantly at all levels of target-template similarity even when templates

are distant from the target. Template scaling methods developed in this work can be applied in many other

prediction tasks and in more advanced methods designed for computing structural proﬁles.

1 INTRODUCTION

Protein 3D structure prediction beneﬁts greatly from

prediction of various 1D and 2D structural attributes

such as secondary structure, backbone torsion (dihe-

dral) angles, solvent accessibility, disordered regions,

and contact maps (Cheng et al., 2008). Methods that

predict structural properties of proteins typically em-

ploy sequence-based frequency proﬁles in their fea-

ture sets to utilize information in similar proteins.

These proﬁles can be in the form of position speciﬁc

scoring matrices (PSSM) or hidden Markov models

(HMM) and can be derivedby aligning the amino acid

sequence of the query with sequences in a large pro-

tein database using an efﬁcient algorithm such as PSI-

BLAST (Altschul et al., 1997) or HHBlits (Remmert

et al., 2011). Despite the many efforts for improving

the quality of sequence-based alignments and their

proﬁles, the accuracy of 1D and 2D predictions has

come to saturation due to the difﬁculty of eliminat-

ing false positives especially when the query sequence

diverges from those in the protein database consider-

ably. Recently, there has been a growing interest in

using structural proﬁles as input features for predict-

ing various structural characteristics of proteins. A

structural frequency proﬁle is a position speciﬁc scor-

ing matrix (PSSM) that is constructed from the struc-

tural labels of templates (i.e., hit proteins) obtained

by aligning the target (i.e., query) against a set of pro-

teins. To date structural proﬁles have been derived

mainly for protein secondary structure (Li et al., 2012;

Cong et al., 2013); backbone structural motifs, sol-

vent accessibility, contact density (Mooney and Pol-

lastri, 2009); and shape strings (Sun et al., 2012).

To construct a structural proﬁle, the occurrence

frequencies of template residues are accumulated fol-

lowed by a normalization step. Methods that have

been developed for this task mainly use Laplacian

counts, which is a technique that gives equal weights

to templates (Li et al., 2012). As an alternative to

the Laplacian count method, a new scoring technique

has been proposed which scale the templates by the

third power of sequence identity score and the struc-

tural quality information (Pollastri et al., 2007; Walsh

et al., 2009). In this paper, we propose new struc-

tural proﬁle methods for 7-state torsion angles of pro-

teins by incorporating various score terms of HH-

search alignments (Soding, 2005) and by adjusting

Aydin Z., Baker D. and Noble W..

Constructing Structural Proﬁles for Protein Torsion Angle Prediction.

DOI: 10.5220/0005208500260035

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2015), pages 26-35

ISBN: 978-989-758-070-3

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

the power parameter according to the target-template

similarity.

Despite the variety of methods proposed for pre-

dicting backbone torsion angles of proteins (Singh

et al., 2014; Song et al., 2012; Wu and Zhang,

2008a; Faraggi et al., 2012; Shen et al., 2009; Ber-

janskii et al., 2006), less effort has been made to sys-

tematically incorporate structurally related templates

into torsion angle predictions (Mooney and Pollastri,

2009). To the best of our knowledge, there is no work

in the literature that inspects the accuracy of torsion

angle predictions at all levels of target-template simi-

larity (i.e., from easy to difﬁcult targets). Therefore

after deriving structural proﬁles, we combine them

with ab-initio predictions of a two-stage classiﬁer us-

ing a linear committee approach. Our method is able

to generate speciﬁc and effective predictions for tar-

gets at all difﬁculty levels. We achieve this by adjust-

ing the power parameter and the weight of the struc-

tural proﬁles with respect to the similarity interval of

the target.

2 METHODS

2.1 Backbone Torsion Angles

Each residue (i.e., amino acid) has three associated

torsion angles: φ, ψ, and ω. The angle φ denotes rota-

tion about theC

-N bond of the residue, ψ denotes ro-

tation about the bond linkingC

and the carbonyl car-

bon, and ω denotes rotation about the bond between

the carbonyl carbon of the current residue and the ni-

trogen of the next residue. We compute φ, ψ, and ω

from the 3-D coordinate information in Protein Data

Bank (PDB), which is the database of solved protein

structures. Each of these angles is constrained to the

range [−180, 180].

Following (Blum et al., 2008), we ﬁrst subdivided

residues into ﬁve torsion angle classes, which repre-

sent the major clusters observed in PDB. However, to

reduce the imbalance in the sizes of these classes, we

further subdivided the two most common labels (A

and B) according to whether the secondary structure

class is loop or not. The resulting seven labels are

described in Table 1.

2.2 Torsion Angle Class Prediction

Based on the deﬁnition given in Table 1, the 7-state

torsion angle prediction problem can be stated as fol-

lows. For a given protein, the goal is to assign to each

amino acid a torsion angle label from the alphabet

{L, A, M, B, E, G, O} as shown in Fig. 1.

LWGLVKQGLKCEDCGMNVHHKCREKVANLC

MMELMGLBBBBLLLGMBBMAAAALLMMLMO

Figure 1: 7-state torsion angle class prediction problem.

The ﬁrst row shows the amino acid sequence of the target

and the second row is the sequence of 7-state torsion angle

labels, which are deﬁned according to Table 1.

2.3 Alignment Methods

2.3.1 Deriving Templates for Structural Proﬁles

In this paper, we used the HHsearch method (Soding,

2005) to detect the templates that are similar to

a given target. HHsearch ﬁrst derives an HMM-

proﬁle for the target and aligns it against a database

of HMM-proﬁles (Soding, 2005). At the end of

the alignment, it ranks the templates (i.e. hits)

according to a probability score ranging from 0%

to 100% and reports the ones that score above

a threshold. An example alignment is shown in

Fig. 2. We used the following commandline to

compute HMM-HMM alignments for each tar-

get:

./hhsearch -i protein.hhm -d hhm3 -o

protein.hhr -cpu 2 -mact 0.05 -ssw 0.11

-atab protein.start.tab -realign -E 100

-cov 20 -b 20

. We then selected the HMM-HMM

alignments that score above the given threshold as

the templates. Note that, HHsearch uses predicted

secondary structure to be able to compute sensitive

HMM-HMM alignments. We used the PSIPRED

version 2.61 (Jones, 1999) to predict secondary

structures. All these alignments were generated in

2011. Further details on HHsearch and the HMM-

HMM alignments can be found in the corresponding

documentation (Soding, 2006; Soding et al., 2012).

2.3.2 Generating Position-Speciﬁc Scoring

Matrices for the Ab-Initio Method

We employed PSSMs generated by the PSI-BLAST

(Altschul et al., 1997) and HHMAKE (Soding, 2005)

algorithms as input features. We used BLAST version

2.2.20 and the NCBI’s non-redundant (NR) database

dated June 2011 to generate PSI-BLAST PSSMs. We

generated the HMM-proﬁles by HHsearch version

1.5.1 (Soding, 2005). Note that in deriving the HH-

MAKE PSSMs we did not perform any HMM-HMM

alignments. After deriving PSSMs we scaled them to

the interval [0, 1] by applying a sigmoidal transforma-

tion. Detailed descriptions of the PSSMs and the sig-

moidal transformation can be found in (Aydin et al.,

2011).

ConstructingStructuralProfilesforProteinTorsionAnglePrediction

Table 1: 7-state torsion angle labels. The table lists the seven torsion angle classes, their deﬁnitions, and the percentage of

residues assigned to each class in the PDB-PC90 data set. ss denotes the secondary structure label of the amino acid residue.

Label Deﬁnition Percent

L |ω| ≥ 90, φ < 0, −125 < ψ ≤ 50, ss = loop 11.94

A |ω| ≥ 90, φ < 0, −125 < ψ ≤ 50, ss 6= loop 38.21

M |ω| ≥ 90, φ < 0, ψ ≤ −125 OR ψ > 50, ss = loop 20.08

B |ω| ≥ 90, φ < 0, ψ ≤ −125 OR ψ > 50, ss 6= loop 22.27

E |ω| ≥ 90, φ ≥ 0, |ψ| > 100 1.92

G |ω| ≥ 90, φ ≥ 0, |ψ| ≤ 100 4.73

O |ω| < 90 0.84

No 3

>3fy3A

Probab=100.00 E-value=3.6e-41 Score=236.01 Aligned_columns=201 Identities=22%

Q ss_pred CCEEECCC---EEE--ECCCCCEEEECCCC---EEEE-CHHCCCCCCCEEEEC-------------------CCCHHHHH

Q ss_conf 99896176---479--87898379944863---4985-133664889879976-------------------88256532

Q 2odlA.fasta 4 GMDVVHGT---ATM--QVDGNKTIIRNSVD---AIIN-WKQFNIDQNEMVQFL-------------------QENNNSAV 55

(372)

Q Consensus 4 g~~v~~g~---~~i--~~~~~~~i~q~s~~---~~~n-w~sFnIg~~~~v~f~-------------------q~~~a~vi 55

(372)

.+.|+.+. -.+ ..++.+.|+..+|. ..|| |++|||++.+.+.+| ...+|++|

T Consensus 1 ~gIv~~~~~~~~~v~~~~nG~~vinI~~Pn~~GiS~N~y~~FnV~~~G~vlNNs~~~~~t~l~G~i~~NpnL~~~~A~~I 80

(234)

T 3fy3A 1 NGIVPDAGHQGPDVSAVNGGTQVINIVTPNNEGISHNQYQDFNVGKPGAVFNNALEAGQSQLAGHLNANSNLNGQAASLI 80

(234)

T ss_dssp CCCEECSSTTCCEEEEETTTEEEEECCCCCTTSEEEEEEEECCBCTTCEEEEECSSCEEETTTEEECCCTTSSSCCCSEE

T ss_pred CCEEECCCCCCCEEEECCCCCEEEEEECCCCCCCEEEEEEEEECCCCCEEEECCCCCCCCCCCCEECCCCCCCCCCCCEE

T ss_conf 95663798988677755999328986678888534532012103999759965755453000202215753378764289

Figure 2: An example HMM-HMM alignment by HHsearch.

2.4 Structural Frequency Proﬁles for

7-State Torsion Angles

Our structural proﬁle is a 7 × N matrix where rows

represent the torsion angle classes and columns de-

note the amino acids of the target. An example struc-

tural proﬁle is illustrated in Fig. 3.

1 2 ... N

L 0.40

0.20 ... 0.18

A 0.17

0.30 ... 0.02

M 0.01

0.13 ... 0.08

B 0.03

0.07 ... 0.12

E 0.19

0.11 ... 0.04

G 0.16

0.09 ... 0.06

O 0.04

0.10 ... 0.50

Figure 3: A structural frequency proﬁle for 7-state tor-

sion angle representation. Rows represent the torsion angle

classes and columns denote the amino acids of the target.

Each column sums to 1.

The entries of a structural frequency proﬁle matrix

represent the propensity of having a particular torsion

angle class at a given amino acid residue of the tar-

get protein. It therefore acts as a signature summariz-

ing the 7-state torsion label expectancy of the target

residues. The entries of this matrix are normalized to

the interval [0,1] and each column sums to 1 similar

to a marginal a posteriori probability distribution. For

this reason, we denote a structural proﬁle by P

|y),

where t

is the torsion angle class of the j

residue

and y is the amino acid sequence of the target.

A structural proﬁle can be obtained by collecting

the occurrence frequencies of the structure labels of

the template proteins. There could be various ap-

proaches for computing a structural frequency proﬁle.

The following sections explain the methods that have

been implemented in this paper.

2.4.1 Laplacian Counts

The most straightforward approach for deriving a

structural frequency proﬁle is to count the occurrence

frequencies of torsion angle labels that match to a

given position of the target, which is followed by a

normalization step. The Laplacian count method is

employed by most of the approaches that have been

proposed for computing structural proﬁles to date (Li

et al., 2012). First a count matrix C is obtained that

contains the occurrence frequencies of the structure

labels of the templates, which is formulated as fol-

lows

C(i, j) =

∑

A( j,k)

δ(T( j, k), i) (1)

where C(i, j) is the (i, j)

entry of the count matrix

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

such that i ∈ {L, A, M, B, E, G, O} is the torsion angle

class, j is the residue position of the target, A( j, k) is

the residue of the k

database protein aligned to the

position, T( j, k) is the corresponding torsion an-

gle class of the template, and δ(T( j, k), i) is the Kro-

necker delta function deﬁned as

δ(t, i) =



1 if t = i

0 otherwise

(2)

In other words, each torsion angle label that is

aligned to the j

position of the target contributes by

a count of 1, which is also known as the Laplacian

count method. Once the count matrix is obtained it

is normalized so that each column sums to 1. This is

formulated as

(i, j) =

(

C(i, j)

∑

C(i, j)

if |A( j)| > 0

0 otherwise

(3)

where A( j) is the set of all residues aligned to the j

residue of the target, |A( j)| is the number of residues

in A( j), and M

is the normalized count matrix. In this

formulation, the residues of the target are divided into

two categories. In the ﬁrst group, we have “aligned”

positions (represented by the condition |A( j)| > 0)

where at least one residue is aligned from a database

protein and in the second group there is the set of “un-

aligned” positions (i.e., the case where |A( j)| = 0) for

which no residues are aligned from any hits. The sec-

ond condition is realized for positions that correspond

to gapped regions and for positions that are left out of

the aligned regions when a local alignment algorithm

is employed.

After the normalization step, the structural proﬁle

matrix can be computed as

M(i, j) =



(i, j) if |A( j)| > 0

(i, j) otherwise

(4)

where M

(i, j) is the background probability of align-

ing a template residue with torsion class i to the j

residue of the target. In this paper, we use predictions

from the ab-initio classiﬁer for the background distri-

bution of torsion angle labels.

2.4.2 Weighing Hits by Integer Powers of

Sequence Identity Scores

A second method for computing structural proﬁles

weights templates by integer powers of the sequence

identity score. In HHsearch, this score is computed

for each target template pair and is represented by the

“Identities” ﬁeld as shown in Fig. 2. We ﬁrst divide

this score by 100 and convert it to a weight value. We

then compute an integer power of this weight, which

is used to scale templates that contribute to the struc-

tural proﬁle. This is expressed in the equations below

C(i, j) =

∑

A( j,k)

θ(T( j, k), i) (5)

θ(t, i) =



if t = i

0 otherwise

(6)

where C is the count matrix, θ is the new occurrence

count function replacing the Kronecker delta in Eq.

2, I is the sequence identity score of the k

template

and a is an integer that represents the strength of the

ampliﬁcation one wishes to impose on the structurally

similar templates. The remaining terms are the same

as their counterparts in Eqs. 1 and 2. This type of

template scaling has two beneﬁts. The ﬁrst one is re-

lated to scaling templates by sequence identity scores,

which increases the contribution of structurally closer

templates while reducing the votes of distant ones.

The second beneﬁt comes by taking integer powers

of I, which manages the situation where a handful of

structurally similar templates are followed by many

less similar or distant templates. In such a scenario,

if we use the Laplacian counts as in Section 2.4.1

or weigh templates by sequence identity scores only

(i.e., a = 1) the contribution of the similar templates

would be suppressed by many structurally less simi-

lar candidates. To further amplify the effect of struc-

turally similar templates and to reduce the contribu-

tion of false positives (i.e., noise) it is useful to take

integer powers of the sequence identity scores as for-

mulated in Eq. 6. Once we compute the count matrix,

we normalize it as in Eq. 3. All the other steps in de-

riving the structural proﬁle are the same as in Section

2.4.1.

2.4.3 Incorporating Quality of Templates

In addition to taking integer powers of the sequence

identity score, it is also possible to include other

weight factors to the score function in Eq. 6. One

such measure assesses the experimental quality of the

templates and is proposed in (Pollastri et al., 2007;

Walsh et al., 2009). When we employ this approach

to score the templates, Eq. 6 takes the following form:

θ(t, i) =



if t = i

0 otherwise

(7)

where q is the quality of the template computed as X-

ray resolution + R-factor / 20 as proposed in (Hobohm

and Sander, 1994). According to this measure, a tem-

plate with a higher experimental quality has a lower

q parameter. Since this measure requires the X-ray

ConstructingStructuralProfilesforProteinTorsionAnglePrediction

resolution of the template, we apply it to those tem-

plates that have been solved by the X-ray method only

ignoring the remaining templates for the target.

2.4.4 Incorporating Other Alignment Scores

When two proteins are aligned to each other, typ-

ically several score terms are calculated for assess-

ing the statistical signiﬁcance including e-value, raw

similarity score, and percentage of sequence iden-

tity. Employing these terms in scaling the templates

could also be useful in constructing a structural pro-

ﬁle. With this motivation, we ﬁrst incorporated the

e-value score into the occurrence count function by

converting it to a multiplicative weight factor as in

(Wu and Zhang, 2008b). This is formulated as

θ(t, i) =



if t = i

0 otherwise

(8)

where w

is the e-value weight deﬁned as







1 if E < 10

−10

−0.05log

(E) + 0.5 if 10

−10

≤ E < 10

0 otherwise

(9)

such that E is the E-value of the alignment. Note that

we dropped the quality term as it did not bring any

signiﬁcant beneﬁts, which is veriﬁed by our simula-

tions. According to Eq. 9, w

is set to 1 when the

target-template similarity is above a certain threshold

(E − value < 10

−10

) and decreases linearly as the E-

value of the target-template alignment is greater than

−10

until it becomes considerably high (i.e., 10

in which case it is set to zero.

In addition to the E-value, we also considered in-

corporating the overall raw similarity score of the

alignment into our structural proﬁles. For this pur-

pose, we modiﬁed the occurrence count function as

θ(t, i) =



if t = i

0 otherwise

(10)

where s is the raw score of the alignment. For HH-

search, this is the overall similarity score obtained at

the end of the HMM-HMM alignment, which is com-

puted as the sum of the similarities of the aligned pro-

ﬁle columns minus the gap penalties (Soding, 2005).

A slight variation of this approach normalizes the raw

score with the length of the aligned region as

θ(t, i) =



if t = i

0 otherwise

(11)

such that L is the length of the aligned region and is

given as the ﬁeld denoted as “Aligned

columns” in

HHsearch’s output (see Fig. 2).

2.4.5 Scaling Columns of the Alignment

Up to this point, we scaled the templates uniformly

throughout the aligned positions without discriminat-

ing the individual columns of the alignment. In this

section, we explain an approach for amplifying lo-

cal regions within an alignment that could potentially

contribute more accurate torsion label information

and suppressing those that could be locally more dis-

tant. For this purpose, we include the similarity score

between the aligned residues from a BLOSUM ma-

trix into the occurrence count function as formulated

below

θ(t, i) =



if t = i

0 otherwise

(12)

where b is the similarity score such as BLOSUM ma-

trix score between the j

residue of the target and

the residue in the template that is aligned to the target

residue. When the two residues are biologically simi-

lar to each other we expect this score term to be larger

than the term obtained from dissimilar pairings. This

approach has the potential to amplify local matches

between motifs that are common both in the target

and the template. Note that normalizing the sequence

alignment score with L is optional. For instance, a

slightly modiﬁed version of Eq. 12 does not perform

such type of normalization:

θ(t, i) =



if t = i

0 otherwise

(13)

2.5 Prediction Model

2.5.1 Ab-initio Predictor

Our ab-initio torsion class predictor is a hybrid ar-

chitecture, in which four dynamic Bayesian network

(DBN) models are combined with a neural network.

Two of the DBNs use PSSMs derived by PSI-BLAST

(Altschul et al., 1997) and the other two use PSSMs

from the HHMAKE module of the HHsearch method

(Soding, 2005). Details of the ab-initio predictor can

be found in (Aydin et al., 2011; Aydin et al., 2012).

For simplicity we treat the output signal of the neural

network as a probability distribution due to the con-

straints it satisﬁes (i.e., it sums to 1 and takes values

from 0 to 1). This distribution is denoted as P

|x),

which represents the ab-initio likelihood of the j

residue to have t

as the torsion angle label given x,

the set of input features around position j. Hence our

neural network predicts the torsion angle label of the

residue at the center of the feature window by select-

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

ing the particular label with the maximum discrimi-

nant score at the output layer. This is formulated as

∗

= argmax

|x). (14)

2.5.2 Committee Predictor

The committee predictor combines the ab-initio pre-

dictions of torsion angle classes with the structural

frequencyproﬁle according to the following equation:

|x, y) =



(1− λ)P

|x) + λP

|y) if aligned

|x) if unaligned

(15)

where P

|x, y) is the combined likelihood of having

torsion angle label t

for the residue at position j, λ is

the weight assigned to the structural proﬁle, x is the

feature set of the ab-initio predictor described in Sec-

tion 2.5.1, y is the amino acid sequence of the target,

|x) is the distribution of torsion angle classes ob-

tained from the ab-initio predictor, and P

|y) is the

structural proﬁle computed from the templates. Ac-

cording to this equation, the combined likelihood is

the weighted average of the ab-initio predictions and

the structural proﬁle for positions that are aligned to

at least one template residue and it becomes equal to

the likelihood of the ab-initio predictor only for the

remaining positions.

After computing P

|x, y), we predict the torsion

angle class of the j

residue as the particular label

that maximizes P

|x, y) as

∗

= argmax

|x, y), (16)

where t

∗

is the ﬁnal prediction of the committee

method.

2.6 Datasets

2.6.1 PDB-PC90 Dataset

To obtain the PDB-PC90 dataset, we used the PISCES

server (Wang and Dunbrack, Jr., 2003; Wang and

Dunbrack, Jr., 2005) with the following set of crite-

ria: percent identity threshold of 90%, resolution cut-

off of 2.5

A, and R-value cutoff of 1.0. We also used

PISCES to ﬁlter out non-X-ray and C

-only struc-

tures and to remove short (< 30 amino acids) and long

(> 10000 amino acids) chains. This dataset contained

17056 chains.

2.6.2 Training and Test Set

We randomly selected 5161 proteins from the PDB-

PC90 dataset. Among those, we randomly selected

a set of 994 proteins for the test set. From the set

of 5161 proteins, we then removed those proteins

that are similar to the test set using a 10% sequence

identity threshold. The remaining set contained 4205

chains, which is used to train our ab-initio prediction

method. We computed the HHsearch alignments for

the set of 994 proteins, which are used for computing

the structural proﬁles of torsion angle classes and for

predicting the torsion angle classes.

2.6.3 Similarity Intervals and Subsets of the Test

Set

To distinguish easy targets from difﬁcult ones, we de-

ﬁned similarity intervals using the HHsearch align-

ments from half of the proteins in our test set (see Sec-

tion 2.6.2). For each target, we ﬁrst selected the max-

imum sequence identity score from the set of target-

template alignments. Then we ranked those scores

and deﬁned percentile intervals of sequence identity

with increments of 5%. This initially produced a total

of 20 intervals. We combined the eigth and ninth in-

tervals as the maximum sequence identity scores for

those targets were very close to each other. We also

combined the tenth up to the twentieth intervals since

the maximum sequence identity score was 100% for

all the targets in those bins. This procedure resulted

in a total of 9 sequence identity intervals. In the last

step, we further reduced the number of intervals to 5

by combining the 2nd and 3rd, 4th and 5th, and 7th up

to 9th intervals. The resulting intervals are tabulated

below

According to Table 2, the ﬁrst interval represents

targets with the most distant templates (i.e., those

with the maximum sequence identity score of 26%

or less) and the last interval represents targets that

contain highly similar templates (i.e., those with the

maximum sequence identity score greater than 80%).

Based on this binning, we further divided our test set

of 994 proteins (see the previous section) into 5 sub-

sets such that each contained those targets that fall

into one and only one of the intervals deﬁned in Table

2. The number of proteins and amino acids in each of

these subsets are summarized in Table 3.

Note that the number of proteins in each subset is

not uniformly the same (especially true for the last

set that contains targets from the “High” category)

mainly because datasets have been constructed by

random sampling from PDB without enforcing spe-

ciﬁc constraints for having equal number of samples

in each interval. Nonetheless we have enough sam-

ples in each subset mainly because the torsion angle

class prediction is performed on each residue sepa-

rately. Furthermore the proportion of target positions

that are aligned to at least one template residue is con-

ConstructingStructuralProfilesforProteinTorsionAnglePrediction

Table 2: Intervals of sequence identity scores The intervals are deﬁned by selecting the target-template alignments with

maximum sequence identity scores followed by sorting these scores in ascending order. Percentile increments of 5% results

in a total of 20 bins, which are further reduced to 5 intervals.

Interval Percentiles (%) Max Identity (%)

Low 0-5 0.0-26.0

Medium-Low 5-15 26.0-35.0

Medium 15-25 35.0-50.0

Medium-High 25-30 50.0-80.0

High 30-100 80.0-100.0

Table 3: The ﬁve subsets of the test set with 994 proteins. The number of proteins, the total number of amino acids and the

number of amino acids that are aligned to at least one template residue are shown for each subset. The subsets are derived

based on the intervals deﬁned in Table 2.

Subset # proteins # residues # aligned res.

Low 56 12903 12665

Medium-Low 99 21792 21682

Medium 95 22596 22561

Medium-High 62 15326 15295

High 682 160037 159993

Total 994 232654 232196

siderably high. This shows that a structural proﬁle

column is computed using the aligned templates for

most of the target residues.

3 RESULTS

We ﬁrst compare the torsion angle label accuracy of

the structural proﬁles on positions that are aligned to

templates only. For this purpose, we implemented the

proﬁle methods summarized in Table 4.

In SP4, we modify the power of the sequence

identity score term (i.e., a) according to the similar-

ity interval the target belongs to. For this purpose, we

use the following mapping to deﬁne a:

a =











1 if target in Low Interval

3 if target in Medium-Low Interval

5 if target in Medium Interval

7 if target in Medium-High Interval

9 if target in High Interval

(17)

where the interval of the target is deﬁned according

to Table 2. In SP9 and SP10 we use the BLOSUM62

matrix to scale individual columns of the alignment as

formulated in Eqs. 12 and 13. In SP9, we employed

the BLOSUM scores uniformly for all columns of the

alignment whereas in SP10, we utilized the BLOSUM

scores for targets in the “High” interval only. If the

target belongs to one of the remaining four intervals

then we turn off this score term in Eq. 13. In that case

this equation takes the following form

θ(t, i) =



if t = i

0 otherwise

(18)

Once we compute a structural proﬁle, we predict

the torsion angle class of the aligned target residues

by selecting the particular label that yields the max-

imum value in the corresponding column of the pro-

ﬁle.

Following these deﬁnitions, the torsion angle class

prediction accuracy of the structural proﬁles listed in

Table 4 is summarized in Table 5 below. In this table,

Overall 1 is the number of correctly predicted amino

acids divided by the total number of amino acids for

which a structural proﬁle column is computed (i.e.,

those that are aligned to at least one template). The

second up to the sixth columns show accuracies for

the ﬁve similarity intervals and are computed on each

subset of the test set. Overall 2 is the average of

the ﬁve accuracies obtained for Low to High inter-

vals. It estimates the accuracy we would obtain had

we used equal number of amino acids for each of the

ﬁve intervals. Based on these results, the most ac-

curate structural proﬁle method is SP10 though other

methods such as SP7, SP4 and SP8 are also quite ef-

fective. SP10 outperforms SP1, (the Laplacian count

method), by 14.53%, which is a statistically signiﬁ-

cant improvement. It is also better than SP5 by 1.2%,

which was proposed in (Pollastri et al., 2007; Walsh

et al., 2009). This improvement is also statistically

signiﬁcant (with a p-value < 0.0001 from a two-tailed

Z-test at a signiﬁcance level of 0.01).

After establishing that the new structural proﬁle

methods contain more accurate torsion angle infor-

mation than the approaches proposed in the literature,

we evaluated the accuracy when the structural pro-

ﬁles are combined with an ab-initio predictor. For

this purpose, we trained our ab-initio method us-

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

Table 4: The implemented structural proﬁle methods and their descriptions. Further details can be found in Section 2.4.

Structural Proﬁle and Desciption

SP1: Eq. 1, Laplacian

SP2: Eq. 6, a = 1

SP3: Eq. 6, a = 3

SP4: Eq. 6, a varies wrt similarity interval

SP5: Eq. 7, quality, a = 3, (Pollastri et al.)

SP6: Eq. 8, E-value, a = 3

SP7: Eq. 10, E-value, align. score, a = 3

SP8: Eq. 11, E-value, norm. align. score, a = 3

SP9: Eq. 12, E-value, norm. align. score, BLOSUM62, a = 3

SP10: Eq. 13, E-value, align. score, a varies, BLOSUM62 scores if target is in High interval only

Table 5: 7-state torsion angle prediction accuracy of structural proﬁles. L: Low, ML: Medium-Low, M: Medium, MH:

Medium-High, H: High. Only target residues that are aligned to at least one template are considered.

Proﬁle L ML M MH H Overall 1 Overall 2

SP1 66.49 69.83 71.94 74.46 75.66 74.17 71.68

SP2 66.85 70.71 73.93 80.07 81.96 79.18 74.70

SP3 66.68 72.20 77.82 86.64 92.87 87.64 79.24

SP4 66.85 72.20 79.40 87.92 93.47 88.30 79.97

SP5 66.53 72.13 77.28 87.07 92.73 87.50 79.15

SP6 67.15 73.26 78.89 87.13 93.06 88.03 79.90

SP7 66.48 74.11 80.23 87.66 93.30 88.40 80.36

SP8 66.59 73.91 80.13 87.31 93.25 88.33 80.24

SP9 65.05 72.19 78.30 86.39 93.68 88.14 79.12

SP10 66.55 74.11 80.71 87.36 93.69 88.70 80.48

ing the training set described in Section 2.6.2 and

computed 7-state torsion angle predictions on all the

amino acids of the test set. We then combined those

predictions with a structural proﬁle as in Eq. 15. For

the target residues that were not aligned to any tem-

plate, we simply took predictions from the ab-initio

method. Regarding the λ parameter (i.e., the weight

of the structural proﬁle) we considered two possibil-

ities. The ﬁrst approach sets λ to 0.5 and the second

one modiﬁes it according to the similarity interval of

the target according to the following function

λ =











0.5 if target in Low Interval

0.6 if target in Medium-Low Interval

0.7 if target in Medium Interval

0.8 if target in Medium-High Interval

0.9 if target in High Interval

(19)

In this equation, λ is gradually increased as the

similarity interval of the target approaches to the

“High” interval thereby giving more weight to the

structural proﬁle than the ab-initio predictor. Table 6

summarizes the accuracy of committee predictors that

combine the ab-initio method with various structural

proﬁles. In addition to the overall accuracy measure,

we also included the segment overlap (SOV) measure

that is used in 1D structure prediction to assess the

accuracy at the segmental level (Zemla et al., 1999).

The SOV measure depicts how well the predicted tor-

sion label segments match the true segments and is

biologically more meaningful than the residue level

accuracy.

According to this table, the ab-initio+SP10

method (with variable λ parameter) is better than

the ab-initio+SP5 method in all categories. The im-

provements are 2.78% in Overall 1, 3.40% in Over-

all 2, 3.57% in SOV, 1.65% in Low interval, 2.74%

in Medium-Low interval, 4.90% in Medium inter-

val, 4.82% in Medium-High interval and 2.86% in

High interval. When ab-initio+SP10 is compared

with ab-initio+SP1 (i.e., the Laplacian method) for

λ = 0.5, the improvements are 12.20% in Overall 1,

6.68% in Overall 2, 13.62% in SOV, 0.12% in Low

interval, 2.59% for Medium-Low interval, 4.59% in

Medium interval, 10.43% in Medium-High interval

and 15.66% in High interval. Adjusting the λ param-

eter with respect to the similarity interval was partic-

ularly useful for the ab-initio+SP5 method. In other

words, when λ is set to 0.5 uniformly for all similar-

ity intervals, the accuracy of ab-initio+SP5 dropped

signiﬁcantly higher than the ab-initio+SP10. This

shows that the proposed structural proﬁle SP10 is

more useful than SP5 when combined with the ab-

initio method. This is because torsion label errors of

SP10 and the ab-initio method overlap less as com-

pared to SP5 and therefore SP10 provides a better

complement to the ab-initio predictor. Another ob-

servation one can make is the improvement over the

ab-initio method when structural proﬁles are incor-

ConstructingStructuralProfilesforProteinTorsionAnglePrediction

Table 6: 7-state torsion angle prediction accuracy of methods that incorporate structural proﬁles with ab-initio pre-

dictions. L: Low, ML: Medium-Low, M: Medium, MH: Medium-High, H: High. All target residues in the test set are

considered.

Method L ML M MH H Overall 1 Overall 2 SOV

Ab-initio 72.36 73.96 73.58 73.35 74.01 73.83 73.45 71.33

Ab-initio + SP1, λ = 0.5 74.47 75.42 76.10 76.94 77.96 77.28 76.18 74.49

Ab-initio + SP5, λ = 0.5 72.39 74.19 74.19 75.53 78.28 76.99 74.92 74.52

Ab-initio + SP10, λ = 0.5 74.59 78.01 80.69 87.37 93.62 89.10 82.86 88.11

Ab-initio + SP5, λ as in Eq. 19 72.94 75.21 77.22 83.86 90.99 86.70 80.04 85.19

Ab-initio + SP10, λ as in Eq. 19 74.59 77.95 82.12 88.68 93.85 89.48 83.44 88.76

porated. This is true even for the Low interval (an

improvement of 2.23%) and is partly because of the

sensitive nature of HMM-HMM proﬁle alignments

and also because HHsearch uses predicted secondary

structure from PSIPRED (Jones, 1999) to align a pair

of HMMs.

In addition to the structural proﬁle methods de-

scribed above, we also considered three other ap-

proaches. The ﬁrst one incorporates the probabil-

ity score of HMM-proﬁle alignments into Eq. 13

as a multiplicative factor to globally scale the tem-

plates and the second method incorporates the col-

umn score of HHsearch alignments to amplify local

regions that are well conserved (e.g. motifs). These

two approaches did not bring any reasonable change

in the accuracy measures (result not shown). As a

third approach we considered employing Henikoff

weights (Henikoff and Henikoff, 1994) to scale the

count information of templates before constructing

the structural proﬁles. We applied this weighting pro-

cedure for the following three scenarios: (1) weights

based on matched residues only, (2) weights based

on torsion angle labels only, (3) weights based on

residue and torsion angle tuples. Unfortunately, in

all three cases, torsion angle prediction accuracy was

signiﬁcantly lower than the level achieved by other

scaling methods considered in this paper (result not

shown). Note that we did not consider utilizing a

background distribution in Eq. 4 for torsion angle

labels mainly because we use predictions from our

ab-initio method, which would eventually contain a

more accurate torsion angle representation than a sim-

ple background distribution.

Finally, we would like to state that we are unable

to compare our torsion angle class predictor with the

literature mainly because there is no other work that

performs 7-state torsion angle prediction on the same

set of alphabet. However we had shown in an ear-

lier paper that a 5-state version of our predictor pro-

vides results comparable to the state-of-the-art in the

ab-initio setting (Aydin et al., 2012).

4 CONCLUSIONS

In this paper, we propose novel methods for scaling

templates to construct structural proﬁles of torsion an-

gle states. Though we use the score terms in HH-

search method, our approach is generic and most of

the structural proﬁle methods proposed in this work

can also be implemented using other alignment meth-

ods including PSI-BLAST. Second, the scaling tech-

niques can be applied in many other tasks such as sec-

ondary structure prediction, solvent accessibility pre-

diction, contact map prediction, and 3D structure pre-

diction. Third, they can easily be incorporated into

other methods that have been developed for deriving

structural proﬁles from templates.

The proposed methods can be improved in several

ways. First of all, certain parameters of the method

can be optimized such as the power of sequence iden-

tity score and the weight that is used to combine struc-

tural proﬁles with ab-initio predictions. Addition-

ally, the templates can be scaled in a position-speciﬁc

manner using the conﬁdence scores, which are now

available in HHBlits (the new version of HHsearch).

A third technique uses templates that score within a

window only instead of taking all the templates that

scored above a threshold. This approach can be com-

bined with the existing techniques presented in this

work to reduce the computational cost. Finally, in-

stead of using a linear model, the ab-initio predictions

can be combined with structural proﬁles using more

advanced models such as neural networks. We be-

lieve that all these efforts will potentially improve the

accuracy of structure prediction tasks further.

ACKNOWLEDGEMENTS

This work is supported by grant 113E550 from

3501 TUBITAK National Young Researchers Career

Award.

BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

REFERENCES

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang,

J., Zhang, Z., Miller, W., and Lipman, D. J. (1997).

Gapped BLAST and PSI-BLAST: A new generation

of protein database search programs. 25:3389–3402.

Aydin, Z., Singh, A., Bilmes, J., and Noble, W. S. (2011).

Learning sparse models for a dynamic Bayesian net-

work classiﬁer of protein secondary structure. BMC

Bioinformatics, 12:154.

Aydin, Z., Thompson, J., Bilmes, J., Baker, D., and Noble,

W. S. (2012). Protein torsion angle class prediction

by a hybrid architecture of bayesian and neural net-

works. In 13th International Conference on Bioinfor-

matics and Computational Biology.

Berjanskii, M. V., Neal, S., and Wishart, D. S. (2006).

PREDITOR: a web server for predicting protein tor-

sion angle restraints. Nucleic Acids Research, 34(Web

Server Issue):W:63–69.

Blum, B., Jordan, M., Kim, D., Das, R., Bradley, P., and

Baker, D. (2008). Feature selection methods for im-

proving protein structure prediction with Rosetta. In

Platt, J., Koller, D., Singer, Y., and Roweis, S., editors,

Advances in Neural Information Processing Systems

20, pages 137–144. MIT Press, Cambridge, MA.

Cheng, J., Tegge, A. N., and Baldi, P. (2008). Ma-

chine learning methods for protein structure predic-

tion. IEEE Reviews in Biomedical Engineering, 1:41–

49.

Cong, P., Li, D., Wang, Z., Tang, S., and Li, T. (2013).

Spssm8: An accurate approach for predicting eight-

state secondary structures of proteins. Biochimie,

95(12):2460–2464.

Faraggi, E., Zhang, T., Yang, Y., Kurgan, L., and Zhou, Y.

(2012). SPINE X: improving protein secondary struc-

ture prediction by multistep learning coupled with

prediction of solvent accessible surface area and back-

bone torsion angles. PLoS One, 7(2):e30361.

Henikoff, S. and Henikoff, J. G. (1994). Position-based se-

quence weights. 243:574–578.

Hobohm, U. and Sander, C. (1994). Enlarged representative

set of protein structures. Protein Science, 3:522–524.

Jones, D. T. (1999). Protein secondary structure prediction

based on position-speciﬁc scoring matrices. 292:195–

202.

Li, D., Li, T., Cong, P., Xong, W., and Sun, J. (2012). A

novel structural position-speciﬁc scoring matrix for

the prediction of protein secondary structures. Bioin-

formatics, 28(1):32–39.

Mooney, C. and Pollastri, G. (2009). Beyond the twilight

zone: Automated prediction of structural properties of

proteins by recursive neural networks and remote ho-

mology information. Proteins: Structure, Function,

and Bioinformatics, 77:181–190.

Pollastri, G., Martin, A. J. M., Mooney, C., and Vullo,

A. (2007). Accurate prediction of protein secondary

structure and solvent accessibility by consensus com-

biners of sequence and structure information. BMC

Bioinformatics, 8(201).

Remmert, M., Biegert, A., Hauser, A., and Soding, J.

(2011). Hhblits: lightning-fast iterative protein se-

quence searching by hmm-hmm alignment. Nature

Methods, 9(2):173–175.

Shen, Y., Delaglio, F., Cornilescu, G., and Bax, A. (2009).

TALOS+: a hybrid method for predicting protein

backbone torsion angles from nmr chemical shifts.

Journal of Biomolecular NMR, 44(4):213–223.

Singh, H., Singh, S., and Raghava, G. P. S. (2014). Eval-

uation of protein dihedral angle prediction methods.

PLoS One, 9(8):e105667.

Soding, J. (2005). Protein homology detection by HMM-

HMM comparison. Bioinformatics, 21:951–960.

Soding, J. (2006). Quick guide to HHsearch.

ftp://toolkit.genzentrum.lmu.de/pub/HHsearch/old/

HHsearch/HHsearch1.5.1/HHsearch-guide.pdf.

Soding, J., Remmert, M., and Hauser, A. (2012). HH-suite

for sensitive sequence searching based on hmm-hmm

alignment. ftp://toolkit.genzentrum.lmu.de/pub/HH-

suite/hhsuite-userguide.pdf.

Song, J., Tan, H., Wang, M., Webb, G. I., and Akutsu, T.

(2012). TANGLE: two-level support vector regression

approach for protein backbone torsion angle predic-

tion from primary sequences. PLoS One, 7(2):e30361.

Sun, J., Tang, S., Xiong, W., Cong, P., and Li, T. (2012).

Dsp: a protein shape string and its proﬁle predic-

tion server. Nucleic Acids Research, 40(W1):W298–

W302.

Walsh, I., Bau, D., Martin, A. J. M., Mooney, C., Vullo,

A., and Pollastri, G. (2009). Ab initio and template-

based prediction of multi-class distance maps by two-

dimensional recursive neural networks. BMC Struc-

tural Biology, 9(5).

Wang, G. and Dunbrack, Jr., R. L. (2003). PISCES:

a protein sequence culling server. Bioin-

formatics, 19:1589–1591. Web server at

http://dunbrack.fccc.edu/PISCES.php.

Wang, G. and Dunbrack, Jr., R. L. (2005). PISCES: re-

cent improvements to a pdb sequence culling server.

Nucleic Acids Res., 33:W94–W98. Web server at

http://dunbrack.fccc.edu/PISCES.php.

Wu, S. and Zhang, Y. (2008a). ANGLOR: A composite

machine-learning algorithm for protein backbone tor-

sion angle prediction. PLoS One, 3(10):e3400.

Wu, S. and Zhang, Y. (2008b). MUSTER: Improving pro-

tein sequence proﬁle-proﬁle alignments by using mul-

tiple sources of structure information. Proteins: Struc-

ture, Function, and Bioinformatics, 72(2):547–556.

Zemla, A., Venclovas, C., Fidelis, K., and Rost, B. (1999).

A modiﬁed deﬁnition of Sov, a segment-based mea-

sure for protein secondary structure prediction assess-

ment. Proteins, 34:220–223.

ConstructingStructuralProfilesforProteinTorsionAnglePrediction