Predicting Molecular Functions in Plants using Wavelet-based Motifs

G. Arango-Argoty

, A. F. Giraldo-Forero

, J. A. Jaramillo-Garz´on

1,2

, L. Duque-Mun˜oz

1,2

and G. Castellanos-Dominguez

Signal Processing and Recognition Group, Universidad Nacional de Colombia,

Campus la Nubia, Km 7 v´ıa al Magdalena, Manizales, Colombia

Grupo de M´aquinas Inteligentes y Reconocimiento de Patrones - MIRP, Instituto Tecnol´ogico Metropolitano,

Cll 54A No 30-01, Medell´ın, Colombia

Keywords:

Amino Acid Properties, Dissimilarity based Classiﬁcation, Molecular Function, Motifs, Wavelet Transform.

Abstract:

Predicting molecular functions of proteins is a fundamental challenge in bioinformatics. Commonly used

algorithms are based on sequence alignments and fail when the training sequences have low percentages of

identity with query proteins, as it is the case for non-model organisms such as land plants. On the other

hand, machine learning-based algorithms offer a good alternative for prediction, but most of them ignore that

molecular functions are conditioned by functional domains instead of global features of the whole sequence.

This work presents a novel application of the Wavelet Transform in order to detect discriminant sub-sequences

(motifs) and use them as input for a pattern recognition classiﬁer. The results show that the continuous wavelet

transform is a suitable tool for the identiﬁcation and characterization of motifs. Also, the proposed classiﬁ-

cation methodology shows good prediction capabilities for datasets with low percentage of identity among

sequences, outperforming BLAST2GO on about 11,5% and PEPSTATS-SVM on 16,4%. Plus, it offers major

interpretability of the obtained results.

1 INTRODUCTION

Functions of gene products are speciﬁed by the

molecular activities they perform. These functions

may include transporting other molecules around,

binding to different compounds or holding molecules

together for fastening reactions. Several computa-

tional methods for protein function prediction use se-

quence alignment tools such as BLASTP (Johnson

et al., 2008), which are designed to transfer functions

from already annotated sequences to the novel ones

based on sequence similarity criteria (Cheng et al.,

2005). In this matter, homologous proteins can be

identiﬁed under the assumption that amino acids hav-

ing an important role in protein function and struc-

ture cannot mutate without an important effecton pro-

tein activity. However, those amino acids can change

very slowly in a given protein family during evolu-

tion (Liu et al., 2006) and thus, for a set of sequences

that stretch a great evolutionary distance, it is possible

to highly conserved amino acid regions, even if they

greatly differ from a global perspective. On the other

hand, when the sequence similarity is low, aligned

segments are often short and occur by chance, lead-

ing to unreliable and unusable alignments when the

sequences have less than 40% and 20% similarity, re-

spectively (Cheng et al., 2005).

Recently, a vast number of predictors based on

pattern recognition techniques have been designed in

an effort to ﬁnd alternative methods that do not rely

solely on alignments. Each one of them computes a

different set of attributes to characterize protein se-

quences, including statistical and physical-chemical

properties of amino acids (Shen and Burger, 2010),

energy concentrations from time-frequency represen-

tations (Gupta et al., 2009), distance measures, word

statistics, Hidden Markov Models, information the-

ory and others (Vinga and Almeida, 2003). However,

most of them only describe global attributes of the

whole protein sequence, ignoring the fact that func-

tional domainsmay reside in different portions of pro-

teins within the same family. Such recurring patterns

are called MOTIFS and they can be used to identify

representative regions of the proteins, revealing po-

tential information about their molecular function.

Nevertheless, only a small portion of proteins have

clearly identiﬁable sorting signals in their sequence

and, since proteins are commonly able to perform sev-

eral molecular functions instead of only one, there

is a strong challenge on how to use those motifs for

140

Arango-Argoty G., F. Giraldo-Forero A., A. Jaramillo-Garzón J., Duque-Muñoz L. and Castellanos-Dominguez G..

Predicting Molecular Functions in Plants using Wavelet-based Motifs.

DOI: 10.5220/0004234201400145

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2013), pages 140-145

ISBN: 978-989-8565-35-8

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

predicting molecular functions with the less possible

amount of false positives and false negatives.

The Wavelet Transform (WT) has been previously

used as a powerful tool for mining information in pro-

teins (Murray et al., 2002). Here, a novel applica-

tion of the WT is developed, extending the represen-

tation scheme to a complete classiﬁcation methodol-

ogy. First, a protein is decomposed into a set of sub-

sequences by means of the WT. These sub-sequences

are further clustered to build a set of prototype motifs

representing the original protein sequence set. Proto-

type motifs are then used as features in order to build

a representation space, and hence being able to infer

classiﬁcation rules based on pattern recognition tech-

niques. The properties of the proposed method are:

I) detection of variable length motifs; II) identiﬁca-

tion of patterns distributed in any position along the

sequences and and III) accurate prediction of protein

molecular functions including proteins associated to

multiple functions.

2 MATERIALS AND METHODS

2.1 Experimental Setup

The proposed methodology is depicted in Figure 1. In

step 1, the supervised training set of proteins (molec-

ular function) is preprocessed to extract short sub-

sequences of variable length. These patterns are de-

termined by interactions among adjacent amino acids

represented by wavelet coefﬁcients. In step 2, all the

extracted sub-sequences are clustered to get the pro-

totype motif set. Due to the variable motif length,

the multiple sequence alignment is used to compute

the consensus of all sub-sequences belonging to one

cluster. In step 3, a new protein sequence can be rep-

resented as the minimum distance between the pro-

totypes and its own sequence motifs. Once all pro-

teins are mapped into the set of prototype motifs, a

Support Vector Machine classiﬁer is trained to predict

their molecular function.

All experiments are carried out on land plants

(embryophyta) proteins, belonging to nine differ-

ent molecular functions, as shown in Table 1. A

dataset of 1008 Embryophyta proteins is reported by

UNIPROT

(Jain et al., 2009) (ﬁle version:

24-01-11

with, at least, one annotation in the ontology

molecular function of Gene Ontology Annotation

Project (Barrell et al., 2009) (ﬁle version:22-12-10)

and whose evidence of existence is neither unknown

nor predicted by computational tools. To avoid bias

due to the presence of protein families, the database

does not contain protein sequences with a pair-wise

Figure 1: Main methodology a) The sequences are con-

verted into numerical signals and the CWT is applyed to

obtain two-dimensional representations (position in the x-

axis and amino acid interaction in the y-axis). Detected

motifs are marked with numbers. b) Clustering of detected

motifs and logos representation. c) The distance between a

query protein and the prototype motifs is used to train/test

the classiﬁer.

Table 1: Number of protein sequences per class.

Functions Entire Reduced Functions Entired Reduced

NtBind 109 53 Transp 280 133

TranscFact 160 102

LipBind 38 24

RnaBind 80 52

Kinase 224 103

Nase 33 24

Enzreg 78 46

RecepBind 40 27

similarity superior to 40%. The web server version

of cd-hit (Huang et al., 2010) is used to ﬁlter the

dataset by similarity; the remaining number of se-

quences obtained after this process is 564. Classes are

deﬁned according to the GO Slim Classiﬁcation for

Plants (Swarbreck et al., 2008).

2.2 Extraction of Motifs

Let S = {s

}, i = 1, 2,...,M, be the training set of pro-

tein sequences. Then, a given protein s

of length n

can be represented as a numerical signal η

(t) that is a

function of its length, by substituting each amino acid

with its equivalent value of a given physical-chemical

propertyI . After all proteins havebeen convertedinto

the numerical signal set η

η = {η

(t)}, they are pro-

jected by the Continuous Wavelet Transform (CWT)

that is deﬁned as the decomposition of a signal η

(t),

as follows:

(a,b) =

|a|

∞

−∞

(t)ϕ



t − b



dt, (1)

where ϕ((t − b)/a) is the basis wavelet function at a

particular scale a and a translation b, with a,b ∈ R,

a ≥ 0. This work uses the Gauss mother wavelet due

to its smoothing property (Murray et al., 2002).

PredictingMolecularFunctionsinPlantsusingWavelet-basedMotifs

141

The resulting matrix W

∈ R

×n

, is called

“scalogram”, and n

represents the maximum scale

(or motif length) considered for the decomposition.

It has been empirically ﬁxed to provide an acceptable

trade off between time complexity and maximum mo-

tif length to n

= 64. W

provides the localization

of frequent sub-sequences within a given sequence s

Particularly, for regions with a similar amino acid be-

havior along the sequence, i.e., having high energy

concentrations, it is possible to locate the centroid

point in the scale-position space. Then, this point

grows in the position axis towards both the left and

the right sides until the value of the actual position

becomes less than the minimal value of the region,

and therefore, determines the respective set of n

mo-

tifs for the sequence s

, {ξ

: j = 1,. .., n

}⊂s

,. This

process is applied to each sequence in S .

Regarding the physical-chemical property I , used

for converting sequences into numerical signals, a

total of 51 indexes was selected from the

AAINDEX

database (Kawashima and Kanehisa, 2000). Such in-

dexes involve the six regions of the amino acid prop-

erties, aiming to explore different numerical represen-

tations.

2.3 Dissimilarity Space Representation

In order to obtain representative motifs within the k-

th labeled class, motif subsequences are clustered by

using the well known Iterative Self Organizing Data

Analysis Technique (

ISODATA

). For the implementa-

tion of the algorithm, the alignment-score distance

d(·,·) ∈ R

, between any two motifs ξ

ξ and ν

ν is de-

ﬁned as follows (subscripts are ignored since the orig-

inal sequnece of each motif is irrelevant in this con-

text):

d(ξ

ξ,ν

ν) =



1−

d(ξ

ξ,ν

ν)

d(ξ

ξ,ξ

ξ)



1−

d(ξ

ξ,ν

ν)

d(ν

ν,ν

ν)



, (2)

where

d(·,·) is the similarity between two sequences

ξ and ν

ν computed as:

d(ξ

ξ,ν

ν) =

∑

l=1

D(ξ

ξ(l),ν

ν(l)) (3)

being n

the minimal length of both subsequences un-

der consideration, and D

D(ξ

ξ(l),ν

ν(l)) the value of the

scoring matrix for the respective l-th elements of ξ

and ν

ν. As scoring matrix D

D, the Point Accepted Mu-

tation (

PAM250

) is used for the pairwise local align-

ment, as recommended in (Wheeler, 2002).

The

ISODATA

algorithm produces a set of n

clus-

ters for each class. Then, as stated in (Schnei-

der, 2002), one prototype motif ζ

, r = 1,... ,n

, is

generated as the consensus sequence of each cluster.

Given the proﬁle matrix P

with elements P

(i, j) =

(i, j)/kC

k, where f

(i, j) represents the cardinal

of amino acid j at position i of the multiple sub-

sequence alignment C

, then, each component of the

consensus sequence is computed:

( j) = max

∀i

(i, j)}, (4)

Once the set of prototype motifs {ζ

} has been

computed, a query protein z

z can be represented by

the minimum alignment-scoredistances between such

prototype motifs and its own motifs ξ

. The scalar-

valued r-th component of the feature space represen-

tation is computed as:

= min

∀ξ

∈z

{d(ξ

,ζ

)}, r = 1,2, ... ,n

(5)

where n

∑

. Conceptually, quantity δ

∈ R

a measure of the extent at which the prototype motif

is present in the sequence z

2.4 Classiﬁcation Methodology

The entire database is divided into modeling and

classiﬁcation sets in which, the 60% of the se-

quences are selected to compute the prototype motifs,

whereas 40% are left for testing purposes. The Fast

Correlation-Based Filter (FCBF), described in (Yu

and Liu, 2003), is used for feature selection. Since

basic SVM are designed only for two-class prob-

lems, classiﬁcation is implemented following the one-

against-all strategy, which produces a strong class im-

balance. So, the Synthetic Minority Over-sampling

Technique is employed (Chawla et al., 2002). Param-

eters of the SVM are tuned with a Particle Swarm Op-

timization algorithm. Validation of the results is ob-

tained by 10-fold cross-validation over the testing set

(40%). Sensitivity (S

), speciﬁcity (S

), and geomet-

ric mean (G

) are used as classiﬁcation performance

measures:

where n

, and n

denote true positive,

false positive, true negative and false negative, respec-

tively.

2.5 Comparison with other Methods

Blast2GO

: is a research tool designed with the main

purpose of enabling Gene Ontology (GO) based data

mining on sequence data for which the GO an-

notations are not available. Annotation based on

Blast2GO

is carried out by three sequential stages,

BIOINFORMATICS2013-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

142

Figure 2: 1) classiﬁcation performance (geometric mean) for several amino acid properties. 2) selected properties and perfor-

mance of the ensemble prototype motifs.

Blasting, mapping and annotation. For Blasting

the

BLASTP

algoritm is trained and tested over the

same database, holding the same validation method-

ology described in section 2.4. For this purpose,

the

Blast+ version 2.2.26

software is used (pa-

rameters: blosum 62, e-value 10, word size >= 2 -

outfmt 5). In the mapping stage, BLAST results are

loaded to BLAST2GO module (-E-Value-Hit-Filter

10 -Annotation CutOff 10 to improve the false posi-

tive rate) in order to map these results to b2g jun11

database. Finally, in the annotation stage the test-

ing sequences are labeled using the evidence code

weights proposed by

Blast2GO

(Conesa and G¨otz,

2008).

Pepstats-SVM

: is a pattern recognition approach

that uses 37 global features proposed in

Pepstats

(Sarac¸, 2010). The same classiﬁcation framework

used in section 2.4 is applyed for comparisson pur-

poses. The goal of this comparisson is to show that,

under the same conditions, the prototype motif based

method overpasses the performance of methods based

on global features.

3 RESULTS AND DISCUSSION

Figure 2 depicts the prediction performance using 51

amino acid properties from

AAINDEX

database. Lipid

binding proteins are diverse in sequence, structure,

and function (Lin et al., 2006), so, Lipid binding

proved to be the molecular function that showed the

highest performance within the whole set amino acid

properties.

Receptor binding proteins interact selectively

with one or more speciﬁc sites on a receptor

molecule (Lodish et al., 1995). Protein receptors

are transmembranal proteins whose conformation is

given by α, β structures, and some speciﬁc domains

(DNA-binding domains, hormone-binding domain,

transmembrane subunits among others). A clear in-

ﬂuence between structure of the receptor proteins and

β-turn and α-helix properties was evinced.

Nucleases are enzymes that participate in nucleic

acid catabolism and play roles in DNA replication,

cutting DNA molecules into small fragments (en-

donuclease activity) and DNA repair by proofreading

(exonuclease activity) (Lodish et al., 1995). The ac-

cessible residues property showed the best character-

ization, after molecular weight, for nuclease activity

function.

Disease-resistance genes are important in the cells

for the detection of pathogens and induction of de-

fense responses (Bai et al., 2002). These genes

code for proteins that interact selectively and non-

covalently with a nucleotide or any compound by

nucleotide binding sites (NBS). The NBS can affect

the disease resistance (R) protein function through

nucleotide binding (NtBind) or hydrolysis (Martin

et al., 2003). As shown in coiled coil, parallel

β − strand, total β − strand and α helix are the best

amino acid properties that represent this NtBind func-

tion. This can be explained by the fact that some

proteins contain a coiled coil domain, and the struc-

tural conformation of the NBS domain according to

the SCOP classiﬁcation are α and β subunits (Wilson

et al., 2009).

Proteins with sequence-speciﬁc DNA binding

transcription factor activity (TranscFact) func-

tion interacts selectively and non-covalently with a

speciﬁc DNA sequence in order to modulate the

transcription of genetic information from DNA to

mRNA (Barrell et al., 2009). The amino acid com-

position and molecular weight are the properties that

best represented this function. The TranscFact class

is the function with the highest number of preserved

motifs (Figure 3). Two conserved prototype motifs

are analyzed using the web tool ScanProsite. Prosite

consists of documentation entries describing protein

domains, families and functional sites (Gattiker et al.,

2002).

The prototype motif 1 belongs to the WRKY do-

main that is an amino acid region deﬁned by the con-

PredictingMolecularFunctionsinPlantsusingWavelet-basedMotifs

143

Table 2: Sensitivity, Speciﬁcity and Geometric mean values over 9 funcional classes.

Function Blast2GO Wavelet Pepstats-SVM

Gm S

NtBind 0.609 0.67 0.639 0.864 0.739 0.799 0.423 0.705 0.546

TranscFact 0.854 0.771 0.811 0.756 0.731 0.744 0.619 0.837 0.72

RnaBind 0.571 0.809 0.68 0.810 0.756 0.782 0.545 0.755 0.642

Nase 0.545 0.866 0.69 1.000 0.772 0.878 0.545 0.698 0.617

RecepBind 0.818 0.928 0.871 1.000 0.8624 0.929 0.636 0.9092 0.758

Transp 0.741 0.729 0.735 0.741 0.754 0.748 0.618 0.643 0.63

LipBind 0.3 0.886 0.515 0.900 0.794 0.845 0.455 0.688 0.559

Kinase 0.884 0.633 0.748 0.691 0.794 0.740 0.533 0.702 0.612

EnzReg 0.316 0.93 0.542 0.778 0.817 0.797 0.636 0.784 0.706

0.626 0.802 0.692 0.838 0.780 0.807 0.557 0.746 0.643

Figure 3: Logos of conserved prototype motifs for Transc-

Fact molecular function. The motifs 1 and 2 correspond to

plant transcription factors WRKY and AP2/ERF domains,

respectively.

served amino acid sequence WRKYGQK and binds to

a speciﬁcally DNA sequence motif. The prototype

motif 2 is found in the AP2/ERF domain. The struc-

ture of this domain integrates a three-stranded β-sheet

and several α helices almost parallel to the β-sheets.

It contacts DNA via Arg and Trp residues located in

the β-sheet (Gattiker et al., 2002).

Having analyzed the prediction performances, an

ensemble of classiﬁers was trained with the best fea-

tures for each class. Those features are marked with

circles in Figure 2.

By comparing the achieved results shown in Ta-

ble 2, where the highest performances per class are

highlighted in bold, it is possible to infer that the pro-

posed wavelet-based method outperforms the other

methods in seven out of nine classes. The classiﬁ-

cation results of the proposed method are lower than

the results of

Blast2GO

in only two cases, namely

TransFact and Kinase. Moreover, it can be seen that

the proposed method is the most sensitive of the three

methods shown, decreasing the achieved number of

false negatives. Geometric mean between sensitivity

and speciﬁcity is computed as a global performance

measure, showing that the wavelet based method-

ology overpasses the performance of

Blast2GO

about 11.5% in average and

Pepstats-SVM

in a

16.4%.

4 CONCLUSIONS

In this paper a methodology to molecular function

prediction in plants is proposed. The approach ex-

plores the distribution of the proteins computing a

set of prototype motifs. Thus, this motifs are used

to train a classiﬁer an make a prediction to improve

the performance of the two novel explored methods

Pepstats

and

Blast2GO

. For this purpose an en-

hanced version of the previous work (Arango-Argoty

et al., 2011) was used, whose main feature is the

use of the continuous wavelet transform to identify

and characterize protein motifs. This transform can

provide accurate information about the structure of

a protein and hence the structures/motifs related to

each molecular function. Due to the protein database

contains sequences with a low identity (< 40%), the

prototype motifs showed to be discriminative and

representative. Thus, the classiﬁcation performance

based on wavelet-motif detection improves the results

achieved by 1) a method based on global features of

the proteins (Pepstats), showing that a simple peptide

statistics are not enough to classify GO terms and 2) a

method based on similituds (Blast2GO) due to it ap-

proach lose sensitivity when the identity among se-

quences is low. At last, the proposed methodology

offers a more complete interpretation of the obtained

results since: a) the method is able to distinguish the

most representative properties of the amino acids for

each class and b) it identiﬁes the motifs associated

with each molecular function. A possible direction

of research could be the use of robust methods for

clustering and computation of prototype motif such

as hidden Markov models.

BIOINFORMATICS2013-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

144

ACKNOWLEDGEMENTS

This work was partially funded by the Research

ofﬁce (DIMA) at the Universidad Nacional de

Colombia at Manizales and the Colombian National

Research Centre (COLCIENCIAS) through grant

No.111952128388 and the “Jovenes Investigadores e

Innovadores 2010”, Convenio Interadministrativo Es-

pecial de Cooperacion No. 146 de enero 24 de 2011

between COLCIENCIAS and Universidad Nacional

de Colombia Sede Manizales

REFERENCES

Arango-Argoty, G., Jaramillo-Garz´on, J. A., R¨othlisberger,

S., and Castellanos-Dom´ınguez, C. G. (2011). Pro-

tein subcellular location prediction based on variable-

length motifs detection and dissimilarity based classi-

ﬁcation. Annual International Conference of the IEEE

EMBS, (76).

Bai, J., Pennill, L., Ning, J., Lee, S., Ramalingam, J.,

Webb, C., Zhao, B., Sun, Q., Nelson, J., Leach, J.,

et al. (2002). Diversity in nucleotide binding site–

leucine-rich repeat genes in cereals. Genome research,

12(12):1871.

Barrell, D., Dimmer, E., Huntley, R., Binns, D.,

O’Donovan, C., and Apweiler, R. (2009). The GOA

database in 2009–an integrated Gene Ontology Anno-

tation resource. Nucleic acids research, 37(Database

issue):D396.

Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, W.

(2002). SMOTE: synthetic minority over-sampling

technique. Journal of Artiﬁcial Intelligence Research,

16(1):321–357.

Cheng, B., Carbonell, J., and Klein-Seetharaman, J. (2005).

Protein classiﬁcation based on text document classiﬁ-

cation techniques. Proteins: Structures, Function and

Bioinformatics, 58:955–970.

Conesa, A. and G¨otz, S. (2008). Blast2GO: A Compre-

hensive Suite for Functional Analysis in Plant Ge-

nomics. International journal of plant genomics,

2008:619832.

Gattiker, A., Gasteiger, E., and Bairoch, A. (2002). Scan-

Prosite: a reference implementation of a PROSITE

scanning tool. Applied Bioinformatics, 1(2):107–108.

Gupta, R., Mittal, A., Singh, K., Narang, V., and Roy, S.

(2009). Time-series approach to protein classiﬁcation

problem. Engineering in Medicine and Biology Mag-

azine, 28(4):32–37.

Huang, Y., Niu, B., Gao, Y., Fu, L., and Li, W. (2010). Cd-

hit suite: a web server for clustering and comparing

biological sequences. Bioinformatics, 26(5):680–682.

Jain, E., Bairoch, A., Duvaud, S., Phan, I., Redaschi, N.,

Suzek, B., Martin, M., McGarvey, P., and Gasteiger,

E. (2009). Infrastructure for the life sciences: de-

sign and implementation of the UniProt website. BMC

bioinformatics, 10(1):136.

Johnson, M., Zaretskaya, I., Raytselis, Y., Merezhuk, Y.,

McGinnis, S., and Madden, T. (2008). Ncbi blast: a

better web interface. Nucleic acids research, 36(suppl

2):W5–W9.

Kawashima, S. and Kanehisa, M. (2000). Aaindex:

amino acid index database. Nucleic acids research,

28(1):374.

Lin, H., Han, L., Zhang, H., Zheng, C., Xie, B., and Chen,

Y. (2006). Prediction of the functional class of lipid

binding proteins from sequence-derived properties ir-

respective of sequence similarity. Journal of lipid re-

search, 47(4):824.

Liu, X., Korde, N., Jakob, U., and Leichert, L. (2006).

CoSMoS: conserved sequence motif search in the pro-

teome. BMC bioinformatics, 7(1):37.

Lodish, H., Berk, A., Zipursky, S., Matsudaira, P., Balti-

more, D., and Darnell, J. (1995). Molecular cell biol-

ogy. New York.

Martin, G., Bogdanove, A., and Sessa, G. (2003). Under-

standing the functions of plant disease resistance pro-

teins. Annual review of plant biology, 54(1):23–61.

Murray, K., Gorse, D., and Thornton, J. (2002). Wavelet

transforms for the characterization and detection of

repeating motifs1. Journal of molecular biology,

316(2):341–363.

Sarac¸, O. (2010). GOPred: GO Molecular Function Predic-

tion by Combined Classiﬁers. PloS one, 5(8):1–11.

Schneider, T. (2002). Consensus sequence zen. Applied

bioinformatics, 1(3):111.

Shen, Y. and Burger, G. (2010). TESTLoc: protein sub-

cellular localization prediction from EST data. BMC

bioinformatics, 11(1):563.

Swarbreck, D., Wilks, C., Lamesch, P., Berardini, T. Z.,

Garcia-Hernandez, M., Foerster, H., Li, D., Meyer,

T., Muller, R., Ploetz, L., Radenbaugh, A., Singh,

S., Swing, V., Tissier, C., Zhang, P., and Huala, E.

(2008). The arabidopsis information resource (tair):

gene structure and function annotation. Nucleic acids

research, 36.

Vinga, S. and Almeida, J. (2003). Alignment-free sequence

comparison: a review. Bioinformatics, 19(4):513.

Wheeler, D. (2002). Selecting the right protein-scoring ma-

trix. Current Protocols in Bioinformatics, pages 3–5.

Wilson, D., Pethica, R., Zhou, Y., Talbot, C., Vogel, C.,

Madera, M., Chothia, C., and Gough, J. (2009).

Superfamilysophisticated comparative genomics, data

mining, visualization and phylogeny. Nucleic acids

research, 37(suppl 1):D380.

Yu, L. and Liu, H. (2003). Feature selection for high-

dimensional data: A fast correlation-based ﬁlter so-

lution. In Machine Learning-International Workshop

then Conference-, volume 20, page 856.

PredictingMolecularFunctionsinPlantsusingWavelet-basedMotifs

145