A Texture-based Classification Method for Proteins in

Two-Dimensional Electrophoresis Gel Images

A Feature Selection Method using Support Vector Machines

and Genetic Algorithms

Carlos Fernandez-Lozano, Jose A. Seoane, Marcos Gestal, Daniel Rivero,

Julian Dorado and Alejandro Pazos

Information and Comunnications Technologies Department. Faculty of Computer Science, University of A Coruña,

Campus Elviña s/n, 15071, A Coruña, Spain

Keywords: Texture Analysis, Feature Selection, Electrophoresis, Support Vector Machines, Genetic Algorithm.

Abstract: In this paper, the influence of textural information is studied in two-dimensional electrophoresis gel images.

A Genetic Algorithm-based feature selection technique is used in order to select the most representative

textural features and reduced the original set (296 feat.) to a more efficient subset. Such a method makes use

of a Support Vector Machines classifier. Different experiments have been performed, the pattern set has

been divided into two parts (training and validation) extracting a total of 30%, 20% and 0% of the training

data, and a 10-fold cross validation is used for validation. In case of extracting 0% means that training set is

used for validation. For each division 10 different trials have been done. Experiments have been carried out

in order to measure the behaviour of the system and to achieve the most representative textural features for

the classification of proteins in two-dimensional gel electrophoresis images. This information can be useful

for a protein segmentation process.

1 INTRODUCTION

Proteomics is the study of protein properties in a cell

or tissue aimed at obtaining a global integrated view

of disease, physiological and biochemical processes

of cells and regulatory networks. One of the most

powerful techniques, widely used to analyze

complex protein mixtures extracted from cells,

tissues, or other biological samples, is two-

dimensional polyacrylamide gel electrophoresis

(2D-PAGE). In this method, proteins are classified

by molecular weight (MWt) and iso-electric point

(pI) using a controlled laboratory process and digital

imaging equipment. Among others separation of

proteins of a sample could also be done with several

different techniques such as chromatography or

mass spectrometry.

The main advantages of this approach are its

robustness, its parallelism and its unique ability to

analyze complete proteins at high resolution,

keeping them intact and being able to isolate them

entirely, however this method has also several

drawbacks (Rabilloud, Chevallet et al., 2010).

In this work the most representative group of

textural features are selected using Genetic

Algorithms.

2 THEORETICAL

BACKGROUND

The method proposed in this work intends to assist

in 2D-PAGE image analysis by studying the textural

information present within them. To do so, a novel

combination of Genetic Algorithms (Holland, 1975)

and Support Vector Machines (Vapnik, 1979) is

presented. In this section, the main techniques used

are briefly introduced and explained.

One of the most important characteristics used

for identifying objects or regions of interest in an

image is texture, related with the spatial (statistical)

distribution of the grey levels within an image

(Haralick, Shanmugam et al., 1973). Texture is a

surface’s property and can be regarded as the regular

spatial organization of complex patterns, always

401

Fernandez-Lozano C., Seoane J., Gestal M., Rivero D., Dorado J. and Pazos A..

A Texture-based Classiﬁcation Method for Proteins in Two-Dimensional Electrophoresis Gel Images - A Feature Selection Method using Support Vector

Machines and Genetic Algorithms.

DOI: 10.5220/0004208704010404

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 401-404

ISBN: 978-989-8565-47-1

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

present even if they could exist as a non-dominant

feature.

Genetic Algorithms (GAs) are search techniques

inspired by Darwinian Evolution and developed by

Holland in the 1970s (Holland, 1975). In a GA, an

initial population of individuals, i.e. possible

solutions defined within the domain of a fitness

function to be optimized, is evolved by means of

genetic operators: selection, crossover and mutation.

The selection operator ensures the survival of the

fittest, while the crossover represents the mating

between individuals, and the mutation operator

introduces random modifications. GAs possesses

effective exploration and exploitation capabilities to

explore the search space in parallel, exploiting the

information about the quality of the individuals

evaluated so far (Goldberg, 1989).

Vapnik introduces Support Vector Machines

(SVMs) in the late 1970s on the foundation of

statistical learning theory (Vapnik, 1979). The basic

implementation deals with two-class problems in

which data are separated by a hyperplane defined by

a number of support vectors. This hyperplane

separates the positive from the negative examples, to

orient it such that the distance between the boundary

and the nearest data point in each class is maximal;

the nearest data points are used to define the

margins, known as support vectors (Burges, 1998).

These classifiers have also proven to be

exceptionally efficient in classification problems of

higher dimensionality (Chapelle, Haffner et al.,

1999; Moulin, Alves Da Silva et al., 2004), because

of their ability to generalize in high-dimensional

spaces, such as the ones spanned by texture patterns.

3 MATERIALS

In order to generate the dataset, ten 2D-PAGE

images of different types of tissues and different

experimental conditions were used. These images

are similar to the ones used by G.-Z. Yang (Imperial

College of Science, Technology and Medicine,

London). It is important to notice that Hunt et al.

(Hunt, Thomas et al. 2005) determined that 7-8 is

the minimum acceptable number of samples for a

proteomic study.

For each image, 50 regions of interest (ROIs)

representing proteins and 50 representing no-

proteins (noise, black non-protein regions, and

background) were selected to build a training set

with 1000 samples in a double-blind process in the

way that two clinicians select as many ROIs as they

considered and after that, within the common ROIs

clinicians selected proteins which are representatives

(isolated, overlapped, big, small, darker, etc.).

4 PROPOSED METHOD

The first step in texture analysis is texture feature

extraction from the ROIs. With a specialized

software called Mazda (Szczypiski et al., 2009), 296

texture features are computed for each element in

the training set. These features are based on the

image histogram, co-ocurrence matrix, run-length

matrix, image gradients, autoregressive models and

wavelet analysis. Histogram-related measures

conform the first-order statistics proposed by

Haralick (Haralick, Shanmugam et al., 1973) but

second-order statistics are those derived from the

Spatial Distribution Grey-Level Matrices (SDGM).

All these feature sets were included in the

dataset. The normalization method applied was the

one set by default in Mazda: image intensities were

normalized in the range from 1 to Ng=2

, where k is

the number of bits per pixel used to encode the

image under analysis.

In this work, GA is aimed at finding the smallest

feature subset able to yield a fitness value above a

threshold. Besides optimizing the complexity of the

classifier, feature selection may also improve the

classifiers quality. In fact, classification accuracy

could even improve if noisy or dependent features

are removed.

GAs for feature selection were first proposed by

Siedlecki and Skalansky (Siedlecki and Sklansky,

1989). Many studies have been done on GA for

feature selection since then (Kudo and Sklansky

1998), concluding that GA is suitable for finding

optimal solutions to large problems with more than

40 features to select from.

GA for feature selection could be used in

combination with a classifier such SVM, KNN or

ANN, optimizing it. In our method, based on both

GA and SVM, there is no a fixed number of

variables. As the GA continuously reduces the

number of variables that characterize the samples, a

pruned search is implemented. The fitness function

(1) considers not only the classification results but

also the number of variables used for such a

classification, so it is defined as the sum of two

factors, one related to the classification results and

another to the number of variables selected.

Regarding classification results, it apparently gives

better results taking into account the F-measure than

only using the accuracy obtained with image

features (Müller, Demuth et al., 2008; Tamboli and

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

402

Shah, 2011). F-measure is a function made up of the

recall (true positives rate or sensitivity: proportion of

actual positives which are correctly identified as

such) and precision (or positive predictive value:

proportion of positive test results that are true

positives) measurements.

 



1









(1)

Therefore individuals with less active genes are

favored.

5 EXPERIMENTAL RESULTS

The method proposed in this work requires the

division of the pattern set into two halves. To avoid

overfitting, this work proposes to split the training

dataset into training and validation sets to perform a

validation of the obtained results. Once the GA

finishes, the best individual found (the one with

lowest fitness value) is tested, using a 10-fold cross

validation (10-fold CV), to calculate the error of the

proposed model with the validation set and using

only the features in the best individual chromosome.

This involves dividing the validation set into 10

complementary subsets, performing the analysis on

1 subset, retained as the validation data and the

remaining k-1 subsamples are used as training data.

This second partitioning provokes that either

validation could be carried out with a very reduced

number of data points. In this case, either training or

validation sets, will surely not be representative of

the search space that is being explored. Different

experiments have been performed to verify this, in

these experiments the pattern set has been divided

into two parts (training and validation) extracting a

total of 30 %, 20% and 0% of the training data to the

validation set. In case of extracting 0%, which

means that no validation is performed training set is

used for validation using cross validation. For each

division 10 different trials have been done; a

different seed is used to divide randomly the

elements of the dataset each time.

Parameters domains of the feature selection

method were initially adjusted based on the literature

in the way that are ranging in the case of the

population size from 100 to 250 individuals, elitism

from 0% to 2%, crossover probability from 80% to

98% and mutation probability from 1% to 5%. One-

point, two-point, scattered, arithmetic and heuristic

crossover functions were probed. Regarding with

selection function, uniform, roulette and tournament

functions were evaluated with uniform and Gaussian

mutation functions.

Final combination set population size to 250

individuals, no elite, 95% crossover probability, 2%

mutation probability, crossover scattered,

tournament selection and mutation uniform.

SVM parameters domains are set for the kernel

function as lineal, quadratic, polynomial (order

ranging from 3 to 10) and Gaussian radial basis,

with sigma parameter ranging from 0,1 to 10 and C

parameter from 1 to 100. The RBF(2) kernel

function is selected as the most accurate for solving

this problem.

Experiments results shown in Table 1 separated

in training error and the validation error calculated

using 10-fold CV and the validation set for each

division. Results are in mean of error of the 10 trials

and standard deviation is in brackets. Best results in

validation are achieved for the 0% division.

In these 10 trials, some features seem to be more

relevant and appear recurrently as the solution of

each trial. Skewness, S(0,5) InvDfMom, S(2,2)

Correlat and S(0,4) InvDfMom appears at least in 5

solutions. Skewness is a measure of the degree of

asymmetry of the image histogram distribution.

Correlation analyzes the linear dependency of gray

levels of neighboring pixels. When the scale of local

texture is larger than the distance this measure is

typically high. And inverse difference moment is the

inverse of the contrast of the occurrence matrix so it

is a measure of the amount of local uniformity

present in the image.

The co-ocurrence matrix in Mazda (Szczypiski et

al., 2009) is symmetric and the image is normalized.

Co-ocurrence based parameters are computed up to

20 times, for (d,0), (0,d), (d,d) and (d,-d) where

distance d is ranging from 1 to 5.

6 SUMMARY AND

CONCLUSIONS

In this work we present a method for classification

of proteins in two-dimensional electrophoresis gels

using textural information. The proposed method is

based on a feature selection process using GAs

(Holland, 1975) and SVMs (Vapnik, 1979).

A dataset with 10 images, 100 ROIs for each one

and 296 features per ROI is created. Two different

clinicians have performed this manual protein

detection. This is a high variability process; a

refinement step based on the correlation of the

results of the two clinicians was performed, in order

ATexture-basedClassificationMethodforProteinsinTwo-DimensionalElectrophoresisGelImages-AFeatureSelection

MethodusingSupportVectorMachinesandGeneticAlgorithms

403

to select enough representative proteins.

Different experiments have been performed and

in these experiments the pattern set has been divided

into two parts (training and validation) extracting a

total of 30 %, 20% and 0% of the training data to the

validation set.

The proposed method has been successfully

applied to different real images, including images

with high complexity, which means larger number

of proteins and larger deformation between images.

Furthermore, the method presented have important

implications for the analysis of two-dimensional

electrophoresis gel images in the sense that this

classification step can be very useful in order to

discard over-segmented areas after a protein

segmentation or identification process.

ACKNOWLEDGEMENTS

This work is supported by the General Directorate of

Culture, Education and University Management of

the Xunta de Galicia (Ref. 10SIN105004PR).

REFERENCES

Burges, C. J. C. (1998). A tutorial on support vector

machines for pattern recognition. Data Mining and

Knowledge Discovery 2(2): 121-167.

Chapelle, O., P. Haffner, et al. (1999). Support vector

machines for histogram-based image classification.

IEEE Transactions on Neural Networks 10(5): 1055-

1064.

Goldberg, D. (1989). Genetic Algorithms in Search,

Optimization, and Machine Learning, Addison-Wesley

Professional.

Haralick, R. M., K. Shanmugam, et al. (1973). Textural

features for image classification. IEEE Transactions

on Systems, Man and Cybernetics smc 3(6): 610-621.

Holland, J. H. (1975). Adaptation in natural and artificial

systems: an introductory analysis with applications to

biology, control, and artificial intelligence, University

of Michigan Press.

Hunt, S. M. N., M. R. Thomas, et al. (2005). Optimal

Replication and the Importance of Experimental

Design for Gel-Based Quantitative Proteomics.

Journal of Proteome Research 4(3): 809-819.

Kudo, M. and J. Sklansky (1998). A comparative

evaluation of medium- and large-scale feature

selectors for pattern classifiers. Kybernetika 34(4):

429-434.

Moulin, L. S., A. P. Alves Da Silva, et al. (2004). Support

vector machines for transient stability analysis of

large-scale power systems. IEEE Transactions on

Power Systems 19(2): 818-825.

Müller, M., B. Demuth, et al. (2008). An evolutionary

approach for learning motion class patterns. 5096

LNCS: 365-374.

Rabilloud, T., M. Chevallet, et al. (2010). Two-

dimensional gel electrophoresis in proteomics: Past,

present and future. Journal of Proteomics 73(11):

2064-2077.

Siedlecki, W. and J. Sklansky (1989). A note on genetic

algorithms for large-scale feature selection. Pattern

Recognition Letters 10(5): 335-347.

Szczypiski, P. M., M. Strzelecki, et al. (2009). MaZda-A

software package for image texture analysis.

Computer Methods and Programs in Biomedicine

94(1): 66-76.

Tamboli, A. S. and M. A. Shah (2011). A Generic

Structure of Object Classification Using Genetic

Programming. Communication Systems and Network

Technologies (CSNT), 2011 International Conference

on.

Vapnik, V. N. (1979). Estimation of dependences based on

empirical data [in Russian]. Nauka, English translation

Springer Verlang, 1982.

APPENDIX

Table 1: Results separated in training and validation.

Training

division

Accuracy Sensitivity Specificity F

70-30 0.9128(0.0103) 0.9450(0.0180) 0.8816(0.0131) 0.9141(0.0107)

80-20 0.9183(0.0093) 0.9504(0.0118) 0.8866(0.0086) 0.9202(0.0097)

100-0 0.9236(0.0035) 0.9593(0.0050) 0.8880(0.0072) 0.9262(0.0032)

Validation

70-30 0.9073(0.0058) 0.9442(0.0112) 0.8699(0.0118) 0.9110(0.0057)

80-20 0.9165(0.0090) 0.9518(0.0107) 0.8812(0.0135) 0.9192(0.0085)

100-0 0.9254(0.0040) 0.9580(0.0037) 0.8928(0.0068)

0.9277(0.0037)

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

404