present even if they could exist as a non-dominant
feature.
Genetic Algorithms (GAs) are search techniques
inspired by Darwinian Evolution and developed by
Holland in the 1970s (Holland, 1975). In a GA, an
initial population of individuals, i.e. possible
solutions defined within the domain of a fitness
function to be optimized, is evolved by means of
genetic operators: selection, crossover and mutation.
The selection operator ensures the survival of the
fittest, while the crossover represents the mating
between individuals, and the mutation operator
introduces random modifications. GAs possesses
effective exploration and exploitation capabilities to
explore the search space in parallel, exploiting the
information about the quality of the individuals
evaluated so far (Goldberg, 1989).
Vapnik introduces Support Vector Machines
(SVMs) in the late 1970s on the foundation of
statistical learning theory (Vapnik, 1979). The basic
implementation deals with two-class problems in
which data are separated by a hyperplane defined by
a number of support vectors. This hyperplane
separates the positive from the negative examples, to
orient it such that the distance between the boundary
and the nearest data point in each class is maximal;
the nearest data points are used to define the
margins, known as support vectors (Burges, 1998).
These classifiers have also proven to be
exceptionally efficient in classification problems of
higher dimensionality (Chapelle, Haffner et al.,
1999; Moulin, Alves Da Silva et al., 2004), because
of their ability to generalize in high-dimensional
spaces, such as the ones spanned by texture patterns.
3 MATERIALS
In order to generate the dataset, ten 2D-PAGE
images of different types of tissues and different
experimental conditions were used. These images
are similar to the ones used by G.-Z. Yang (Imperial
College of Science, Technology and Medicine,
London). It is important to notice that Hunt et al.
(Hunt, Thomas et al. 2005) determined that 7-8 is
the minimum acceptable number of samples for a
proteomic study.
For each image, 50 regions of interest (ROIs)
representing proteins and 50 representing no-
proteins (noise, black non-protein regions, and
background) were selected to build a training set
with 1000 samples in a double-blind process in the
way that two clinicians select as many ROIs as they
considered and after that, within the common ROIs
clinicians selected proteins which are representatives
(isolated, overlapped, big, small, darker, etc.).
4 PROPOSED METHOD
The first step in texture analysis is texture feature
extraction from the ROIs. With a specialized
software called Mazda (Szczypiski et al., 2009), 296
texture features are computed for each element in
the training set. These features are based on the
image histogram, co-ocurrence matrix, run-length
matrix, image gradients, autoregressive models and
wavelet analysis. Histogram-related measures
conform the first-order statistics proposed by
Haralick (Haralick, Shanmugam et al., 1973) but
second-order statistics are those derived from the
Spatial Distribution Grey-Level Matrices (SDGM).
All these feature sets were included in the
dataset. The normalization method applied was the
one set by default in Mazda: image intensities were
normalized in the range from 1 to Ng=2
k
, where k is
the number of bits per pixel used to encode the
image under analysis.
In this work, GA is aimed at finding the smallest
feature subset able to yield a fitness value above a
threshold. Besides optimizing the complexity of the
classifier, feature selection may also improve the
classifiers quality. In fact, classification accuracy
could even improve if noisy or dependent features
are removed.
GAs for feature selection were first proposed by
Siedlecki and Skalansky (Siedlecki and Sklansky,
1989). Many studies have been done on GA for
feature selection since then (Kudo and Sklansky
1998), concluding that GA is suitable for finding
optimal solutions to large problems with more than
40 features to select from.
GA for feature selection could be used in
combination with a classifier such SVM, KNN or
ANN, optimizing it. In our method, based on both
GA and SVM, there is no a fixed number of
variables. As the GA continuously reduces the
number of variables that characterize the samples, a
pruned search is implemented. The fitness function
(1) considers not only the classification results but
also the number of variables used for such a
classification, so it is defined as the sum of two
factors, one related to the classification results and
another to the number of variables selected.
Regarding classification results, it apparently gives
better results taking into account the F-measure than
only using the accuracy obtained with image
features (Müller, Demuth et al., 2008; Tamboli and
VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications
402